A single exome

In the age of 50,000+ and 60,000+ whole exome catalogues, it's hard to find processed data for a single exome. At least I had trouble trying to find a single VCF file for a single exome from one individual. After searching for a while, I gave up and decided to generate one myself. This post is on how I generated a single VCF file, which I have hosted on my web server.

I used data from The 1000 Genomes Project, in particular, I chose individual HG00284, which as far as I'm aware does not have any known medical conditions. I downloaded the FASTQ files corresponding to the whole-exome sequencing and a BED file of the targeted regions.

# http://www.internationalgenome.org/data-portal/sample/HG00284
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/20130108.exome.targets.bed

# I renamed the files to the expected format of the pipeline
# you can get this info from the FASTQ header
mv ERR031940_1.filt.fastq.gz ERR031940_A81BHGABXX_CAATAT_L003_R1.fastq.gz
mv ERR031940_2.filt.fastq.gz ERR031940_A81BHGABXX_CAATAT_L003_R2.fastq.gz

I called variants using GATK version 3.3-0-g37228af, the 2.8 data bundle, BWA version 0.7.12-r1039, and SAMtools 1.2. The GATK pipeline I used is documented in my GitHub repository.

gunzip -c exome.vcf.gz | grep -v "^#" | wc -l

Actually, before I decided to run the GATK pipeline I thought I could just download the whole-genome sequencing (WGS) VCF file and use GATK's SelectVariants to subset HG00284 based on the exome target BED file. But the WGS VCF file does not contain the individual genotypes! However, I found out later that the individual genotypes are in the VCF files for each individual chromosome, so I downloaded them all and ran SelectVariants on each chromosomal VCF file. Finally, I concatenated all the subsetted VCF files.

# download reference from ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

for file in `ls ALL.chr*.gz`;
   do echo $file
   base=`basename $file .vcf.gz`

   java -Xmx2g -jar ~/src/GenomeAnalysisTK.jar \
   -R $my_genome \
   -T SelectVariants \
   --variant $file \
   -o ${base}_HG00284_exome.vcf \
   -L 20130108.exome.targets.bed \
   --excludeNonVariants \
   --num_threads 64 \
   -sn HG00284


# get one VCF header
zcat ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes_HG00284_exome.vcf | grep "^#" > header

for file in `ls *.vcf`;
   do echo $file;
   cat $file | grep -v "^#" >> blah

cat header blah > merged.vcf
rm -rf header blah

Comparing VCF files

I used SnpSift to compare the two VCF files: the one where I called variants from the WES FASTQ files and the other where I subsetted the WGS VCF based on the targeted exome regions.

java -Xmx1g -jar ~/src/snpEff/SnpSift.jar \
concordance -v HG00284/exome.vcf \
HG00284_subset/merged.vcf > blah

cat concordance_exome_merged.by_sample.txt | transpose
sample  HG00284

MISSING_ENTRY_exome/MISSING_ENTRY_merged        0
MISSING_ENTRY_exome/ALT_1       6010
MISSING_ENTRY_exome/ALT_2       3526
MISSING_GT_exome/MISSING_GT_merged      0
MISSING_GT_exome/REF    0
MISSING_GT_exome/ALT_1  0
MISSING_GT_exome/ALT_2  0
REF/MISSING_ENTRY_merged        0
REF/MISSING_GT_merged   0
REF/ALT_1       0
REF/ALT_2       0
ALT_1/MISSING_ENTRY_merged      27376
ALT_1/MISSING_GT_merged 0
ALT_1/REF       0
ALT_1/ALT_1     15553
ALT_1/ALT_2     67
ALT_2/MISSING_ENTRY_merged      14285
ALT_2/MISSING_GT_merged 0
ALT_2/REF       0
ALT_2/ALT_1     564
ALT_2/ALT_2     10187
ERROR   233

There were 9,536 (6,010 + 3,526) variants present in the merged VCF file (from WGS) and not in the exome VCF file (from WES). There were 41,661 (27,376 + 14,285) variants present in the exome VCF file and not in the merged VCF file. There was high concordance (~97.6% or 25,740/26,371*100) in the variants that were present in both files (26,371). I have written a post on how to interpret the SnpSift concordance output.

Variant annotation

After calling the variants, I used vt to decompose and normalise the VCF file, VEP for variant annotation, and GEMINI to store the variants.

vt decompose -s  | vt normalize -r $GENOME - > exome.vt.vcf

perl ~/src/ensembl-tools-release-82/scripts/variant_effect_predictor/variant_effect_predictor.pl -i exome.vt.vcf -o exome.vt.vep.vcf --vcf
         --fork 16 --offline --cache --sift b --polyphen b --symbol --numbers --biotype --total_length
         --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE

gemini load -v exome.vt.vep.vcf -t VEP --cores 16 exome.vt.vep.db

gemini query -q "select count(*) from variants" --header *.db

gemini query -q 'select impact_severity, count(impact_severity) from variants group by impact_severity' --header *.db
impact_severity count(impact_severity)
HIGH    428
LOW     57419
MED     10730

There are 428 high impact variants in this individual; these variants include frameshifts, stop-gains, and splice acceptor/donor variants.

gemini query -q 'select clinvar_disease_name, clinvar_dbsource_id, clinvar_sig, (gts).(*) from variants where clinvar_sig is "pathogenic"' --header *.db 
clinvar_disease_name    clinvar_dbsource_id     clinvar_sig     gts.HG00284
Neutrophil-specific_antigens_na1/na2    610665.0001     pathogenic      T/C
Neutrophil-specific_antigens_na1/na2    610665.0001     pathogenic      T/C
Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|not_provided  NM_000060.2:c.1330G>C|5519|CM980257|609019.0005|609019.0009 pathogenic      G/C
Prekallikrein_deficiency        229000.0005     pathogenic      A/A
Hereditary_pancreatitis .       pathogenic      C/T
Hereditary_pancreatitis 276000.0002|276000.0004 pathogenic      A/T
Hereditary_pancreatitis 276000.0007     pathogenic      A/G
Prostate_cancer,_hereditary,_13 157145.0001     pathogenic      T/C
Serum_amyloid_a_variant 104750.0002     pathogenic      T/T
4-Alpha-hydroxyphenylpyruvate_hydroxylase_deficiency    609695.0005     pathogenic      T/C
Tyrosinase-positive_oculocutaneous_albinism     611409.0003     pathogenic      C/T
BARDET-BIEDL_SYNDROME_2/6,_DIGENIC      606151.0013     pathogenic      T/T
Prostate_cancer,_hereditary,_2  605367.0002     pathogenic      C/T
Prostate_cancer,_hereditary,_2  605367.0001     pathogenic      G/A
Myeloperoxidase_deficiency      606989.0005     pathogenic      G/A
SPASTIC_PARAPLEGIA_75,_AUTOSOMAL_RECESSIVE      159460.0002     pathogenic      C/T

This individual harbours 16 pathogenic variants and is homozygous for three of them (other diseases may be dominant).


I have generated a VCF file based on WES sequencing on one individual (HG00284) from the 1000 Genomes Project. In addition, I generated another VCF file of the same exome regions using the WGS VCF files. The WES VCF file contained more variants owning to its deeper sequencing. Of the variants present in both VCF files, there was a high concordance (~97.6%). Lastly, I annotated the WES VCF file using VEP. Using GEMINI, it was shown that HG00284 has 428 high impact variants and 16 pathogenic ClinVar variants (ClinVar variants can have several annotations; the 16 variants are classified just as pathogenic and nothing else).

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.