A single exome

In the age of 50,000+ and 60,000+ whole exome catalogues, it’s hard to find processed data for a single exome. At least I had trouble trying to find a single VCF file for a single exome from one individual. After searching for a while, I gave up and decided to generate one myself. This post is on how I generated a single VCF file, which I have hosted on my web server.

I used data from The 1000 Genomes Project, in particular, I chose individual HG00284, which as far as I’m aware does not have any known medical conditions. I downloaded the FASTQ files corresponding to the whole-exome sequencing and a BED file of the targeted regions.

# http://www.internationalgenome.org/data-portal/sample/HG00284
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/20130108.exome.targets.bed

# I renamed the files to the expected format of the pipeline
# you can get this info from the FASTQ header
mv ERR031940_1.filt.fastq.gz ERR031940_A81BHGABXX_CAATAT_L003_R1.fastq.gz
mv ERR031940_2.filt.fastq.gz ERR031940_A81BHGABXX_CAATAT_L003_R2.fastq.gz

I called variants using GATK version 3.3-0-g37228af, the 2.8 data bundle, BWA version 0.7.12-r1039, and SAMtools 1.2. The GATK pipeline I used is documented in my GitHub repository.

gunzip -c exome.vcf.gz | grep -v "^#" | wc -l

Actually, before I decided to run the GATK pipeline I thought I could just download the whole-genome sequencing (WGS) VCF file and use GATK’s SelectVariants to subset HG00284 based on the exome target BED file. But the WGS VCF file does not contain the individual genotypes! However, I found out later that the individual genotypes are in the VCF files for each individual chromosome, so I downloaded them all and ran SelectVariants on each chromosomal VCF file. Finally, I concatenated all the subsetted VCF files.

# download reference from ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

for file in `ls ALL.chr*.gz`;
   do echo $file
   base=`basename $file .vcf.gz`

   java -Xmx2g -jar ~/src/GenomeAnalysisTK.jar \
   -R $my_genome \
   -T SelectVariants \
   --variant $file \
   -o ${base}_HG00284_exome.vcf \
   -L 20130108.exome.targets.bed \
   --excludeNonVariants \
   --num_threads 64 \
   -sn HG00284


# get one VCF header
zcat ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes_HG00284_exome.vcf | grep "^#" > header

for file in `ls *.vcf`;
   do echo $file;
   cat $file | grep -v "^#" >> blah

cat header blah > merged.vcf
rm -rf header blah

Comparing VCF files

I used SnpSift to compare the two VCF files: the one where I called variants from the WES FASTQ files and the other where I subsetted the WGS VCF based on the targeted exome regions.

java -Xmx1g -jar ~/src/snpEff/SnpSift.jar \
concordance -v HG00284/exome.vcf \
HG00284_subset/merged.vcf > blah

cat concordance_exome_merged.by_sample.txt | transpose
sample  HG00284

MISSING_ENTRY_exome/MISSING_ENTRY_merged        0
MISSING_ENTRY_exome/ALT_1       6010
MISSING_ENTRY_exome/ALT_2       3526
MISSING_GT_exome/MISSING_GT_merged      0
MISSING_GT_exome/REF    0
MISSING_GT_exome/ALT_1  0
MISSING_GT_exome/ALT_2  0
REF/MISSING_ENTRY_merged        0
REF/MISSING_GT_merged   0
REF/ALT_1       0
REF/ALT_2       0
ALT_1/MISSING_ENTRY_merged      27376
ALT_1/MISSING_GT_merged 0
ALT_1/REF       0
ALT_1/ALT_1     15553
ALT_1/ALT_2     67
ALT_2/MISSING_ENTRY_merged      14285
ALT_2/MISSING_GT_merged 0
ALT_2/REF       0
ALT_2/ALT_1     564
ALT_2/ALT_2     10187
ERROR   233

There were 9,536 (6,010 + 3,526) variants present in the merged VCF file (from WGS) and not in the exome VCF file (from WES). There were 41,661 (27,376 + 14,285) variants present in the exome VCF file and not in the merged VCF file. There was high concordance (~97.6% or 25,740/26,371*100) in the variants that were present in both files (26,371). I have written a post on how to interpret the SnpSift concordance output.

Variant annotation

After calling the variants, I used vt to decompose and normalise the VCF file, VEP for variant annotation, and GEMINI to store the variants.

vt decompose -s  | vt normalize -r $GENOME - > exome.vt.vcf

perl ~/src/ensembl-tools-release-82/scripts/variant_effect_predictor/variant_effect_predictor.pl -i exome.vt.vcf -o exome.vt.vep.vcf --vcf
         --fork 16 --offline --cache --sift b --polyphen b --symbol --numbers --biotype --total_length
         --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE

gemini load -v exome.vt.vep.vcf -t VEP --cores 16 exome.vt.vep.db

gemini query -q "select count(*) from variants" --header *.db

gemini query -q 'select impact_severity, count(impact_severity) from variants group by impact_severity' --header *.db
impact_severity count(impact_severity)
HIGH    428
LOW     57419
MED     10730

There are 428 high impact variants in this individual; these variants include frameshifts, stop-gains, and splice acceptor/donor variants.

gemini query -q 'select clinvar_disease_name, clinvar_dbsource_id, clinvar_sig, (gts).(*) from variants where clinvar_sig is "pathogenic"' --header *.db 
clinvar_disease_name    clinvar_dbsource_id     clinvar_sig     gts.HG00284
Neutrophil-specific_antigens_na1/na2    610665.0001     pathogenic      T/C
Neutrophil-specific_antigens_na1/na2    610665.0001     pathogenic      T/C
Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|not_provided  NM_000060.2:c.1330G>C|5519|CM980257|609019.0005|609019.0009 pathogenic      G/C
Prekallikrein_deficiency        229000.0005     pathogenic      A/A
Hereditary_pancreatitis .       pathogenic      C/T
Hereditary_pancreatitis 276000.0002|276000.0004 pathogenic      A/T
Hereditary_pancreatitis 276000.0007     pathogenic      A/G
Prostate_cancer,_hereditary,_13 157145.0001     pathogenic      T/C
Serum_amyloid_a_variant 104750.0002     pathogenic      T/T
4-Alpha-hydroxyphenylpyruvate_hydroxylase_deficiency    609695.0005     pathogenic      T/C
Tyrosinase-positive_oculocutaneous_albinism     611409.0003     pathogenic      C/T
BARDET-BIEDL_SYNDROME_2/6,_DIGENIC      606151.0013     pathogenic      T/T
Prostate_cancer,_hereditary,_2  605367.0002     pathogenic      C/T
Prostate_cancer,_hereditary,_2  605367.0001     pathogenic      G/A
Myeloperoxidase_deficiency      606989.0005     pathogenic      G/A
SPASTIC_PARAPLEGIA_75,_AUTOSOMAL_RECESSIVE      159460.0002     pathogenic      C/T

This individual harbours 16 pathogenic variants and is homozygous for three of them (other diseases may be dominant).


I have generated a VCF file based on WES sequencing on one individual (HG00284) from the 1000 Genomes Project. In addition, I generated another VCF file of the same exome regions using the WGS VCF files. The WES VCF file contained more variants owning to its deeper sequencing. Of the variants present in both VCF files, there was a high concordance (~97.6%). Lastly, I annotated the WES VCF file using VEP. Using GEMINI, it was shown that HG00284 has 428 high impact variants and 16 pathogenic ClinVar variants (ClinVar variants can have several annotations; the 16 variants are classified just as pathogenic and nothing else).

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
5 comments Add yours
    1. Disregard my previous comment. For the exomes, make sure they are both using the same technology, i.e. compatible, before combining them together. If they are the same, you can combine them together and call variants.

  1. Hi Dave, I want to clarify my earlier question. I am attempting to download exome samples from 1000 genome phase3. What I realized was that some individuals have multiple files eg. this individual( https://www.internationalgenome.org/data-portal/sample/NA18505


    I am interested in paired-end sequence. The question is do I pick a pair and call the variants from or I have to combine all the files as you suggested earlier.? Thanks

    1. On https://www.internationalgenome.org/data-portal/sample/NA18505 I can only see two sets of raw Exome FASTQ files (I clicked on the “Sequence” and “Exome” checkboxes.

      ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­SRR716/­SRR716649/­SRR716649_1.­fastq.­gz Exome
      ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­SRR716/­SRR716648/­SRR716648_2.­fastq.­gz Exome
      ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­SRR716/­SRR716648/­SRR716648_1.­fastq.­gz Exome
      ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­SRR716/­SRR716649/­SRR716649_2.­fastq.­gz Exome

      If SRR716648 and SRR716649 are produced using the same capture technology, you can merge both the *1.fastq.gz and *2.fastq.gz files together and call variants on the paired end files.

      If they are produced using different technologies, call variants separately but on each set of paired end reads, i.e. SRR716648_1.­fastq.­gz with SRR716648_2.­fastq.­gz and SRR716649_1.­fastq.­gz with SRR716649_2.­fastq.­gz.

  2. Thanks Dave.
    My project seeks to find snps under selection. I have 14 exome samples of a type of cancer (one phenotype). We did not sequence any controls so we want to use a closely related population in the 1000Genome project as control. My plan is to call variants on these two datasets (meaning two different populations) and perform GWAS analysis on them. For now I am at the stage of downloading the datasets. I have also implemented the GATK pipeline using what you have on github as template. I would like to know if the approach I am using is suitable especially using 1KGenome project as control. Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.