In the age of 50,000+ and 60,000+ whole exome catalogues, it’s hard to find processed data for a single exome. At least I had trouble trying to find a single VCF file for a single exome from one individual. After searching for a while, I gave up and decided to generate one myself. This post is on how I generated a single VCF file, which I have hosted on my web server.
I used data from The 1000 Genomes Project, in particular, I chose individual HG00284, which as far as I’m aware does not have any known medical conditions. I downloaded the FASTQ files corresponding to the whole-exome sequencing and a BED file of the targeted regions.
# http://www.internationalgenome.org/data-portal/sample/HG00284 wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/20130108.exome.targets.bed # I renamed the files to the expected format of the pipeline # you can get this info from the FASTQ header mv ERR031940_1.filt.fastq.gz ERR031940_A81BHGABXX_CAATAT_L003_R1.fastq.gz mv ERR031940_2.filt.fastq.gz ERR031940_A81BHGABXX_CAATAT_L003_R2.fastq.gz
I called variants using GATK version 3.3-0-g37228af, the 2.8 data bundle, BWA version 0.7.12-r1039, and SAMtools 1.2. The GATK pipeline I used is documented in my GitHub repository.
gunzip -c exome.vcf.gz | grep -v "^#" | wc -l 68267
Actually, before I decided to run the GATK pipeline I thought I could just download the whole-genome sequencing (WGS) VCF file and use GATK’s SelectVariants to subset HG00284 based on the exome target BED file. But the WGS VCF file does not contain the individual genotypes! However, I found out later that the individual genotypes are in the VCF files for each individual chromosome, so I downloaded them all and ran SelectVariants on each chromosomal VCF file. Finally, I concatenated all the subsetted VCF files.
# download reference from ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz my_genome=hs37d5.fa for file in `ls ALL.chr*.gz`; do echo $file base=`basename $file .vcf.gz` java -Xmx2g -jar ~/src/GenomeAnalysisTK.jar \ -R $my_genome \ -T SelectVariants \ --variant $file \ -o ${base}_HG00284_exome.vcf \ -L 20130108.exome.targets.bed \ --excludeNonVariants \ --num_threads 64 \ -sn HG00284 done # get one VCF header zcat ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes_HG00284_exome.vcf | grep "^#" > header for file in `ls *.vcf`; do echo $file; cat $file | grep -v "^#" >> blah done cat header blah > merged.vcf rm -rf header blah
Comparing VCF files
I used SnpSift to compare the two VCF files: the one where I called variants from the WES FASTQ files and the other where I subsetted the WGS VCF based on the targeted exome regions.
java -Xmx1g -jar ~/src/snpEff/SnpSift.jar \ concordance -v HG00284/exome.vcf \ HG00284_subset/merged.vcf > blah cat concordance_exome_merged.by_sample.txt | transpose sample HG00284 MISSING_ENTRY_exome/MISSING_ENTRY_merged 0 MISSING_ENTRY_exome/MISSING_GT_merged 0 MISSING_ENTRY_exome/REF 0 MISSING_ENTRY_exome/ALT_1 6010 MISSING_ENTRY_exome/ALT_2 3526 MISSING_GT_exome/MISSING_ENTRY_merged 0 MISSING_GT_exome/MISSING_GT_merged 0 MISSING_GT_exome/REF 0 MISSING_GT_exome/ALT_1 0 MISSING_GT_exome/ALT_2 0 REF/MISSING_ENTRY_merged 0 REF/MISSING_GT_merged 0 REF/REF 0 REF/ALT_1 0 REF/ALT_2 0 ALT_1/MISSING_ENTRY_merged 27376 ALT_1/MISSING_GT_merged 0 ALT_1/REF 0 ALT_1/ALT_1 15553 ALT_1/ALT_2 67 ALT_2/MISSING_ENTRY_merged 14285 ALT_2/MISSING_GT_merged 0 ALT_2/REF 0 ALT_2/ALT_1 564 ALT_2/ALT_2 10187 ERROR 233
There were 9,536 (6,010 + 3,526) variants present in the merged VCF file (from WGS) and not in the exome VCF file (from WES). There were 41,661 (27,376 + 14,285) variants present in the exome VCF file and not in the merged VCF file. There was high concordance (~97.6% or 25,740/26,371*100) in the variants that were present in both files (26,371). I have written a post on how to interpret the SnpSift concordance output.
Variant annotation
After calling the variants, I used vt to decompose and normalise the VCF file, VEP for variant annotation, and GEMINI to store the variants.
vt decompose -s | vt normalize -r $GENOME - > exome.vt.vcf perl ~/src/ensembl-tools-release-82/scripts/variant_effect_predictor/variant_effect_predictor.pl -i exome.vt.vcf -o exome.vt.vep.vcf --vcf --fork 16 --offline --cache --sift b --polyphen b --symbol --numbers --biotype --total_length --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE gemini load -v exome.vt.vep.vcf -t VEP --cores 16 exome.vt.vep.db gemini query -q "select count(*) from variants" --header *.db count(*) 68577 gemini query -q 'select impact_severity, count(impact_severity) from variants group by impact_severity' --header *.db impact_severity count(impact_severity) HIGH 428 LOW 57419 MED 10730
There are 428 high impact variants in this individual; these variants include frameshifts, stop-gains, and splice acceptor/donor variants.
gemini query -q 'select clinvar_disease_name, clinvar_dbsource_id, clinvar_sig, (gts).(*) from variants where clinvar_sig is "pathogenic"' --header *.db clinvar_disease_name clinvar_dbsource_id clinvar_sig gts.HG00284 Neutrophil-specific_antigens_na1/na2 610665.0001 pathogenic T/C Neutrophil-specific_antigens_na1/na2 610665.0001 pathogenic T/C Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|Biotinidase_deficiency|not_provided NM_000060.2:c.1330G>C|5519|CM980257|609019.0005|609019.0009 pathogenic G/C Prekallikrein_deficiency 229000.0005 pathogenic A/A Hereditary_pancreatitis . pathogenic C/T Hereditary_pancreatitis 276000.0002|276000.0004 pathogenic A/T Hereditary_pancreatitis 276000.0007 pathogenic A/G Prostate_cancer,_hereditary,_13 157145.0001 pathogenic T/C Serum_amyloid_a_variant 104750.0002 pathogenic T/T 4-Alpha-hydroxyphenylpyruvate_hydroxylase_deficiency 609695.0005 pathogenic T/C Tyrosinase-positive_oculocutaneous_albinism 611409.0003 pathogenic C/T BARDET-BIEDL_SYNDROME_2/6,_DIGENIC 606151.0013 pathogenic T/T Prostate_cancer,_hereditary,_2 605367.0002 pathogenic C/T Prostate_cancer,_hereditary,_2 605367.0001 pathogenic G/A Myeloperoxidase_deficiency 606989.0005 pathogenic G/A SPASTIC_PARAPLEGIA_75,_AUTOSOMAL_RECESSIVE 159460.0002 pathogenic C/T
This individual harbours 16 pathogenic variants and is homozygous for three of them (other diseases may be dominant).
Summary
I have generated a VCF file based on WES sequencing on one individual (HG00284) from the 1000 Genomes Project. In addition, I generated another VCF file of the same exome regions using the WGS VCF files. The WES VCF file contained more variants owning to its deeper sequencing. Of the variants present in both VCF files, there was a high concordance (~97.6%). Lastly, I annotated the WES VCF file using VEP. Using GEMINI, it was shown that HG00284 has 428 high impact variants and 16 pathogenic ClinVar variants (ClinVar variants can have several annotations; the 16 variants are classified just as pathogenic and nothing else).

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Thanks Dave. I am also planning to download exome data from 1000 genomes phase 3. But I realized some samples have more than 2 files. Do I choose just 2 or all and call the snps . An example is this file
https://www.internationalgenome.org/data-portal/sample/NA18505
Disregard my previous comment. For the exomes, make sure they are both using the same technology, i.e. compatible, before combining them together. If they are the same, you can combine them together and call variants.
Hi Dave, I want to clarify my earlier question. I am attempting to download exome samples from 1000 genome phase3. What I realized was that some individuals have multiple files eg. this individual( https://www.internationalgenome.org/data-portal/sample/NA18505
SRR096669.filt.fastq.gz
SRR096669_1.filt.fastq.gz
SRR096669_2.filt.fastq.gz
SRR096670.filt.fastq.gz
SRR096670_1.filt.fastq.gz
SRR096670_2.filt.fastq.gz
SRR716648.filt.fastq.gz
SRR716648_1.filt.fastq.gz
SRR716648_2.filt.fastq.gz
SRR716649.filt.fastq.gz
SRR716649_1.filt.fastq.gz
SRR716649_2.filt.fastq.gz
I am interested in paired-end sequence. The question is do I pick a pair and call the variants from or I have to combine all the files as you suggested earlier.? Thanks
On https://www.internationalgenome.org/data-portal/sample/NA18505 I can only see two sets of raw Exome FASTQ files (I clicked on the “Sequence” and “Exome” checkboxes.
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR716/SRR716649/SRR716649_1.fastq.gz Exome
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR716/SRR716648/SRR716648_2.fastq.gz Exome
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR716/SRR716648/SRR716648_1.fastq.gz Exome
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR716/SRR716649/SRR716649_2.fastq.gz Exome
If SRR716648 and SRR716649 are produced using the same capture technology, you can merge both the *1.fastq.gz and *2.fastq.gz files together and call variants on the paired end files.
If they are produced using different technologies, call variants separately but on each set of paired end reads, i.e. SRR716648_1.fastq.gz with SRR716648_2.fastq.gz and SRR716649_1.fastq.gz with SRR716649_2.fastq.gz.
Thanks Dave.
My project seeks to find snps under selection. I have 14 exome samples of a type of cancer (one phenotype). We did not sequence any controls so we want to use a closely related population in the 1000Genome project as control. My plan is to call variants on these two datasets (meaning two different populations) and perform GWAS analysis on them. For now I am at the stage of downloading the datasets. I have also implemented the GATK pipeline using what you have on github as template. I would like to know if the approach I am using is suitable especially using 1KGenome project as control. Thanks