The Variant Effect Predictor (VEP) tool can be used for annotating variants with respect to custom annotation sources. This is useful if gene models of interest are not represented in the Ensembl or RefSeq databases. To get started, first install VEP since it takes some time.
git clone https://github.com/Ensembl/ensembl-vep.git cd ensembl-vep perl INSTALL.pl
If all goes well, you will be asked whether you want to install any cache files; it is recommended to use cache files, so select yes. I want to work with the hg19/GRCh37 assembly and I’m not interested in RefSeq models, so I’ll select homo_sapiens_vep_91_GRCh37.tar.gz for my cache. Once you have selected and downloaded the relevant cache, the next step converts the cache (which is performed automatically) so that VEP works more efficiently; for hg19/GRCh37, this step (convert_cache.pl) takes several hours to complete.
The next prompt will ask whether you want to download FASTA files. VEP can use BED, GFF, GTF, VCF, and bigWig files to perform custom annotations. I prefer using GTF files and with this option, VEP requires the genomic sequence in order to generate transcript models; therefore download the relevant reference, which is homo_sapiens for me.
The last prompt will ask whether you want to install any plugins; we won’t be using any plugins for this post so we can skip it; you should now see “All done” and the installation process is complete. You can always run the install script (INSTALL.pl) again to install additional references and plugins.
VEP provides some example VCF files that you can use to test VEP.
vep -i examples/homo_sapiens_GRCh37.vcf --cache --assembly GRCh37 --offline --force_overwrite --no_progress --vcf --output_file my.vcf head my.vcf ##fileformat=VCFv4.0 ##VEP="v91" time="2018-01-04 23:07:28" cache="/home/davetang/.vep/homo_sapiens/91_GRCh37" ensembl-variation=91.c78d8b4 ensembl-io=91.923d668 ensembl=91.18ee742 ensembl-funcgen=91.4681d69 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2" ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID"> #CHROM POS ID REF ALT QUAL FILTER INFO 21 26960070 rs116645811 G A . . CSQ=A|missense_variant|MODERATE|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|10/11||||1043|1001|334|T/M|aCg/aTg|||-1||HGNC|14027,A|intron_variant|MODIFIER|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding||9/9||||||||||-1||HGNC|14027,A|upstream_gene_variant|MODIFIER|LINC00515|ENSG00000260583|Transcript|ENST00000567517|antisense|||||||||||4432|-1||HGNC|16019 21 26965148 rs1135638 G A . . CSQ=A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||939|897|299|G|ggC/ggT|||-1||HGNC|14027,A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||939|897|299|G|ggC/ggT|||-1||HGNC|14027,A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||876|867|289|G|ggC/ggT|||-1|cds_end_NF|HGNC|14027 21 26965172 rs10576 T C . . CSQ=C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||915|873|291|P|ccA/ccG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||915|873|291|P|ccA/ccG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||852|843|281|P|ccA/ccG|||-1|cds_end_NF|HGNC|14027 21 26965205 rs1057885 T C . . CSQ=C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||882|840|280|V|gtA/gtG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||882|840|280|V|gtA/gtG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||819|810|270|V|gtA/gtG|||-1|cds_end_NF|HGNC|14027 21 26976144 rs116331755 A G . . CSQ=G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|3/11||||426|384|128|L|ctT/ctC|||-1||HGNC|14027,G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|3/10||||426|384|128|L|ctT/ctC|||-1||HGNC|14027,G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|3/8||||393|384|128|L|ctT/ctC|||-1|cds_end_NF|HGNC|14027 21 26976222 rs7278168 C T . . CSQ=T|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|3/11||||348|306|102|K|aaG/aaA|||-1||HGNC|14027,T|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|3/10||||348|306|102|K|aaG/aaA|||-1||HGNC|14027,T|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|3/8||||315|306|102|K|aaG/aaA|||-1|cds_end_NF|HGNC|14027
The hardest part of using VEP for annotating variants with a GTF file is to create a valid GTF file. If you are unfamiliar to GTF files, I have a brief writeup of GTF files. Note that not all transcript biotypes are supported by VEP and unsupported biotypes are ignored. Furthermore, if the ignored biotype leads to an incomplete transcript model, the whole transcript model may be discarded. The following GTF entity types will be parsed by VEP:
- cds (or CDS)
- stop_codon
- exon
- gene
- transcript
Entities need to be linked and unlinked entities, those without parents or children, are discarded. For example, an exon is linked to a transcript by a transcript_id and a transcript is linked to a gene by a gene_id. Below is an example GTF file that I created to use as a custom file for VEP; note that the spaces are tabs.
21 source gene 26960060 26960090 . + 0 gene_id "gene23"; 21 source transcript 26960060 26960090 . + 0 gene_id "gene23"; transcript_id "transcript23" 21 source exon 26960060 26960090 . + 0 gene_id "gene23"; transcript_id "transcript23"; exon_number "1" 21 source cds 26960060 26960090 . + 0 gene_id "gene23"; transcript_id "transcript23"; exon_number "1"
The example above is a simple example, where there is just one exon. The sequence ontology defines an exon as:
A region of the transcript sequence within a gene which is not removed from the primary RNA transcript by RNA splicing.
and a cds as:
A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.
My interpretation is that an exon may contain UTRs and a cds is the coding region. In my example the exon and cds are the same; the exon entity is required and cannot be removed or the gene model becomes invalid.
The GTF file needs to be bgzipped and indexed with tabix; clone the htslib repo and follow the installation instructions to install bgzip and tabix if you haven’t already.
bgzip tmp.gtf tabix -p gff tmp.gtf.gz
Next we need an example VCF file to annotate. I’ll use the first entry in the example file homo_sapiens_GRCh37.vcf provided by VEP.
head -3 examples/homo_sapiens_GRCh37.vcf > test.vcf cat test.vcf ##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO 21 26960070 rs116645811 G A . . .
Now to run VEP.
vep -i test.vcf --gtf tmp.gtf.gz --cache --assembly GRCh37 --offline --force_overwrite --no_progress --output_file test_anno.out # annotation is on the last line tail -1 test_anno.out rs116645811 21:26960070 A gene23 transcript23 Transcript missense_variant 11 11 4 R/H cGt/cAt - IMPACT=MODERATE;STRAND=1;SOURCE=tmp.gtf.gz
GTF files use a 1-based system, so 26960070 is 26960069-26960070 in the above screenshot. Our gene model goes GAA-TGA-GTG-CGT-AGT and so on, hence the variant from G->A affects the CGT codon, which is shown in the VEP output file, and causes a missense mutation from arginine (R) to histidine (H).

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Dear Dr. Tang
May I consultant how to format the VEP output to make each mutation only in one line? Currently, one mutation for different samples appears on multi-lines, is there a way to combine them.
Thank you very much!
Best,
Hi Sunny,
can you show an example of your output (and explain your workflow)?
Cheers,
Dave