Annotating variants with a custom file

The Variant Effect Predictor (VEP) tool can be used for annotating variants with respect to custom annotation sources. This is useful if gene models of interest are not represented in the Ensembl or RefSeq databases. To get started, first install VEP since it takes some time.

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl

If all goes well, you will be asked whether you want to install any cache files; it is recommended to use cache files, so select yes. I want to work with the hg19/GRCh37 assembly and I’m not interested in RefSeq models, so I’ll select homo_sapiens_vep_91_GRCh37.tar.gz for my cache. Once you have selected and downloaded the relevant cache, the next step converts the cache (which is performed automatically) so that VEP works more efficiently; for hg19/GRCh37, this step (convert_cache.pl) takes several hours to complete.

The next prompt will ask whether you want to download FASTA files. VEP can use BED, GFF, GTF, VCF, and bigWig files to perform custom annotations. I prefer using GTF files and with this option, VEP requires the genomic sequence in order to generate transcript models; therefore download the relevant reference, which is homo_sapiens for me.

The last prompt will ask whether you want to install any plugins; we won’t be using any plugins for this post so we can skip it; you should now see “All done” and the installation process is complete. You can always run the install script (INSTALL.pl) again to install additional references and plugins.

VEP provides some example VCF files that you can use to test VEP.

vep -i examples/homo_sapiens_GRCh37.vcf --cache --assembly GRCh37 --offline --force_overwrite --no_progress --vcf --output_file my.vcf

head my.vcf
##fileformat=VCFv4.0
##VEP="v91" time="2018-01-04 23:07:28" cache="/home/davetang/.vep/homo_sapiens/91_GRCh37" ensembl-variation=91.c78d8b4 ensembl-io=91.923d668 ensembl=91.18ee742 ensembl-funcgen=91.4681d69 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
21      26960070        rs116645811     G       A       .       .       CSQ=A|missense_variant|MODERATE|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|10/11||||1043|1001|334|T/M|aCg/aTg|||-1||HGNC|14027,A|intron_variant|MODIFIER|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding||9/9||||||||||-1||HGNC|14027,A|upstream_gene_variant|MODIFIER|LINC00515|ENSG00000260583|Transcript|ENST00000567517|antisense|||||||||||4432|-1||HGNC|16019
21      26965148        rs1135638       G       A       .       .       CSQ=A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||939|897|299|G|ggC/ggT|||-1||HGNC|14027,A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||939|897|299|G|ggC/ggT|||-1||HGNC|14027,A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||876|867|289|G|ggC/ggT|||-1|cds_end_NF|HGNC|14027
21      26965172        rs10576 T       C       .       .       CSQ=C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||915|873|291|P|ccA/ccG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||915|873|291|P|ccA/ccG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||852|843|281|P|ccA/ccG|||-1|cds_end_NF|HGNC|14027
21      26965205        rs1057885       T       C       .       .       CSQ=C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||882|840|280|V|gtA/gtG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||882|840|280|V|gtA/gtG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||819|810|270|V|gtA/gtG|||-1|cds_end_NF|HGNC|14027
21      26976144        rs116331755     A       G       .       .       CSQ=G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|3/11||||426|384|128|L|ctT/ctC|||-1||HGNC|14027,G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|3/10||||426|384|128|L|ctT/ctC|||-1||HGNC|14027,G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|3/8||||393|384|128|L|ctT/ctC|||-1|cds_end_NF|HGNC|14027
21      26976222        rs7278168       C       T       .       .       CSQ=T|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|3/11||||348|306|102|K|aaG/aaA|||-1||HGNC|14027,T|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|3/10||||348|306|102|K|aaG/aaA|||-1||HGNC|14027,T|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|3/8||||315|306|102|K|aaG/aaA|||-1|cds_end_NF|HGNC|14027

The hardest part of using VEP for annotating variants with a GTF file is to create a valid GTF file. If you are unfamiliar to GTF files, I have a brief writeup of GTF files. Note that not all transcript biotypes are supported by VEP and unsupported biotypes are ignored. Furthermore, if the ignored biotype leads to an incomplete transcript model, the whole transcript model may be discarded. The following GTF entity types will be parsed by VEP:

  • cds (or CDS)
  • stop_codon
  • exon
  • gene
  • transcript

Entities need to be linked and unlinked entities, those without parents or children, are discarded. For example, an exon is linked to a transcript by a transcript_id and a transcript is linked to a gene by a gene_id. Below is an example GTF file that I created to use as a custom file for VEP; note that the spaces are tabs.

21      source  gene    26960060        26960090        .       +       0       gene_id "gene23";
21      source  transcript      26960060        26960090        .       +       0       gene_id "gene23"; transcript_id "transcript23"
21      source  exon    26960060        26960090        .       +       0       gene_id "gene23"; transcript_id "transcript23"; exon_number "1"
21      source  cds     26960060        26960090        .       +       0       gene_id "gene23"; transcript_id "transcript23"; exon_number "1"

The example above is a simple example, where there is just one exon. The sequence ontology defines an exon as:

A region of the transcript sequence within a gene which is not removed from the primary RNA transcript by RNA splicing.

and a cds as:

A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

My interpretation is that an exon may contain UTRs and a cds is the coding region. In my example the exon and cds are the same; the exon entity is required and cannot be removed or the gene model becomes invalid.

The GTF file needs to be bgzipped and indexed with tabix; clone the htslib repo and follow the installation instructions to install bgzip and tabix if you haven’t already.

bgzip tmp.gtf
tabix -p gff tmp.gtf.gz

Next we need an example VCF file to annotate. I’ll use the first entry in the example file homo_sapiens_GRCh37.vcf provided by VEP.

head -3 examples/homo_sapiens_GRCh37.vcf > test.vcf

cat test.vcf 
##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
21      26960070        rs116645811     G       A       .       .       .

Now to run VEP.

vep -i test.vcf --gtf tmp.gtf.gz --cache --assembly GRCh37 --offline --force_overwrite --no_progress --output_file test_anno.out

# annotation is on the last line
tail -1 test_anno.out
rs116645811     21:26960070     A       gene23  transcript23    Transcript      missense_variant        11      11      4       R/H     cGt/cAt -       IMPACT=MODERATE;STRAND=1;SOURCE=tmp.gtf.gz

GTF files use a 1-based system, so 26960070 is 26960069-26960070 in the above screenshot. Our gene model goes GAA-TGA-GTG-CGT-AGT and so on, hence the variant from G->A affects the CGT codon, which is shown in the VEP output file, and causes a missense mutation from arginine (R) to histidine (H).

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.