From Dave's wiki
Jump to navigation Jump to search

Variant Effect Predictor

Setting up Perl

tar -xzf ActivePerl-
cd ActivePerl-
ppm install Archive::Extract
ppm install DBD::mysql

Downloading and installing

cd ensembl-tools-release-82/scripts/variant_effect_predictor 

Downloading cache

# you can either use the script
# or do it manually as below
mkdir ~/.vep
cd ~/.vep/
wget -c
tar -xzf homo_sapiens_vep_82_GRCh37.tar.gz
# run VEP with the cache option
perl --cache -i input.txt -o output.txt

Running VEP

See -i example_GRCh37.vcf --cache --assembly GRCh37 --offline --force_overwrite --no_progress --format vcf --output_file my.vcf

In newer version of VEP (such as version 91), the Perl script has been renamed to vep and the example files are in the examples directory.

vep -i examples/homo_sapiens_GRCh37.vcf --cache --assembly GRCh37 --offline --force_overwrite --no_progress --vcf --output_file my.vcf

Custom annotation

VEP can integrate custom annotation from standard format files into your results by using the --custom flag. These files may be hosted locally or remotely, with no limit to the number or size of the files. BED, GFF, GTF, and VCF files need to be indexed using tabix; bigWig files contain their own indices.

Annotations typically appear as key=value pairs in the Extra column of the VEP output; they will also appear in the INFO column if using VCF format output. The value for a particular annotation is defined as the identifier for each feature; if not available, an identifier derived from the coordinates of the annotation is used. Annotations will appear in each line of output for the variant where multiple lines exist.

VEP can use transcript annotations defined in GFF or GTF files but VEP requires a FASTA file containing the genomic sequence in order to generate transcript models. Your GFF or GTF file must be sorted in chromosomal order. VEP does not use header lines so it is safe to remove them.

Annotations by GFF/GTF files are distinguished by the SOURCE field in the VEP output.

VEP has been tested on GFF files generated by Ensembl and NCBI (RefSeq). Due to inconsistency in the GFF specification and adherence to it, VEP may encounter problems parsing some GFF files. Not all transcript biotypes defined in your GFF may be supported by VEP. Lines of other types will be ignored; if this leads to an incomplete transcript model, the whole transcript model may be discarded. Entities in the GFF are expected to be linked using a key named "parent" or "Parent" in the attributes (9th) column of the GFF. Unlinked entities (i.e. those with no parents or children) are discarded. Sibling entities (those that share the same parent) may have overlapping coordinates, e.g. for exon and CDS entities. Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id.

The following GTF entity types will be parsed by VEP:

  • cds (or CDS)
  • stop_codon
  • exon
  • gene
  • transcript

Transcripts require a Sequence Ontology biotype to be defined in order to be parsed by VEP. The simplest way to define this is using an attribute named "biotype" on the transcript entity. Other configurations are supported in order for VEP to be able to parse GFF files from NCBI and other sources.

See and