ANNOVAR

From Dave's wiki
Jump to navigation Jump to search

ANNOVAR is an efficient software tool to utilise up-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others). ANNOVAR can perform:

  1. Gene-based annotation - identify whether variants cause protein coding changes
  2. Region-based annotations - identify whether variants occur in specific genomic regions
  3. Filter-based annotation - identify whether variants are documented in specific databases
  4. Other functionalities - retrieve the nucleotide sequence in any user-specific genomic positions in batch

http://annovar.openbioinformatics.org/en/latest/user-guide/input/

Downloading

http://annovar.openbioinformatics.org/en/latest/user-guide/download/ -> http://www.openbioinformatics.org/annovar/annovar_download_form.php

tar -xzf annovar.latest.tar.gz
cd annovar

Many of the databases that ANNOVAR uses can be directly retrieved from UCSC Genome Browser Annotation Database by -downdb argument.

Quick start

For beginners, the easiest way to use ANNOVAR is to use the table_annovar.pl program. First, we need to download appropriate database files using annotate_variation.pl:

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/
annotate_variation.pl -buildver hg19 -downdb genomicSuperDups humandb/ 
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp6500siv2_all humandb/
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2014oct humandb/
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp138 humandb/ 
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb26_all humandb/

Running the ANNOVAR on the test data:

table_annovar.pl example/ex1.avinput humandb/ \
-buildver hg19 \
-out myanno \
-remove \
-protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2014oct_all,1000g2014oct_afr,1000g2014oct_eas,1000g2014oct_eur,snp138,ljb26_all \
-operation g,r,r,f,f,f,f,f,f,f \
-nastring . \
-csvout

On the test VCF file:

table_annovar.pl example/ex2.vcf humandb/ \
-buildver hg19 \
-out myanno \
-remove \
-protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2014oct_all,1000g2014oct_afr,1000g2014oct_eas,1000g2014oct_eur,snp138,ljb26_all \
-operation g,r,r,f,f,f,f,f,f,f \
-nastring . \
-vcfinput

http://annovar.openbioinformatics.org/en/latest/user-guide/startup/

annotate_variation.pl

The annotate_variation.pl program is the core program in ANNOVAR and requires a simple text-based format, i.e. the ANNOVAR input format, where, each line corresponds to one variant.

Prepare input files

The ANNOVAR input format is a space- or tab- delimited files, where the first five columns represent chromosomes, start position, end position, the reference nucleotide, and the observed nucleotide and each line corresponds to one variant. Additional columns can be supplied and will be printed out in printed form.

convert2annovar.pl -includeinfo -allsample -withfreq -format vcf4 test.vcf > test.avinput

Downloading annotation files

See http://annovar.openbioinformatics.org/en/latest/user-guide/download/

Gene-based annotations:

# whole genome FASTA files
annotate_variation.pl -downdb -buildver hg19 seq humandb/hg19_seq/
# RefSeq
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/
# UCSC known gene
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar knownGene humandb/
# Ensembl gene
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar ensGene humandb/

Region-based annotations:

annotate_variation.pl -build hg19 -downdb phastConsElements46way humandb/
annotate_variation.pl -build hg19 -downdb tfbsConsSites humandb/
annotate_variation.pl -build hg19 -downdb cytoBand humandb/
annotate_variation.pl -build hg19 -downdb wgRna humandb/
annotate_variation.pl -build hg19 -downdb targetScanS humandb/
annotate_variation.pl -build hg19 -downdb genomicSuperDups humandb/
annotate_variation.pl -build hg19 -downdb dgvMerged humandb/
annotate_variation.pl -build hg19 -downdb gwasCatalog humandb/
annotate_variation.pl -downdb wgEncodeCaltechRnaSeqRawSignalRep1Gm12878CellLongpolyaBb12x75 humandb/
annotate_variation.pl -downdb wgEncodeBroadChipSeqPeaksGm12878H3k4me1 humandb/
annotate_variation.pl -downdb wgEncodeRegDnaseClustered humandb/
annotate_variation.pl -downdb wgEncodeRegTfbsClustered humandb/

Filter-based annotations:

# 1000 Genomes Project (2015 Aug) annotations
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 1000g2015aug humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 snp138 humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 dbnsfp30a humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 esp6500siv2_all humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 exac03 humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 gerp++gt2 humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 popfreq_max_20150413 humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 clinvar_20150629 humandb/

Usage

Below are some basic usage examples on the example data; I've removed the comments from the output because it doesn't fit in the code block on my small screen.

# annotates the 12 variants in ex1.avinput file and
# classify them as intergenic, intronic, non-synonymous SNP,
# frameshift deletion, large-scale duplication, etc.
annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/
cat ex1.avinput.variant_function 
UTR5    ISG15(NM_005101:c.-33T>C)       1       948921  948921  T       C
UTR3    ATAD3C(NM_001039211:c.*91G>T)   1       1404001 1404001 G       T
splicing        NPHP4(NM_001291593:exon18:c.1279-2T>A,NM_001291594:exon17:c.1282-2T>A,NM_015102:exon21:c.2818-2T>A)     1       5935162 5935162 A       T
intronic        DDR2    1       162736463       162736463       C       T
intronic        DNASE2B 1       84875173        84875173        C       T
intergenic      LOC645354(dist=11566),LOC391003(dist=116902)    1       13211293        13211294        TC      -
intergenic      UBIAD1(dist=55105),PTCHD2(dist=135699)  1       11403596        11403596        -       AT
intergenic      LOC100129138(dist=872538),LOC101928476(dist=640085)     1       105492231       105492231       A       ATAAA
exonic  IL23R   1       67705958        67705958        G       A
exonic  ATG16L1 2       234183368       234183368       A       G
exonic  NOD2    16      50745926        50745926        C       T
exonic  NOD2    16      50756540        50756540        G       C
exonic  NOD2    16      50763778        50763778        -       C
exonic  GJB2    13      20763686        20763686        G       -
exonic  CRYL1,GJB6      13      20797176        21105944        0       -

# identifies the cytogenetic band for these variants
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
cat ex1.avinput.hg19_cytoBand 
cytoBand        1p36.33 1       948921  948921  T       C
cytoBand        1p36.33 1       1404001 1404001 G       T
cytoBand        1p36.31 1       5935162 5935162 A       T
cytoBand        1q23.3  1       162736463       162736463       C       T
cytoBand        1p31.1  1       84875173        84875173        C       T
cytoBand        1p36.21 1       13211293        13211294        TC      -
cytoBand        1p36.22 1       11403596        11403596        -       AT
cytoBand        1p21.1  1       105492231       105492231       A       ATAAA
cytoBand        1p31.3  1       67705958        67705958        G       A
cytoBand        2q37.1  2       234183368       234183368       A       G
cytoBand        16q12.1 16      50745926        50745926        C       T
cytoBand        16q12.1 16      50756540        50756540        G       C
cytoBand        16q12.1 16      50763778        50763778        -       C
cytoBand        13q12.11        13      20763686        20763686        G       -
cytoBand        13q12.11        13      20797176        21105944        0       -

# identifies a subset of variants that are not observed in 1000G 
annotate_variation.pl -filter -dbtype 1000g2015aug_all -buildver hg19 example/ex1.avinput humandb/
cat ex1.avinput.hg19_ALL.sites.2015_08_filtered 
1       11403596        11403596        -       AT
1       105492231       105492231       A       ATAAA
13      20797176        21105944        0       -
1       13211293        13211294        TC      -

# these are "dropped", i.e. removed, because they already exist in the 1000 Genomes Project
cat ex1.avinput.hg19_ALL.sites.2015_08_dropped 
1000g2015aug_all        0.0676917       1       1404001 1404001 G       T
1000g2015aug_all        0.620607        1       162736463       162736463       C       T
1000g2015aug_all        0.843251        1       5935162 5935162 A       T
1000g2015aug_all        0.0227636       1       67705958        67705958        G       A
1000g2015aug_all        0.548922        1       84875173        84875173        C       T
1000g2015aug_all        0.903155        1       948921  948921  T       C
1000g2015aug_all        0.00239617      13      20763686        20763686        G       -
1000g2015aug_all        0.014377        16      50745926        50745926        C       T
1000g2015aug_all        0.00459265      16      50756540        50756540        G       C
1000g2015aug_all        0.00599042      16      50763778        50763778        -       C
1000g2015aug_all        0.395966        2       234183368       234183368       A       G

Available databases

Check out http://annovar.openbioinformatics.org/en/latest/user-guide/download/

Gene annotation

cat test.vcf
##fileformat=VCFv4.0
##reference=hg19
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Bob
11      5246715 rs1     T       C       99      PASS    AA=T    GT      1/1
11      5246954 rs2     G       A       99      PASS    AA=G    GT      0/0
11      5246955 rs3     A       G       99      PASS    AA=A    GT      0/0
11      5246956 rs4     G       A       99      PASS    AA=G    GT      0/0
11      5246957 rs5     C       A       99      PASS    AA=C    GT      1/1
11      5246958 rs6     T       C       99      PASS    AA=T    GT      0/0
11      5246959 rs7     G       A       99      PASS    AA=G    GT      0/0
11      5247841 rs8     C       T       99      PASS    AA=C    GT      0/0
11      5248075 rs9     A       G       99      PASS    AA=A    GT      0/1
11      5248266 rs10    G       T       99      PASS    AA=C    GT      0/0

convert2annovar.pl -includeinfo -allsample -withfreq -format vcf4 test.vcf > test.avinput
annotate_variation.pl -geneanno -buildver hg19 test.avinput humandb/

cat test.avinput.variant_function 
UTR3    HBB(NM_000518:c.*113A>G)        11      5246715 5246715 T       C       1       99      .       11      5246715 rs1     T       C       99      PASS    AA=T    GT      1/1
exonic  HBB     11      5246954 5246954 G       A       0       99      .       11      5246954 rs2     G       A       99      PASS    AA=G    GT      0/0
exonic  HBB     11      5246955 5246955 A       G       0       99      .       11      5246955 rs3     A       G       99      PASS    AA=A    GT      0/0
exonic  HBB     11      5246956 5246956 G       A       0       99      .       11      5246956 rs4     G       A       99      PASS    AA=G    GT      0/0
splicing        HBB(NM_000518:exon3:c.316-1G>T) 11      5246957 5246957 C       A       1       99      .       11      5246957 rs5     C       A       99      PASS    AA=C    GT      1/1
splicing        HBB(NM_000518:exon3:c.316-2A>G) 11      5246958 5246958 T       C       0       99      .       11      5246958 rs6     T       C       99      PASS    AA=T    GT      0/0
intronic        HBB     11      5246959 5246959 G       A       0       99      .       11      5246959 rs7     G       A       99      PASS    AA=G    GT      0/0
exonic  HBB     11      5247841 5247841 C       T       0       99      .       11      5247841 rs8     C       T       99      PASS    AA=C    GT      0/0
intronic        HBB     11      5248075 5248075 A       G       0.5     99      .       11      5248075 rs9     A       G       99      PASS    AA=A    GT      0/1
UTR5    HBB(NM_000518:c.-15C>A) 11      5248266 5248266 G       T       0       99      .       11      5248266 rs10    G       T       99      PASS    AA=C    GT      0/0

cat test.avinput.exonic_variant_function 
line2   synonymous SNV  HBB:NM_000518:exon3:c.C318T:p.L106L,    11      5246954 5246954 G       A       0       99      .       11      5246954 rs2     G       A       99      PASS    AA=G    GT      0/0
line3   nonsynonymous SNV       HBB:NM_000518:exon3:c.T317C:p.L106P,    11      5246955 5246955 A       G       0       99      .       11      5246955 rs3     A       G       99      PASS    AA=A    GT      0/0
line4   nonsynonymous SNV       HBB:NM_000518:exon3:c.C316T:p.L106F,    11      5246956 5246956 G       A       0       99      .       11      5246956 rs4     G       A       99      PASS    AA=G    GT      0/0
line8   nonsynonymous SNV       HBB:NM_000518:exon2:c.G281A:p.C94Y,     11      5247841 5247841 C       T       0       99      .       11      5247841 rs8     C       T       99      PASS    AA=C    GT      0/0

Note that an exonic variant within 2 bp of a exon-intron junction, is only listed as an exonic variant. Use the -exonicsplicing parameter to annotate exonic variants close to exon-intron junctions

cat test.avinput.variant_function 
UTR3    HBB(NM_000518:c.*113A>G)        11      5246715 5246715 T       C       1       99      .       11      5246715 rs1     T       C       99      PASS    AA=T    GT      1/1
exonic  HBB     11      5246954 5246954 G       A       0       99      .       11      5246954 rs2     G       A       99      PASS    AA=G    GT      0/0
exonic;splicing HBB;HBB 11      5246955 5246955 A       G       0       99      .       11      5246955 rs3     A       G       99      PASS    AA=A    GT      0/0
exonic;splicing HBB;HBB 11      5246956 5246956 G       A       0       99      .       11      5246956 rs4     G       A       99      PASS    AA=G    GT      0/0
splicing        HBB(NM_000518:exon3:c.316-1G>T) 11      5246957 5246957 C       A       1       99      .       11      5246957 rs5     C       A       99      PASS    AA=C    GT      1/1
splicing        HBB(NM_000518:exon3:c.316-2A>G) 11      5246958 5246958 T       C       0       99      .       11      5246958 rs6     T       C       99      PASS    AA=T    GT      0/0
intronic        HBB     11      5246959 5246959 G       A       0       99      .       11      5246959 rs7     G       A       99      PASS    AA=G    GT      0/0
exonic  HBB     11      5247841 5247841 C       T       0       99      .       11      5247841 rs8     C       T       99      PASS    AA=C    GT      0/0
intronic        HBB     11      5248075 5248075 A       G       0.5     99      .       11      5248075 rs9     A       G       99      PASS    AA=A    GT      0/1
UTR5    HBB(NM_000518:c.-15C>A) 11      5248266 5248266 G       T       0       99      .       11      5248266 rs10    G       T       99      PASS    AA=C    GT      0/0

Further reading

Why Are There More Non-Synonymous Snps Than Synonymous Snps In The 1000 Genomes Data? https://www.biostars.org/p/48604/

Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR http://www.nature.com/nprot/journal/v10/n10/abs/nprot.2015.105.html