Exomiser

From Dave's wiki
Jump to navigation Jump to search

The Exomiser is a Java program that functionally annotates variants from whole-exome sequencing data starting from a VCF file (version 4). The functional annotation code is based on Jannovar and uses UCSC KnownGene transcript definitions and hg19 genomic coordinates.

http://www.sanger.ac.uk/science/tools/exomiser

Getting started

# version 7
wget -c ftp://ftp.sanger.ac.uk/pub/resources/software/exomiser/downloads/exomiser/exomiser-cli-7.2.1.sha256
wget -c ftp://ftp.sanger.ac.uk/pub/resources/software/exomiser/downloads/exomiser/exomiser-cli-7.2.1-distribution.zip
wget -c ftp://ftp.sanger.ac.uk/pub/resources/software/exomiser/downloads/exomiser/exomiser-cli-7.2.1-data.zip

# check
sha256sum -c exomiser-cli-7.2.1.sha256
exomiser-cli-7.2.1-distribution.zip: OK
exomiser-cli-7.2.1-data.zip: OK

unzip exomiser-cli-7.2.1-distribution.zip
unzip exomiser-cli-7.2.1-data.zip

# test run
java -Xms2g -Xmx4g -jar exomiser-cli-7.2.1.jar --analysis NA19722_601952_AUTOSOMAL_RECESSIVE_POMP_13_29233225_5UTR_38.yml
ls -1 results/
NA19722_601952_AUTOSOMAL_RECESSIVE_POMP_13_29233225_5UTR_38.genes.tsv
NA19722_601952_AUTOSOMAL_RECESSIVE_POMP_13_29233225_5UTR_38.html
NA19722_601952_AUTOSOMAL_RECESSIVE_POMP_13_29233225_5UTR_38.variants.tsv
NA19722_601952_AUTOSOMAL_RECESSIVE_POMP_13_29233225_5UTR_38.vcf

# version 6
wget ftp://ftp.sanger.ac.uk/pub/resources/software/exomiser/downloads/exomiser/exomiser-cli-6.0.0-distribution.zip
wget ftp://ftp.sanger.ac.uk/pub/resources/software/exomiser/downloads/exomiser/h2_db_dumps/exomiser-6.0.1.h2.db.gz
unzip exomiser-cli-6.0.0-distribution.zip
gunzip exomiser-6.0.1.h2.db.gz
mv exomiser-6.0.1.h2.db exomiser-cli-6.0.0/data/exomiser.h2.db
cd exomiser-cli-6.0.0
java -jar exomiser-cli-6.0.0.jar --help > help.txt

Suggested workflow

exomiser.png

Figure from http://www.ncbi.nlm.nih.gov/pubmed/26562621

Help

java -jar exomiser-cli-7.2.1.jar --help

  Welcome to:               
  _____ _            _____                     _               
 |_   _| |__   ___  | ____|_  _____  _ __ ___ (_)___  ___ _ __ 
   | | | '_ \ / _ \ |  _| \ \/ / _ \| '_ ` _ \| / __|/ _ \ '__|
   | | | | | |  __/ | |___ >  < (_) | | | | | | \__ \  __/ |   
   |_| |_| |_|\___| |_____/_/\_\___/|_| |_| |_|_|___/\___|_|   
                                                               
 A Tool to Annotate and Prioritize Exome Variants     v7.2.1

usage: java -jar exomizer-cli-7.2.1.jar [...]
    --analysis <file>                          Path to analysis script
                                               file. This should be in
                                               yaml format.
    --analysis-batch <file>                    Path to analysis batch
                                               file. This should be in
                                               plain text file with the
                                               path to a single analys
                                               script file in yaml format
                                               on each line.
    --batch-file <file>                        Path to batch file. This
                                               should contain a list of
                                               fully qualified path names
                                               for the settings files you
                                               wish to process. There
                                               should be one file name on
                                               each line.
    --candidate-gene <arg>                     Gene symbol of known or
                                               suspected gene association
                                               e.g. FGFR2
 -D,--disease-id <arg>                         OMIM ID for disease being
                                               sequenced. e.g. OMIM:101600
 -E,--hiphive-params <type>                    Comma separated list of
                                               optional parameters for
                                               hiphive: human, mouse,
                                               fish, ppi. e.g.
                                               --hiphive-params=human or
                                               --hiphive-params=human,mous
                                               e,ppi
 -F,--max-freq <arg>                           Maximum frequency threshold
                                               for variants to be
                                               retained. e.g. 100.00 will
                                               retain all variants.
                                               Default: 100.00
 -f,--out-format <type>                        Comma separated list of
                                               format options: HTML, VCF,
                                               TAB-GENE or TAB-VARIANT,.
                                               Defaults to HTML if not
                                               specified. e.g.
                                               --out-format=TAB-VARIANT or
                                               --out-format=TAB-GENE,TAB-V
                                               ARIANT,HTML,VCF
    --full-analysis <true/false>               Run the analysis such that
                                               all variants are run
                                               through all filters. This
                                               will take longer, but give
                                               more complete results.
                                               Default is false
    --genes-to-keep <Entrez geneId>            Comma separated list of
                                               seed genes (Entrez gene
                                               IDs) for filtering
 -H,--help                                     Shows this help
 -h,--help                                     Shows this help
    --hpo-ids <HPO ID>                         Comma separated list of HPO
                                               IDs for the sample being
                                               sequenced e.g.
                                               HP:0000407,HP:0009830,HP:00
                                               02858
 -I,--inheritance-mode <arg>                   Filter variants for
                                               inheritance pattern (AR,
                                               AD, X)
    --num-genes <arg>                          Number of genes to show in
                                               output
 -o,--out-prefix <arg>                         Out file prefix. Will
                                               default to
                                               vcf-filename-exomiser-resul
                                               ts
    --output-pass-variants-only <true/false>   Only write out PASS
                                               variants in TSV and VCF
                                               files.
 -P,--keep-non-pathogenic <true/false>         Keep the predicted
                                               non-pathogenic variants
                                               that are normally removed
                                               by default. These are
                                               defined as syonymous,
                                               intergenic, intronic,
                                               upstream, downstream or
                                               intronic ncRNA variants.
                                               This setting can optionally
                                               take a true/false argument.
                                               Not including the argument
                                               is equivalent to specifying
                                               'false'.
 -p,--ped <file>                               Path to pedigree (ped)
                                               file. Required if the vcf
                                               file is for a family.
    --prioritiser <name>                       Name of the prioritiser
                                               used to score the genes.
                                               Can be one of:
 -Q,--min-qual <arg>                           Mimimum quality threshold
                                               for variants as specifed in
                                               VCF 'QUAL' column.
                                               Default: 0
 -R,--restrict-interval <arg>                  Restrict to region/interval
                                               (e.g., chr2:12345-67890)
    --remove-known-variants <true/false>       Filter out all variants
                                               with an entry in
                                               dbSNP/ESP/ExAC (regardless
                                               of frequency).
 -S,--seed-genes <Entrez geneId>               Comma separated list of
                                               seed genes (Entrez gene
                                               IDs) for random walk
    --settings-file <file>                     Path to settings file. Any
                                               settings specified in the
                                               file will be overidden by
                                               parameters added on the
                                               command-line.
 -T,--keep-off-target <true/false>             Keep the off-target
                                               variants that are normally
                                               removed by default. These
                                               are defined as intergenic,
                                               intronic, upstream,
                                               downstream or intronic
                                               ncRNA variants. This
                                               setting can optionally take
                                               a true/false argument. Not
                                               including the argument is
                                               equivalent to specifying
                                               'true'.
 -v,--vcf <file>                               Path to VCF file with
                                               mutations to be analyzed.
                                               Can be either for an
                                               individual or a family.

Usage

(a) Exomiser hiPHIVE algorithm - phenotype comparisons to human, mouse and fish involving disruption of the gene or nearby genes in the interactome using a RandomWalk

java -Xms2g -Xmx4g -jar exomiser-cli-7.2.1.jar --prioritiser=hiphive -I AD -F 1 -D OMIM:101600 -v data/Pfeiffer.vcf

java -Xms2g -Xmx4g -jar exomiser-cli-7.2.1.jar --prioritiser=hiphive -I AD -F 1 --hpo-ids \
HP:0000006,HP:0000174,HP:0000194,HP:0000218,HP:0000238,HP:0000244,HP:0000272,HP:0000303,HP:0000316, \
HP:0000322,HP:0000324, HP:0000327,HP:0000348,HP:0000431,HP:0000452,HP:0000453,HP:0000470,HP:0000486, \
HP:0000494,HP:0000508,HP:0000586,HP:0000678, HP:0001156,HP:0001249,HP:0002308,HP:0002676,HP:0002780, \
HP:0003041,HP:0003070,HP:0003196,HP:0003272,HP:0003307,HP:0003795, HP:0004209,HP:0004322,HP:0004440, \
HP:0005048, HP:0005280,HP:0005347,HP:0006101,HP:0006110,HP:0009602,HP:0009773,HP:0010055, HP:0010669, \
HP:0011304 -v data/Pfeiffer.vcf

(b) Exomiser PHIVE algorithm - phenotype comparisons to mice with disruption of the gene

java -Xmx2g -jar exomiser-cli-7.2.1.jar --prioritiser=phive -I AD -F 1 -D OMIM:101600 -v data/Pfeiffer.vcf

(c) Exomiser Phenix algorithm - phenotype comparisons to known human disease genes

java -Xms2g -Xmx4g -jar exomiser-cli-7.2.1.jar --prioritiser=phenix -v data/Pfeiffer.vcf -I AD -F 1 --hpo-ids \
HP:0000006,HP:0000174,HP:0000194,HP:0000218,HP:0000238,HP:0000244,HP:0000272,HP:0000303,HP:0000316, \
HP:0000322,HP:0000324, HP:0000327,HP:0000348,HP:0000431,HP:0000452,HP:0000453,HP:0000470,HP:0000486, \
HP:0000494,HP:0000508,HP:0000586,HP:0000678, HP:0001156,HP:0001249,HP:0002308,HP:0002676,HP:0002780, \
HP:0003041,HP:0003070,HP:0003196,HP:0003272,HP:0003307,HP:0003795, HP:0004209,HP:0004322,HP:0004440, \
HP:0005048, HP:0005280,HP:0005347,HP:0006101,HP:0006110,HP:0009602,HP:0009773,HP:0010055, HP:0010669, \
HP:0011304

(d) Exomiser ExomeWalker algorithm - prioritisation by proximity in interactome to the seed genes

java -Xms2g -Xmx4g -jar exomiser-cli-7.2.1.jar --prioritiser exomewalker -v data/Pfeiffer.vcf -I AD -F 1 -S 2260

Web tool

https://www.sanger.ac.uk/resources/software/exomiser/submit

Download test file from https://www.sanger.ac.uk/resources/software/exomiser/submit/resources/Pfeiffer.vcf

Issues

  1. A PED file is required for VCF files with multiple samples; I have a script at https://github.com/davetang/learning_vcf_file/blob/master/script/vcf_to_ped.R that produces a PED file from a VCF file
  2. For the PED file processed by The Exomiser, a zero it not allowed in the sex column (make everyone male [one] or female [two] instead)
  3. For the PED file processed by The Exomiser, a negative nine is not allowed in the phenotype column (use a zero instead)
  4. If you run The Exomiser in a directory that doesn't contain a results folder, no results will be outputted; create a results folder before you conduct your analysis

Other info

https://sangerinstitute.wordpress.com/2013/11/28/the-rare-diseases-of-mice-and-men/