Exome sequencing

From Dave's wiki
Jump to navigation Jump to search

http://en.wikipedia.org/wiki/Exome_sequencing

TruSeq Exome Enrichment Kit

The TruSeq Exome Enrichment Kit is an in-solution sequence capture method for isolating exonic regions of interest in the human genome using hybrid selection[1]. The highlights include:

  • Most Comprehensive Exome Coverage:

Highly uniform coverage across 62 Mb of exomic sequence, including 5' UTR, 3' UTR, microRNA, and other non-coding RNA.

  • Most Cost-Effective Exome Sequencing:

Streamlined protocol for pre-enrichment pooling of up to six samples dramatically reduces hands-on time and cost.

  • Integrated Solution:

Optimised for use with the TruSeq DNA Sample Preparation Kit, providing a gel-free protocol that requires the lowest DNA input.

  • Simplest and Most Scalable Workflow:

Automation-friendly with master-mixed reagents and plate-based processing for up to 96 reactions.

Illumina TruSeq DNA Sample Preparation Kits

Explains how to prepare up to 96 pooled indexed paired-end libraries of genomic DNA (gDNA) for subsequent cluster generation and DNA sequencing using the reagents provided in the Illumina TruSeq DNA Sample Preparation Kits (low- throughput (LT) and high-throughput (HT)). The goal of this kit is to add adapter sequences onto the ends of DNA fragments to generate indexed single read or paired- end sequencing libraries.[2]

Capture Target Oligos

The kit includes > 340,000 95mer probes, each constructed against the human NCBI37/hg19 reference genome. The probe set was designed to enrich > 200,000 exons, spanning 20,794 genes of interest. While the sum length of these probes is 32 Mb, the kit actually targets 62 Mb of the human genome (117.5 Mb if the 150 bp regions captured upstream and downstream of target are also considered). Each 95mer probe targets libraries of 300–400 bp (insert size of 180–280 bp), enriching 265–465 bases centered symmetrically around the midpoint of the probe. This means that, in addition to comprehensive coverage of the major exon databases, the kit also provides broad coverage of non-coding DNA in exon-flanking regions (promoters and UTRs). Genome-wide association studies suggest that > 80% of disease-associated variants fall outside coding regions. Analysis of these regions enables researchers to discover variants that effect gene function, at a more affordable price than whole-genome sequencing.

BED file

Download the BED from here: https://app.box.com/s/p2m1ahb7w3xhme5rd01rx5na5lw8ch95 (see https://www.biostars.org/p/144554/)

gzip TruSeq_exome_targeted_regions.hg19.bed
zcat TruSeq_exome_targeted_regions.hg19.bed.gz | wc -l
  201071

I wrote a script to convert the BED file into an ANNOVAR input file, at a single nucleotide resolution. The script is below:

#!/usr/bin/env perl

use strict;
use warnings;
use lib '/home/san/dtang/script';
use Genome;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

my %genome = Genome::store_hg19();

if ($infile =~ /\.gz$/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

while(<IN>){
   chomp;
   my ($chr, $start, $end, @rest) = split(/\t/);
   $chr =~ s/^chr//;
   for(my $i = $start; $i < $end; ++$i){
      my $base = substr($genome{$chr}, $i, 1);
      # BED file coordinates are 0 based
      # ANNOVAR file coordinates should be 1 based
      print join("\t", $chr, $i+1, $i+1, $base, $base), "\n";
   }
}
close(IN);

exit(0);

__END__

Then to create the avinput

bed_to_annovar.pl TruSeq_exome_targeted_regions.hg19.bed.gz | gzip > TruSeq_exome_targeted_regions.hg19.avinput
zcat TruSeq_exome_targeted_regions.hg19.avinput.gz | head
1       14363   14363   C       C
1       14364   14364   C       C
1       14365   14365   T       T
1       14366   14366   G       G
1       14367   14367   C       C
1       14368   14368   A       A
1       14369   14369   C       C
1       14370   14370   A       A
1       14371   14371   G       G
1       14372   14372   C       C

# corresponds to roughly 62 M of the genome
cat TruSeq_exome_targeted_regions.hg19.avinput | wc -l
62085295

Annotate using ANNOVAR

# annotating using ANNOVAR
annotate_variation.pl --outfile annotation -geneanno -buildver hg19 TruSeq_exome_targeted_regions.hg19.avinput humandb/

head annotation.variant_function 
ncRNA_exonic    DDX11L1,WASH7P  1       14363   14363   C       C
ncRNA_exonic    DDX11L1,WASH7P  1       14364   14364   C       C
ncRNA_exonic    DDX11L1,WASH7P  1       14365   14365   T       T
ncRNA_exonic    DDX11L1,WASH7P  1       14366   14366   G       G
ncRNA_exonic    DDX11L1,WASH7P  1       14367   14367   C       C
ncRNA_exonic    DDX11L1,WASH7P  1       14368   14368   A       A
ncRNA_exonic    DDX11L1,WASH7P  1       14369   14369   C       C
ncRNA_exonic    DDX11L1,WASH7P  1       14370   14370   A       A
ncRNA_exonic    DDX11L1,WASH7P  1       14371   14371   G       G
ncRNA_exonic    DDX11L1,WASH7P  1       14372   14372   C       C

gunzip -c annotation.variant_function.gz | cut -f1 | sort | uniq -c | sort -k1rn
31990011 exonic
21915423 UTR3
3690240 UTR5
3156649 ncRNA_exonic
 885584 ncRNA_intronic
 210679 intronic
 156276 intergenic
  30989 downstream
  27935 upstream
   7827 exonic;splicing
   6143 splicing
   3236 UTR5;UTR3
   2550 upstream;downstream
   1005 ncRNA_exonic;splicing
    748 ncRNA_splicing

Note that these classifications are based on RefSeq transcripts; an intronic region according to RefSeq may be an exonic region for another transcript database.

SureSelect

Annotated as above

cat annotation_v5.variant_function | cut -f1 | sort | uniq -c | sort -k1rn
29931881 exonic
17834548 intronic
17525345 UTR3
2173091 UTR5
2055831 intergenic
1528023 ncRNA_intronic
1094249 downstream
1041816 ncRNA_exonic
 737490 upstream
 558600 splicing
  59053 upstream;downstream
  10407 UTR5;UTR3
  10055 exonic;splicing
   6674 ncRNA_splicing
   2463 ncRNA_exonic;splicing

cat annotation_v6.variant_function | cut -f1 | sort | uniq -c | sort -k1rn
29993606 exonic
21351051 intronic
1660004 intergenic
1354793 UTR3
1260625 ncRNA_intronic
1002459 ncRNA_exonic
 956114 UTR5
 569328 splicing
 344824 upstream
 165866 downstream
  31106 upstream;downstream
   9985 exonic;splicing
   8932 ncRNA_splicing
   3767 UTR5;UTR3
   2646 ncRNA_exonic;splicing

References