How deep should we sequence?

Updated 2013 November 12th.

High throughput sequencers are continually increasing their output of reads; according to Illumina, the HiSeq2500/1500 can output a maximum of 187 million single end reads per lane/flow cell. The question is “How deep should we sequence our samples?” Obviously it depends on the aim; if we wish to profile lowly expressed genes, the deeper the better. For this post I am interested in how deep I should sequence a CAGE library to reach saturation i.e. profiled all (or nearly all) RNA species in the sample. I had some samples that had a very low complexity, i.e. low number of unique reads, and I wanted to estimate the sequencing depth I would require to theoretically profile all the RNA species in my sample.

During my quest for an answer, I came across an interesting paper, which stated that for RNA sequencing:

While 100 million reads are sufficient to detect most expressed genes and transcripts, about 500 million reads are needed to measure accurately their expression levels.

I found this blog post that raised some issues about this estimate. Basically what the authors of the paper did was combine all 20 of the samples (each sample had roughly 40 million reads) together, totalling ~800 million and used this as the population. Obviously this is not the same as sequencing one sample to a depth of 800 million reads, since the samples are not homogeneous and thus the complexity is artificially higher. So in the end, this 500 million read estimation may be an over-estimation due to this inflated complexity. See the very well written blog post I linked above for more elaboration.

So I continued my search and found preseq, which is a tool aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. It’s published in Nature Methods and announced on SEQanswers. From the estimates made by preseq, one can then use this information to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. Just what I wanted!

In the preseq paper, they define complexity as the number of distinct molecules that can be observed in a set of sequenced reads. The complexity is proportional to the make up of molecules in a sample and the sequencing depth. There is this concept of “sequencing real estate”, which is that the throughput of a sequencer is limited. If you have an abundant RNA molecule that dominates the sample, everything else is under sampled because the sequencing real estate was used to mainly sequence the abundant RNA molecule. In terms of complexity, this would be very low because there are few distinct molecules as observed by the reads and this is exactly the type of library that I have. The aim of preseq is to estimate the number of previously unsequenced molecules if we sequenced deeper based purely on the distribution of observed counts.

How does preseq work? It’s a highly non-parametric method that relies on bootstrapping the observed counts for estimating the variance. The bootstrap process samples counts with replacement from the observed counts until the number of sampled reads is equal to the total observed reads. The technical explanation (by Timothy, the developer of preseq) for why we sample reads until we equal the total observed reads is because the distribution of the observed histogram, condition upon the observed total number of reads, is multinomial with probabilities proportional to the expected counts (for which the observed counts are unbiased estimators). The way I understood that, is that the sampling has to be the same size as the observed total number of reads because the distribution of the observed counts is conditional on the total number of reads (but not sure how accurate that interpretation is).

Now a word about CAGE libraries and in particular the library I’m analysing. The distribution of CAGE reads usually follows a power law distribution; see the linked post for a graph that shows this relationship. The CAGE libraries I am analysing does not fit this distribution particularly well towards the tail because I have 3 RNA molecules that make up ~90% of the sequenced reads. My library is extremely sparse and thus when bootstrapping, one or more of these counts might be missing, which drives up the number of sampled distinct dramatically. In their recent update of the preseq program, they have addressed this issue.

Running preseq

Firstly install CMake and BamTools (assuming you don’t have them):

tar -xzf cmake-
cd cmake-
./bootstrap --prefix=`pwd`
make; make install
cp bin/* ~/bin
cd ..
git clone git://
cd bamtools
cmake --version
#cmake version
mkdir build
cd build
cmake ..
cd ..
cp bin/* ~/bin
#create a lib directory if you don't already have one
mkdir ~/lib
cp lib/ lib/ ~/lib/

Add this line to your .bashrc if you use the Bourne Again SHell:

export LD_LIBRARY_PATH=/home/tan118/lib:$LD_LIBRARY_PATH

Now to get started with preseq, download the source and compile according to the instructions in the manual:

#install the GNU Scientific Library
tar -xzf gsl-1.16.tar.gz
cd gsl-1.16
make; make install
cp .libs/* ~/lib
cp .libs/* ~/lib

wget -O preseq-0.0.4.tar.bz2
md5sum preseq-0.0.4.tar.bz2
fc280ccc6d0cc66a4adc48d27af56ef3  preseq-0.0.4.tar.bz2
tar -xjf preseq-0.0.4.tar.bz2
cd preseq-0.0.4

make all BAMTOOLS_ROOT=/home/davetang/src/bamtools

The complexity plot, also known as a discovery plot, can be generated by using c_curve:

c_curve -B -v -o blah.txt blah.bam 2> blah_verbose.txt
#extremely low complexity libraries
cat blah.txt
0	0
1000000	63083
2000000	108190
3000000	147340
4000000	182888
5000000	214696
6000000	244278
7000000	272133
8000000	298105

head blah_verbose.txt
loading mapped locations
TOTAL READS     = 8459464
MAX COUNT       = 2920545
COUNTS OF 1     = 207315
1	207315
2	50300
3	17102
4	8376

Let’s plot this complexity curve:

data <- read.table("blah.txt", header=T, stringsAsFactors=F)
1           0              0
2     1000000          63083
3     2000000         108190
4     3000000         147340
5     4000000         182888
6     5000000         214696
7     6000000         244278
8     7000000         272133
9     8000000         298105
plot(data$TOTAL_READS, data$DISTINCT_READS, type="l")

preseq_c_curve_004Saturation is in a land far far away.

To estimate the future yield of a genomic library, use the lc_extrap program.

#default number of bootstraps
lc_extrap -B -o blah2.txt -v blah.bam 2> blah2_verbose.txt
head -15 blah2.txt 
0	0	0	0
1000000.0	62845.0	16449.7	240094.9
2000000.0	108363.0	28389.1	413628.8
3000000.0	147745.0	38742.5	563426.9
4000000.0	182733.0	47869.2	697553.8
5000000.0	214827.0	56295.6	819791.8
6000000.0	244833.0	64282.0	932503.6
7000000.0	272172.0	71342.2	1038342.6
8000000.0	298357.0	78287.2	1137055.2
9000000.0	322657.8	84613.2	1230399.1
10000000.0	345792.6	90679.4	1318629.0
11000000.0	367741.3	96434.3	1402339.8
12000000.0	388616.0	101907.3	1481958.3
13000000.0	408510.0	107122.9	1557840.8

#old results
#0       0       0       0
#1000000.0       129734.5        227553.6        73965.2
#2000000.0       223709.5        392356.6        127552.2
#3000000.0       304335.0        533908.6        173475.0
#4000000.0       376565.5        660613.1        214651.5
#5000000.0       442678.0        776371.0        252410.0
#6000000.0       503531.5        883310.7        287038.3
#7000000.0       560750.5        983555.3        319698.5
#8000000.0       614028.0        1077223.2       350002.1

#using 10 bootstraps
~/src/preseq/lc_extrap -b 10 -B -o blah3.txt -v blah.bam 2> blah3_verbose.txt
head -15 blah3.txt 
0	0	0	0
1000000.0	62734.0	12627.2	311671.7
2000000.0	108665.0	21989.7	536982.5
3000000.0	147589.0	29793.0	731129.2
4000000.0	183162.0	37052.4	905428.5
5000000.0	214899.0	43414.0	1063747.8
6000000.0	244733.0	49517.2	1209565.2
7000000.0	272240.0	54993.0	1347709.5
8000000.0	298140.0	60224.5	1475934.3
9000000.0	322657.8	65202.5	1596688.6
10000000.0	345792.6	69875.8	1711215.9
11000000.0	367741.3	74308.8	1819887.9
12000000.0	388616.0	78524.0	1923262.9
13000000.0	408510.0	82540.2	2021808.9

#old results
#0       0       0       0
#1000000.0       139784.5        417770.4        46771.4
#2000000.0       241243.0        720321.0        80794.8
#3000000.0       328504.5        980548.0        110056.0
#4000000.0       407053.0        1213359.2       136556.5
#5000000.0       477678.0        1426316.5       159975.9
#6000000.0       544632.5        1623012.4       182761.7
#7000000.0       605724.5        1806452.7       203106.4
#8000000.0       663932.5        1978728.6       222772.5

Based on the old results:

From the output of the c_curve program, at 8 million reads I had only 298,314 distinct reads. From the output of lc_extrap using 100 bootstraps, at 8 million reads, it expected 614028.0 distinct reads; that’s more than twice what I had in the library! So here’s the explanation from Tim:

What is happening in this example is that a lot of the sampled histograms are missing at least one, and possibly more, of the larger counts, when they should always be in the histogram (or at least some close value since we’re pretty sure the expected value is close to the observed). If the histogram is full, this isn’t a problem since there is a large probability we sample a count close to any particular count. This has a profound effect on the estimates and drives them upward (making the library seem more complex than it is).

Actually Tim, presented me with lc_extrap results that were estimated without performing the bootstrapping and the expected results were very similar to the observed counts (as expected). They will be making updates to the software to accommodate for sparse datasets like mine and I will update this post when the updated software is available.

The updated post running preseq-0.0.4 shows that the expected reads is close to the number of observed reads at 8 million. Let’s plot the extrapolated reads in R:

data <- read.table("blah2.txt",header=T,stringsAsFactors=F)
1      0.0e+00               0.0                   0.0                   0.0
2      1.0e+06           62845.0               16449.7              240094.9
3      2.0e+06          108363.0               28389.1              413628.8
4      3.0e+06          147745.0               38742.5              563426.9
5      4.0e+06          182733.0               47869.2              697553.8
6      5.0e+06          214827.0               56295.6              819791.8
7      6.0e+06          244833.0               64282.0              932503.6
8      7.0e+06          272172.0               71342.2             1038342.6
9      8.0e+06          298357.0               78287.2             1137055.2
10     9.0e+06          322657.8               84613.2             1230399.1
11     1.0e+07          345792.6               90679.4             1318629.0
12     1.1e+07          367741.3               96434.3             1402339.8
13     1.2e+07          388616.0              101907.3             1481958.3
14     1.3e+07          408510.0              107122.9             1557840.8
15     1.4e+07          427502.1              112101.9             1630285.5
16     1.5e+07          445657.8              116861.6             1699539.4
17     1.6e+07          463034.3              121417.0             1765821.6
18     1.7e+07          479751.6              125801.5             1829561.6
19     1.8e+07          495747.7              129978.6             1890817.4
20     1.9e+07          511232.0              134048.6             1949727.9
plot(data$TOTAL_READS[1:700], data$EXPECTED_DISTINCT[1:700], type="l")

lc_extrap_preseq_004Approaching saturation at 200 million reads.

So if I sequenced this library again doubling the depth (16 million), I should expect roughly 1.5 times more distinct reads. If I tripled the sequencing depth, I should expect to see 2 times more distinct reads.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
4 comments Add yours
  1. Very interesting post, but I don’t understand the definition of “distinct reads”. It means that those reads are mapped on a specific gene without duplicates? In other words, if I remove the duplicates reads, I will find few reads on few genes?
    Thanks in advance.

    1. Distinct reads are unique reads. For example if my reads were “AA”, “AC”, “AA”, “AA”, and “GG”, I would have three distinct reads (“AA”, “AC”, and “GG”).

  2. Hi, congrats on the post. I’m interested to plot on Y axis the number of different contigs.
    I have sequenced a transcriptome and now I am wondering how to plot a rarefaction curve to check if the number of reads sequenced are enough to cover the entire transcriptome or alternatively if I sequence more reads I would assemble more transcripts.

    Many thanks

    1. Hi Daniel,

      I think obtaining a rarefaction curve based on your transcriptome data is straight-forward using preseq. However, knowing whether a randomly sampled read can be used to assemble a transcript is not straight-forward. I guess you could make simple assumptions, such as 10x coverage (I just made this number up; I’m not sure how much actual coverage is necessary) over the entire transcript is required for assembly, and you could run some simulations based on this. Perhaps you will have better luck asking the experts on bioinformatics forums.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.