Updated 2013 November 12th.
High throughput sequencers are continually increasing their output of reads; according to Illumina, the HiSeq2500/1500 can output a maximum of 187 million single end reads per lane/flow cell. The question is "How deep should we sequence our samples?" Obviously it depends on the aim; if we wish to profile lowly expressed genes, the deeper the better. For this post I am interested in how deep I should sequence a CAGE library to reach saturation i.e. profiled all (or nearly all) RNA species in the sample. I had some samples that had a very low complexity, i.e. low number of unique reads, and I wanted to estimate the sequencing depth I would require to theoretically profile all the RNA species in my sample.
During my quest for an answer, I came across an interesting paper, which stated that for RNA sequencing:
While 100 million reads are sufficient to detect most expressed genes and transcripts, about 500 million reads are needed to measure accurately their expression levels.
I found this blog post that raised some issues about this estimate. Basically what the authors of the paper did was combine all 20 of the samples (each sample had roughly 40 million reads) together, totalling ~800 million and used this as the population. Obviously this is not the same as sequencing one sample to a depth of 800 million reads, since the samples are not homogeneous and thus the complexity is artificially higher. So in the end, this 500 million read estimation may be an over-estimation due to this inflated complexity. See the very well written blog post I linked above for more elaboration.
So I continued my search and found preseq, which is a tool aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. It's published in Nature Methods and announced on SEQanswers. From the estimates made by preseq, one can then use this information to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. Just what I wanted!
In the preseq paper, they define complexity as the number of distinct molecules that can be observed in a set of sequenced reads. The complexity is proportional to the make up of molecules in a sample and the sequencing depth. There is this concept of "sequencing real estate", which is that the throughput of a sequencer is limited. If you have an abundant RNA molecule that dominates the sample, everything else is under sampled because the sequencing real estate was used to mainly sequence the abundant RNA molecule. In terms of complexity, this would be very low because there are few distinct molecules as observed by the reads and this is exactly the type of library that I have. The aim of preseq is to estimate the number of previously unsequenced molecules if we sequenced deeper based purely on the distribution of observed counts.
How does preseq work? It's a highly non-parametric method that relies on bootstrapping the observed counts for estimating the variance. The bootstrap process samples counts with replacement from the observed counts until the number of sampled reads is equal to the total observed reads. The technical explanation (by Timothy, the developer of preseq) for why we sample reads until we equal the total observed reads is because the distribution of the observed histogram, condition upon the observed total number of reads, is multinomial with probabilities proportional to the expected counts (for which the observed counts are unbiased estimators). The way I understood that, is that the sampling has to be the same size as the observed total number of reads because the distribution of the observed counts is conditional on the total number of reads (but not sure how accurate that interpretation is).
Now a word about CAGE libraries and in particular the library I'm analysing. The distribution of CAGE reads usually follows a power law distribution; see the linked post for a graph that shows this relationship. The CAGE libraries I am analysing does not fit this distribution particularly well towards the tail because I have 3 RNA molecules that make up ~90% of the sequenced reads. My library is extremely sparse and thus when bootstrapping, one or more of these counts might be missing, which drives up the number of sampled distinct dramatically. In their recent update of the preseq program, they have addressed this issue.
Running preseq
Firstly install CMake and BamTools (assuming you don't have them):
#CMake wget http://www.cmake.org/files/v2.8/cmake-2.8.12.1.tar.gz tar -xzf cmake-2.8.12.1.tar.gz cd cmake-2.8.12.1 ./bootstrap --prefix=`pwd` make; make install cp bin/* ~/bin cd .. #BamTools git clone git://github.com/pezmaster31/bamtools.git cd bamtools cmake --version #cmake version 2.8.12.1 mkdir build cd build cmake .. make cd .. cp bin/* ~/bin #create a lib directory if you don't already have one mkdir ~/lib cp lib/libbamtools.so lib/libbamtools.so.2.3.0 ~/lib/
Add this line to your .bashrc if you use the Bourne Again SHell:
export LD_LIBRARY_PATH=/home/tan118/lib:$LD_LIBRARY_PATH
Now to get started with preseq, download the source and compile according to the instructions in the manual:
#install the GNU Scientific Library wget ftp://ftp.gnu.org/gnu/gsl/gsl-1.16.tar.gz tar -xzf gsl-1.16.tar.gz cd gsl-1.16 ./configure make; make install cp .libs/libgsl.so* ~/lib cp .libs/libgslcblas.so* ~/lib wget -O preseq-0.0.4.tar.bz2 http://smithlab.usc.edu/plone/software/librarycomplexity-source-code md5sum preseq-0.0.4.tar.bz2 fc280ccc6d0cc66a4adc48d27af56ef3 preseq-0.0.4.tar.bz2 tar -xjf preseq-0.0.4.tar.bz2 cd preseq-0.0.4 make all BAMTOOLS_ROOT=/home/davetang/src/bamtools
The complexity plot, also known as a discovery plot, can be generated by using c_curve:
c_curve -B -v -o blah.txt blah.bam 2> blah_verbose.txt #extremely low complexity libraries cat blah.txt TOTAL_READS DISTINCT_READS 0 0 1000000 63083 2000000 108190 3000000 147340 4000000 182888 5000000 214696 6000000 244278 7000000 272133 8000000 298105 head blah_verbose.txt loading mapped locations TOTAL READS = 8459464 DISTINCT READS = 309612 MAX COUNT = 2920545 COUNTS OF 1 = 207315 OBSERVED COUNTS (2920546) 1 207315 2 50300 3 17102 4 8376
Let's plot this complexity curve:
data <- read.table("blah.txt", header=T, stringsAsFactors=F)
data
TOTAL_READS DISTINCT_READS
1 0 0
2 1000000 63083
3 2000000 108190
4 3000000 147340
5 4000000 182888
6 5000000 214696
7 6000000 244278
8 7000000 272133
9 8000000 298105
plot(data$TOTAL_READS, data$DISTINCT_READS, type="l")
Saturation is in a land far far away.
To estimate the future yield of a genomic library, use the lc_extrap program.
#default number of bootstraps lc_extrap -B -o blah2.txt -v blah.bam 2> blah2_verbose.txt head -15 blah2.txt TOTAL_READS EXPECTED_DISTINCT LOGNORMAL_LOWER_95%CI LOGNORMAL_UPPER_95%CI 0 0 0 0 1000000.0 62845.0 16449.7 240094.9 2000000.0 108363.0 28389.1 413628.8 3000000.0 147745.0 38742.5 563426.9 4000000.0 182733.0 47869.2 697553.8 5000000.0 214827.0 56295.6 819791.8 6000000.0 244833.0 64282.0 932503.6 7000000.0 272172.0 71342.2 1038342.6 8000000.0 298357.0 78287.2 1137055.2 9000000.0 322657.8 84613.2 1230399.1 10000000.0 345792.6 90679.4 1318629.0 11000000.0 367741.3 96434.3 1402339.8 12000000.0 388616.0 101907.3 1481958.3 13000000.0 408510.0 107122.9 1557840.8 #old results #TOTAL_READS EXPECTED_DISTINCT LOGNORMAL_LOWER_95%CI #LOGNORMAL_UPPER_95%CI #0 0 0 0 #1000000.0 129734.5 227553.6 73965.2 #2000000.0 223709.5 392356.6 127552.2 #3000000.0 304335.0 533908.6 173475.0 #4000000.0 376565.5 660613.1 214651.5 #5000000.0 442678.0 776371.0 252410.0 #6000000.0 503531.5 883310.7 287038.3 #7000000.0 560750.5 983555.3 319698.5 #8000000.0 614028.0 1077223.2 350002.1 #using 10 bootstraps ~/src/preseq/lc_extrap -b 10 -B -o blah3.txt -v blah.bam 2> blah3_verbose.txt head -15 blah3.txt TOTAL_READS EXPECTED_DISTINCT LOGNORMAL_LOWER_95%CI LOGNORMAL_UPPER_95%CI 0 0 0 0 1000000.0 62734.0 12627.2 311671.7 2000000.0 108665.0 21989.7 536982.5 3000000.0 147589.0 29793.0 731129.2 4000000.0 183162.0 37052.4 905428.5 5000000.0 214899.0 43414.0 1063747.8 6000000.0 244733.0 49517.2 1209565.2 7000000.0 272240.0 54993.0 1347709.5 8000000.0 298140.0 60224.5 1475934.3 9000000.0 322657.8 65202.5 1596688.6 10000000.0 345792.6 69875.8 1711215.9 11000000.0 367741.3 74308.8 1819887.9 12000000.0 388616.0 78524.0 1923262.9 13000000.0 408510.0 82540.2 2021808.9 #old results #TOTAL_READS EXPECTED_DISTINCT LOGNORMAL_LOWER_95%CI #LOGNORMAL_UPPER_95%CI #0 0 0 0 #1000000.0 139784.5 417770.4 46771.4 #2000000.0 241243.0 720321.0 80794.8 #3000000.0 328504.5 980548.0 110056.0 #4000000.0 407053.0 1213359.2 136556.5 #5000000.0 477678.0 1426316.5 159975.9 #6000000.0 544632.5 1623012.4 182761.7 #7000000.0 605724.5 1806452.7 203106.4 #8000000.0 663932.5 1978728.6 222772.5
Based on the old results:
From the output of the c_curve program, at 8 million reads I had only 298,314 distinct reads. From the output of lc_extrap using 100 bootstraps, at 8 million reads, it expected 614028.0 distinct reads; that's more than twice what I had in the library! So here's the explanation from Tim:
What is happening in this example is that a lot of the sampled histograms are missing at least one, and possibly more, of the larger counts, when they should always be in the histogram (or at least some close value since we're pretty sure the expected value is close to the observed). If the histogram is full, this isn't a problem since there is a large probability we sample a count close to any particular count. This has a profound effect on the estimates and drives them upward (making the library seem more complex than it is).
Actually Tim, presented me with lc_extrap results that were estimated without performing the bootstrapping and the expected results were very similar to the observed counts (as expected). They will be making updates to the software to accommodate for sparse datasets like mine and I will update this post when the updated software is available.
The updated post running preseq-0.0.4 shows that the expected reads is close to the number of observed reads at 8 million. Let's plot the extrapolated reads in R:
data <- read.table("blah2.txt",header=T,stringsAsFactors=F)
data[1:20,]
TOTAL_READS EXPECTED_DISTINCT LOGNORMAL_LOWER_95.CI LOGNORMAL_UPPER_95.CI
1 0.0e+00 0.0 0.0 0.0
2 1.0e+06 62845.0 16449.7 240094.9
3 2.0e+06 108363.0 28389.1 413628.8
4 3.0e+06 147745.0 38742.5 563426.9
5 4.0e+06 182733.0 47869.2 697553.8
6 5.0e+06 214827.0 56295.6 819791.8
7 6.0e+06 244833.0 64282.0 932503.6
8 7.0e+06 272172.0 71342.2 1038342.6
9 8.0e+06 298357.0 78287.2 1137055.2
10 9.0e+06 322657.8 84613.2 1230399.1
11 1.0e+07 345792.6 90679.4 1318629.0
12 1.1e+07 367741.3 96434.3 1402339.8
13 1.2e+07 388616.0 101907.3 1481958.3
14 1.3e+07 408510.0 107122.9 1557840.8
15 1.4e+07 427502.1 112101.9 1630285.5
16 1.5e+07 445657.8 116861.6 1699539.4
17 1.6e+07 463034.3 121417.0 1765821.6
18 1.7e+07 479751.6 125801.5 1829561.6
19 1.8e+07 495747.7 129978.6 1890817.4
20 1.9e+07 511232.0 134048.6 1949727.9
plot(data$TOTAL_READS[1:700], data$EXPECTED_DISTINCT[1:700], type="l")
Approaching saturation at 200 million reads.
So if I sequenced this library again doubling the depth (16 million), I should expect roughly 1.5 times more distinct reads. If I tripled the sequencing depth, I should expect to see 2 times more distinct reads.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Very interesting post, but I don’t understand the definition of “distinct reads”. It means that those reads are mapped on a specific gene without duplicates? In other words, if I remove the duplicates reads, I will find few reads on few genes?
Thanks in advance.
Distinct reads are unique reads. For example if my reads were “AA”, “AC”, “AA”, “AA”, and “GG”, I would have three distinct reads (“AA”, “AC”, and “GG”).
Hi, congrats on the post. I’m interested to plot on Y axis the number of different contigs.
I have sequenced a transcriptome and now I am wondering how to plot a rarefaction curve to check if the number of reads sequenced are enough to cover the entire transcriptome or alternatively if I sequence more reads I would assemble more transcripts.
Many thanks
Hi Daniel,
I think obtaining a rarefaction curve based on your transcriptome data is straight-forward using preseq. However, knowing whether a randomly sampled read can be used to assemble a transcript is not straight-forward. I guess you could make simple assumptions, such as 10x coverage (I just made this number up; I’m not sure how much actual coverage is necessary) over the entire transcript is required for assembly, and you could run some simulations based on this. Perhaps you will have better luck asking the experts on bioinformatics forums.
Cheers,
Dave
Thank you, Dave for a wonderful post. I am working with some low yield libraries and I was wondering if you had an example for preseq gc_extrap ?
Hi Jessica,
I just had a look and I had deleted the BAM file I used for this post. You can use SAMtools to sub-sample a BAM file to simulate a low yield library (https://github.com/davetang/learning_bam_file#random-subsampling-of-bam-file).
Cheers,
Dave