One of my Top 10 posts is on creating a coverage plot using R. For that post I used CAGE data, which is a transcriptomic data set containing transcription start sites, and I used R exclusively for building a "coverage plot." The main issue with that post was that the plots were density plots rather than a real coverage plot. In this post, I'll use BEDTools to calculate the per base coverage of a defined region and produce an actual coverage plot using R.
Firstly download the CAGE data set.
#the smallest CAGE BAM file from ENCODE wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHchCellPapAlnRep1.bam
Then install BEDTools.
git clone https://github.com/arq5x/bedtools2.git cd bedtools2 make clean; make all
In my old post, I created a density plot from the region chr22:31480000-39080000. For the sake of comparison, I'll focus on this region again.
# create a BED file with the region of interest # the spaces in the BED file are tabs cat my_region.bed chr22 31480000 39080000 my_region 0 + # calculate the coverage in this region bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed chr22 31480000 39080000 my_region 0 + 23964 46060 7600000 0.0060605 # calculate the coverage per base pair in this region # on the same strand bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed -d -s | gzip > wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_pos.tsv.gz # on the opposite strand bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed -d -S | gzip > wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_neg.tsv.gz # check out the files gunzip -c wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_pos.tsv.gz | head -5 chr22 31480000 39080000 my_region 0 + 1 0 chr22 31480000 39080000 my_region 0 + 2 0 chr22 31480000 39080000 my_region 0 + 3 0 chr22 31480000 39080000 my_region 0 + 4 0 chr22 31480000 39080000 my_region 0 + 5 0 gunzip -c wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_neg.tsv.gz | head -5 chr22 31480000 39080000 my_region 0 + 1 0 chr22 31480000 39080000 my_region 0 + 2 0 chr22 31480000 39080000 my_region 0 + 3 0 chr22 31480000 39080000 my_region 0 + 4 0 chr22 31480000 39080000 my_region 0 + 5 0
Now to load the data into R.
# to save memory use colClasses and NULLs
# this is useful because I'm just interested
# in a single column
pos <- read.table("wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_pos.tsv.gz",
header=FALSE,
colClasses=c(rep('NULL',7),NA)
)
# check it out
head(pos)
V8
1 0
2 0
3 0
4 0
5 0
6 0
# some sanity checking
table(pos$V8>0)
FALSE TRUE
7576926 23074
neg <- read.table("wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_neg.tsv.gz",
header=FALSE,
colClasses=c(rep('NULL',7),NA)
)
# check it out
head(neg)
V8
1 0
2 0
3 0
4 0
5 0
6 0
# sanity checking
table(neg$V8>0)
FALSE TRUE
7576354 23646
The data has loaded into R as simple data frames; now to produce some plots.
names(pos) <- 'coverage'
names(neg) <- 'coverage'
pos$position <- 31480001:39080000
neg$position <- 31480001:39080000
neg$coverage <- neg$coverage * -1
max(pos$coverage)
[1] 4232
min(neg$coverage)
[1] -1633
# I swear that one day I will learn to use ggplot2
# so that I don't have to create a plot in the manner below
plot(pos$coverage,
type='h',
ylim=c(-2000,4500),
col='blue',
ylab='Coverage',
xlab='Position',
xaxt='n',
main='CAGE coverage at chr22:31480000-39080000'
)
# plot the coverage on the negative strand
lines(neg$coverage,
type='h',
col='red'
)
# generate my x-axis
l <- length(pos$position)
axis(1,
at = seq(from = 0, to = l, by = l/4),
labels = seq(from = head(pos$position,n=1)-1,
to=tail(pos$position, n=1),
by = l/4
)
)
Here's the density plot from the old post:
Conclusion
If you examine the two plots (the density plot vs. the actual coverage plot) you can see that they roughly resemble each other, however, the coverage plot from this post shows the coverage per base for a region of interest. I had to predetermine the region (as a BED file) when I used BEDTools mainly because if I calculated the per base coverage for an entire chromosome and tried to load that into R, I would probably run out of memory (even when I'm working with the smallest chromosome). There should be much smarter ways of storing coverage data in R (compared to what I've done) and one day I'll find out. For this post, plotting roughly 7.5 million items twice (once for the positive and once for the negative strand) took a couple of minutes on my MacBook Pro (Retina, 13-inch, Early 2015), which has 8 gigs of RAM and a 2.9 GHz Intel Core i5 processor.
Extra: exome coverage
One key statistic that is often reported in whole exome sequencing studies is the percentage coverage at x fold; for example 80% of bases were covered at >20x. Unfortunately, I can only demonstrate how to calculate this with some local data as I'm not aware of publicly available exome data sets (if someone lets me know, I can update this post with the publicly available data set). The data I'm working with was generated using the Illumina TruSeq Exome pipeline, which I found out, no longer exists.
# some simple stats on the BAM file
samtools flagstat blah.bam
108229305 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
107498562 + 0 mapped (99.32%:-nan%)
108229305 + 0 paired in sequencing
54052752 + 0 read1
54176553 + 0 read2
106976470 + 0 properly paired (98.84%:-nan%)
107422932 + 0 with itself and mate mapped
75630 + 0 singletons (0.07%:-nan%)
292196 + 0 with mate mapped to a different chr
203728 + 0 with mate mapped to a different chr (mapQ>=5)
# running the coverage as per above
# uses A LOT of memory
# bedtools coverage -a TruSeq-Exome-Targeted-Regions.bed -b blah.bam -bed -d | gzip > blah.tsv.gz
# convert the BAM file into a BED file first
bedtools bamtobed -i blah.bam | gzip > blah.bed.gz
gunzip -c blah.bed.gz | wc -l
107498562
# calculate the coverage using the BED file
# which uses much less memory
bedtools coverage -a TruSeq-Exome-Targeted-Regions.bed -b blah.bed.gz -d | gzip > coverage.tsv.gz
# check out the coverage file
gunzip -c coverage.tsv.gz | head
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 1 1064
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 2 1056
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 3 1049
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 4 1040
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 5 1043
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 6 1036
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 7 1034
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 8 1034
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 9 1027
chr1 14362 14829 chr1:14363-14829:WASH5P 467 + 10 1017
# how many base pairs does TruSeq-Exome-Targeted-Regions.bed cover?
gunzip -c coverage.tsv.gz | wc -l
62085295
# how many bases have greater than 20x coverage?
gunzip -c coverage.tsv.gz | awk '$8>20 {print}' | wc -l
54231134
# calculate the percentage
bc -l<<<54231134*100/62085295
87.34940214103838920311
# how many bases have greater than 30x coverage?
gunzip -c AFS-0005.dedup.realign.base.txt.gz | awk '$8>30 {print}' | wc -l
50320381
bc -l<<<50320381*100/62085295
81.05040170945471065249

This work is licensed under a Creative Commons
Attribution 4.0 International License.


Hi, since you are doing a histogram type of plot, why import all the 0 coverage? I usually filter my data prior to importing. This does mean you need to import the index column. Comparisons are also a little more work as you need to check for existence of the row rather than just compare rows, but you get more in.
Yes there are definitely much smarter ways than what I propose in the post. The advantage is that it’s straightforward.
Error: unable to open file or unable to determine types for file my_region.bed
./bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed
I don’t think the script works…
Works for me. Did you make sure that each column is tab-delimited in the BED file?
Hi,
I tried your method but my all values show 0 coverage . Can you briefly describe how you have created a bed file of region of interest. Actually I have a bam file I have to extract chr 18 within 56517463 -5623469 range from bam file . I have tried extracting this region and save as myregion.bed file and I am trying to find out a coverage of this region with my genome. Please help me how can I do that.
Hi Sima,
for the BED file you can use any text editor you want; I used Vim. Make sure each column is tab-delimited.
Cheers,
Dave
Dear friend
I can use the same approach to calculate the coverage for a bacterial genome assembly
Best wish
Hi,
I really appreciate you writing this code. I tried to use the same code using bedtools2, however, I have a short viral reference genome (130kbp), and need the coverage for the whole region. Hence I don’t need to specify a region as in your my.region.bed file. I cannot figure out how to plot the whole region with the coverage. I know what the plot should look like from genious and the outcome from running the above code on my file does not look like it. Any suggestions? How do I specify the whole region?
Hi Anni,
did you set your my_region.bed file (the spaces are tabs) as:
chr 0 130000 my_region 0 +
Cheers,
Dave
Thanks for your reply.
I was unsure of how to specify this region since I wanted the whole dataset. I tried what you wrote above and get the following warning.
***** WARNING: File text.bed has inconsistent naming convention for record:
chr 0 134000 my_region 0 +
For “chr” use the name of your viral reference sequence/s that you mapped the reads to. For example, for my work, the reference sequence names are chr1, chr2, chr3, … for hg19.
I’m having problems with the -bed command on the line;
# calculate the coverage in this region
bedtools coverage -a my_region.bed -b gEncodeRikenCageHchCellPapAlnRep1.bam -bed
The .bed file reads correctly but when I run this line:
bedtools coverage -a ChrA1x.bed -b MM091AllConcat.sorted.rmdup.readsadded.1.bam -bed
I get: Error: Unrecognized parameters -bed
Any suggestion would be hugely appreciated as I’m a noob 🙂
We all get started somewhere 🙂 What version of bedtools are you using? Run:
bedtools --version
I just downloaded and compiled the latest version (bedtools v2.26.0-19-g6bf23c4) and at least the example in the post works.
Thanks for this post!
I would like to mention that one might have to take care of the pos and neg coverage fields, when the bed file used (in the bedtools step) is composed of a mixture of positive and negative stranded genes.
It was a very useful post. I created coverage plot for all my samples solely because of your post. Thank you.
I was wondering if you would like to plot it with ggplot. Now that I know, that you know ggplot. 😀
Hi Dave,
Very nice tutorial and thanks for sharing it. I would really appreciate if you could also show how to you calculate the total number of reads on both positive and negative strand separately for a given bed file?
Hi Dave,
Thanks for this tutorial! Indeed very useful. I am trying to incorporate a scale break on y-axis, since I have only 1 contig with a very high coverage and kind of messing up with the scale for the remaining genome. Do you have any suggestion for doing that?