Creating a coverage plot using BEDTools and R

One of my Top 10 posts is on creating a coverage plot using R. For that post I used CAGE data, which is a transcriptomic data set containing transcription start sites, and I used R exclusively for building a "coverage plot." The main issue with that post was that the plots were density plots rather than a real coverage plot. In this post, I'll use BEDTools to calculate the per base coverage of a defined region and produce an actual coverage plot using R.

Firstly download the CAGE data set.

#the smallest CAGE BAM file from ENCODE
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHchCellPapAlnRep1.bam

Then install BEDTools.

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all

In my old post, I created a density plot from the region chr22:31480000-39080000. For the sake of comparison, I'll focus on this region again.

# create a BED file with the region of interest
# the spaces in the BED file are tabs
cat my_region.bed
chr22   31480000        39080000        my_region       0       +

# calculate the coverage in this region
bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed
chr22   31480000        39080000        my_region       0       +       23964   46060   7600000 0.0060605

# calculate the coverage per base pair in this region
# on the same strand
bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed -d -s | gzip > wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_pos.tsv.gz
# on the opposite strand
bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed -d -S | gzip > wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_neg.tsv.gz

# check out the files
gunzip -c wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_pos.tsv.gz | head -5
chr22   31480000        39080000        my_region       0       +       1       0
chr22   31480000        39080000        my_region       0       +       2       0
chr22   31480000        39080000        my_region       0       +       3       0
chr22   31480000        39080000        my_region       0       +       4       0
chr22   31480000        39080000        my_region       0       +       5       0

gunzip -c wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_neg.tsv.gz | head -5
chr22   31480000        39080000        my_region       0       +       1       0
chr22   31480000        39080000        my_region       0       +       2       0
chr22   31480000        39080000        my_region       0       +       3       0
chr22   31480000        39080000        my_region       0       +       4       0
chr22   31480000        39080000        my_region       0       +       5       0

Now to load the data into R.

# to save memory use colClasses and NULLs
# this is useful because I'm just interested
# in a single column
pos <- read.table("wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_pos.tsv.gz",
                  header=FALSE,
                  colClasses=c(rep('NULL',7),NA)
                  )

# check it out
head(pos)
  V8
1  0
2  0
3  0
4  0
5  0
6  0

# some sanity checking
table(pos$V8>0)

  FALSE    TRUE 
7576926   23074

neg <- read.table("wgEncodeRikenCageHchCellPapAlnRep1_chr22_my_region_neg.tsv.gz",
                  header=FALSE,
                  colClasses=c(rep('NULL',7),NA)
                  )

# check it out
head(neg)
  V8
1  0
2  0
3  0
4  0
5  0
6  0

# sanity checking
table(neg$V8>0)

  FALSE    TRUE 
7576354   23646

The data has loaded into R as simple data frames; now to produce some plots.

names(pos) <- 'coverage'
names(neg) <- 'coverage'
pos$position <- 31480001:39080000
neg$position <- 31480001:39080000
neg$coverage <- neg$coverage * -1

max(pos$coverage)
[1] 4232

min(neg$coverage)
[1] -1633

# I swear that one day I will learn to use ggplot2
# so that I don't have to create a plot in the manner below
plot(pos$coverage,
     type='h',
     ylim=c(-2000,4500),
     col='blue',
     ylab='Coverage',
     xlab='Position',
     xaxt='n',
     main='CAGE coverage at chr22:31480000-39080000'
     )

# plot the coverage on the negative strand
lines(neg$coverage,
      type='h',
      col='red'
      )

# generate my x-axis
l <- length(pos$position)
axis(1,
     at = seq(from = 0, to = l, by = l/4),
     labels = seq(from = head(pos$position,n=1)-1,
                  to=tail(pos$position, n=1),
                  by = l/4
                  )
     )

cage_coverage

Here's the density plot from the old post:

coverage_on_region_of_interest

Conclusion

If you examine the two plots (the density plot vs. the actual coverage plot) you can see that they roughly resemble each other, however, the coverage plot from this post shows the coverage per base for a region of interest. I had to predetermine the region (as a BED file) when I used BEDTools mainly because if I calculated the per base coverage for an entire chromosome and tried to load that into R, I would probably run out of memory (even when I'm working with the smallest chromosome). There should be much smarter ways of storing coverage data in R (compared to what I've done) and one day I'll find out. For this post, plotting roughly 7.5 million items twice (once for the positive and once for the negative strand) took a couple of minutes on my MacBook Pro (Retina, 13-inch, Early 2015), which has 8 gigs of RAM and a 2.9 GHz Intel Core i5 processor.

Extra: exome coverage

One key statistic that is often reported in whole exome sequencing studies is the percentage coverage at x fold; for example 80% of bases were covered at >20x. Unfortunately, I can only demonstrate how to calculate this with some local data as I'm not aware of publicly available exome data sets (if someone lets me know, I can update this post with the publicly available data set). The data I'm working with was generated using the Illumina TruSeq Exome pipeline, which I found out, no longer exists.

# some simple stats on the BAM file
samtools flagstat blah.bam 
108229305 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
107498562 + 0 mapped (99.32%:-nan%)
108229305 + 0 paired in sequencing
54052752 + 0 read1
54176553 + 0 read2
106976470 + 0 properly paired (98.84%:-nan%)
107422932 + 0 with itself and mate mapped
75630 + 0 singletons (0.07%:-nan%)
292196 + 0 with mate mapped to a different chr
203728 + 0 with mate mapped to a different chr (mapQ>=5)

# running the coverage as per above
# uses A LOT of memory
# bedtools coverage -a TruSeq-Exome-Targeted-Regions.bed -b blah.bam -bed -d | gzip > blah.tsv.gz
# convert the BAM file into a BED file first
bedtools bamtobed -i blah.bam | gzip > blah.bed.gz
gunzip -c blah.bed.gz | wc -l
107498562

# calculate the coverage using the BED file
# which uses much less memory
bedtools coverage -a TruSeq-Exome-Targeted-Regions.bed -b blah.bed.gz -d | gzip > coverage.tsv.gz

# check out the coverage file
gunzip -c coverage.tsv.gz | head
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	1	1064
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	2	1056
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	3	1049
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	4	1040
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	5	1043
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	6	1036
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	7	1034
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	8	1034
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	9	1027
chr1	14362	14829	chr1:14363-14829:WASH5P	467	+	10	1017

# how many base pairs does TruSeq-Exome-Targeted-Regions.bed cover?
gunzip -c coverage.tsv.gz | wc -l
62085295

# how many bases have greater than 20x coverage?
gunzip -c coverage.tsv.gz | awk '$8>20 {print}' | wc -l
54231134

# calculate the percentage
bc -l<<<54231134*100/62085295
87.34940214103838920311

# how many bases have greater than 30x coverage?
gunzip -c AFS-0005.dedup.realign.base.txt.gz | awk '$8>30 {print}' | wc -l
50320381

bc -l<<<50320381*100/62085295
81.05040170945471065249
Print Friendly



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
15 comments Add yours
  1. Hi, since you are doing a histogram type of plot, why import all the 0 coverage? I usually filter my data prior to importing. This does mean you need to import the index column. Comparisons are also a little more work as you need to check for existence of the row rather than just compare rows, but you get more in.

    1. Yes there are definitely much smarter ways than what I propose in the post. The advantage is that it's straightforward.

  2. Error: unable to open file or unable to determine types for file my_region.bed

    ./bedtools coverage -a my_region.bed -b wgEncodeRikenCageHchCellPapAlnRep1.bam -bed

    I don't think the script works...

  3. Hi,

    I tried your method but my all values show 0 coverage . Can you briefly describe how you have created a bed file of region of interest. Actually I have a bam file I have to extract chr 18 within 56517463 -5623469 range from bam file . I have tried extracting this region and save as myregion.bed file and I am trying to find out a coverage of this region with my genome. Please help me how can I do that.

    1. Hi Sima,

      for the BED file you can use any text editor you want; I used Vim. Make sure each column is tab-delimited.

      Cheers,

      Dave

  4. Hi,

    I really appreciate you writing this code. I tried to use the same code using bedtools2, however, I have a short viral reference genome (130kbp), and need the coverage for the whole region. Hence I don't need to specify a region as in your my.region.bed file. I cannot figure out how to plot the whole region with the coverage. I know what the plot should look like from genious and the outcome from running the above code on my file does not look like it. Any suggestions? How do I specify the whole region?

    1. Thanks for your reply.

      I was unsure of how to specify this region since I wanted the whole dataset. I tried what you wrote above and get the following warning.

      ***** WARNING: File text.bed has inconsistent naming convention for record:
      chr 0 134000 my_region 0 +

      1. For "chr" use the name of your viral reference sequence/s that you mapped the reads to. For example, for my work, the reference sequence names are chr1, chr2, chr3, ... for hg19.

  5. I'm having problems with the -bed command on the line;

    # calculate the coverage in this region
    bedtools coverage -a my_region.bed -b gEncodeRikenCageHchCellPapAlnRep1.bam -bed

    The .bed file reads correctly but when I run this line:
    bedtools coverage -a ChrA1x.bed -b MM091AllConcat.sorted.rmdup.readsadded.1.bam -bed

    I get: Error: Unrecognized parameters -bed

    Any suggestion would be hugely appreciated as I'm a noob 🙂

    1. We all get started somewhere 🙂 What version of bedtools are you using? Run:

      bedtools --version

      I just downloaded and compiled the latest version (bedtools v2.26.0-19-g6bf23c4) and at least the example in the post works.

  6. Thanks for this post!

    I would like to mention that one might have to take care of the pos and neg coverage fields, when the bed file used (in the bedtools step) is composed of a mixture of positive and negative stranded genes.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.