Creating a coverage plot in R

Disclaimer (2015 August 5th): as pointed out in this comment thread below, this post created a density plot rather than a coverage plot. I have written a new post that uses BEDTools to calculate the coverage and R to produce an actual coverage plot.

I’ve recently discovered GitHub Gist, so for this post I’m going to use that to host my code (and all subsequent posts as I see fit). The code was not displaying properly due to some CSS property of the Twenty Ten theme, so I had to update my WordPress theme to Twenty Eleven, which also led me to changing my header image. The photo I used for the header was a shot I took at the summit of Mount Fuji around 06:00 on the 24th August 2013, when the clouds finally cleared a little; I hiked all night to make it to the top to see sunrise but unfortunately the weather was terrible. The photo looks nice, so I thought I’ll use it as the header.

Anyway back to the topic; I wanted to create a coverage plot of mapped reads starting from a BAM file. So far I’ve been using IGV’s coverage track to get a visual idea of the coverage. In the past, I’ve also used bedtools genomecov to generate bedGraph files and the subsequent wig and bigWig files that I would then visualise on the UCSC Genome Browser. How about creating a coverage plot in R (so that I can export it as a postscript file)? Yeah sure, why not. Let’s download a BAM file as an example:

#the smallest CAGE BAM file from ENCODE
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHchCellPapAlnRep1.bam

First let’s read in a BAM file using Bioconductor’s Rsamtools and convert it into a data frame (since I’m more familiar with data frames):

Now that we’ve stored our BAM file into a data.frame we can make our coverage plot.

cage_density_plotI’m not sure how informative this is, but here it be.

If you want a histogram looking image instead:

plot(chr22_pos_density,
     ylim = range(c(chr22_neg_density$y, chr22_pos_density$y)),
     main = "Coverage plot of mapped CAGE reads",
     xlab = "Chromosome 22",
     col = 'blue',
     type='h'
     )

lines(chr22_neg_density, type='h', col = 'red')

cage_density_plot_histo

If you want to focus on a specific region, follow the steps above until you create the two vectors chr22_neg and chr22_pos. Remember that these vectors store the starting position of all the reads mapped onto chromosome 22. So to focus on a specific region we just need to shrink or subset this vector:

#check our two vectors
length(chr22_neg)
#[1] 24413
length(chr22_pos)
#[1] 32013

#what is actually inside these two vectors?
head(chr22_pos)
[1] 16096446 16124065 16147449 16165745 16339913 16364755

#get a summary of the mapped positions
summary(chr22_pos)
#    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#16100000 31480000 35790000 35040000 39080000 51240000

#say I am interested in region 31480000 to 39080000
lower <- 31480000
upper <- 39080000

chr22_pos_interest <- chr22_pos[chr22_pos > lower & chr22_pos < upper]
#check how many entries we have
length(chr22_pos_interest)
#[1] 16220

#do the same for the negative strand
chr22_neg_interest <- chr22_neg[chr22_neg > lower & chr22_neg < upper]
length(chr22_neg_interest)
#[1] 7744

#now continue with the code above
#but with our two new vectors of interest
chr22_neg_density <- density(chr22_neg_interest)
chr22_pos_density <- density(chr22_pos_interest)
 
#display the negative strand with negative values
chr22_neg_density$y <- chr22_neg_density$y * -1
 
plot(chr22_pos_density,
     ylim = range(c(chr22_neg_density$y, chr22_pos_density$y)),
     main = "Coverage plot of mapped CAGE reads",
     xlab = "Chromosome 22",
     col = 'blue',
     lwd=2.5,
     type='h'
)
lines(chr22_neg_density, lwd=2.5, col = 'red', type='h')

coverage_on_region_of_interest

Conclusions

So there you have it, a simple coverage plot in R. Now that the BAM file is stored as a data frame, you can perform subsets to your liking to focus on a specific region, focus on reads with a certain mapping quality, etc. and produce all kinds of density/coverage plots. You can also display several densities together by following the same steps and adding more “lines” and by using different colours.

And GitHub Gist is super awesome.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
42 comments Add yours
  1. Hey Dave great tutorial!!!
    It helps a lot.
    However, this part “bam_df$flag == 16” will not work always since different combinations gives “being reverse complemented”.
    Did you found any solution for that?
    Thanks

    1. Hey David!

      Thanks for the comment and sorry about the not so robust code. I updated the code to convert the BAM flag into a binary, and checks whether a read is mapped on the negative or positive strand via two new functions.

      I checked the code on the test BAM file used for this post and there weren’t any issues.

      Cheers,

      Dave

    1. The two vectors, chr22_pos and chr22_neg, contain all the starting positions of the mapped reads on chromosome 22. You can subset those vectors to contain only the positions you’re interested in, i.e. focusing on a specific locus. For example:

      test < - 1:10; want <- test[test > 2 & test < 8]

      1. Thanks Davo. This is what I tried. I used

        test <- 1:10;
        chr9_neg_density <- density(chr9_neg[test])
        chr9_pos_density <- density(chr9_pos[test])

        It generates an error saying x contains missing values.

        Can you tell me what am I doing wrong?

        I am a novice in R.

  2. Hi Davo,

    This looks great! I’ve been looking for a way to plot coverage across an entire genome for a human WGS sequenced at 30X+ coverage. Before I bury myself in it too much do you think your script will be able to do that?

    I tried quickly the first step of converting a 98GB .bam file but R kicked me off after a long while, i’m thinking the bam might be too big?

    How big was your example bam and how long did the various steps take (roughly)?

    Cheers,

    Nick

    1. Hi Nick,

      the bam file in the example is 107M and the entire process from reading in the file to the plot didn’t take too long on my laptop; from memory I think it was roughly 5-15 minutes.

      I’m not so familiar with Rsamtools but perhaps there’s an option to read in only one chromosome of a BAM file. If not, it’s easy to subset a BAM file with SAMtools:


      samtools index wgEncodeRikenCageHchCellPapAlnRep1.bam
      samtools view -b wgEncodeRikenCageHchCellPapAlnRep1.bam chr1 > chr1.bam

      Then just read in chr1.bam into R. Hope that helps a bit.

      Cheers,

      Dave

  3. it seems that they assume you have already read in your bam file into R with the scanbam function in order to then parse by chromosome for example. However, my bam file is so large that R takes forever to even read it in.

    1. Right. Then use SAMtools to slice the bam file before reading it into R:


      samtools index wgEncodeRikenCageHchCellPapAlnRep1.bam
      samtools view -b wgEncodeRikenCageHchCellPapAlnRep1.bam 11:60000-5000000 > region_of_interest.bam

  4. Hi Davo,

    Thank you very much for this script.
    I am trying to use on my samples but when I want to use on my sample I get stuck on the function for checking positive and negative strands.
    I managed to get only one (negative) the positive I always get that alla my reads are FALSE.

    What am I doing wrong?

    Many thanks

    Andrea

  5. Hello,
    I am trying to do a coverage plot on BAM File – but for the entire genome. And my genome is not human, but viral.
    Can you give some example code as to how i can use the BAM dataframe for this ?

    Thanks
    Krithika

    1. The code would be exactly the same; you just need to adjust it to match your reference sequence name. If the viral genome is called hiv, then change chr22 to hiv.

      1. So you mean to say instead of “chr22”, use the viral BAM file, and use the same flag settings 16 – I should then be able to plot coverage for negative and positive strand of the entire genome.

      2. Tried it – Error ed out when I tried to calculate “chr22_neg” and “chr22_pos”. Error is shown below:
        Error in bam_df[bam_df & apply(as.data.frame(bam_df$flag), 1, check_neg), : error in evaluating the argument ‘i’ in selecting a method for function ‘[‘: Error in bam_df & apply(as.data.frame(bam_df$flag), 1, check_neg) : operations are possible only for numeric, logical or complex types

        1. It’s a bit hard to troubleshoot without knowing what your file looks like. If you can share your BAM file (over Dropbox or something else), then I can have a look.

          1. Hello,
            Thanks for your reply. I understand its hard to troubleshoot. Unfortunately I am not allowed to share the BAM file.

  6. Thanks Dave for the tutorial it was really helpful to get into drawing coverage plots with R!
    Works like a charm!

  7. Hi Dave,

    Thanks for another brilliant tutorial, I was wondering whether you knew how to plot actual coverage rather than density, I think in my case this would be more informative as in positions it’s dropping to close to 0.
    Also I have done this with HIV data, to answer the question above, it depends what your rname looks like but generally your data won’t be stranded (mine isn’t) so it’s as simple as making one file:
    name_pos <- bam_df[bam_df$rname == '(what you're reference is called in the .fast file!)', 'pos']
    name_density <- density(name_pos)
    plot(name_density,
    ylim = range(c(name_density$y)),
    main = "MY OWN TITLE",
    xlab = "MY OWN LABEL",
    col = 'blue',
    lwd=2.5)

    And then missing off the last line plotting the negative strand coverage.

    Many Thanks!

    Kat

      1. Hi Dave,

        Thank you for this post.
        My question is rather similar to Kat’s, so I’m responding in reply. I like the tutorial and am fond of using R for genomics data analysis and visualisation. However, using the suggested workflow you have outlined, it seems like a lot of positions in the genome are missing. I suspect these are base positions with zero counts?. Is there any way to get those zeroes in the data frame object, using e.g. an argument on one of the functions?

        Thanks,
        Koen

        1. Hi Koen,

          actually Kat and you raise a very important point that I should elaborate on this post; only the positions with greater than zero expression are stored in the vector. For eukaryotic transcriptome data, this approach roughly works.

          I guess if your chromosome is small, you could try filling the vector of positions with zeros and then gradually adding the expression at each position.

          Cheers,

          Dave

  8. Hello Dave, thank you for your script, I got a pretty good plot from my BAM file, but I would like to ask you some things…
    I couldn’t understand what is this “16” on “bam_df$flag == 16”
    I found your script because I am looking for some tool to get the coverage of a metagenomic dataset against a reference genome. I could build coverage plots, but I couldn’t get number, 2X, 5X, depth, breath… I am trying to use your suggestion above, genomecov, but now can’t figure out how to get a genome file…. It’s all so difficult….
    Thank you!

    1. Hi Marcele,

      the BAM flags are very confusing; so confusing that I have an entire post on the topic! Have a look at http://davetang.org/muse/2014/03/06/understanding-bam-flags/ and hopefully it makes more sense.

      The genome file can be easily obtained for species available on the UCSC Genome Browser page. But since you’re working with metagenomes, you’ll have to generate your own genome file, which is a simple two column file that looks like this:

      chrom size
      chr1 249250621
      chr2 243199373
      chr3 198022430
      chr4 191154276

      Hope that helps,

      Dave

  9. Dear David,
    Thanks for the tutorial, it’s really great.
    How to deal with complex flags like 65, 189?
    The neg_ function returns a value I can’t confirm is true, and the pos_ function always returns zero…
    Cheers,

  10. Hey Dave

    Very nice tutorial indeed
    Is it possible to combine many regions with the above script and plot the average density?
    I have a bam file from Chip-seq and want to plot the coverage over the TSS but separating positive from negative strand and as well taking into account gene directionality

    Cheers

  11. Dear Dave,
    Thank you for this post.
    I have started to work in order to map HTS reads againts viral genomes. I have prepared the bam file in Geneious and follow your script.
    Unfortunately when in the step:
    table(bam_df$rname == ‘dati.bam’ & bam_df$flag == 16)
    Only FALSE reads appeared.
    I do not know how to prepare bam files to clearly indicate what reads are positive and what negative.
    Could you give me your piece od advice?
    Best regards,

    1. Hi JM,

      can you take a look at your BAM file using SAMtools and check what the reference name is? For example in the post, the reference name is “chr22”. Perhaps your reference name is not “dati.bam”.

      Cheers,

      Dave

      1. Dear Dave,

        First of all, thank you so much for your quick response.

        I have been working with the script and finally I could find my problem (as you mentioned the name of the rows was the source of my error).

        I have another question in this way. The chart provided the density (kernel density I mean), however, I don not know how informative is it. In fact, I am not so sure about the meaning of this density. Could you help me to clarify this concept?

        I want to thank you for your detailed explanations. They make easier work in R for bioinformatics, even if the student is not so expert in programing. Thank a lot!

        Bets regards.

  12. Dear Dave,
    is there any way to plot smooth average profile over bed file in r using wig, bigwig or bam files?

    thanks much
    TA

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.