Using the Bioconductor GenomicRanges package

Updated: 2019 April 4th

From the introductory article:

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project.

To begin, install the package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicRanges", version = "3.8")

The introduction article starts with creating a GRanges object:

The GRanges class represents a collection of genomic features that each have a single start and end location on the genome. This includes features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function.

library("GenomicRanges")

gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
              ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
              strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
              score = 1:10,
              GC = seq(1, 0, length=10)
             )

class(gr)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

gr

The GRanges() function was used to specify a list of seqnames, their ranges, strand, score, and GC content. For the seqnames, the Rle() function is used; Rle is an abbreviation for run-length encoding, which is a way of representing and compressing data. The function saves us from typing c(“chr1”, “chr2”, “chr2”, “chr2”, “chr1”, “chr1”, “chr3”, “chr3”, “chr3”, “chr3”), which would also work.

The IRanges() function was used to create a vector representation of sequence ranges; 10 ranges were created and named using the first ten letters of the alphabet.

Rle() was used to indicate the strandedness of the ranges.

Metadata can also be added to a GRanges object. In the example, a score and GC content column were added.

BED to GRanges

I have created a BED file that we will use for this example.

library("GenomicRanges")

data <- read.table("https://davetang.org/file/granges.bed", col.names = c('chr','start','end','id','score','strand'))
data
   chr    start      end           id score strand
1 chr1 66999824 67210768    NM_032291     0      +
2 chr1 33546713 33585995    NM_052998     0      +
3 chr1 48998526 50489626    NM_032785     0      -
4 chr1 16767166 16786584 NM_001145278     0      +
5 chr1 16767166 16786584 NM_001145277     0      +
6 chr1  8384389  8404227 NM_001080397     0      +

bed <- with(data, GRanges(chr, IRanges(start+1, end), strand, score, refseq=id))
bed
GRanges object with 6 ranges and 2 metadata columns:
      seqnames            ranges strand |     score       refseq
         <Rle>         <IRanges>  <Rle> | <integer>     <factor>
  [1]     chr1 66999825-67210768      + |         0    NM_032291
  [2]     chr1 33546714-33585995      + |         0    NM_052998
  [3]     chr1 48998527-50489626      - |         0    NM_032785
  [4]     chr1 16767167-16786584      + |         0 NM_001145278
  [5]     chr1 16767167-16786584      + |         0 NM_001145277
  [6]     chr1   8384390-8404227      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# fetch metadata
elementMetadata(bed)
DataFrame with 6 rows and 2 columns
      score       refseq
  <integer>     <factor>
1         0    NM_032291
2         0    NM_052998
3         0    NM_032785
4         0 NM_001145278
5         0 NM_001145277
6         0 NM_001080397

The with() function used to create the GRanges object is nice because it saves us some typing.

# using with()
bed <- with(data, GRanges(chr, IRanges(start+1, end), strand, score, refseq=id))

# without with()
bed2 <- GRanges(data$chr, IRanges(data$start+1, data$end), data$strand, score = data$score, refseq = data$id)

# comparing the two GRanges object
identical(bed, bed2)
[1] TRUE

I created another BED file to demonstrate how we can intersect ranges.

data2 <- read.table("https://davetang.org/file/granges2.bed", col.names = c('chr','start','end','id','score','strand'))
data2
   chr    start      end    id score strand
1 chr1 66999822 66999823  1_no     0      +
2 chr1 33585996 33585997  2_no     0      +
3 chr1 16786583 16786584 3_yes     0      +
4 chr1  8384389  8384390 4_yes     0      +

bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, refseq = id))
bed2
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |     score   refseq
         <Rle> <IRanges>  <Rle> | <integer> <factor>
  [1]     chr1  66999823      + |         0     1_no
  [2]     chr1  33585997      + |         0     2_no
  [3]     chr1  16786584      + |         0    3_yes
  [4]     chr1   8384390      + |         0    4_yes
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

intersect(bed, bed2)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   8384390      +
  [2]     chr1  16786584      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

However I lose the metadata from both BED files. Thank to a suggestion by Daniel (see comments), we can use the subsetByOverlaps() function.

subsetByOverlaps(bed, bed2)
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     score       refseq
         <Rle>         <IRanges>  <Rle> | <integer>     <factor>
  [1]     chr1 16767167-16786584      + |         0 NM_001145278
  [2]     chr1 16767167-16786584      + |         0 NM_001145277
  [3]     chr1   8384390-8404227      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The 3 RefSeqs that overlapped with features in bed2 are returned.

I have written a very simple package that can read in BED-like files into R as a GRanges object.

library(devtools)
install_github('davetang/bedr')
library(bedr)

# download the example bed file into /tmp
download.file(url = "https://davetang.org/file/granges2.bed", destfile = "/tmp/test.bed")

my_bed <- bed_to_granges("/tmp/test.bed", header = FALSE)

my_bed
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |          id     score
         <Rle>         <IRanges>  <Rle> | <character> <integer>
  [1]     chr1 66999822-66999823      + |        1_no         0
  [2]     chr1 33585996-33585997      + |        2_no         0
  [3]     chr1 16786583-16786584      + |       3_yes         0
  [4]     chr1   8384389-8384390      + |       4_yes         0
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Matching the overlaps

In the previous example, subsetByOverlaps() does not indicate the matching features in the second GRanges object. In this section, we will demonstrate the use of the GenomicRanges functions countOverlaps(), findOverlaps(), queryHits(), and subjectHits().

library(GenomicRanges)
# creating the same GRanges object but without a BED file
data_grange <- GRanges(seqnames = rep('chr1',6),
                       ranges = IRanges(start = c(66999824,33546713,48998526,16767166,16767166,8384389),
                                        end = c(67210768,33585995,50489626,16786584,16786584,8404227)),
                       strand = c('+','+','-','+','+','+'),
                       score = rep(0,6),
                       refseq = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397')
)

data_grange
GRanges object with 6 ranges and 2 metadata columns:
      seqnames            ranges strand |     score       refseq
         <Rle>         <IRanges>  <Rle> | <numeric>  <character>
  [1]     chr1 66999824-67210768      + |         0    NM_032291
  [2]     chr1 33546713-33585995      + |         0    NM_052998
  [3]     chr1 48998526-50489626      - |         0    NM_032785
  [4]     chr1 16767166-16786584      + |         0 NM_001145278
  [5]     chr1 16767166-16786584      + |         0 NM_001145277
  [6]     chr1   8384389-8404227      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# 3 overlapping snps
snp_grange <- GRanges(seqnames = rep('chr1',3),
                      ranges = IRanges(start = c(66999900,33546793,8384389),
                                       end = c(66999901,33546794,8384390)),
                      strand = c('+','+','+'),
                      id = c('snp1','snp2','snp3'),
                      score = rep(0,3)
)

snp_grange
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |          id     score
         <Rle>         <IRanges>  <Rle> | <character> <numeric>
  [1]     chr1 66999900-66999901      + |        snp1         0
  [2]     chr1 33546793-33546794      + |        snp2         0
  [3]     chr1   8384389-8384390      + |        snp3         0
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# count the overlaps
countOverlaps(data_grange, snp_grange)
[1] 1 1 0 0 0 1

# find the overlaps
my_overlaps <- findOverlaps(query = snp_grange, subject = data_grange)
my_overlaps
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           2
  [3]         3           6
  -------
  queryLength: 3 / subjectLength: 6

# create the match
my_query <- queryHits(my_overlaps)
my_subject <- subjectHits(my_overlaps)

data.frame(subject = data_grange[my_subject]$refseq,
           query = snp_grange[my_query]$id)
       subject query
1    NM_032291  snp1
2    NM_052998  snp2
3 NM_001080397  snp3

The final data frame shows the SNPs that overlapped the RefSeq genes.

See also

A tutorial on GenomicRanges.

subsetByOverlaps to keep info from both GRanges objects

My older post on intersectBed.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
9 comments Add yours
  1. You may want to look at the subsetByOverlaps method, which returns a GRanges object and retains all of the metadata from the query object. You’ll just want to read the IRanges documentation to adjust your overlap method as required.

  2. Hi there,

    There is a bug in your code. BED is 0-based, whereas GRanges are 1-based.
    See: https://www.biostars.org/p/84686/

    Fortunately these days the GenomicRanges library has a nice data.frame wrapper, makeGRangesFromDataFrame(). You can keep the metadata by setting ‘keep.extra.columns = TRUE’.

    1. Hi Jason,

      thanks for the comment! I updated the code to convert the 0-based BED starts into 1-based starts for the GRanges object. This is by far the most common bioinformatics mistake and I should be more careful about it.

      Cheers,

      Dave

      1. Hi Dave,

        I also noticed this long time ago. say you have a bed file which is 0 based, if you use read.table to read in the data, you have to add 1 to start to make it 1 based. However, if you use import function from rtracklayer and specify format = “BED”, it will automatically add 1 to the start.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.