Updated: 2019 April 4th
From the introductory article:
The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project.
To begin, install the package.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GenomicRanges", version = "3.8")
The introduction article starts with creating a GRanges object:
The GRanges class represents a collection of genomic features that each have a single start and end location on the genome. This includes features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function.
library("GenomicRanges") gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)), strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, length=10) ) class(gr) [1] "GRanges" attr(,"package") [1] "GenomicRanges" gr
The GRanges() function was used to specify a list of seqnames, their ranges, strand, score, and GC content. For the seqnames, the Rle() function is used; Rle is an abbreviation for run-length encoding, which is a way of representing and compressing data. The function saves us from typing c("chr1", "chr2", "chr2", "chr2", "chr1", "chr1", "chr3", "chr3", "chr3", "chr3"), which would also work.
The IRanges() function was used to create a vector representation of sequence ranges; 10 ranges were created and named using the first ten letters of the alphabet.
Rle() was used to indicate the strandedness of the ranges.
Metadata can also be added to a GRanges object. In the example, a score and GC content column were added.
BED to GRanges
I have created a BED file that we will use for this example.
library("GenomicRanges") data <- read.table("https://davetang.org/file/granges.bed", col.names = c('chr','start','end','id','score','strand')) data chr start end id score strand 1 chr1 66999824 67210768 NM_032291 0 + 2 chr1 33546713 33585995 NM_052998 0 + 3 chr1 48998526 50489626 NM_032785 0 - 4 chr1 16767166 16786584 NM_001145278 0 + 5 chr1 16767166 16786584 NM_001145277 0 + 6 chr1 8384389 8404227 NM_001080397 0 + bed <- with(data, GRanges(chr, IRanges(start+1, end), strand, score, refseq=id)) bed GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | score refseq <Rle> <IRanges> <Rle> | <integer> <factor> [1] chr1 66999825-67210768 + | 0 NM_032291 [2] chr1 33546714-33585995 + | 0 NM_052998 [3] chr1 48998527-50489626 - | 0 NM_032785 [4] chr1 16767167-16786584 + | 0 NM_001145278 [5] chr1 16767167-16786584 + | 0 NM_001145277 [6] chr1 8384390-8404227 + | 0 NM_001080397 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths # fetch metadata elementMetadata(bed) DataFrame with 6 rows and 2 columns score refseq <integer> <factor> 1 0 NM_032291 2 0 NM_052998 3 0 NM_032785 4 0 NM_001145278 5 0 NM_001145277 6 0 NM_001080397
The with() function used to create the GRanges object is nice because it saves us some typing.
# using with() bed <- with(data, GRanges(chr, IRanges(start+1, end), strand, score, refseq=id)) # without with() bed2 <- GRanges(data$chr, IRanges(data$start+1, data$end), data$strand, score = data$score, refseq = data$id) # comparing the two GRanges object identical(bed, bed2) [1] TRUE
I created another BED file to demonstrate how we can intersect ranges.
data2 <- read.table("https://davetang.org/file/granges2.bed", col.names = c('chr','start','end','id','score','strand')) data2 chr start end id score strand 1 chr1 66999822 66999823 1_no 0 + 2 chr1 33585996 33585997 2_no 0 + 3 chr1 16786583 16786584 3_yes 0 + 4 chr1 8384389 8384390 4_yes 0 + bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, refseq = id)) bed2 GRanges object with 4 ranges and 2 metadata columns: seqnames ranges strand | score refseq <Rle> <IRanges> <Rle> | <integer> <factor> [1] chr1 66999823 + | 0 1_no [2] chr1 33585997 + | 0 2_no [3] chr1 16786584 + | 0 3_yes [4] chr1 8384390 + | 0 4_yes ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths intersect(bed, bed2) GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 8384390 + [2] chr1 16786584 + ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
However I lose the metadata from both BED files. Thank to a suggestion by Daniel (see comments), we can use the subsetByOverlaps() function.
subsetByOverlaps(bed, bed2) GRanges object with 3 ranges and 2 metadata columns: seqnames ranges strand | score refseq <Rle> <IRanges> <Rle> | <integer> <factor> [1] chr1 16767167-16786584 + | 0 NM_001145278 [2] chr1 16767167-16786584 + | 0 NM_001145277 [3] chr1 8384390-8404227 + | 0 NM_001080397 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
The 3 RefSeqs that overlapped with features in bed2 are returned.
I have written a very simple package that can read in BED-like files into R as a GRanges object.
library(devtools) install_github('davetang/bedr') library(bedr) # download the example bed file into /tmp download.file(url = "https://davetang.org/file/granges2.bed", destfile = "/tmp/test.bed") my_bed <- bed_to_granges("/tmp/test.bed", header = FALSE) my_bed GRanges object with 4 ranges and 2 metadata columns: seqnames ranges strand | id score <Rle> <IRanges> <Rle> | <character> <integer> [1] chr1 66999822-66999823 + | 1_no 0 [2] chr1 33585996-33585997 + | 2_no 0 [3] chr1 16786583-16786584 + | 3_yes 0 [4] chr1 8384389-8384390 + | 4_yes 0 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
Matching the overlaps
In the previous example, subsetByOverlaps() does not indicate the matching features in the second GRanges object. In this section, we will demonstrate the use of the GenomicRanges functions countOverlaps(), findOverlaps(), queryHits(), and subjectHits().
library(GenomicRanges) # creating the same GRanges object but without a BED file data_grange <- GRanges(seqnames = rep('chr1',6), ranges = IRanges(start = c(66999824,33546713,48998526,16767166,16767166,8384389), end = c(67210768,33585995,50489626,16786584,16786584,8404227)), strand = c('+','+','-','+','+','+'), score = rep(0,6), refseq = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397') ) data_grange GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | score refseq <Rle> <IRanges> <Rle> | <numeric> <character> [1] chr1 66999824-67210768 + | 0 NM_032291 [2] chr1 33546713-33585995 + | 0 NM_052998 [3] chr1 48998526-50489626 - | 0 NM_032785 [4] chr1 16767166-16786584 + | 0 NM_001145278 [5] chr1 16767166-16786584 + | 0 NM_001145277 [6] chr1 8384389-8404227 + | 0 NM_001080397 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths # 3 overlapping snps snp_grange <- GRanges(seqnames = rep('chr1',3), ranges = IRanges(start = c(66999900,33546793,8384389), end = c(66999901,33546794,8384390)), strand = c('+','+','+'), id = c('snp1','snp2','snp3'), score = rep(0,3) ) snp_grange GRanges object with 3 ranges and 2 metadata columns: seqnames ranges strand | id score <Rle> <IRanges> <Rle> | <character> <numeric> [1] chr1 66999900-66999901 + | snp1 0 [2] chr1 33546793-33546794 + | snp2 0 [3] chr1 8384389-8384390 + | snp3 0 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths # count the overlaps countOverlaps(data_grange, snp_grange) [1] 1 1 0 0 0 1 # find the overlaps my_overlaps <- findOverlaps(query = snp_grange, subject = data_grange) my_overlaps Hits object with 3 hits and 0 metadata columns: queryHits subjectHits <integer> <integer> [1] 1 1 [2] 2 2 [3] 3 6 ------- queryLength: 3 / subjectLength: 6 # create the match my_query <- queryHits(my_overlaps) my_subject <- subjectHits(my_overlaps) data.frame(subject = data_grange[my_subject]$refseq, query = snp_grange[my_query]$id) subject query 1 NM_032291 snp1 2 NM_052998 snp2 3 NM_001080397 snp3
The final data frame shows the SNPs that overlapped the RefSeq genes.
See also
subsetByOverlaps to keep info from both GRanges objects
My older post on intersectBed.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
You may want to look at the subsetByOverlaps method, which returns a GRanges object and retains all of the metadata from the query object. You’ll just want to read the IRanges documentation to adjust your overlap method as required.
Hi Daniel,
Thanks for the tip! That was exactly what I wanted and I updated the post accordingly.
Cheers,
Dave
This one was very helpful. It saved a lot of my time. Cheers
Thanks for your post. I have to add this function, because it is an extremely useful add-on for the GRange package.
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/rangeoverlapper.R
Hi there,
There is a bug in your code. BED is 0-based, whereas GRanges are 1-based.
See: https://www.biostars.org/p/84686/
Fortunately these days the GenomicRanges library has a nice data.frame wrapper, makeGRangesFromDataFrame(). You can keep the metadata by setting ‘keep.extra.columns = TRUE’.
Hi Jason,
thanks for the comment! I updated the code to convert the 0-based BED starts into 1-based starts for the GRanges object. This is by far the most common bioinformatics mistake and I should be more careful about it.
Cheers,
Dave
Hi Dave,
I also noticed this long time ago. say you have a bed file which is 0 based, if you use read.table to read in the data, you have to add 1 to start to make it 1 based. However, if you use import function from rtracklayer and specify format = “BED”, it will automatically add 1 to the start.
Very useful post! Thank you and to your useful commentators as well!