BED to GRanges

Updated 2015 April 6th to include the intersect_bed() function in the bedr package.

Last year I saw a post on Writing an R package from scratch and I always wanted to follow the tutorial. Yesterday while trying to make some plots using Gviz, I had some BED-like files (not supported by Gviz), which I wanted to convert into a GRanges object (supported by Gviz). I thought it would be easier if there was a function that could load BED-like files into GRanges objects. I couldn't find one so I thought I'll write a very primitive R package that contains this function. This post is on the creation of the BED to GRanges function and the R package that contains it.

Following the tutorial by Hilary, I installed and loaded the necessary packages. Then I created my package, which I named bedr:

install.packages("devtools")
library("devtools")
devtools::install_github("klutometis/roxygen")
library(roxygen2)

setwd("~")
create("bedr")

When the create() function is run, it produces some files, one of which is the DESCRIPTION file. In this file you can list basic information as well as all the dependencies your package requires; just use a comma to separate entries in "Depends." Furthermore, you can add an "Imports:" line to import required packages. Below is my minimal DESCRIPTION file.

cat DESCRIPTION 
Package: bedr
Title: BED to GRanges
Version: 0.0.0.9000
Authors@R: person("Dave", "Tang", , "me@davetang.org", role = c("aut", "cre"))
Description: Loads a BED-like file and stores it as a GRanges object.
Depends: R (>= 3.1.2)
License: Free
LazyData: true

I wrote a simple R function (with some roxygen2 formatted documentation), which simply loads a BED-like file and returns it as a GRanges object:

#' BED to GRanges
#'
#' This function loads a BED-like file and stores it as a GRanges object.
#' The tab-delimited file must be ordered as 'chr', 'start', 'end', 'id', 'score', 'strand'.
#' The minimal BED file must have the 'chr', 'start', 'end' columns.
#' Any columns after the strand column are ignored.
#' 
#' @param file Location of your file
#' @keywords BED GRanges
#' @export
#' @examples
#' bed_to_granges('my_bed_file.bed')

bed_to_granges <- function(file){
   df <- read.table(file,
                    header=F,
                    stringsAsFactors=F)

   if(length(df) > 6){
      df <- df[,-c(7:length(df))]
   }

   if(length(df)<3){
      stop("File has less than 3 columns")
   }

   header <- c('chr','start','end','id','score','strand')
   names(df) <- header[1:length(names(df))]

   if('strand' %in% colnames(df)){
      df$strand <- gsub(pattern="[^+-]+", replacement = '*', x = df$strand)
   }

   library("GenomicRanges")

   if(length(df)==3){
      gr <- with(df, GRanges(chr, IRanges(start, end)))
   } else if (length(df)==4){
      gr <- with(df, GRanges(chr, IRanges(start, end), id=id))
   } else if (length(df)==5){
      gr <- with(df, GRanges(chr, IRanges(start, end), id=id, score=score))
   } else if (length(df)==6){
      gr <- with(df, GRanges(chr, IRanges(start, end), id=id, score=score, strand=strand))
   }
   return(gr)
}

The above function is stored in a file called bed_to_granges.R, which I stored in the bedr/R directory. Next I'll create the documentation by running the document() function and I'm done!

setwd("bedr")
document()

Moving the package to GitHub

I made a new repository on GitHub and pushed all the files to it:

pwd
/Users/davetang/bedr
git init
git add -A
git commit -m 'Initial commit'
git remote add origin git@github.com:davetang/bedr.git
git push origin master

Testing the package

The reason I wrote this function was to make it easy to load broad peak files from ENCODE. It's a nine column file, where the first six columns contain the typical information of a BED file, and the last three columns contain information regarding the peaks. Let's download it to test the bedr package:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz

Now we can install the package directly from GitHub and test it by loading the broad peak file:

library(devtools)
install_github('davetang/bedr')
library(bedr)
my_file <- 'wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz'
my_granges <- bed_to_granges(my_file)
head(my_granges)
GRanges object with 6 ranges and 2 metadata columns:
      seqnames         ranges strand |          id     score
         <Rle>      <IRanges>  <Rle> | <character> <integer>
  [1]     chr1 [ 9932, 10575]      * |       peak1       251
  [2]     chr1 [10936, 26830]      * |       peak2       958
  [3]     chr1 [26931, 31558]      * |       peak3       958
  [4]     chr1 [31777, 33355]      * |       peak4       176
  [5]     chr1 [34204, 35247]      * |       peak5        95
  [6]     chr1 [35538, 39530]      * |       peak6       958
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

What can we do with GRanges objects?

For more information have a look at a post I wrote on the GenomicRanges package and on the Gviz package for those who are interested.

Adding more functionality to bedr

I want to add another function to the bedr package; firstly, I'll clone the repository:

git clone git@github.com:davetang/bedr.git

Now write another function:

#' Intersect BED
#'
#' This function performs an intersection between two GRanges objects
#' and returns the results in a data.frame
#' 
#' @param a First GRanges object
#' @param b Second GRanges object
#' @keywords BED GRanges
#' @export
#' @examples
#' intersect_bed(a, b)

intersect_bed <- function(a, b){
  library(GenomicRanges)
  my_hit <- findOverlaps(a, b)
  my_df  <- cbind(as.data.frame(a[queryHits(my_hit)]),
                  as.data.frame(b[subjectHits(my_hit)]))
}

Save the above function as a file called intersect_bed.R in the bedr/R directory. Next create the documentation by running the document() function.

library(roxygen2)
library(devtools)
setwd('~/github/bedr/')
document()

Now push to GitHub:

cd ~/github/bedr
git add -A
git commit -m 'intersect_bed.R'
git push origin master



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
6 comments Add yours
  1. I was just playing with your code and I noticed if fails if ‘strand’ is not specified in the input bed file. Easy enough to fix:

    if(‘strand’ %in% colnames(df)){
    df$strand <- gsub(pattern="[^+-]+", replacement = '*', x = df$strand)
    }

    1. Ah yes, I should only make the substitution when there’s strand information. I’ve updated the code, thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.