Using the GenometriCorr package

I was reading through the bedtools jaccard documentation when I saw the reference “Exploring Massive, Genome Scale Datasets with the GenometriCorr Package”. Firstly for those wondering what the Jaccard index is, it’s a simple metric that is defined as so:

$$!J(A,B) = \frac{| A \cap B |}{| A \cup B |}$$

The numerator is the number of intersections between A and B, and the denominator is the union of A and B. If you wanted a metric to summarise the amount of intersection between two sets of features, check out bedtools jaccard. Back to the topic, I decided to check out the paper, as the title attracted me and I wanted to learn more about the GenometriCorr package. It turns out that it’s an R package.

The package performs four statistical tests based on two sets of genome features: a query set and a reference set. These statistical tests aim to identify spatial correlations between the two sets of genome features. The basic premise is that if two types of features are correlated, then the locations of the features with respect to each other, will follow a detectable pattern, such as being consistently nearby, far away, or overlapping. If on the other hand they are independent, their locations will be randomly positioned with respect to each other. Here’s a figure taken from the documentation:

ref_vs_queryThe orange oval on the left is the reference feature and the black bar on the right is the midway point. The package checks the distribution of query features between these two points. The first example shows that the query features are positioned at a fixed distance, indicated by the denser colour; the second example shows that the query features are far away from the reference; the third example shows a uniform distribution of features; and the last example shows that query features are near the reference.

The first statistical test checks the distribution of relative distances of query features to reference features. The second test checks the absolute distance between the two. The third is a projection test, which calculates the probability of a query overlapping a reference (the probability is based on the coverage of the reference features), and compares the actual intersection of the two features to what is expected based on the probability. The fourth test, is based on the Jaccard index and uses random permutations to check for significance in the amount of overlap. For more information, read the vignette or the publication or contact the author. Let’s get started by installing the package and its dependencies:

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite('rtracklayer')
biocLite("IRanges")
biocLite("GenomicRanges")
install.packages('gplots')
install.packages('GenometriCorr',repos='http://genometricorr.sourceforge.net/R/',type='source')

Now let’s follow the example in the vignette:

#load the package
library(GenometriCorr)

#load rtracklayer for importing data
library(rtracklayer)
refseq <- as(import(
  system.file("extdata","UCSCrefseqgenes_hg19.bed", package = "GenometriCorr")
  ),"RangedData")

head(refseq)
#RangedData with 6 rows and 1 value column across 47 spaces
#     space               ranges | strand
#  <factor>            <IRanges> |  <Rle>
#1     chr1 [66999825, 67210768] |      *
#2     chr1 [ 8384390,  8404227] |      *
#3     chr1 [16767167, 16786584] |      *
#4     chr1 [16767167, 16786584] |      *
#5     chr1 [16767167, 16786584] |      *
#6     chr1 [48998527, 50489626] |      *

cpgis <- as(import(
  system.file("extdata", "UCSCcpgis_hg19.bed", package = "GenometriCorr")
  ),"RangedData")

#where are these files located?
system.file(package = "GenometriCorr")
#[1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/GenometriCorr"

The files should just be BED files, judging from the file extension, but let’s check them out anyway:

cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/GenometriCorr/extdata/UCSCrefseqgenes_hg19.bed | head
#chrom	txStart	txEnd
chr1	66999824	67210768
chr1	8384389	8404227
chr1	16767166	16786584
chr1	16767166	16786584
chr1	16767166	16786584
chr1	48998526	50489626
chr1	33546713	33585995
chr1	25071759	25170815
chr1	92145899	92351836

cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/GenometriCorr/extdata/UCSCcpgis_hg19.bed | head
#chrom	chromStart	chromEnd
chr1	28735	29810
chr1	135124	135563
chr1	327790	328229
chr1	437151	438164
chr1	449273	450544
chr1	533219	534114
chr1	544738	546649
chr1	713984	714547
chr1	762416	763445

So they are simply BED files; back to the example:

#define hg19
human.chrom.length <- c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067,
                        159138663, 146364022, 141213431, 135534747,135006516, 133851895,
                        115169878, 107349540, 102531392, 90354753, 81195210, 78077248,
                        59128983, 63025520, 48129895, 51304566, 59373566, 155270560)

names(human.chrom.length) <- c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6",
                               "chr7", "chr8", "chr9", "chr10", "chr11", "chr12",
                               "chr13", "chr14", "chr15", "chr16", "chr17", "chr18",
                               "chr19", "chr20", "chr21", "chr22", "chrY", "chrX")

VisualiseTwoIRanges(cpgis["chr1"]$ranges,
                    refseq["chr1"]$ranges,
                    nameA = "CpG Islands",
                    nameB = "RefSeq Genes",
                    chrom_length = human.chrom.length[["chr1"]],
                    title = "CpGIslands and RefGenes on chr1 of Hg19")

cpg_refgeneNot sure if this plot is useful or not, given the large distance.

All four statistical tests are carried out using the GenometriCorrelation() function:

#number of permutations, these are the default numbers, anyway
pn.area <- 100
pn.dist <- 100
pn.jacc <- 100
cpgi_to_genes <- GenometriCorrelation(cpgis, refseq, chromosomes.length = human.chrom.length,
                                      chromosomes.to.proceed = c("chr1", "chr2", "chr3"),
                                      ecdf.area.permut.number = pn.area,
                                      mean.distance.permut.number = pn.dist,
                                      jaccard.measure.permut.number = pn.jacc,
                                      keep.distributions = TRUE, showProgressBar = TRUE)
cpgi_to_genes
                                               chr1         chr2         chr3         awhole     
query.population                               2462         1688         1163         5313       
reference.population                           3847         2338         2074         8259       
relative.distances.ks.p.value                  5.178358e-07 0            3.401398e-06 0          
relative.distances.ecdf.deviation.area         0.01850287   0.03115026   0.0202765    0.02289413 
relative.distances.ecdf.area.correlation       0.07414226   0.1244852    0.08119435   0.09157381 
query.reference.intersection                   1202288      811160       578920       2592368    
query.reference.union                          106528617    99650239     88014314     294193170  
jaccard.measure                                0.01128606   0.008140071  0.006577566  0.008811789
projection.test.p.value                        0            0            0            0          
projection.test.lower.tail                     FALSE        FALSE        FALSE        FALSE      
reference.middles                              Numeric,3847 Numeric,2338 Numeric,2074 NULL       
relative.distances.ecdf.deviation.area.p.value "<0.01"      "<0.01"      "<0.01"      "<0.01"    
scaled.absolute.min.distance.sum.p.value       "<0.01"      "<0.01"      "<0.01"      "<0.01"    
scaled.absolute.min.distance.sum.lower.tail    TRUE         TRUE         TRUE         TRUE       
jaccard.measure.p.value                        "<0.01"      "<0.01"      "<0.01"      "<0.01"    
jaccard.measure.lower.tail                     FALSE        FALSE        FALSE        FALSE

Line 13 and 14 show the number of query and reference features on chr1, chr2, chr3, and chr1 + chr2 + chr3, respectively. Line 15 shows the p-value of the Kolmogorov-Smirnov test between the distribution of relative distances of query features to reference features, to a uniform distribution of relative distances. The p-value indicates that the two features we are testing are not uniformly distributed. Line 16 and 17 indicate whether the two features are closer (positive value) or further away (negative value) than expected. Line 18 and 19 shows the total number of base pairs intersected and the union of genomic features, respectively. Line 20 is the Jaccard index, which is line 18 / line 19. Line 21 is the p-value for the projection test and line 22 indicates whether there is significantly more overlap (FALSE) or less (TRUE). Line 24 and 25 are the p-values for the permutation tests between the relative and absolute distances, respectively. Line 26 indicates whether query features are closer (TRUE) or further away (FALSE) to reference features than expected. Line 27 is the p-value for the permutation test on the Jaccard index and line 28 indicates whether whether the the overlap is less (TRUE) or more (FALSE) than expected.

The package produces some plots to visualise the results:

#create plot of results
graphical.report(cpgi_to_genes,
                 pdffile = "cpg_to_refseq_dist.pdf",
                 show.chromosomes = c("chr1"),
                 show.all = FALSE)

visualize(cpgi_to_genes,
          pdffile = "cpg_to_refseq_overlap.pdf",
          show.chromosomes = c("chr1"),
          show.all = FALSE)

I created a GitHub repository at https://github.com/davetang/GenometriCorr to test these scenarios:

  1. Two sets of features that are randomly generated
  2. Two sets of features that completely overlap each other
  3. Two sets of features that never overlap each other
  4. Two sets of features that never overlap and are a certain distance away from each other.

Check out the repository for more information on how I generated the tests. For those who are only interested in the results, here are the hyperlinks:

  1. a_bed_b_bed.out
  2. merged_subset_bed_merged_subset_bed.out
  3. merged_subset_bed_merged_comp_subset_bed.out
  4. 100000_window_10th_bed_100000_window_10th_3kb_bed.out

None of the p-values are significant when overlapping two randomly generated sets of features, as one would expect. As I mentioned on Twitter, I’m surprised by the number of citations:

At least for my work, where I compare sets of genome features almost on a daily basis, this package is extremely useful.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
8 comments Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.