I was reading through the bedtools jaccard documentation when I saw the reference “Exploring Massive, Genome Scale Datasets with the GenometriCorr Package”. Firstly for those wondering what the Jaccard index is, it’s a simple metric that is defined as so:

$$!J(A,B) = \frac{| A \cap B |}{| A \cup B |}$$

The numerator is the number of intersections between A and B, and the denominator is the union of A and B. If you wanted a metric to summarise the amount of intersection between two sets of features, check out bedtools jaccard. Back to the topic, I decided to check out the paper, as the title attracted me and I wanted to learn more about the GenometriCorr package. It turns out that it’s an R package.

The package performs four statistical tests based on two sets of genome features: a query set and a reference set. These statistical tests aim to identify spatial correlations between the two sets of genome features. The basic premise is that if two types of features are correlated, then the locations of the features with respect to each other, will follow a detectable pattern, such as being consistently nearby, far away, or overlapping. If on the other hand they are independent, their locations will be randomly positioned with respect to each other. Here’s a figure taken from the documentation:

*The orange oval on the left is the reference feature and the black bar on the right is the midway point. The package checks the distribution of query features between these two points. The first example shows that the query features are positioned at a fixed distance, indicated by the denser colour; the second example shows that the query features are far away from the reference; the third example shows a uniform distribution of features; and the last example shows that query features are near the reference*.

The first statistical test checks the distribution of relative distances of query features to reference features. The second test checks the absolute distance between the two. The third is a projection test, which calculates the probability of a query overlapping a reference (the probability is based on the coverage of the reference features), and compares the actual intersection of the two features to what is expected based on the probability. The fourth test, is based on the Jaccard index and uses random permutations to check for significance in the amount of overlap. For more information, read the vignette or the publication or contact the author. Let’s get started by installing the package and its dependencies:

source("http://bioconductor.org/biocLite.R") biocLite() biocLite('rtracklayer') biocLite("IRanges") biocLite("GenomicRanges") install.packages('gplots') install.packages('GenometriCorr',repos='http://genometricorr.sourceforge.net/R/',type='source')

Now let’s follow the example in the vignette:

#load the package library(GenometriCorr) #load rtracklayer for importing data library(rtracklayer) refseq <- as(import( system.file("extdata","UCSCrefseqgenes_hg19.bed", package = "GenometriCorr") ),"RangedData") head(refseq) #RangedData with 6 rows and 1 value column across 47 spaces # space ranges | strand # <factor> <IRanges> | <Rle> #1 chr1 [66999825, 67210768] | * #2 chr1 [ 8384390, 8404227] | * #3 chr1 [16767167, 16786584] | * #4 chr1 [16767167, 16786584] | * #5 chr1 [16767167, 16786584] | * #6 chr1 [48998527, 50489626] | * cpgis <- as(import( system.file("extdata", "UCSCcpgis_hg19.bed", package = "GenometriCorr") ),"RangedData") #where are these files located? system.file(package = "GenometriCorr") #[1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/GenometriCorr"

The files should just be BED files, judging from the file extension, but let’s check them out anyway:

cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/GenometriCorr/extdata/UCSCrefseqgenes_hg19.bed | head #chrom txStart txEnd chr1 66999824 67210768 chr1 8384389 8404227 chr1 16767166 16786584 chr1 16767166 16786584 chr1 16767166 16786584 chr1 48998526 50489626 chr1 33546713 33585995 chr1 25071759 25170815 chr1 92145899 92351836 cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/GenometriCorr/extdata/UCSCcpgis_hg19.bed | head #chrom chromStart chromEnd chr1 28735 29810 chr1 135124 135563 chr1 327790 328229 chr1 437151 438164 chr1 449273 450544 chr1 533219 534114 chr1 544738 546649 chr1 713984 714547 chr1 762416 763445

So they are simply BED files; back to the example:

#define hg19 human.chrom.length <- c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067, 159138663, 146364022, 141213431, 135534747,135006516, 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, 59128983, 63025520, 48129895, 51304566, 59373566, 155270560) names(human.chrom.length) <- c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrY", "chrX") VisualiseTwoIRanges(cpgis["chr1"]$ranges, refseq["chr1"]$ranges, nameA = "CpG Islands", nameB = "RefSeq Genes", chrom_length = human.chrom.length[["chr1"]], title = "CpGIslands and RefGenes on chr1 of Hg19")

*Not sure if this plot is useful or not, given the large distance*.

All four statistical tests are carried out using the GenometriCorrelation() function:

#number of permutations, these are the default numbers, anyway pn.area <- 100 pn.dist <- 100 pn.jacc <- 100 cpgi_to_genes <- GenometriCorrelation(cpgis, refseq, chromosomes.length = human.chrom.length, chromosomes.to.proceed = c("chr1", "chr2", "chr3"), ecdf.area.permut.number = pn.area, mean.distance.permut.number = pn.dist, jaccard.measure.permut.number = pn.jacc, keep.distributions = TRUE, showProgressBar = TRUE) cpgi_to_genes chr1 chr2 chr3 awhole query.population 2462 1688 1163 5313 reference.population 3847 2338 2074 8259 relative.distances.ks.p.value 5.178358e-07 0 3.401398e-06 0 relative.distances.ecdf.deviation.area 0.01850287 0.03115026 0.0202765 0.02289413 relative.distances.ecdf.area.correlation 0.07414226 0.1244852 0.08119435 0.09157381 query.reference.intersection 1202288 811160 578920 2592368 query.reference.union 106528617 99650239 88014314 294193170 jaccard.measure 0.01128606 0.008140071 0.006577566 0.008811789 projection.test.p.value 0 0 0 0 projection.test.lower.tail FALSE FALSE FALSE FALSE reference.middles Numeric,3847 Numeric,2338 Numeric,2074 NULL relative.distances.ecdf.deviation.area.p.value "<0.01" "<0.01" "<0.01" "<0.01" scaled.absolute.min.distance.sum.p.value "<0.01" "<0.01" "<0.01" "<0.01" scaled.absolute.min.distance.sum.lower.tail TRUE TRUE TRUE TRUE jaccard.measure.p.value "<0.01" "<0.01" "<0.01" "<0.01" jaccard.measure.lower.tail FALSE FALSE FALSE FALSE

Line 13 and 14 show the number of query and reference features on chr1, chr2, chr3, and chr1 + chr2 + chr3, respectively. Line 15 shows the p-value of the Kolmogorov-Smirnov test between the distribution of relative distances of query features to reference features, to a uniform distribution of relative distances. The p-value indicates that the two features we are testing are not uniformly distributed. Line 16 and 17 indicate whether the two features are closer (positive value) or further away (negative value) than expected. Line 18 and 19 shows the total number of base pairs intersected and the union of genomic features, respectively. Line 20 is the Jaccard index, which is line 18 / line 19. Line 21 is the p-value for the projection test and line 22 indicates whether there is significantly more overlap (FALSE) or less (TRUE). Line 24 and 25 are the p-values for the permutation tests between the relative and absolute distances, respectively. Line 26 indicates whether query features are closer (TRUE) or further away (FALSE) to reference features than expected. Line 27 is the p-value for the permutation test on the Jaccard index and line 28 indicates whether whether the the overlap is less (TRUE) or more (FALSE) than expected.

The package produces some plots to visualise the results:

#create plot of results graphical.report(cpgi_to_genes, pdffile = "cpg_to_refseq_dist.pdf", show.chromosomes = c("chr1"), show.all = FALSE) visualize(cpgi_to_genes, pdffile = "cpg_to_refseq_overlap.pdf", show.chromosomes = c("chr1"), show.all = FALSE)

I created a GitHub repository at https://github.com/davetang/GenometriCorr to test these scenarios:

- Two sets of features that are randomly generated
- Two sets of features that completely overlap each other
- Two sets of features that never overlap each other
- Two sets of features that never overlap and are a certain distance away from each other.

Check out the repository for more information on how I generated the tests. For those who are only interested in the results, here are the hyperlinks:

- a_bed_b_bed.out
- merged_subset_bed_merged_subset_bed.out
- merged_subset_bed_merged_comp_subset_bed.out
- 100000_window_10th_bed_100000_window_10th_3kb_bed.out

None of the p-values are significant when overlapping two randomly generated sets of features, as one would expect. As I mentioned on Twitter, I’m surprised by the number of citations:

```
```Only 20 citations!? Seems extremely useful "Exploring Massive, Genome Scale Datasets with the GenometriCorr Package" http://t.co/cLjiETExew

— Dave Tang (@davetang31) October 29, 2014

At least for my work, where I compare sets of genome features almost on a daily basis, this package is extremely useful.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Yes, GenometriCorr is an excellent tool for finding proximal relationships among genomic elements. Indeed, it is undercited, and definitely deserves better attention

Thanks for creating such an excellent test scenario, and the line-by-line explanation of the output. Genomation – next? https://github.com/BIMSBbioinfo/genomation

That looks like an excellent package! I will definitely write a blog post on it! Thanks!

I’ve used it for my paper ( will send it out sometime early next year). It is the best one in terms of easy to use, flexibility and figure quality.

I know deeptools https://github.com/fidelram/deepTools

ngsplot https://code.google.com/p/ngsplot/

can make similar figures, but Genomation is the best!

I hear that Genomation will be out in Bioconductor very soon. I will write a post when it comes out.

it is already on bioconductor http://bioconductor.org/packages/devel/bioc/html/genomation.html

Hi Dave, also see GAT for similar function http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3722528/

https://www.biostars.org/p/96170/#96175

Cool; I’ll update this post in the future to compare the two methods. Thanks!

If you are comparing sets of genome feaures on a daily basis I’d advice you to read this paper: http://dx.doi.org/10.1093/dnares/dsu044, especially part 3.1 and Table 1 to understand the possible limitations of GenometriCorr and HyperBrowser.