OMIM IDs to gene coordinates

A post on linking OMIM IDs to gene coordinates using biomaRt; this provides a way of representing OMIM IDs on the genome. For those unfamiliar with OMIM, here's the description from the OMIM FAQ:

Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression.

Firstly, register yourself for the downloads; you will get an email almost instantaneously, which provides a link where you can download these files:

ls -1
genemap
genemap.key
genemap2.txt
mim2gene.txt
morbidmap
omim.txt.Z

The mim2gene.txt file associates OMIM IDs to HUGO gene symbols, Entrez and Ensembl gene IDs. Let's take a look:

#these are the column headers
cat mim2gene.txt | head -1
# MIM Number    Type    Gene ID Approved Gene Symbol    Ensembl Gene ID

#how many OMIM IDs are in this file
cat mim2gene.txt | grep -v "^#" | cut -f1 | wc -l
23999
#how many unique OMIM IDs
cat mim2gene.txt | grep -v "^#" | cut -f1 | sort -u | wc -l
23999

#output lines where there is an Ensembl ID
cat mim2gene.txt | grep -v "^#" | awk '$5 !~ "-" {print}' | head -5
100640  gene    216     ALDH1A1 ENSG00000165092
100650  gene/phenotype  217     ALDH2   ENSG00000111275
100660  gene    218     ALDH3A1 ENSG00000108602
100670  gene    219     ALDH1B1 ENSG00000137124
100678  gene    39      ACAT2   ENSG00000120437

#how many lines have Ensembl IDs
cat mim2gene.txt | grep -v "^#" | awk '$5 !~ "-" {FS="\t"; print}' | wc -l
14552

#save
cat mim2gene.txt | grep -v "^#" | awk '$5 !~ "-" {FS="\t"; print}' | gzip > omim_to_ensembl.tsv.gz 

To obtain the gene coordinates of the Ensembl gene IDs, we will use biomaRt. Firstly we need to load the data into R:

#load data into R
data <- read.table("omim_to_ensembl.tsv.gz",
                   header=F,
                   stringsAsFactors=F,
                   sep="\t")

#check the dimensions of data
dim(data)
[1] 14552     5

#name the columns
# MIM Number    Type    Gene ID Approved Gene Symbol    Ensembl Gene ID
colnames(data) <- c('omim',
                    'type',
                    'entrez_gene_id',
                    'hgnc_symbol',
                    'ensembl_gene_id')

#check out data
head(data)
    omim           type entrez_gene_id hgnc_symbol ensembl_gene_id
1 100640           gene            216     ALDH1A1 ENSG00000165092
2 100650 gene/phenotype            217       ALDH2 ENSG00000111275
3 100660           gene            218     ALDH3A1 ENSG00000108602
4 100670           gene            219     ALDH1B1 ENSG00000137124
5 100678           gene             39       ACAT2 ENSG00000120437
6 100690           gene           1134      CHRNA1 ENSG00000138435

Now that the data is loaded into R, we can use biomaRt.

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

#load the package
library("biomaRt")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
filters <- listFilters(ensembl)

#these are the filters we can use
#to limit our biomaRt query
head(filters)
             name     description
1 chromosome_name Chromosome name
2           start Gene Start (bp)
3             end   Gene End (bp)
4      band_start      Band Start
5        band_end        Band End
6    marker_start    Marker Start

#search for the filters that contain
#the word "ensembl"
grep(pattern = 'ensembl',
     x = filters$name,
     ignore.case = T,
     value = T)
[1] "with_ox_clone_based_ensembl_gene"       "with_ox_clone_based_ensembl_transcript"
[3] "ensembl_gene_id"                        "ensembl_transcript_id"                 
[5] "ensembl_peptide_id"                     "ensembl_exon_id"                       
[7] "clone_based_ensembl_gene_name"          "clone_based_ensembl_transcript_name"

#the attributes are the attributes
#that we want retured
attributes <- listAttributes(ensembl)
head(attributes, 10)
                    name           description
1        ensembl_gene_id       Ensembl Gene ID
2  ensembl_transcript_id Ensembl Transcript ID
3     ensembl_peptide_id    Ensembl Protein ID
4        ensembl_exon_id       Ensembl Exon ID
5            description           Description
6        chromosome_name       Chromosome Name
7         start_position       Gene Start (bp)
8           end_position         Gene End (bp)
9                 strand                Strand
10                  band                  Band

#create a vector that contains the Ensembl IDs
my_value <- data$ensembl_gene_id
#how many entries?
length(my_value)
[1] 14552
#how many unique entries?
length(unique(my_value))
[1] 14419
#store only the unique entries
my_value <- unique(my_value)

#now we can perform the biomaRt query
gene_info <- getBM(attributes= c('chromosome_name',
                                 'start_position',
                                 'end_position',
                                 'ensembl_gene_id',
                                 'strand'),
                 filters = "ensembl_gene_id",
                 value = my_value,
                 mart = ensembl)

#here's the results from our query
head(gene_info)
  chromosome_name start_position end_position ensembl_gene_id strand
1               X      100627109    100639991 ENSG00000000003     -1
2               X      100584802    100599885 ENSG00000000005      1
3              20       50934867     50958555 ENSG00000000419     -1
4               1      169849631    169894267 ENSG00000000457     -1
5               1       27612064     27635277 ENSG00000000938     -1
6               1      196651878    196747504 ENSG00000000971      1

#check the dimension of our results
#it matched the length of the my_value vector
#meaning that it could find coordinates for
#all the Ensembl gene IDs
dim(gene_info)
[1] 14419     5

Lastly, we need to connect the gene coordinates to our the OMIM to gene table.

#use the merge function to connect the data frames
merged_data <- merge(x = data,
                     y = gene_info,
                     by = 'ensembl_gene_id')

#check it out
head(merged_data)
  ensembl_gene_id   omim type entrez_gene_id hgnc_symbol chromosome_name start_position end_position strand
1 ENSG00000000003 300191 gene           7105      TSPAN6               X      100627109    100639991     -1
2 ENSG00000000005 300459 gene          64102        TNMD               X      100584802    100599885      1
3 ENSG00000000419 603503 gene           8813        DPM1              20       50934867     50958555     -1
4 ENSG00000000457 608192 gene          57147       SCYL3               1      169849631    169894267     -1
5 ENSG00000000938 164940 gene           2268         FGR               1       27612064     27635277     -1
6 ENSG00000000971 134370 gene           3075         CFH               1      196651878    196747504      1

#create a table that resembles a BED file
my_bed <- data.frame(chr=merged_data$chromosome_name,
                     start=merged_data$start_position,
                     end=merged_data$end_position,
                     id=paste(merged_data$ensembl_gene_id, merged_data$omim,sep="/"),
                     score=rep(0,nrow(merged_data)),
                     strand=merged_data$strand)

head(my_bed)
  chr     start       end                     id score strand
1   X 100627109 100639991 ENSG00000000003/300191     0     -1
2   X 100584802 100599885 ENSG00000000005/300459     0      1
3  20  50934867  50958555 ENSG00000000419/603503     0     -1
4   1 169849631 169894267 ENSG00000000457/608192     0     -1
5   1  27612064  27635277 ENSG00000000938/164940     0     -1
6   1 196651878 196747504 ENSG00000000971/134370     0      1

#perform some cosmetics
my_bed$strand <- gsub(pattern = "-1",replacement = "-",x = my_bed$strand)
my_bed$strand <- gsub(pattern = "1",replacement = "+",x = my_bed$strand)
my_bed <- my_bed[order(my_bed[,'chr'],my_bed[,'start']),]
my_bed$chr <- paste('chr', my_bed$chr, sep = '')

head(my_bed)
       chr   start     end                     id score strand
12296 chr1  944204  959309 ENSG00000188976/610770     0      -
12156 chr1  975204  982093 ENSG00000187642/615921     0      -
12219 chr1  998962 1000172 ENSG00000188290/608060     0      -
12149 chr1 1001138 1014541 ENSG00000187608/147571     0      +
12206 chr1 1020123 1056118 ENSG00000188157/103320     0      +
13248 chr1 1167092 1167202 ENSG00000207730/612091     0      +

write.table(x = my_bed,
            file = 'omim_to_ensembl.bed',
            quote = F,
            sep = "\t",
            row.names = F,
            col.names = F)

Check out the BED file

head omim_to_ensembl.bed 
chr1    944204  959309  ENSG00000188976/610770  0       -
chr1    975204  982093  ENSG00000187642/615921  0       -
chr1    998962  1000172 ENSG00000188290/608060  0       -
chr1    1001138 1014541 ENSG00000187608/147571  0       +
chr1    1020123 1056118 ENSG00000188157/103320  0       +
chr1    1167092 1167202 ENSG00000207730/612091  0       +
chr1    1167846 1167956 ENSG00000207607/612090  0       +
chr1    1169005 1169087 ENSG00000198976/612094  0       +
chr1    1203508 1206691 ENSG00000186891/603905  0       -
chr1    1211326 1214138 ENSG00000186827/600315  0       -

gzip omim_to_ensembl.bed
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Excellent post! A couple of questions:
    – Is there an automated way to connect OMIM IDs with disease names?
    – And, from there, output sets of disease-specific genes into disease-specific files, like ‘migraine.bed’?

    And, a couple of cosmetics:
    – Code chunk “Now that the data is loaded into R, we can use biomaRt.”, line 32. Adding ‘attributes <- listAttributes(ensembl)' will make the following code working
    – Code chunk "Lastly, we need to connect the gene coordinates to our the OMIM to gene table." – lines 24 and 25 are duplicated

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.