Learning to use biomaRt

In the past I've been manually downloading tables of data annoation and parsing them with Perl. I guess it's time to do things more elegantly. Below is code taken from the biomaRt vignette:

Note
If you are using Ubuntu and getting a "Cannot find xml2-config" problem while installing XML, a prequisite to biomaRt, try installing libxml2-dev:

sudo apt-get install libxml2-dev

To get started:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

library("biomaRt")
listMarts()
ensembl=useMart("ensembl")
listDatasets(ensembl)
listDatasets(ensembl)$dataset
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
#building a query, requires filters, attributes and values
#listFilters shows all filters
filters = listFilters(ensembl)

head(filters)
             name     description
1 chromosome_name Chromosome name
2           start Gene Start (bp)
3             end   Gene End (bp)
4      band_start      Band Start
5        band_end        Band End
6    marker_start    Marker Start

#listAttributes shows all attributes
attributes = listAttributes(ensembl)

head(attributes)
                   name           description
1       ensembl_gene_id       Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3    ensembl_peptide_id    Ensembl Protein ID
4       ensembl_exon_id       Ensembl Exon ID
5           description           Description
6       chromosome_name       Chromosome Name

#the getBM function is the main query function in biomaRt
#as an example, let's convert affy ids into Entrez gene names
affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl)

The vignette contains other cool examples, which you should look at if you are interested in using biomaRt.

I will try to build a query that converts a human RefSeq id into the Entrez id


library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
filters = listFilters(ensembl)
#look for filters with RefSeq
grep("refseq", filters$name, ignore.case=T, value=T)
# [1] "with_refseq_peptide_predicted"  "with_ox_refseq_genomic"         "with_ox_refseq_mrna"            #"with_ox_refseq_mrna_predicted" 
# [5] "with_ox_refseq_ncrna"           "with_ox_refseq_ncrna_predicted" "refseq_mrna"                    #"refseq_mrna_predicted"         
# [9] "refseq_ncrna"                   "refseq_ncrna_predicted"         "refseq_peptide"                 #"refseq_peptide_predicted"      
#[13] "refseq_genomic"
attributes = listAttributes(ensembl)

#RefSeq for beta actin
my_refseq <- 'NM_001101'
getBM(attributes='ensembl_gene_id', filters = 'refseq_mrna', values = my_refseq , mart = ensembl)
#  ensembl_gene_id
#1 ENSG00000075624
getBM(attributes=c('ensembl_gene_id','description'), filters = 'refseq_mrna', values = my_refseq , mart = ensembl)
#  ensembl_gene_id                              description
#1 ENSG00000075624 actin, beta [Source:HGNC Symbol;Acc:132]

GO terms

Here I try to get GO terms for an Ensembl gene id


library("biomaRt")
ensembl<- useMart("ensembl")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

#find the GO attribute name
grep("go",attributes$name, ignore.case=T, value=T)
 [1] "go_id" #the one I'm interested in
 [2] "go_linkage_type"                                    
 [3] "goslim_goa_accession"                               
 [4] "goslim_goa_description"                             
 [5] "gorilla_ensembl_gene"                               
 [6] "homolog_ggor__dm_stable_id_4016_r1"                 
 [7] "gorilla_homolog_ensembl_peptide"                    
 [8] "gorilla_chromosome"                                 
 [9] "gorilla_chrom_start"                                
[10] "gorilla_chrom_end"                                  
[11] "gorilla_homolog_type"                               
[12] "gorilla_homolog_subtype"                            
[13] "gorilla_homolog_perc_id"                            
[14] "gorilla_homolog_perc_id_r1"                         
[15] "gorilla_homolog_dn"                                 
[16] "gorilla_homolog_ds"                                 
[17] "inter_paralog_gorilla_ensembl_gene"                 
[18] "inter_paralog_ggor__dm_stable_id_4016_r1"           
[19] "inter_paralog_gorilla_inter_paralog_ensembl_peptide"
[20] "inter_paralog_gorilla_chromosome"                   
[21] "inter_paralog_gorilla_chrom_start"                  
[22] "inter_paralog_gorilla_chrom_end"                    
[23] "inter_paralog_gorilla_orthology_type"               
[24] "gorilla_inter_paralog_subtype"                      
[25] "gorilla_inter_paralog_perc_id"                      
[26] "gorilla_inter_paralog_perc_id_r1"                   
[27] "gorilla_inter_paralog_dn"                           
[28] "gorilla_inter_paralog_ds"

grep("ensembl", filters$name, ignore.case=T, value=T)
[1] "with_clone_based_ensembl_gene"       "with_clone_based_ensembl_transcript"
[3] "ensembl_gene_id"                     "ensembl_transcript_id"              
[5] "ensembl_peptide_id"                  "ensembl_exon_id"

test <- 'ENSG00000075624'
getBM(attributes="go_id", filters="ensembl_gene_id", values = test, mart = ensembl)
 [1] "GO:0007411" "GO:0007409" "GO:0006457" "GO:0045214" "GO:0044267"
 [6] "GO:0006928" "GO:0007596" "GO:0051592" "GO:0045216" "GO:0034332"
[11] "GO:0051084" "GO:0034329" "GO:0005737" "GO:0005829" "GO:0030424"
[16] "GO:0005856" "GO:0043234" "GO:0030529" "GO:0005625" "GO:0030863"
[21] "GO:0031941" "GO:0035267" "GO:0030016" "GO:0070688" "GO:0005515"
[26] "GO:0005524" "GO:0019901" "GO:0019894" "GO:0005200" "GO:0050998"
[31] "GO:0030957"

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("GO.db")

library("GO.db")
goooo <- getBM(attributes="go_id", filters="ensembl_gene_id", values = test, mart = ensembl)
Term(goooo)
                                   GO:0007411 
                              "axon guidance" 
                                   GO:0007409 
                               "axonogenesis" 
                                   GO:0006457 
                            "protein folding" 
                                   GO:0045214 
                     "sarcomere organization" 
                                   GO:0044267 
         "cellular protein metabolic process" 
                                   GO:0006928 
                "cellular component movement" 
                                   GO:0007596 
                          "blood coagulation" 
                                   GO:0051592 
                    "response to calcium ion" 
                                   GO:0045216 
            "cell-cell junction organization" 
                                   GO:0034332 
             "adherens junction organization" 
                                   GO:0051084 
"'de novo' posttranslational protein folding" 
                                   GO:0034329 
                     "cell junction assembly" 
                                   GO:0005737 
                                  "cytoplasm" 
                                   GO:0005829 
                                    "cytosol" 
                                   GO:0030424 
                                       "axon" 
                                   GO:0005856 
                               "cytoskeleton" 
                                   GO:0043234 
                            "protein complex" 
                                   GO:0030529 
                  "ribonucleoprotein complex" 
                                         <NA> 
                                           NA 
                                   GO:0030863 
                      "cortical cytoskeleton" 
                                   GO:0031941 
                          "filamentous actin" 
                                   GO:0035267 
     "NuA4 histone acetyltransferase complex" 
                                   GO:0030016 
                                  "myofibril" 
                                   GO:0070688 
                             "MLL5-L complex" 
                                   GO:0005515 
                            "protein binding" 
                                   GO:0005524 
                                "ATP binding" 
                                   GO:0019901 
                     "protein kinase binding" 
                                   GO:0019894 
                            "kinesin binding" 
                                   GO:0005200 
     "structural constituent of cytoskeleton" 
                                   GO:0050998 
              "nitric-oxide synthase binding" 
                                   GO:0030957 
                        "Tat protein binding"
as.data.frame(Term(goooo))
Error in data.frame(`Term(goooo)` = c("axon guidance", "axonogenesis",  : 
  row names contain missing values

#one of the GO terms is obsolete, use na.omit to circumvent
na.omit(as.data.frame(Term(goooo)))
                                           Term(goooo)
GO:0007411                               axon guidance
GO:0007409                                axonogenesis
GO:0006457                             protein folding
GO:0045214                      sarcomere organization
GO:0044267          cellular protein metabolic process
GO:0006928                 cellular component movement
GO:0007596                           blood coagulation
GO:0051592                     response to calcium ion
GO:0045216             cell-cell junction organization
GO:0034332              adherens junction organization
GO:0051084 'de novo' posttranslational protein folding
GO:0034329                      cell junction assembly
GO:0005737                                   cytoplasm
GO:0005829                                     cytosol
GO:0030424                                        axon
GO:0005856                                cytoskeleton
GO:0043234                             protein complex
GO:0030529                   ribonucleoprotein complex
GO:0030863                       cortical cytoskeleton
GO:0031941                           filamentous actin
GO:0035267      NuA4 histone acetyltransferase complex
GO:0030016                                   myofibril
GO:0070688                              MLL5-L complex
GO:0005515                             protein binding
GO:0005524                                 ATP binding
GO:0019901                      protein kinase binding
GO:0019894                             kinesin binding
GO:0005200      structural constituent of cytoskeleton
GO:0050998               nitric-oxide synthase binding
GO:0030957                         Tat protein binding

for(go in goooo[1:5]){
print(GOTERM[go])
   cat("--------------------------------------\n")
}

GOID: GO:0007411
Term: axon guidance
Ontology: BP
Definition: The chemotaxis process that directs the migration of an axon
    growth cone to a specific target site in response to a combination of
    attractive and repulsive cues.
Synonym: axon chemotaxis
Synonym: axon growth cone guidance
Synonym: axon pathfinding
Synonym: GO:0008040
Secondary: GO:0008040
--------------------------------------
GOID: GO:0007409
Term: axonogenesis
Ontology: BP
Definition: Generation of a long process of a neuron, that carries
    efferent (outgoing) action potentials from the cell body towards
    target cells.
Synonym: axon growth
Synonym: neuron long process generation
Synonym: GO:0007410
Secondary: GO:0007410
--------------------------------------
GOID: GO:0006457
Term: protein folding
Ontology: BP
Definition: The process of assisting in the covalent and noncovalent
    assembly of single chain polypeptides or multisubunit complexes into
    the correct tertiary structure.
Synonym: alpha-tubulin folding
Synonym: beta-tubulin folding
Synonym: chaperone activity
Synonym: chaperonin ATPase activity
Synonym: chaperonin-mediated tubulin folding
Synonym: co-chaperone activity
Synonym: co-chaperonin activity
Synonym: glycoprotein-specific chaperone activity
Synonym: non-chaperonin molecular chaperone ATPase activity
Synonym: protein complex assembly, multichaperone pathway
Synonym: GO:0007022
Synonym: GO:0007024
Synonym: GO:0007025
Secondary: GO:0007022
Secondary: GO:0007024
Secondary: GO:0007025
--------------------------------------
GOID: GO:0045214
Term: sarcomere organization
Ontology: BP
Definition: The myofibril assembly process that results in the
    organization of muscle actomyosin into sarcomeres. The sarcomere is
    the repeating unit of a myofibril in a muscle cell, composed of an
    array of overlapping thick and thin filaments between two adjacent Z
    discs.
Synonym: sarcomere alignment
Synonym: sarcomere organisation
Synonym: GO:0006938
Secondary: GO:0006938
--------------------------------------
GOID: GO:0044267
Term: cellular protein metabolic process
Ontology: BP
Definition: The chemical reactions and pathways involving a specific
    protein, rather than of proteins in general, occurring at the level
    of an individual cell. Includes cellular protein modification.
Synonym: cellular protein metabolism
--------------------------------------

#two Ensembl ids
test <- c('ENSG00000206172','ENSG00000075624')
getBM(attributes=c('ensembl_gene_id','go_id'), filters="ensembl_gene_id", values = test, mart = ensembl)
   ensembl_gene_id      go_id
1  ENSG00000075624 GO:0007411
2  ENSG00000075624 GO:0007409
3  ENSG00000075624 GO:0006457
4  ENSG00000075624 GO:0045214
5  ENSG00000075624 GO:0044267
6  ENSG00000075624 GO:0006928
7  ENSG00000075624 GO:0007596
8  ENSG00000075624 GO:0051592
9  ENSG00000075624 GO:0045216
10 ENSG00000075624 GO:0034332
11 ENSG00000075624 GO:0051084
12 ENSG00000075624 GO:0034329
13 ENSG00000075624 GO:0005737
14 ENSG00000075624 GO:0005829
15 ENSG00000075624 GO:0030424
16 ENSG00000075624 GO:0005856
17 ENSG00000075624 GO:0043234
18 ENSG00000075624 GO:0030529
19 ENSG00000075624 GO:0005625
20 ENSG00000075624 GO:0030863
21 ENSG00000075624 GO:0031941
22 ENSG00000075624 GO:0035267
23 ENSG00000075624 GO:0030016
24 ENSG00000075624 GO:0070688
25 ENSG00000075624 GO:0005515
26 ENSG00000075624 GO:0005524
27 ENSG00000075624 GO:0019901
28 ENSG00000075624 GO:0019894
29 ENSG00000075624 GO:0005200
30 ENSG00000075624 GO:0050998
31 ENSG00000075624 GO:0030957
32 ENSG00000206172 GO:0042744
33 ENSG00000206172 GO:0051291
34 ENSG00000206172 GO:0042542
35 ENSG00000206172 GO:0010942
36 ENSG00000206172 GO:0015671
37 ENSG00000206172 GO:0022627
38 ENSG00000206172 GO:0005833
39 ENSG00000206172 GO:0031838
40 ENSG00000206172 GO:0005515
41 ENSG00000206172 GO:0004601
42 ENSG00000206172 GO:0020037
43 ENSG00000206172 GO:0005506
44 ENSG00000206172 GO:0005344
45 ENSG00000206172 GO:0019825
46 ENSG00000206172 GO:0031720

Working with SNPs

And lastly an example taken from http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf:

library(biomaRt)
snp <- useMart("snp",dataset="hsapiens_snp")
start <- Sys.time()
out=getBM(attributes=c("refsnp_id","allele","chrom_start"),
          filters=c("chr_name","start","end"),
          values=list(8,148350, 158612),
          mart=snp)
end <- Sys.time()
print(end - start)
Time difference of 1.544547 secs

nrow(out)
[1] 52

head(out)
    refsnp_id allele chrom_start
1 rs547420070    A/C      148373
2  rs77274555    G/A      148391
3 rs567299969    T/A      148394
4 rs368076569    G/A      148407
5 rs190721891    C/G      148576
6 rs199747597    T/C      148679

tail(out)
     refsnp_id allele chrom_start
47 rs369534639    C/T      157748
48    rs434724    G/C      157764
49 rs200513547    C/A      157862
50 rs376761743    C/T      157970
51 rs369198444    C/T      157978
52 rs201289114    A/G      158290

# with SNP IDs
my_snp <- c('rs547420070', 'rs77274555')

start <- Sys.time()
my_snp_detail <- getBM(attributes=c("refsnp_id","allele","chrom_start"),
                       filters=c("snp_filter"),
                       values=my_snp,
                       mart=snp)
end <- Sys.time()
print(end - start)
Time difference of 1.551306 secs

my_snp_detail
    refsnp_id allele chrom_start
1 rs547420070    A/C      148373
2  rs77274555    G/A      148391

Gene IDs to gene symbols

Working with biologists, they will no doubt ask you for gene symbols. Here's how to convert Ensembl gene IDs to HUGO Gene Nomenclature Committee gene symbols.

library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
filters = listFilters(ensembl)

test <- 'ENSG00000118473'
getBM(attributes=c('ensembl_gene_id', "hgnc_symbol"), filters = "ensembl_gene_id", values=test, mart=ensembl)
  ensembl_gene_id hgnc_symbol
1 ENSG00000118473       SGIP1

test <- c('ENSG00000118473', 'ENSG00000162426')
getBM(attributes=c('ensembl_gene_id', "hgnc_symbol"), filters = "ensembl_gene_id", values=test, mart=ensembl)
  ensembl_gene_id hgnc_symbol
1 ENSG00000118473       SGIP1
2 ENSG00000162426     SLC45A1

getBM(attributes=c('ensembl_gene_id', "hgnc_symbol", "description"), filters = "ensembl_gene_id", values=test, mart=ensembl)
  ensembl_gene_id hgnc_symbol
1 ENSG00000118473       SGIP1
2 ENSG00000162426     SLC45A1
                                                                             description
1 SH3-domain GRB2-like (endophilin) interacting protein 1 [Source:HGNC Symbol;Acc:25412]
2                      solute carrier family 45, member 1 [Source:HGNC Symbol;Acc:17939]

Working with older reference assemblies

# hg38
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

# last patch of hg19
grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                  host="grch37.ensembl.org",
                  path="/biomart/martservice")

# listMarts(grch37)
               biomart            version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes
2     ENSEMBL_MART_SNP  Ensembl Variation
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation
4    ENSEMBL_MART_VEGA               Vega

grep(pattern = "homo",
     x = listDatasets(grch37)[,2],
     ignore.case = TRUE)
[1] 31

listDatasets(grch37)[31,]
                 dataset                     description    version
31 hsapiens_gene_ensembl Homo sapiens genes (GRCh37.p13) GRCh37.p13

old_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                    host="grch37.ensembl.org",
                    path="/biomart/martservice",
                    dataset="hsapiens_gene_ensembl")

my_refseq <- 'NM_001101'
getBM(attributes=c('ensembl_gene_id','chromosome_name','start_position','end_position'),
      filters = 'refseq_mrna',
      values = my_refseq,
      mart = ensembl)
  ensembl_gene_id chromosome_name start_position end_position
1 ENSG00000075624               7        5527151      5563784

getBM(attributes=c('ensembl_gene_id','chromosome_name','start_position','end_position'),
      filters = 'refseq_mrna',
      values = my_refseq,
      mart = old_mart)
Error in value[[3L]](cond) : 
  Request to BioMart web service failed. Verify if you are still connected to the internet.  Alternatively the BioMart web service is temporarily down.
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
12 comments Add yours
  1. Hi,
    great post that I use as a reference. I am wondering since you have been likely using this for quite some time to convert identifiers if you still find this to be the most convenient way. The reason I am wondering is I am running into some issues going from ensembl to entrez for mouse! Namely, of the 9600 identifiers I am trying to convert, some 800 or so do not get converted. When I paste those into MGIs online translator, about 750 of the 800 get converted. What a pita to have to use several converters and it would be nice to stay within a single environment if possible. Writing intermediate files sure clutters things up quick.
    Thanks,
    Bob

    1. Hi Bob,

      I wrote about converting between different databases using biomaRt and noticed that Ensembl and Entrez are quite interchangeable. Was there anything common with these 800 identifiers that couldn’t be converted? I agree that it is definitely much nicer just to stay within a single environment.

      Cheers,

      Dave

  2. Nice instructions. Thank you. I have actually been using biomaRt in R to convert IDs. However I only know how to convert a single column of IDs. I’ve been looking for a way to convert one particular column of IDs to another kind of IDs in a big list with several other columns. Any suggestion will be appreciated.

      1. Thanks for the heads up. I know I could subset/select a column (e.g. refseq_mrna) from the big list and use that for “getBM(attributes)” to convert them to “ensembl_gene_id”.
        The issue is how I could then put the ensembl_gene_ids back to the original list especially one refseq_mrna could be mapped to several ensembl_gene_ids. Thanks.

        1. Something like this?

          df <- data.frame(x=1:2, RefSeq=c("a","b"))
          lookup <- data.frame(RefSeq=c("a","b","b"), Ensembl=c('z','y','y'))
          merge(df, lookup, by = 'RefSeq')
          

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.