In the past I've been manually downloading tables of data annoation and parsing them with Perl. I guess it's time to do things more elegantly. Below is code taken from the biomaRt vignette:
Note
If you are using Ubuntu and getting a "Cannot find xml2-config" problem while installing XML, a prequisite to biomaRt, try installing libxml2-dev:
sudo apt-get install libxml2-dev
To get started:
#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library("biomaRt")
listMarts()
ensembl=useMart("ensembl")
listDatasets(ensembl)
listDatasets(ensembl)$dataset
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
#building a query, requires filters, attributes and values
#listFilters shows all filters
filters = listFilters(ensembl)
head(filters)
name description
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
#listAttributes shows all attributes
attributes = listAttributes(ensembl)
head(attributes)
name description
1 ensembl_gene_id Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3 ensembl_peptide_id Ensembl Protein ID
4 ensembl_exon_id Ensembl Exon ID
5 description Description
6 chromosome_name Chromosome Name
#the getBM function is the main query function in biomaRt
#as an example, let's convert affy ids into Entrez gene names
affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl)
The vignette contains other cool examples, which you should look at if you are interested in using biomaRt.
I will try to build a query that converts a human RefSeq id into the Entrez id
library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
filters = listFilters(ensembl)
#look for filters with RefSeq
grep("refseq", filters$name, ignore.case=T, value=T)
# [1] "with_refseq_peptide_predicted" "with_ox_refseq_genomic" "with_ox_refseq_mrna" #"with_ox_refseq_mrna_predicted"
# [5] "with_ox_refseq_ncrna" "with_ox_refseq_ncrna_predicted" "refseq_mrna" #"refseq_mrna_predicted"
# [9] "refseq_ncrna" "refseq_ncrna_predicted" "refseq_peptide" #"refseq_peptide_predicted"
#[13] "refseq_genomic"
attributes = listAttributes(ensembl)
#RefSeq for beta actin
my_refseq <- 'NM_001101'
getBM(attributes='ensembl_gene_id', filters = 'refseq_mrna', values = my_refseq , mart = ensembl)
# ensembl_gene_id
#1 ENSG00000075624
getBM(attributes=c('ensembl_gene_id','description'), filters = 'refseq_mrna', values = my_refseq , mart = ensembl)
# ensembl_gene_id description
#1 ENSG00000075624 actin, beta [Source:HGNC Symbol;Acc:132]
GO terms
Here I try to get GO terms for an Ensembl gene id
library("biomaRt")
ensembl<- useMart("ensembl")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
#find the GO attribute name
grep("go",attributes$name, ignore.case=T, value=T)
[1] "go_id" #the one I'm interested in
[2] "go_linkage_type"
[3] "goslim_goa_accession"
[4] "goslim_goa_description"
[5] "gorilla_ensembl_gene"
[6] "homolog_ggor__dm_stable_id_4016_r1"
[7] "gorilla_homolog_ensembl_peptide"
[8] "gorilla_chromosome"
[9] "gorilla_chrom_start"
[10] "gorilla_chrom_end"
[11] "gorilla_homolog_type"
[12] "gorilla_homolog_subtype"
[13] "gorilla_homolog_perc_id"
[14] "gorilla_homolog_perc_id_r1"
[15] "gorilla_homolog_dn"
[16] "gorilla_homolog_ds"
[17] "inter_paralog_gorilla_ensembl_gene"
[18] "inter_paralog_ggor__dm_stable_id_4016_r1"
[19] "inter_paralog_gorilla_inter_paralog_ensembl_peptide"
[20] "inter_paralog_gorilla_chromosome"
[21] "inter_paralog_gorilla_chrom_start"
[22] "inter_paralog_gorilla_chrom_end"
[23] "inter_paralog_gorilla_orthology_type"
[24] "gorilla_inter_paralog_subtype"
[25] "gorilla_inter_paralog_perc_id"
[26] "gorilla_inter_paralog_perc_id_r1"
[27] "gorilla_inter_paralog_dn"
[28] "gorilla_inter_paralog_ds"
grep("ensembl", filters$name, ignore.case=T, value=T)
[1] "with_clone_based_ensembl_gene" "with_clone_based_ensembl_transcript"
[3] "ensembl_gene_id" "ensembl_transcript_id"
[5] "ensembl_peptide_id" "ensembl_exon_id"
test <- 'ENSG00000075624'
getBM(attributes="go_id", filters="ensembl_gene_id", values = test, mart = ensembl)
[1] "GO:0007411" "GO:0007409" "GO:0006457" "GO:0045214" "GO:0044267"
[6] "GO:0006928" "GO:0007596" "GO:0051592" "GO:0045216" "GO:0034332"
[11] "GO:0051084" "GO:0034329" "GO:0005737" "GO:0005829" "GO:0030424"
[16] "GO:0005856" "GO:0043234" "GO:0030529" "GO:0005625" "GO:0030863"
[21] "GO:0031941" "GO:0035267" "GO:0030016" "GO:0070688" "GO:0005515"
[26] "GO:0005524" "GO:0019901" "GO:0019894" "GO:0005200" "GO:0050998"
[31] "GO:0030957"
#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("GO.db")
library("GO.db")
goooo <- getBM(attributes="go_id", filters="ensembl_gene_id", values = test, mart = ensembl)
Term(goooo)
GO:0007411
"axon guidance"
GO:0007409
"axonogenesis"
GO:0006457
"protein folding"
GO:0045214
"sarcomere organization"
GO:0044267
"cellular protein metabolic process"
GO:0006928
"cellular component movement"
GO:0007596
"blood coagulation"
GO:0051592
"response to calcium ion"
GO:0045216
"cell-cell junction organization"
GO:0034332
"adherens junction organization"
GO:0051084
"'de novo' posttranslational protein folding"
GO:0034329
"cell junction assembly"
GO:0005737
"cytoplasm"
GO:0005829
"cytosol"
GO:0030424
"axon"
GO:0005856
"cytoskeleton"
GO:0043234
"protein complex"
GO:0030529
"ribonucleoprotein complex"
<NA>
NA
GO:0030863
"cortical cytoskeleton"
GO:0031941
"filamentous actin"
GO:0035267
"NuA4 histone acetyltransferase complex"
GO:0030016
"myofibril"
GO:0070688
"MLL5-L complex"
GO:0005515
"protein binding"
GO:0005524
"ATP binding"
GO:0019901
"protein kinase binding"
GO:0019894
"kinesin binding"
GO:0005200
"structural constituent of cytoskeleton"
GO:0050998
"nitric-oxide synthase binding"
GO:0030957
"Tat protein binding"
as.data.frame(Term(goooo))
Error in data.frame(`Term(goooo)` = c("axon guidance", "axonogenesis", :
row names contain missing values
#one of the GO terms is obsolete, use na.omit to circumvent
na.omit(as.data.frame(Term(goooo)))
Term(goooo)
GO:0007411 axon guidance
GO:0007409 axonogenesis
GO:0006457 protein folding
GO:0045214 sarcomere organization
GO:0044267 cellular protein metabolic process
GO:0006928 cellular component movement
GO:0007596 blood coagulation
GO:0051592 response to calcium ion
GO:0045216 cell-cell junction organization
GO:0034332 adherens junction organization
GO:0051084 'de novo' posttranslational protein folding
GO:0034329 cell junction assembly
GO:0005737 cytoplasm
GO:0005829 cytosol
GO:0030424 axon
GO:0005856 cytoskeleton
GO:0043234 protein complex
GO:0030529 ribonucleoprotein complex
GO:0030863 cortical cytoskeleton
GO:0031941 filamentous actin
GO:0035267 NuA4 histone acetyltransferase complex
GO:0030016 myofibril
GO:0070688 MLL5-L complex
GO:0005515 protein binding
GO:0005524 ATP binding
GO:0019901 protein kinase binding
GO:0019894 kinesin binding
GO:0005200 structural constituent of cytoskeleton
GO:0050998 nitric-oxide synthase binding
GO:0030957 Tat protein binding
for(go in goooo[1:5]){
print(GOTERM[[go]])
cat("--------------------------------------\n")
}
GOID: GO:0007411
Term: axon guidance
Ontology: BP
Definition: The chemotaxis process that directs the migration of an axon
growth cone to a specific target site in response to a combination of
attractive and repulsive cues.
Synonym: axon chemotaxis
Synonym: axon growth cone guidance
Synonym: axon pathfinding
Synonym: GO:0008040
Secondary: GO:0008040
--------------------------------------
GOID: GO:0007409
Term: axonogenesis
Ontology: BP
Definition: Generation of a long process of a neuron, that carries
efferent (outgoing) action potentials from the cell body towards
target cells.
Synonym: axon growth
Synonym: neuron long process generation
Synonym: GO:0007410
Secondary: GO:0007410
--------------------------------------
GOID: GO:0006457
Term: protein folding
Ontology: BP
Definition: The process of assisting in the covalent and noncovalent
assembly of single chain polypeptides or multisubunit complexes into
the correct tertiary structure.
Synonym: alpha-tubulin folding
Synonym: beta-tubulin folding
Synonym: chaperone activity
Synonym: chaperonin ATPase activity
Synonym: chaperonin-mediated tubulin folding
Synonym: co-chaperone activity
Synonym: co-chaperonin activity
Synonym: glycoprotein-specific chaperone activity
Synonym: non-chaperonin molecular chaperone ATPase activity
Synonym: protein complex assembly, multichaperone pathway
Synonym: GO:0007022
Synonym: GO:0007024
Synonym: GO:0007025
Secondary: GO:0007022
Secondary: GO:0007024
Secondary: GO:0007025
--------------------------------------
GOID: GO:0045214
Term: sarcomere organization
Ontology: BP
Definition: The myofibril assembly process that results in the
organization of muscle actomyosin into sarcomeres. The sarcomere is
the repeating unit of a myofibril in a muscle cell, composed of an
array of overlapping thick and thin filaments between two adjacent Z
discs.
Synonym: sarcomere alignment
Synonym: sarcomere organisation
Synonym: GO:0006938
Secondary: GO:0006938
--------------------------------------
GOID: GO:0044267
Term: cellular protein metabolic process
Ontology: BP
Definition: The chemical reactions and pathways involving a specific
protein, rather than of proteins in general, occurring at the level
of an individual cell. Includes cellular protein modification.
Synonym: cellular protein metabolism
--------------------------------------
#two Ensembl ids
test <- c('ENSG00000206172','ENSG00000075624')
getBM(attributes=c('ensembl_gene_id','go_id'), filters="ensembl_gene_id", values = test, mart = ensembl)
ensembl_gene_id go_id
1 ENSG00000075624 GO:0007411
2 ENSG00000075624 GO:0007409
3 ENSG00000075624 GO:0006457
4 ENSG00000075624 GO:0045214
5 ENSG00000075624 GO:0044267
6 ENSG00000075624 GO:0006928
7 ENSG00000075624 GO:0007596
8 ENSG00000075624 GO:0051592
9 ENSG00000075624 GO:0045216
10 ENSG00000075624 GO:0034332
11 ENSG00000075624 GO:0051084
12 ENSG00000075624 GO:0034329
13 ENSG00000075624 GO:0005737
14 ENSG00000075624 GO:0005829
15 ENSG00000075624 GO:0030424
16 ENSG00000075624 GO:0005856
17 ENSG00000075624 GO:0043234
18 ENSG00000075624 GO:0030529
19 ENSG00000075624 GO:0005625
20 ENSG00000075624 GO:0030863
21 ENSG00000075624 GO:0031941
22 ENSG00000075624 GO:0035267
23 ENSG00000075624 GO:0030016
24 ENSG00000075624 GO:0070688
25 ENSG00000075624 GO:0005515
26 ENSG00000075624 GO:0005524
27 ENSG00000075624 GO:0019901
28 ENSG00000075624 GO:0019894
29 ENSG00000075624 GO:0005200
30 ENSG00000075624 GO:0050998
31 ENSG00000075624 GO:0030957
32 ENSG00000206172 GO:0042744
33 ENSG00000206172 GO:0051291
34 ENSG00000206172 GO:0042542
35 ENSG00000206172 GO:0010942
36 ENSG00000206172 GO:0015671
37 ENSG00000206172 GO:0022627
38 ENSG00000206172 GO:0005833
39 ENSG00000206172 GO:0031838
40 ENSG00000206172 GO:0005515
41 ENSG00000206172 GO:0004601
42 ENSG00000206172 GO:0020037
43 ENSG00000206172 GO:0005506
44 ENSG00000206172 GO:0005344
45 ENSG00000206172 GO:0019825
46 ENSG00000206172 GO:0031720
Working with SNPs
And lastly an example taken from http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf:
library(biomaRt)
snp <- useMart("snp",dataset="hsapiens_snp")
start <- Sys.time()
out=getBM(attributes=c("refsnp_id","allele","chrom_start"),
filters=c("chr_name","start","end"),
values=list(8,148350, 158612),
mart=snp)
end <- Sys.time()
print(end - start)
Time difference of 1.544547 secs
nrow(out)
[1] 52
head(out)
refsnp_id allele chrom_start
1 rs547420070 A/C 148373
2 rs77274555 G/A 148391
3 rs567299969 T/A 148394
4 rs368076569 G/A 148407
5 rs190721891 C/G 148576
6 rs199747597 T/C 148679
tail(out)
refsnp_id allele chrom_start
47 rs369534639 C/T 157748
48 rs434724 G/C 157764
49 rs200513547 C/A 157862
50 rs376761743 C/T 157970
51 rs369198444 C/T 157978
52 rs201289114 A/G 158290
# with SNP IDs
my_snp <- c('rs547420070', 'rs77274555')
start <- Sys.time()
my_snp_detail <- getBM(attributes=c("refsnp_id","allele","chrom_start"),
filters=c("snp_filter"),
values=my_snp,
mart=snp)
end <- Sys.time()
print(end - start)
Time difference of 1.551306 secs
my_snp_detail
refsnp_id allele chrom_start
1 rs547420070 A/C 148373
2 rs77274555 G/A 148391
Gene IDs to gene symbols
Working with biologists, they will no doubt ask you for gene symbols. Here's how to convert Ensembl gene IDs to HUGO Gene Nomenclature Committee gene symbols.
library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
filters = listFilters(ensembl)
test <- 'ENSG00000118473'
getBM(attributes=c('ensembl_gene_id', "hgnc_symbol"), filters = "ensembl_gene_id", values=test, mart=ensembl)
ensembl_gene_id hgnc_symbol
1 ENSG00000118473 SGIP1
test <- c('ENSG00000118473', 'ENSG00000162426')
getBM(attributes=c('ensembl_gene_id', "hgnc_symbol"), filters = "ensembl_gene_id", values=test, mart=ensembl)
ensembl_gene_id hgnc_symbol
1 ENSG00000118473 SGIP1
2 ENSG00000162426 SLC45A1
getBM(attributes=c('ensembl_gene_id', "hgnc_symbol", "description"), filters = "ensembl_gene_id", values=test, mart=ensembl)
ensembl_gene_id hgnc_symbol
1 ENSG00000118473 SGIP1
2 ENSG00000162426 SLC45A1
description
1 SH3-domain GRB2-like (endophilin) interacting protein 1 [Source:HGNC Symbol;Acc:25412]
2 solute carrier family 45, member 1 [Source:HGNC Symbol;Acc:17939]
Working with older reference assemblies
# hg38
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
# last patch of hg19
grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
host="grch37.ensembl.org",
path="/biomart/martservice")
# listMarts(grch37)
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes
2 ENSEMBL_MART_SNP Ensembl Variation
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation
4 ENSEMBL_MART_VEGA Vega
grep(pattern = "homo",
x = listDatasets(grch37)[,2],
ignore.case = TRUE)
[1] 31
listDatasets(grch37)[31,]
dataset description version
31 hsapiens_gene_ensembl Homo sapiens genes (GRCh37.p13) GRCh37.p13
old_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
host="grch37.ensembl.org",
path="/biomart/martservice",
dataset="hsapiens_gene_ensembl")
my_refseq <- 'NM_001101'
getBM(attributes=c('ensembl_gene_id','chromosome_name','start_position','end_position'),
filters = 'refseq_mrna',
values = my_refseq,
mart = ensembl)
ensembl_gene_id chromosome_name start_position end_position
1 ENSG00000075624 7 5527151 5563784
getBM(attributes=c('ensembl_gene_id','chromosome_name','start_position','end_position'),
filters = 'refseq_mrna',
values = my_refseq,
mart = old_mart)
Error in value[[3L]](cond) :
Request to BioMart web service failed. Verify if you are still connected to the internet. Alternatively the BioMart web service is temporarily down.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi,
great post that I use as a reference. I am wondering since you have been likely using this for quite some time to convert identifiers if you still find this to be the most convenient way. The reason I am wondering is I am running into some issues going from ensembl to entrez for mouse! Namely, of the 9600 identifiers I am trying to convert, some 800 or so do not get converted. When I paste those into MGIs online translator, about 750 of the 800 get converted. What a pita to have to use several converters and it would be nice to stay within a single environment if possible. Writing intermediate files sure clutters things up quick.
Thanks,
Bob
Hi Bob,
I wrote about converting between different databases using biomaRt and noticed that Ensembl and Entrez are quite interchangeable. Was there anything common with these 800 identifiers that couldn’t be converted? I agree that it is definitely much nicer just to stay within a single environment.
Cheers,
Dave
Thanks! This has been really helpful! I am bookmarking your website for future use!
Nice instructions. Thank you. I have actually been using biomaRt in R to convert IDs. However I only know how to convert a single column of IDs. I’ve been looking for a way to convert one particular column of IDs to another kind of IDs in a big list with several other columns. Any suggestion will be appreciated.
Are you familiar with subsetting in R? It’s actually fairly easy. Search the web for “subsetting in R”.
Thanks for the heads up. I know I could subset/select a column (e.g. refseq_mrna) from the big list and use that for “getBM(attributes)” to convert them to “ensembl_gene_id”.
The issue is how I could then put the ensembl_gene_ids back to the original list especially one refseq_mrna could be mapped to several ensembl_gene_ids. Thanks.
Something like this?
df <- data.frame(x=1:2, RefSeq=c("a","b")) lookup <- data.frame(RefSeq=c("a","b","b"), Ensembl=c('z','y','y')) merge(df, lookup, by = 'RefSeq')