In the past I've been manually downloading tables of data annoation and parsing them with Perl. I guess it's time to do things more elegantly. Below is code taken from the biomaRt vignette:
Note
If you are using Ubuntu and getting a "Cannot find xml2-config" problem while installing XML, a prequisite to biomaRt, try installing libxml2-dev:
sudo apt-get install libxml2-dev
To get started:
#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("biomaRt") library("biomaRt") listMarts() ensembl=useMart("ensembl") listDatasets(ensembl) listDatasets(ensembl)$dataset ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") #building a query, requires filters, attributes and values #listFilters shows all filters filters = listFilters(ensembl) head(filters) name description 1 chromosome_name Chromosome name 2 start Gene Start (bp) 3 end Gene End (bp) 4 band_start Band Start 5 band_end Band End 6 marker_start Marker Start #listAttributes shows all attributes attributes = listAttributes(ensembl) head(attributes) name description 1 ensembl_gene_id Ensembl Gene ID 2 ensembl_transcript_id Ensembl Transcript ID 3 ensembl_peptide_id Ensembl Protein ID 4 ensembl_exon_id Ensembl Exon ID 5 description Description 6 chromosome_name Chromosome Name #the getBM function is the main query function in biomaRt #as an example, let's convert affy ids into Entrez gene names affyids=c("202763_at","209310_s_at","207500_at") getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl)
The vignette contains other cool examples, which you should look at if you are interested in using biomaRt.
I will try to build a query that converts a human RefSeq id into the Entrez id
library("biomaRt") ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") filters = listFilters(ensembl) #look for filters with RefSeq grep("refseq", filters$name, ignore.case=T, value=T) # [1] "with_refseq_peptide_predicted" "with_ox_refseq_genomic" "with_ox_refseq_mrna" #"with_ox_refseq_mrna_predicted" # [5] "with_ox_refseq_ncrna" "with_ox_refseq_ncrna_predicted" "refseq_mrna" #"refseq_mrna_predicted" # [9] "refseq_ncrna" "refseq_ncrna_predicted" "refseq_peptide" #"refseq_peptide_predicted" #[13] "refseq_genomic" attributes = listAttributes(ensembl) #RefSeq for beta actin my_refseq <- 'NM_001101' getBM(attributes='ensembl_gene_id', filters = 'refseq_mrna', values = my_refseq , mart = ensembl) # ensembl_gene_id #1 ENSG00000075624 getBM(attributes=c('ensembl_gene_id','description'), filters = 'refseq_mrna', values = my_refseq , mart = ensembl) # ensembl_gene_id description #1 ENSG00000075624 actin, beta [Source:HGNC Symbol;Acc:132]
GO terms
Here I try to get GO terms for an Ensembl gene id
library("biomaRt") ensembl<- useMart("ensembl") ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") #find the GO attribute name grep("go",attributes$name, ignore.case=T, value=T) [1] "go_id" #the one I'm interested in [2] "go_linkage_type" [3] "goslim_goa_accession" [4] "goslim_goa_description" [5] "gorilla_ensembl_gene" [6] "homolog_ggor__dm_stable_id_4016_r1" [7] "gorilla_homolog_ensembl_peptide" [8] "gorilla_chromosome" [9] "gorilla_chrom_start" [10] "gorilla_chrom_end" [11] "gorilla_homolog_type" [12] "gorilla_homolog_subtype" [13] "gorilla_homolog_perc_id" [14] "gorilla_homolog_perc_id_r1" [15] "gorilla_homolog_dn" [16] "gorilla_homolog_ds" [17] "inter_paralog_gorilla_ensembl_gene" [18] "inter_paralog_ggor__dm_stable_id_4016_r1" [19] "inter_paralog_gorilla_inter_paralog_ensembl_peptide" [20] "inter_paralog_gorilla_chromosome" [21] "inter_paralog_gorilla_chrom_start" [22] "inter_paralog_gorilla_chrom_end" [23] "inter_paralog_gorilla_orthology_type" [24] "gorilla_inter_paralog_subtype" [25] "gorilla_inter_paralog_perc_id" [26] "gorilla_inter_paralog_perc_id_r1" [27] "gorilla_inter_paralog_dn" [28] "gorilla_inter_paralog_ds" grep("ensembl", filters$name, ignore.case=T, value=T) [1] "with_clone_based_ensembl_gene" "with_clone_based_ensembl_transcript" [3] "ensembl_gene_id" "ensembl_transcript_id" [5] "ensembl_peptide_id" "ensembl_exon_id" test <- 'ENSG00000075624' getBM(attributes="go_id", filters="ensembl_gene_id", values = test, mart = ensembl) [1] "GO:0007411" "GO:0007409" "GO:0006457" "GO:0045214" "GO:0044267" [6] "GO:0006928" "GO:0007596" "GO:0051592" "GO:0045216" "GO:0034332" [11] "GO:0051084" "GO:0034329" "GO:0005737" "GO:0005829" "GO:0030424" [16] "GO:0005856" "GO:0043234" "GO:0030529" "GO:0005625" "GO:0030863" [21] "GO:0031941" "GO:0035267" "GO:0030016" "GO:0070688" "GO:0005515" [26] "GO:0005524" "GO:0019901" "GO:0019894" "GO:0005200" "GO:0050998" [31] "GO:0030957" #install if necessary source("http://bioconductor.org/biocLite.R") biocLite("GO.db") library("GO.db") goooo <- getBM(attributes="go_id", filters="ensembl_gene_id", values = test, mart = ensembl) Term(goooo) GO:0007411 "axon guidance" GO:0007409 "axonogenesis" GO:0006457 "protein folding" GO:0045214 "sarcomere organization" GO:0044267 "cellular protein metabolic process" GO:0006928 "cellular component movement" GO:0007596 "blood coagulation" GO:0051592 "response to calcium ion" GO:0045216 "cell-cell junction organization" GO:0034332 "adherens junction organization" GO:0051084 "'de novo' posttranslational protein folding" GO:0034329 "cell junction assembly" GO:0005737 "cytoplasm" GO:0005829 "cytosol" GO:0030424 "axon" GO:0005856 "cytoskeleton" GO:0043234 "protein complex" GO:0030529 "ribonucleoprotein complex" <NA> NA GO:0030863 "cortical cytoskeleton" GO:0031941 "filamentous actin" GO:0035267 "NuA4 histone acetyltransferase complex" GO:0030016 "myofibril" GO:0070688 "MLL5-L complex" GO:0005515 "protein binding" GO:0005524 "ATP binding" GO:0019901 "protein kinase binding" GO:0019894 "kinesin binding" GO:0005200 "structural constituent of cytoskeleton" GO:0050998 "nitric-oxide synthase binding" GO:0030957 "Tat protein binding" as.data.frame(Term(goooo)) Error in data.frame(`Term(goooo)` = c("axon guidance", "axonogenesis", : row names contain missing values #one of the GO terms is obsolete, use na.omit to circumvent na.omit(as.data.frame(Term(goooo))) Term(goooo) GO:0007411 axon guidance GO:0007409 axonogenesis GO:0006457 protein folding GO:0045214 sarcomere organization GO:0044267 cellular protein metabolic process GO:0006928 cellular component movement GO:0007596 blood coagulation GO:0051592 response to calcium ion GO:0045216 cell-cell junction organization GO:0034332 adherens junction organization GO:0051084 'de novo' posttranslational protein folding GO:0034329 cell junction assembly GO:0005737 cytoplasm GO:0005829 cytosol GO:0030424 axon GO:0005856 cytoskeleton GO:0043234 protein complex GO:0030529 ribonucleoprotein complex GO:0030863 cortical cytoskeleton GO:0031941 filamentous actin GO:0035267 NuA4 histone acetyltransferase complex GO:0030016 myofibril GO:0070688 MLL5-L complex GO:0005515 protein binding GO:0005524 ATP binding GO:0019901 protein kinase binding GO:0019894 kinesin binding GO:0005200 structural constituent of cytoskeleton GO:0050998 nitric-oxide synthase binding GO:0030957 Tat protein binding for(go in goooo[1:5]){ print(GOTERM[go]) cat("--------------------------------------\n") } GOID: GO:0007411 Term: axon guidance Ontology: BP Definition: The chemotaxis process that directs the migration of an axon growth cone to a specific target site in response to a combination of attractive and repulsive cues. Synonym: axon chemotaxis Synonym: axon growth cone guidance Synonym: axon pathfinding Synonym: GO:0008040 Secondary: GO:0008040 -------------------------------------- GOID: GO:0007409 Term: axonogenesis Ontology: BP Definition: Generation of a long process of a neuron, that carries efferent (outgoing) action potentials from the cell body towards target cells. Synonym: axon growth Synonym: neuron long process generation Synonym: GO:0007410 Secondary: GO:0007410 -------------------------------------- GOID: GO:0006457 Term: protein folding Ontology: BP Definition: The process of assisting in the covalent and noncovalent assembly of single chain polypeptides or multisubunit complexes into the correct tertiary structure. Synonym: alpha-tubulin folding Synonym: beta-tubulin folding Synonym: chaperone activity Synonym: chaperonin ATPase activity Synonym: chaperonin-mediated tubulin folding Synonym: co-chaperone activity Synonym: co-chaperonin activity Synonym: glycoprotein-specific chaperone activity Synonym: non-chaperonin molecular chaperone ATPase activity Synonym: protein complex assembly, multichaperone pathway Synonym: GO:0007022 Synonym: GO:0007024 Synonym: GO:0007025 Secondary: GO:0007022 Secondary: GO:0007024 Secondary: GO:0007025 -------------------------------------- GOID: GO:0045214 Term: sarcomere organization Ontology: BP Definition: The myofibril assembly process that results in the organization of muscle actomyosin into sarcomeres. The sarcomere is the repeating unit of a myofibril in a muscle cell, composed of an array of overlapping thick and thin filaments between two adjacent Z discs. Synonym: sarcomere alignment Synonym: sarcomere organisation Synonym: GO:0006938 Secondary: GO:0006938 -------------------------------------- GOID: GO:0044267 Term: cellular protein metabolic process Ontology: BP Definition: The chemical reactions and pathways involving a specific protein, rather than of proteins in general, occurring at the level of an individual cell. Includes cellular protein modification. Synonym: cellular protein metabolism -------------------------------------- #two Ensembl ids test <- c('ENSG00000206172','ENSG00000075624') getBM(attributes=c('ensembl_gene_id','go_id'), filters="ensembl_gene_id", values = test, mart = ensembl) ensembl_gene_id go_id 1 ENSG00000075624 GO:0007411 2 ENSG00000075624 GO:0007409 3 ENSG00000075624 GO:0006457 4 ENSG00000075624 GO:0045214 5 ENSG00000075624 GO:0044267 6 ENSG00000075624 GO:0006928 7 ENSG00000075624 GO:0007596 8 ENSG00000075624 GO:0051592 9 ENSG00000075624 GO:0045216 10 ENSG00000075624 GO:0034332 11 ENSG00000075624 GO:0051084 12 ENSG00000075624 GO:0034329 13 ENSG00000075624 GO:0005737 14 ENSG00000075624 GO:0005829 15 ENSG00000075624 GO:0030424 16 ENSG00000075624 GO:0005856 17 ENSG00000075624 GO:0043234 18 ENSG00000075624 GO:0030529 19 ENSG00000075624 GO:0005625 20 ENSG00000075624 GO:0030863 21 ENSG00000075624 GO:0031941 22 ENSG00000075624 GO:0035267 23 ENSG00000075624 GO:0030016 24 ENSG00000075624 GO:0070688 25 ENSG00000075624 GO:0005515 26 ENSG00000075624 GO:0005524 27 ENSG00000075624 GO:0019901 28 ENSG00000075624 GO:0019894 29 ENSG00000075624 GO:0005200 30 ENSG00000075624 GO:0050998 31 ENSG00000075624 GO:0030957 32 ENSG00000206172 GO:0042744 33 ENSG00000206172 GO:0051291 34 ENSG00000206172 GO:0042542 35 ENSG00000206172 GO:0010942 36 ENSG00000206172 GO:0015671 37 ENSG00000206172 GO:0022627 38 ENSG00000206172 GO:0005833 39 ENSG00000206172 GO:0031838 40 ENSG00000206172 GO:0005515 41 ENSG00000206172 GO:0004601 42 ENSG00000206172 GO:0020037 43 ENSG00000206172 GO:0005506 44 ENSG00000206172 GO:0005344 45 ENSG00000206172 GO:0019825 46 ENSG00000206172 GO:0031720
Working with SNPs
And lastly an example taken from http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf:
library(biomaRt) snp <- useMart("snp",dataset="hsapiens_snp") start <- Sys.time() out=getBM(attributes=c("refsnp_id","allele","chrom_start"), filters=c("chr_name","start","end"), values=list(8,148350, 158612), mart=snp) end <- Sys.time() print(end - start) Time difference of 1.544547 secs nrow(out) [1] 52 head(out) refsnp_id allele chrom_start 1 rs547420070 A/C 148373 2 rs77274555 G/A 148391 3 rs567299969 T/A 148394 4 rs368076569 G/A 148407 5 rs190721891 C/G 148576 6 rs199747597 T/C 148679 tail(out) refsnp_id allele chrom_start 47 rs369534639 C/T 157748 48 rs434724 G/C 157764 49 rs200513547 C/A 157862 50 rs376761743 C/T 157970 51 rs369198444 C/T 157978 52 rs201289114 A/G 158290 # with SNP IDs my_snp <- c('rs547420070', 'rs77274555') start <- Sys.time() my_snp_detail <- getBM(attributes=c("refsnp_id","allele","chrom_start"), filters=c("snp_filter"), values=my_snp, mart=snp) end <- Sys.time() print(end - start) Time difference of 1.551306 secs my_snp_detail refsnp_id allele chrom_start 1 rs547420070 A/C 148373 2 rs77274555 G/A 148391
Gene IDs to gene symbols
Working with biologists, they will no doubt ask you for gene symbols. Here's how to convert Ensembl gene IDs to HUGO Gene Nomenclature Committee gene symbols.
library("biomaRt") ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") filters = listFilters(ensembl) test <- 'ENSG00000118473' getBM(attributes=c('ensembl_gene_id', "hgnc_symbol"), filters = "ensembl_gene_id", values=test, mart=ensembl) ensembl_gene_id hgnc_symbol 1 ENSG00000118473 SGIP1 test <- c('ENSG00000118473', 'ENSG00000162426') getBM(attributes=c('ensembl_gene_id', "hgnc_symbol"), filters = "ensembl_gene_id", values=test, mart=ensembl) ensembl_gene_id hgnc_symbol 1 ENSG00000118473 SGIP1 2 ENSG00000162426 SLC45A1 getBM(attributes=c('ensembl_gene_id', "hgnc_symbol", "description"), filters = "ensembl_gene_id", values=test, mart=ensembl) ensembl_gene_id hgnc_symbol 1 ENSG00000118473 SGIP1 2 ENSG00000162426 SLC45A1 description 1 SH3-domain GRB2-like (endophilin) interacting protein 1 [Source:HGNC Symbol;Acc:25412] 2 solute carrier family 45, member 1 [Source:HGNC Symbol;Acc:17939]
Working with older reference assemblies
# hg38 ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") # last patch of hg19 grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") # listMarts(grch37) biomart version 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 2 ENSEMBL_MART_SNP Ensembl Variation 3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 4 ENSEMBL_MART_VEGA Vega grep(pattern = "homo", x = listDatasets(grch37)[,2], ignore.case = TRUE) [1] 31 listDatasets(grch37)[31,] dataset description version 31 hsapiens_gene_ensembl Homo sapiens genes (GRCh37.p13) GRCh37.p13 old_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl") my_refseq <- 'NM_001101' getBM(attributes=c('ensembl_gene_id','chromosome_name','start_position','end_position'), filters = 'refseq_mrna', values = my_refseq, mart = ensembl) ensembl_gene_id chromosome_name start_position end_position 1 ENSG00000075624 7 5527151 5563784 getBM(attributes=c('ensembl_gene_id','chromosome_name','start_position','end_position'), filters = 'refseq_mrna', values = my_refseq, mart = old_mart) Error in value[[3L]](cond) : Request to BioMart web service failed. Verify if you are still connected to the internet. Alternatively the BioMart web service is temporarily down.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi,
great post that I use as a reference. I am wondering since you have been likely using this for quite some time to convert identifiers if you still find this to be the most convenient way. The reason I am wondering is I am running into some issues going from ensembl to entrez for mouse! Namely, of the 9600 identifiers I am trying to convert, some 800 or so do not get converted. When I paste those into MGIs online translator, about 750 of the 800 get converted. What a pita to have to use several converters and it would be nice to stay within a single environment if possible. Writing intermediate files sure clutters things up quick.
Thanks,
Bob
Hi Bob,
I wrote about converting between different databases using biomaRt and noticed that Ensembl and Entrez are quite interchangeable. Was there anything common with these 800 identifiers that couldn’t be converted? I agree that it is definitely much nicer just to stay within a single environment.
Cheers,
Dave
Thanks! This has been really helpful! I am bookmarking your website for future use!
Nice instructions. Thank you. I have actually been using biomaRt in R to convert IDs. However I only know how to convert a single column of IDs. I’ve been looking for a way to convert one particular column of IDs to another kind of IDs in a big list with several other columns. Any suggestion will be appreciated.
Are you familiar with subsetting in R? It’s actually fairly easy. Search the web for “subsetting in R”.
Thanks for the heads up. I know I could subset/select a column (e.g. refseq_mrna) from the big list and use that for “getBM(attributes)” to convert them to “ensembl_gene_id”.
The issue is how I could then put the ensembl_gene_ids back to the original list especially one refseq_mrna could be mapped to several ensembl_gene_ids. Thanks.
Something like this?