If you’ve worked in the genomics field, then you’ve most likely spent some time converting gene identifiers to other gene identifiers or to gene symbols. Some common gene annotation databases include RefSeq, UCSC known genes, Entrez gene and Ensembl genes. In the past, I’ve relied on large tables that provide the lookup between different identifiers, such as the gene2refseq.gz file that provides lookups between Entrez gene IDs to RefSeq IDs.
However, more elegant and robust solutions exist such as BioMart, which is a freely available online service that allows you to download genomic annotations from several databases. The usefulness of the service is that you can easily perform lookups between several different databases, i.e. converting gene identifiers. There’s a Perl package for working with BioMart and a R Bioconductor package. For this post, I will be using the R Bioconductor biomaRt package; I’ve previously written a short guide on learning to use biomaRt for those interested in a short introduction. The crux of biomaRt is summarised in the biomaRt vignette:
The getBM function is the main query function in biomaRt. It has four main arguments:
- attributes: is a vector of attributes that one wants to retrieve (= the output of the query)
- filters: is a vector of filters that one will use as input to the query
- values: a vector of values for the filters. In case multiple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument
- mart: is and object of class Mart, which is created by the useMart function.
So if I’m interested in all the genes located on chromosome one of the human genome, then I’ll filter on chromosome 1. The listFilters() function shows you all available filters in the selected dataset. To get started, I want to retrieve all Ensembl gene IDs, UCSC known gene IDS, RefSeq IDs and Entrez Gene IDs on the assembled human chromosomes:
#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("biomaRt") library("biomaRt") #use the ensembl mart and the human dataset ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") #create a filter for all assembled human chromosomes my_chr <- c(1:22, 'M', 'X', 'Y') #listAttributes shows all attributes attributes <- listAttributes(ensembl) #find entrez attribute name grep(pattern="entrez", x=attributes$description, ignore.case=T) #[1] 45 attributes[45,] #name description #45 entrezgene EntrezGene ID #find refseq attribute name grep(pattern="refseq", x=attributes$description, ignore.case=T) #[1] 65 66 67 68 69 70 attributes[65:70,] #name description #65 refseq_mrna RefSeq mRNA [e.g. NM_001195597] #66 refseq_mrna_predicted RefSeq mRNA predicted [e.g. XM_001125684] #67 refseq_ncrna RefSeq ncRNA [e.g. NR_002834] #68 refseq_ncrna_predicted RefSeq ncRNA predicted [e.g. XR_108264] #69 refseq_peptide RefSeq Protein ID [e.g. NP_001005353] #70 refseq_peptide_predicted RefSeq Predicted Protein ID [e.g. XP_001720922] #find ucsc attribute name #ucsc grep(pattern="ucsc", x=attributes$description, ignore.case=T) attributes[73,] # name description #73 ucsc UCSC ID #find Ensembl gene name head(attributes[grep(pattern="ensembl gene", x=attributes$description, ignore.case=T ), ] ) # name description #1 ensembl_gene_id Ensembl Gene ID #32 clone_based_ensembl_gene_name Clone based Ensembl gene name #134 ensembl_gene_id Ensembl Gene ID #165 ensembl_gene_id Ensembl Gene ID #175 ensembl_gene_id Ensembl Gene ID #188 vpacos_homolog_ensembl_gene Alpaca Ensembl Gene ID
So now I’ve created my filter (my_chr) and I’ve found all the attribute names for the 4 different gene database (entrezgene, refseq_mrna, ucsc and ensembl_gene_id), we can perform a query using getBM():
my_refseq_mrna <- getBM(attributes = 'refseq_mrna', filters = 'chromosome_name', values = my_chr, mart = ensembl) my_entrez_gene <- getBM(attributes = 'entrezgene', filters = 'chromosome_name', values = my_chr, mart = ensembl ) my_ucsc <- getBM(attributes = 'ucsc', filters = 'chromosome_name', values = my_chr, mart = ensembl ) my_ensembl_gene_id <- getBM(attributes = 'ensembl_gene_id', filters = 'chromosome_name', values = my_chr, mart = ensembl ) #how many identifiers from each database length(my_ensembl_gene_id[,1]) [1] 52322 length(my_entrez_gene[,1]) [1] 24231 length(my_refseq_mrna[,1]) [1] 35423 length(my_ucsc[,1]) [1] 53598
Here I just demonstrate how we could go about converting a RefSeq ID to the UCSC ID:
getBM(attributes=c('refseq_mrna', 'ucsc'), filters = 'refseq_mrna', values = 'NM_001195597', mart = ensembl) refseq_mrna ucsc 1 NM_001195597 uc011ebx.2
However we can create a full lookup table as such:
my_annotation <- getBM(attributes = c('ucsc', 'ensembl_gene_id', 'refseq_mrna', 'entrezgene'), filters = 'chromosome_name', values = my_chr, mart = ensembl ) head(my_annotation) #ucsc ensembl_gene_id refseq_mrna entrezgene #1 uc021wif.1 ENSG00000227342 266919 #2 ENSG00000225267 NA #3 uc002zyz.4 ENSG00000099977 NM_001084392 1652 #4 uc002zza.4 ENSG00000099977 NM_001355 1652 #5 ENSG00000099977 1652 #6 ENSG00000250973 NA #substitute the blank lines with NAs my_annotation <- sapply(my_annotation,gsub,pattern="^$",replacement=NA) #label the rows as sequential numbers row.names(my_annotation) <- 1:length(my_annotation[,1]) head(my_annotation) # ucsc ensembl_gene_id refseq_mrna entrezgene #1 "uc021wif.1" "ENSG00000227342" NA "266919" #2 NA "ENSG00000225267" NA NA #3 "uc002zyz.4" "ENSG00000099977" "NM_001084392" "1652" #4 "uc002zza.4" "ENSG00000099977" "NM_001355" "1652" #5 NA "ENSG00000099977" NA "1652" #6 NA "ENSG00000250973" NA NA
On rows 3 to 5, the Ensembl gene id and the Entrez gene id are the same but corresponding to these gene IDs are different UCSC known gene IDs and RefSeq IDs. This is because there are different transcript models under the same Ensembl and Entrez gene IDs but the UCSC known gene ID and RefSeq IDs have separate identifiers for these different transcripts.
Also note that on row 6 there is an Ensembl gene ID with no equivalent in the other databases. I will comment on this later. Let’s create a Venn diagram to show the overlap between the different databases:
#function to take in a row name and #list of values and #checks for NAs in the list of values and #returns the row name or NA depending on the list of values my_function <- function(x, ...){ to_return <- sapply(list(...), is.na) to_return <- gsub(pattern=FALSE, replacement=x, x=to_return) to_return <- gsub(pattern=TRUE, replacement=NA, x=to_return) } my_venn <- my_annotation #loop through checking each row #and calling my_function for (i in 1:length(my_annotation[,1])){ my_venn[i,] <- my_function(row.names(my_annotation)[i], my_annotation[i,]) } #create vector of row names for each database my_venn_ucsc <- my_venn[,1] my_venn_ucsc <- as.vector(na.omit(my_venn_ucsc)) my_venn_ensembl <- my_venn[,2] my_venn_ensembl <- as.vector(na.omit(my_venn_ensembl)) my_venn_refseq <- my_venn[,3] my_venn_refseq <- as.vector(na.omit(my_venn_refseq)) my_venn_entrez <- my_venn[,4] my_venn_entrez <- as.vector(na.omit(my_venn_entrez)) #plotting the Venn diagram library(gplots) VennList<-list(UCSC = my_venn_ucsc, Ensembl = my_venn_ensembl, RefSeq = my_venn_refseq, Entrez = my_venn_entrez ) venn(VennList)
A lot of Ensembl gene IDs are not present in the other databases.
Firstly, why are there so many Ensembl gene IDs without any equivalent in the other databases?
dim(my_annotation[is.na(my_annotation[,1]) & is.na(my_annotation[,3]) & is.na(my_annotation[,4]),]) #[1] 31967 4 #store these unique Ensembl gene IDs my_unique_ensembl <- my_annotation[is.na(my_annotation[,1]) & is.na(my_annotation[,3]) & is.na(my_annotation[,4]),][,2] head(my_unique_ensembl) 2 6 8 25 "ENSG00000225267" "ENSG00000250973" "ENSG00000233640" "ENSG00000237841" 28 29 "ENSG00000235312" "ENSG00000218125" #which 20 of the attributes can give us an idea #of what these genes are? #ENSG00000225267 getBM(attributes=attributes$name[1:20], filters = 'ensembl_gene_id', values = my_unique_ensembl[1], mart = ensembl) # ensembl_gene_id ensembl_transcript_id ensembl_peptide_id ensembl_exon_id #1 ENSG00000225267 ENST00000430538 NA ENSE00001714293 # description chromosome_name #1 ribosomal protein L8 pseudogene 2 [Source:HGNC Symbol;Acc:17220] 21 # start_position end_position strand band transcript_start transcript_end #1 31635698 31636199 -1 q22.11 31635698 31636199 # external_gene_id external_transcript_id external_gene_db transcript_db_name #1 RPL8P2 RPL8P2-001 HGNC Symbol HGNC transcript name # transcript_count percentage_gc_content gene_biotype transcript_biotype #1 1 55.18 pseudogene processed_pseudogene #gene_biotype my_unique_ensembl_biotype <- getBM(attributes = c('ensembl_gene_id', 'gene_biotype'), filters = 'ensembl_gene_id', values = my_unique_ensembl, mart = ensembl ) head(my_unique_ensembl_biotype) ensembl_gene_id gene_biotype 1 ENSG00000005206 processed_transcript 2 ENSG00000006062 processed_transcript 3 ENSG00000018607 pseudogene 4 ENSG00000020219 pseudogene 5 ENSG00000026036 protein_coding 6 ENSG00000031544 processed_transcript 7 ENSG00000049319 processed_transcript 8 ENSG00000060303 pseudogene 9 ENSG00000072444 protein_coding 10 ENSG00000083622 antisense 11 ENSG00000088340 pseudogene 12 ENSG00000088970 processed_transcript 13 ENSG00000093134 protein_coding 14 ENSG00000100101 processed_transcript 15 ENSG00000101278 pseudogene table(my_unique_ensembl_biotype$gene_biotype) 3prime_overlapping_ncrna antisense IG_C_gene 42 4123 11 IG_C_pseudogene IG_D_gene IG_J_gene 9 37 18 IG_J_pseudogene IG_V_gene IG_V_pseudogene 3 135 186 lincRNA miRNA misc_RNA 5484 1675 1733 polymorphic_pseudogene processed_transcript protein_coding 10 502 1039 pseudogene rRNA sense_intronic 13080 386 638 sense_overlapping snoRNA snRNA 169 977 1579 TR_C_gene TR_D_gene TR_J_gene 1 3 73 TR_J_pseudogene TR_V_gene TR_V_pseudogene 3 25 26
So the majority of Ensembl gene IDs without an equivalent in the other databases are mainly pseudogenes or non-coding RNAs. Now what about the 17,940 annotations only present among Ensembl and Entrez?
head(my_annotation[is.na(my_annotation[,1]) & is.na(my_annotation[,3]) & !is.na(my_annotation[,4]) & !is.na(my_annotation[,2]), ] ) # ucsc ensembl_gene_id refseq_mrna entrezgene #5 NA "ENSG00000099977" NA "1652" #10 NA "ENSG00000152213" NA "115761" #12 NA "ENSG00000101574" NA "64863" #20 NA "ENSG00000125877" NA "3704" #23 NA "ENSG00000154065" NA "147463" #26 NA "ENSG00000132952" NA "10208" dim(my_annotation[is.na(my_annotation[,1]) & is.na(my_annotation[,3]) & !is.na(my_annotation[,4]) & !is.na(my_annotation[,2]), ] ) [1] 17940 4 my_ensembl_entrez <- my_annotation[is.na(my_annotation[,1]) & is.na(my_annotation[,3]) & !is.na(my_annotation[,4]) & !is.na(my_annotation[,2]), ][,2] #gene_biotype my_ensembl_entrez_biotype <- getBM(attributes = c('ensembl_gene_id', 'gene_biotype'), filters = 'ensembl_gene_id', values = my_ensembl_entrez, mart = ensembl ) table(my_ensembl_entrez_biotype$gene_biotype) 3prime_overlapping_ncrna antisense IG_V_gene 2 506 2 lincRNA miRNA misc_RNA 748 9 5 polymorphic_pseudogene processed_transcript protein_coding 19 184 14543 pseudogene rRNA sense_intronic 544 17 33 sense_overlapping snoRNA snRNA 8 3 7
Conclusions
The main observations from the Venn diagram:
- 17,940 annotations are common between Ensembl and Entrez but missing in RefSeq and UCSC.
- 33,750 annotations are common between Ensembl, Entrez and UCSC but missing from RefSeq.
- 38,115 annotations are common to all databases.
Most of the RefSeq annotations are covered in the other databases, since this collection is curated very well. For this same reason, there are a large number of annotations present in the other 3 databases but missing in the RefSeq collection; these genes did not meet the strict criteria of being included into the RefSeq collection.
Regarding the overlap between Entrez and Ensembl that are missing in UCSC and RefSeq, the majority of these genes are protein coding. I manually looked up a few of these genes and found that for these genes are a lot of corresponding transcript models. Thus a likely reason for this exclusive set is due to a missing transcript model in the RefSeq and UCSC known gene databases. I should point out that for each Ensembl gene ID are a set of Ensembl transcript IDs; so one ID can have several transcript IDs. This is another problem of working on the “gene” level as opposed to the “transcript” level. So when converting between a RefSeq gene ID to an Ensembl gene ID, it may not be the same transcript.
One of the motivations for this post was that I wanted to use the R Bioconductor GOstats package for performing a gene ontology enrichment analysis, which required annotations to Entrez gene IDs. I’ve used in the past another R Bioconductor package called goseq, which requires annotation by Ensembl gene IDs. So this led me to the comparison of the different gene databases.
Final word: when choosing which gene database to use for annotation, it’s best to understand how the database is constructed. RefSeqs are a conservative choice, while on the other spectrum are Ensembl annotations, which also include a large number of pseudogenes not present in the other databases. When converting gene identifiers between Ensembl and Entrez genes, the majority of models have equivalents, apart from a large portion of pseudogenes and noncoding RNAs present only in the Ensembl gene annotation database.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi! Thanks for your helpful posts such as this 🙂
I’m attempting to extract all Y chromosome genes from a mus musculus dataset (“mc57bl6nj_gene_ensembl”) using biomaRt. However, the only filters I see are for chromosomes 1-19 and X, but no Y! Do you know why that is, and if there is a work around? I find it surprising that the Y chromosome wouldn’t be included….
(cannot find a straightforward answer by googling…! maybe it is something obvious I am not seeing; I am new in general to genomics work)
Thanks for any insight that you have.
Eva