Thoughts on converting gene identifiers

If you’ve worked in the genomics field, then you’ve most likely spent some time converting gene identifiers to other gene identifiers or to gene symbols. Some common gene annotation databases include RefSeq, UCSC known genes, Entrez gene and Ensembl genes. In the past, I’ve relied on large tables that provide the lookup between different identifiers, such as the gene2refseq.gz file that provides lookups between Entrez gene IDs to RefSeq IDs.

However, more elegant and robust solutions exist such as BioMart, which is a freely available online service that allows you to download genomic annotations from several databases. The usefulness of the service is that you can easily perform lookups between several different databases, i.e. converting gene identifiers. There’s a Perl package for working with BioMart and a R Bioconductor package. For this post, I will be using the R Bioconductor biomaRt package; I’ve previously written a short guide on learning to use biomaRt for those interested in a short introduction. The crux of biomaRt is summarised in the biomaRt vignette:

The getBM function is the main query function in biomaRt. It has four main arguments:

  • attributes: is a vector of attributes that one wants to retrieve (= the output of the query)
  • filters: is a vector of filters that one will use as input to the query
  • values: a vector of values for the filters. In case multiple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument
  • mart: is and object of class Mart, which is created by the useMart function.

So if I’m interested in all the genes located on chromosome one of the human genome, then I’ll filter on chromosome 1. The listFilters() function shows you all available filters in the selected dataset. To get started, I want to retrieve all Ensembl gene IDs, UCSC known gene IDS, RefSeq IDs and Entrez Gene IDs on the assembled human chromosomes:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
 
library("biomaRt")
#use the ensembl mart and the human dataset
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

#create a filter for all assembled human chromosomes
my_chr <- c(1:22, 'M', 'X', 'Y')

#listAttributes shows all attributes
attributes <- listAttributes(ensembl)

#find entrez attribute name
grep(pattern="entrez", x=attributes$description, ignore.case=T)
#[1] 45
attributes[45,]
#name   description
#45 entrezgene EntrezGene ID

#find refseq attribute name
grep(pattern="refseq", x=attributes$description, ignore.case=T)
#[1] 65 66 67 68 69 70

attributes[65:70,]
#name                                     description
#65              refseq_mrna                 RefSeq mRNA [e.g. NM_001195597]
#66    refseq_mrna_predicted       RefSeq mRNA predicted [e.g. XM_001125684]
#67             refseq_ncrna                   RefSeq ncRNA [e.g. NR_002834]
#68   refseq_ncrna_predicted         RefSeq ncRNA predicted [e.g. XR_108264]
#69           refseq_peptide           RefSeq Protein ID [e.g. NP_001005353]
#70 refseq_peptide_predicted RefSeq Predicted Protein ID [e.g. XP_001720922]

#find ucsc attribute name
#ucsc
grep(pattern="ucsc", x=attributes$description, ignore.case=T)
attributes[73,]
#   name description
#73 ucsc     UCSC ID

#find Ensembl gene name
head(attributes[grep(pattern="ensembl gene",
                     x=attributes$description,
                     ignore.case=T
                     ),
                ]
     )
#                             name                   description
#1                 ensembl_gene_id               Ensembl Gene ID
#32  clone_based_ensembl_gene_name Clone based Ensembl gene name
#134               ensembl_gene_id               Ensembl Gene ID
#165               ensembl_gene_id               Ensembl Gene ID
#175               ensembl_gene_id               Ensembl Gene ID
#188   vpacos_homolog_ensembl_gene        Alpaca Ensembl Gene ID

So now I’ve created my filter (my_chr) and I’ve found all the attribute names for the 4 different gene database (entrezgene, refseq_mrna, ucsc and ensembl_gene_id), we can perform a query using getBM():

my_refseq_mrna <- getBM(attributes = 'refseq_mrna',
                       filters = 'chromosome_name',
                       values = my_chr,
                       mart = ensembl)

my_entrez_gene <- getBM(attributes = 'entrezgene',
                        filters = 'chromosome_name',
                        values = my_chr,
                        mart = ensembl
                        )

my_ucsc <- getBM(attributes = 'ucsc',
                 filters = 'chromosome_name',
                 values = my_chr,
                 mart = ensembl
                 )

my_ensembl_gene_id <- getBM(attributes = 'ensembl_gene_id',
                            filters = 'chromosome_name',
                            values = my_chr,
                            mart = ensembl
                            )

#how many identifiers from each database
length(my_ensembl_gene_id[,1])
[1] 52322
length(my_entrez_gene[,1])
[1] 24231
length(my_refseq_mrna[,1])
[1] 35423
length(my_ucsc[,1])
[1] 53598

Here I just demonstrate how we could go about converting a RefSeq ID to the UCSC ID:

getBM(attributes=c('refseq_mrna', 'ucsc'),
      filters = 'refseq_mrna',
      values = 'NM_001195597',
      mart = ensembl)

   refseq_mrna       ucsc
1 NM_001195597 uc011ebx.2

However we can create a full lookup table as such:

my_annotation <- getBM(attributes = c('ucsc', 'ensembl_gene_id', 'refseq_mrna', 'entrezgene'),
                            filters = 'chromosome_name',
                            values = my_chr,
                            mart = ensembl
                            )

head(my_annotation)
#ucsc ensembl_gene_id  refseq_mrna entrezgene
#1 uc021wif.1 ENSG00000227342                  266919
#2            ENSG00000225267                      NA
#3 uc002zyz.4 ENSG00000099977 NM_001084392       1652
#4 uc002zza.4 ENSG00000099977    NM_001355       1652
#5            ENSG00000099977                    1652
#6            ENSG00000250973                      NA

#substitute the blank lines with NAs
my_annotation <- sapply(my_annotation,gsub,pattern="^$",replacement=NA)
#label the rows as sequential numbers
row.names(my_annotation) <- 1:length(my_annotation[,1])

head(my_annotation)
#  ucsc         ensembl_gene_id   refseq_mrna    entrezgene
#1 "uc021wif.1" "ENSG00000227342" NA             "266919"  
#2 NA           "ENSG00000225267" NA             NA        
#3 "uc002zyz.4" "ENSG00000099977" "NM_001084392" "1652"    
#4 "uc002zza.4" "ENSG00000099977" "NM_001355"    "1652"    
#5 NA           "ENSG00000099977" NA             "1652"    
#6 NA           "ENSG00000250973" NA             NA

On rows 3 to 5, the Ensembl gene id and the Entrez gene id are the same but corresponding to these gene IDs are different UCSC known gene IDs and RefSeq IDs. This is because there are different transcript models under the same Ensembl and Entrez gene IDs but the UCSC known gene ID and RefSeq IDs have separate identifiers for these different transcripts.

Also note that on row 6 there is an Ensembl gene ID with no equivalent in the other databases. I will comment on this later. Let’s create a Venn diagram to show the overlap between the different databases:

#function to take in a row name and
#list of values and
#checks for NAs in the list of values and
#returns the row name or NA depending on the list of values
my_function <- function(x, ...){
  to_return <- sapply(list(...), is.na)
  to_return <- gsub(pattern=FALSE, replacement=x, x=to_return)
  to_return <- gsub(pattern=TRUE, replacement=NA, x=to_return)
}

my_venn <- my_annotation

#loop through checking each row
#and calling my_function
for (i in 1:length(my_annotation[,1])){
  my_venn[i,] <- my_function(row.names(my_annotation)[i], my_annotation[i,])
}

#create vector of row names for each database
my_venn_ucsc <- my_venn[,1]
my_venn_ucsc <- as.vector(na.omit(my_venn_ucsc))
my_venn_ensembl <- my_venn[,2]
my_venn_ensembl <- as.vector(na.omit(my_venn_ensembl))
my_venn_refseq <- my_venn[,3]
my_venn_refseq <- as.vector(na.omit(my_venn_refseq))
my_venn_entrez <- my_venn[,4]
my_venn_entrez <- as.vector(na.omit(my_venn_entrez))

#plotting the Venn diagram
library(gplots)
VennList<-list(UCSC = my_venn_ucsc,
               Ensembl = my_venn_ensembl,
               RefSeq = my_venn_refseq,
               Entrez = my_venn_entrez
               )
venn(VennList)

vennA lot of Ensembl gene IDs are not present in the other databases.

Firstly, why are there so many Ensembl gene IDs without any equivalent in the other databases?

dim(my_annotation[is.na(my_annotation[,1]) &
                  is.na(my_annotation[,3]) &
                  is.na(my_annotation[,4]),])
#[1] 31967     4

#store these unique Ensembl gene IDs
my_unique_ensembl <- my_annotation[is.na(my_annotation[,1]) &
                                   is.na(my_annotation[,3]) &
                                   is.na(my_annotation[,4]),][,2]

head(my_unique_ensembl)
                2                 6                 8                25 
"ENSG00000225267" "ENSG00000250973" "ENSG00000233640" "ENSG00000237841" 
               28                29 
"ENSG00000235312" "ENSG00000218125"

#which 20 of the attributes can give us an idea
#of what these genes are?

#ENSG00000225267
getBM(attributes=attributes$name[1:20],
      filters = 'ensembl_gene_id',
      values = my_unique_ensembl[1],
      mart = ensembl)

#  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id ensembl_exon_id
#1 ENSG00000225267       ENST00000430538                 NA ENSE00001714293
#                                                       description chromosome_name
#1 ribosomal protein L8 pseudogene 2 [Source:HGNC Symbol;Acc:17220]              21
#  start_position end_position strand   band transcript_start transcript_end
#1       31635698     31636199     -1 q22.11         31635698       31636199
#  external_gene_id external_transcript_id external_gene_db   transcript_db_name
#1           RPL8P2             RPL8P2-001      HGNC Symbol HGNC transcript name
#  transcript_count percentage_gc_content gene_biotype   transcript_biotype
#1                1                 55.18   pseudogene processed_pseudogene

#gene_biotype
my_unique_ensembl_biotype <- getBM(attributes = c('ensembl_gene_id', 'gene_biotype'),
                                   filters = 'ensembl_gene_id',
                                   values = my_unique_ensembl,
                                   mart = ensembl
                                   )
head(my_unique_ensembl_biotype)
   ensembl_gene_id         gene_biotype
1  ENSG00000005206 processed_transcript
2  ENSG00000006062 processed_transcript
3  ENSG00000018607           pseudogene
4  ENSG00000020219           pseudogene
5  ENSG00000026036       protein_coding
6  ENSG00000031544 processed_transcript
7  ENSG00000049319 processed_transcript
8  ENSG00000060303           pseudogene
9  ENSG00000072444       protein_coding
10 ENSG00000083622            antisense
11 ENSG00000088340           pseudogene
12 ENSG00000088970 processed_transcript
13 ENSG00000093134       protein_coding
14 ENSG00000100101 processed_transcript
15 ENSG00000101278           pseudogene

table(my_unique_ensembl_biotype$gene_biotype)

3prime_overlapping_ncrna                antisense                IG_C_gene 
                      42                     4123                       11 
         IG_C_pseudogene                IG_D_gene                IG_J_gene 
                       9                       37                       18 
         IG_J_pseudogene                IG_V_gene          IG_V_pseudogene 
                       3                      135                      186 
                 lincRNA                    miRNA                 misc_RNA 
                    5484                     1675                     1733 
  polymorphic_pseudogene     processed_transcript           protein_coding 
                      10                      502                     1039 
              pseudogene                     rRNA           sense_intronic 
                   13080                      386                      638 
       sense_overlapping                   snoRNA                    snRNA 
                     169                      977                     1579 
               TR_C_gene                TR_D_gene                TR_J_gene 
                       1                        3                       73 
         TR_J_pseudogene                TR_V_gene          TR_V_pseudogene 
                       3                       25                       26

So the majority of Ensembl gene IDs without an equivalent in the other databases are mainly pseudogenes or non-coding RNAs. Now what about the 17,940 annotations only present among Ensembl and Entrez?

head(my_annotation[is.na(my_annotation[,1]) &
                   is.na(my_annotation[,3]) &
                   !is.na(my_annotation[,4]) &
                   !is.na(my_annotation[,2]),
                  ]
     )
#   ucsc ensembl_gene_id   refseq_mrna entrezgene
#5  NA   "ENSG00000099977" NA          "1652"    
#10 NA   "ENSG00000152213" NA          "115761"  
#12 NA   "ENSG00000101574" NA          "64863"   
#20 NA   "ENSG00000125877" NA          "3704"    
#23 NA   "ENSG00000154065" NA          "147463"  
#26 NA   "ENSG00000132952" NA          "10208"

dim(my_annotation[is.na(my_annotation[,1]) &
                  is.na(my_annotation[,3]) &
                  !is.na(my_annotation[,4]) &
                  !is.na(my_annotation[,2]),
                 ]
   )
[1] 17940     4

my_ensembl_entrez <- my_annotation[is.na(my_annotation[,1]) &
                                   is.na(my_annotation[,3]) &
                                   !is.na(my_annotation[,4]) &
                                   !is.na(my_annotation[,2]),
                                  ][,2]

#gene_biotype
my_ensembl_entrez_biotype <- getBM(attributes = c('ensembl_gene_id', 'gene_biotype'),
                                   filters = 'ensembl_gene_id',
                                   values = my_ensembl_entrez,
                                   mart = ensembl
                                   )

table(my_ensembl_entrez_biotype$gene_biotype)

3prime_overlapping_ncrna                antisense                IG_V_gene 
                       2                      506                        2 
                 lincRNA                    miRNA                 misc_RNA 
                     748                        9                        5 
  polymorphic_pseudogene     processed_transcript           protein_coding 
                      19                      184                    14543 
              pseudogene                     rRNA           sense_intronic 
                     544                       17                       33 
       sense_overlapping                   snoRNA                    snRNA 
                       8                        3                        7

Conclusions

The main observations from the Venn diagram:

  1. 17,940 annotations are common between Ensembl and Entrez but missing in RefSeq and UCSC.
  2. 33,750 annotations are common between Ensembl, Entrez and UCSC but missing from RefSeq.
  3. 38,115 annotations are common to all databases.

Most of the RefSeq annotations are covered in the other databases, since this collection is curated very well. For this same reason, there are a large number of annotations present in the other 3 databases but missing in the RefSeq collection; these genes did not meet the strict criteria of being included into the RefSeq collection.

Regarding the overlap between Entrez and Ensembl that are missing in UCSC and RefSeq, the majority of these genes are protein coding. I manually looked up a few of these genes and found that for these genes are a lot of corresponding transcript models. Thus a likely reason for this exclusive set is due to a missing transcript model in the RefSeq and UCSC known gene databases. I should point out that for each Ensembl gene ID are a set of Ensembl transcript IDs; so one ID can have several transcript IDs. This is another problem of working on the “gene” level as opposed to the “transcript” level. So when converting between a RefSeq gene ID to an Ensembl gene ID, it may not be the same transcript.

One of the motivations for this post was that I wanted to use the R Bioconductor GOstats package for performing a gene ontology enrichment analysis, which required annotations to Entrez gene IDs. I’ve used in the past another R Bioconductor package called goseq, which requires annotation by Ensembl gene IDs. So this led me to the comparison of the different gene databases.

Final word: when choosing which gene database to use for annotation, it’s best to understand how the database is constructed. RefSeqs are a conservative choice, while on the other spectrum are Ensembl annotations, which also include a large number of pseudogenes not present in the other databases. When converting gene identifiers between Ensembl and Entrez genes, the majority of models have equivalents, apart from a large portion of pseudogenes and noncoding RNAs present only in the Ensembl gene annotation database.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
One comment Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.