How do I fetch lincRNAs from Ensembl?

Here’s a very short post on how to fetch lincRNAs from Ensembl using R and the biomaRt package. For those who are not familiar with biomaRt, you can check out my older post on biomaRt. Firstly, start R and install the biomaRt package from Bioconductor by copying and pasting the code below:

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Now let’s setup biomaRt to use the Ensembl genes catalog for Homo sapiens; the coordinates are based on the latest reference genome:

library("biomaRt")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

The getBM() function is the main query function in biomaRt and consists of three parts: filters, attributes, and values. Using the filter we can obtain genes that belong to the RNA biotype we want, which in this case are the lincRNAs. I will also save various attributes, such as the transcript ID, transcript length, and information about the exons.

lincrna <- getBM(attributes= c('ensembl_gene_id',
                               'ensembl_transcript_id',
                               'transcript_length',
                               'chromosome_name',
                               'transcript_start',
                               'transcript_end',
                               'ensembl_exon_id',
                               'exon_chrom_start',
                               'exon_chrom_end',
                               'strand'),
                 filters = "biotype",
                 value = 'lincRNA',
                 mart = ensembl)

#sort by chromosome, then by start
lincrna <- lincrna[order(lincrna[,'chromosome_name'],lincrna[,'transcript_start']),]

#how many lincRNA transcripts are there?
length(unique(lincrna$ensembl_transcript_id))
#[1] 12211

#I only want lincRNAs on assembled chromosomes
my_chr <- c(1:22,'X','Y')
lincrna_assembled <- subset(lincrna, lincrna$chromosome_name %in% my_chr)
lincrna_assembled$chromosome_name <- factor(lincrna_assembled$chromosome_name,
                                            levels=my_chr)

#how many lincRNAs?
length(unique(lincrna_assembled$ensembl_transcript_id))
#[1] 12157

#create a temp object for plotting purposes
tmp <- lincrna_assembled[,c('ensembl_transcript_id','chromosome_name')]
#remove duplicated rows
tmp <- tmp[!duplicated(tmp),]

#plot the distribution of lincRNAs on the chromosomes
plot(table(tmp$chromosome_name),
     ylab='Number of lincRNA',
     xlab='Chromosome',
     main='Distribution of Ensembl lincRNAs on assembled hg38 chromosomes',
     ylim=c(0,1200))

#the table of lincRNAs will be saved
#inside a file called my_lincrna.tsv
#to find out the location of this file
#type getwd() into the R console
write.table(lincrna_assembled,
            file = 'my_lincrna.tsv',
            quote = F,
            sep = "\t",
            row.names = F)

#save image
save.image(file='lincrna.RData')

lincrna_dist

For information on the lincRNA biotype in Emsembl, see this page.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Excellent example of out-of-the box use o biomaRt for retrieving genomic coordinates! Especially, the use of “biotype” filter.

    In case someone wants to have the coordinates in hg19 genome assembly, this mart contains the last GRCh37 patch 13 build of Ensembl DB:

    mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="feb2014.archive.ensembl.org",path="/biomart/martservice",archive=FALSE, verbose=TRUE) # Last mart containing HG19 genome annotation

    1. Thanks Mikhail for the code regarding hg19! I know a lot of people who still haven’t moved over to hg38, so this is very useful.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.