Here’s a very short post on how to fetch lincRNAs from Ensembl using R and the biomaRt package. For those who are not familiar with biomaRt, you can check out my older post on biomaRt. Firstly, start R and install the biomaRt package from Bioconductor by copying and pasting the code below:
source("http://bioconductor.org/biocLite.R") biocLite("biomaRt")
Now let’s setup biomaRt to use the Ensembl genes catalog for Homo sapiens; the coordinates are based on the latest reference genome:
library("biomaRt") ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
The getBM() function is the main query function in biomaRt and consists of three parts: filters, attributes, and values. Using the filter we can obtain genes that belong to the RNA biotype we want, which in this case are the lincRNAs. I will also save various attributes, such as the transcript ID, transcript length, and information about the exons.
lincrna <- getBM(attributes= c('ensembl_gene_id', 'ensembl_transcript_id', 'transcript_length', 'chromosome_name', 'transcript_start', 'transcript_end', 'ensembl_exon_id', 'exon_chrom_start', 'exon_chrom_end', 'strand'), filters = "biotype", value = 'lincRNA', mart = ensembl) #sort by chromosome, then by start lincrna <- lincrna[order(lincrna[,'chromosome_name'],lincrna[,'transcript_start']),] #how many lincRNA transcripts are there? length(unique(lincrna$ensembl_transcript_id)) #[1] 12211 #I only want lincRNAs on assembled chromosomes my_chr <- c(1:22,'X','Y') lincrna_assembled <- subset(lincrna, lincrna$chromosome_name %in% my_chr) lincrna_assembled$chromosome_name <- factor(lincrna_assembled$chromosome_name, levels=my_chr) #how many lincRNAs? length(unique(lincrna_assembled$ensembl_transcript_id)) #[1] 12157 #create a temp object for plotting purposes tmp <- lincrna_assembled[,c('ensembl_transcript_id','chromosome_name')] #remove duplicated rows tmp <- tmp[!duplicated(tmp),] #plot the distribution of lincRNAs on the chromosomes plot(table(tmp$chromosome_name), ylab='Number of lincRNA', xlab='Chromosome', main='Distribution of Ensembl lincRNAs on assembled hg38 chromosomes', ylim=c(0,1200)) #the table of lincRNAs will be saved #inside a file called my_lincrna.tsv #to find out the location of this file #type getwd() into the R console write.table(lincrna_assembled, file = 'my_lincrna.tsv', quote = F, sep = "\t", row.names = F) #save image save.image(file='lincrna.RData')
For information on the lincRNA biotype in Emsembl, see this page.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Excellent example of out-of-the box use o biomaRt for retrieving genomic coordinates! Especially, the use of “biotype” filter.
In case someone wants to have the coordinates in hg19 genome assembly, this mart contains the last GRCh37 patch 13 build of Ensembl DB:
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="feb2014.archive.ensembl.org",path="/biomart/martservice",archive=FALSE, verbose=TRUE) # Last mart containing HG19 genome annotation
Thanks Mikhail for the code regarding hg19! I know a lot of people who still haven’t moved over to hg38, so this is very useful.
in the line
lincrna <- getBM(attributes= c('ensembl_gene_id', , ‘strand’), filters = “biotype”, value = ‘lincRNA’, mart = ensembl)
‘lincRNA’ value doesn’t seem to work anymore in 2019. I tried ‘lncRNA’ instead and it worked