Non-coding RNA
A non-coding RNA (ncRNA) is an RNA molecule that is not translated into a protein. The classical ncRNAs include transfer RNAs, ribosomal RNAs, and small nuclear RNAs. The sequencing of full-length complementary DNAs by the FANTOM consortium revealed that many transcripts have no coding potential, i.e. non-coding. The FANTOM3 full-length cDNAs database contains 102,801 cDNAs and can be assessed at ftp://fantom.gsc.riken.jp/fantomdb/3.0/. Here's part of the FANTOM3 press release (see http://fantom.gsc.riken.jp/3/doc/PressRelease.pdf), which stated the significance of regulatory ncRNAs:
Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs that until recently was not suspected to exist or be relevant to our biology.
Assessing the coding potential of these cDNAs revealed that 38.6% of the cDNAs had very little coding potential (http://dx.doi.org/10.6084/m9.figshare.1046601). Various discoveries of long non-coding RNAs had already been previously cloned by the FANTOM consortium; this includes Hotair (http://www.ncbi.nlm.nih.gov/pubmed/17604720), which corresponds to this cDNA sequence http://www.ncbi.nlm.nih.gov/nuccore/26084774 and Braveheart (http://www.ncbi.nlm.nih.gov/pubmed/23352431), which corresponds to this cDNA sequence http://www.ncbi.nlm.nih.gov/nucleotide/AK143260.
ncRNAs are emerging as key regulators of embryogenesis. They control embryonic gene expression by several means, ranging from microRNA induced degradation of mRNAs to long ncRNA-mediated modification of chromatin. Many aspects of embryogenesis seem to be controlled by ncRNAs, including the maternal–zygotic transition, the maintenance of pluripotency, the patterning of the body axes, the specification and differentiation of cell types and the morphogenesis of organs. Drawing from several animal model systems, we describe two emerging themes for ncRNA function: promoting developmental transitions and maintaining developmental states. These examples also highlight the roles of ncRNAs in ensuring a robust commitment to one of two possible cell fates.
The ratio of non-coding to protein-coding DNA rises as a function of developmental complexity: http://www.nature.com/nrg/journal/v5/n4/fig_tab/nrg1321_F1.html and http://arxiv.org/abs/q-bio.GN/0401020 but see Junk DNA
Pervasive transcription constitutes a new level of eukaryotic genome regulation: http://www.nature.com/embor/journal/v10/n9/full/embor2009181.html
Simply fragments of pre-mRNAs: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1000371
but dismissed in http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1000625
Critique of ENCODE: http://gbe.oxfordjournals.org/content/5/3/578 and http://www.pnas.org/content/110/14/5294.full
Databases
NONCODE -> http://www.noncode.org/index.php
#!/bin/sh wget http://www.noncode.org/datadownload/NONCODEv4_human.lncRNA.exp.gz wget http://www.noncode.org/datadownload/NONCODEv4_human.gene.exp.gz wget http://www.noncode.org/datadownload/NONCODEv4_human.func.gz wget http://www.noncode.org/datadownload/NONCODEv4_hg19.lncAndGene.bed.gz wget http://www.noncode.org/datadownload/NONCODEv4_human_cc.tgz wget http://www.noncode.org/datadownload/NONCODEv4_human.fa.gz wget http://www.noncode.org/datadownload/NONCODEv4u1_human_ncRNA.bed.gz wget http://www.noncode.org/datadownload/NONCODEv4u1_human_lncRNA.bed.gz wget http://www.noncode.org/datadownload/NONCODEv4u1_human_lncRNA_Gene.bed.gz wget http://www.noncode.org/datadownload/NONCODEv4u1_human_lncRNA.gtf.gz wget http://www.noncode.org/datadownload/NONCODEv4u1_human_ncRNA.fa.gz
Some numbers
zcat NONCODEv4u1_human_ncrna.fa.gz | grep "^>" | wc -l 145331 zcat NONCODEv4_human.fa.gz | grep "^>" | wc -l 148172 #As noted in their latest update #2841 transcripts have been deleted. expr 148172 - 145331 2841
I'm interested in the expression files:
zcat NONCODEv4_human.gene.exp.gz NONCODEv4_human.lncRNA.exp.gz | wc -l 151115
But the above file has different number of columns for some rows.
zcat "NONCODEv4_human.lncRNA.exp.gz" | perl -nle '@a=split(/\t/); $n=@a; print $n' | sort | uniq -c 112 23 95024 25 zcat "NONCODEv4_human.lncRNA.exp.gz" | perl -nle '@a=split(/\t/); $n=@a; print if $n==23' | less
I will exclude them as they seem to have no expression value.
zcat "NONCODEv4_human.lncRNA.exp.gz" | perl -nle '@a=split(/\t/); $n=@a; print unless $n==23' | gzip > NONCODE_v4_human_lncRNA_no_zero.exp.gz
Some very quick statistics using R
data <- read.table("NONCODE_v4_human_lncRNA_no_zero.exp.gz", header=T, stringsAsFactors=F, sep="\t") summary(data) NONCODE.ID adipose adrenal brain Length:95023 Min. : 0.00 Min. : 0.00 Min. : 0.00 Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 Mode :character Median : 0.00 Median : 0.05 Median : 0.05 Mean : 11.34 Mean : 10.14 Mean : 6.51 3rd Qu.: 0.50 3rd Qu.: 0.92 3rd Qu.: 0.71 Max. :191100.00 Max. :102810.00 Max. :69899.50 brain_R breast colon foreskin Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Median : 0.00 Median : 0.02 Median : 0.00 Median : 0.000 Mean : 6.48 Mean : 11.19 Mean : 11.70 Mean : 4.370 3rd Qu.: 0.42 3rd Qu.: 0.67 3rd Qu.: 0.47 3rd Qu.: 0.306 Max. :58651.40 Max. :157946.00 Max. :175879.00 Max. :31165.600 heart hela_R HLF_1 HLF_2 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 Median : 0.00 Median : 0.0 Median : 0.0 Median : 0.00 Mean : 3.79 Mean : 9.7 Mean : 27.9 Mean : 5.41 3rd Qu.: 0.32 3rd Qu.: 0.2 3rd Qu.: 0.2 3rd Qu.: 0.41 Max. :56278.60 Max. :482248.0 Max. :626288.0 Max. :40215.70 kidney liver liver_R lung Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.000 Median : 0.00 Median : 0.00 Median : 0.0 Median : 0.000 Mean : 5.77 Mean : 4.38 Mean : 22.6 Mean : 4.929 3rd Qu.: 0.63 3rd Qu.: 0.20 3rd Qu.: 0.0 3rd Qu.: 0.523 Max. :75171.40 Max. :35716.30 Max. :499541.0 Max. :12131.100 lymphNode ovary placenta_R prostate Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Median : 0.000 Median : 0.05 Median : 0.00 Median : 0.000 Mean : 5.160 Mean : 8.93 Mean : 6.33 Mean : 4.976 3rd Qu.: 0.711 3rd Qu.: 0.86 3rd Qu.: 0.57 3rd Qu.: 0.652 Max. :11890.000 Max. :60103.10 Max. :75331.20 Max. :12193.400 skeltalMuscle testes testes_R thyroid Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 Median : 0.00 Median : 0.20 Median : 0.28 Median : 0.04 Mean : 4.68 Mean : 8.46 Mean : 5.11 Mean : 7.12 3rd Qu.: 0.17 3rd Qu.: 1.43 3rd Qu.: 1.31 3rd Qu.: 0.88 Max. :49610.30 Max. :63606.50 Max. :38256.60 Max. :74919.30 whiteBloodCell Min. : 0.000 1st Qu.: 0.000 Median : 0.000 Mean : 5.017 3rd Qu.: 0.326 Max. :10675.900 data[data$adipose==191100,] NONCODE.ID adipose adrenal brain brain_R breast colon foreskin 86398 NONHSAT135828 191100 102810 52456.1 14910.2 157946 175879 3267.11 heart hela_R HLF_1 HLF_2 kidney liver liver_R lung lymphNode 86398 56278.6 808.206 271.58 14484.9 75171.4 35716.3 500.349 11699.1 11890 ovary placenta_R prostate skeltalMuscle testes testes_R thyroid 86398 60103.1 3073.83 12193.4 49610.3 63606.5 3159.86 74919.3 whiteBloodCell 86398 6783.71
NONHSAT135828 is quite highly expressed, however, I'm a bit worried about the difference between HLF_1 and HLF_2.
Different classes
- rRNA - http://www.ncbi.nlm.nih.gov/pubmed/14381428
- tRNA - http://www.ncbi.nlm.nih.gov/pubmed/13538965
- PROMPTS - http://en.wikipedia.org/wiki/Cryptic_unstable_transcript and http://www.ncbi.nlm.nih.gov/pubmed/19056938
- lncRNA - http://www.ncbi.nlm.nih.gov/pubmed/12466851
- vRNA - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2114306/
- Antisense RNA - http://www.ncbi.nlm.nih.gov/pubmed/16141073
- lincRNA - http://www.ncbi.nlm.nih.gov/pubmed/21874018, http://www.ncbi.nlm.nih.gov/pubmed/22196729, http://www.ncbi.nlm.nih.gov/pubmed/21890647
- miRNA - http://www.ncbi.nlm.nih.gov/pubmed/8252621
- rasiRNA or piRNA http://www.ncbi.nlm.nih.gov/pubmed/17322028
- TSSa-RNA - http://www.ncbi.nlm.nih.gov/pubmed/21822281
- eRNA - http://www.ncbi.nlm.nih.gov/pubmed/20393465
- PASR (promoter-associated short RNA) and TASR (termini-associated short RNAs) - http://www.ncbi.nlm.nih.gov/pubmed/17510325, http://www.ncbi.nlm.nih.gov/pubmed/20542000, http://www.pnas.org/content/104/30/12422
- diRNA or DDRNA - http://www.ncbi.nlm.nih.gov/pubmed/22445173, http://www.ncbi.nlm.nih.gov/pubmed/22722852
- ceRNA - http://www.ncbi.nlm.nih.gov/pubmed/24429633
- tiRNA - http://www.ncbi.nlm.nih.gov/pubmed/19377478
- CUT - http://www.ncbi.nlm.nih.gov/pubmed/17074811
- SUT - http://www.ncbi.nlm.nih.gov/pubmed/21826286
- moR - http://www.ncbi.nlm.nih.gov/pubmed/19151725
Reviews
http://www.ncbi.nlm.nih.gov/pubmed/20628352, which discusses methods for annotating the non-coding regions of the genome.
http://www.ncbi.nlm.nih.gov/pubmed/23463798, which concludes that doing experiments is the best way to test whether a non-coding RNA is functional: "In the end, it may not be possible or meaningful to try to apply criteria such as stability, conservation, and expression level to find order in this chaos." and "Ultimately, the true test for function lies in the detailed, mechanistic dissection of the genetic pathways and cellular activities for each individual putative lncRNA."
http://www.ncbi.nlm.nih.gov/pubmed/15851066 - Non-coding RNAs: hope or hype?
Arguments against ncRNA
Lack of conservation - http://www.nature.com/nature/journal/v431/n7010/full/nature03016.html (argument against http://www.nature.com/nature/journal/v420/n6915/full/nature01266.html)
Well known ncRNAs are not conserved - http://www.nature.com/nature/journal/v431/n7010/full/nature03017.html
Transcriptional noise - http://www.nature.com/nsmb/journal/v14/n2/full/nsmb0207-103.html
Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes - http://www.ncbi.nlm.nih.gov/pubmed/22056041
Long non-coding RNAs
Long non-coding RNA are arbitrarily considered to be longer than ~200 nucleotides, on the basis of a convenient practical cut-off in RNA purification protocols that excludes small RNAs.
See http://www.labome.com/method/LncRNA-Research-Resources.html
Examples
- Kcnq1ot1 - expression of the un-spliced lncRNA Kcnq1 overlapping transcript 1 (Kcnq1ot1) silences Kcnq1 on the paternal allele
- Airn - expression of paternal-specific non-coding transcript "antisense Igf2r RNA non-coding", is required for the silencing of three genes on the paternal allele.
- Xist - expression of lncRNA X-inactive specific transcript (Xist) from the designated inactive X-chromosome is essential for the silencing of the inactive X-chromosome
- HOTAIR - expression of HOTAIR, which resides from the HOXC cluster, silences some genes on the Homeobox D (HOXD) cluster
- KLHL1AS - http://www.ncbi.nlm.nih.gov/pubmed/11919683
- H19 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC360709/
- Xist http://www.ncbi.nlm.nih.gov/pubmed/1423610
Papers
- Mercer et al., 2009 Long non-coding RNAs: insights into functions http://www.ncbi.nlm.nih.gov/pubmed/19188922
- Taft et al., 2010 Non-coding RNAs: regulators of disease http://www.ncbi.nlm.nih.gov/pubmed/19882673
- Wang et al., 2011 The long arm of long noncoding RNAs: roles as sensors regulating gene transcriptional programs http://www.ncbi.nlm.nih.gov/pubmed/20573714
- Ponting et al., 2009 Evolution and functions of long noncoding RNAs http://www.ncbi.nlm.nih.gov/pubmed/19239885
- Guttman et al., 2009 Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals http://www.ncbi.nlm.nih.gov/pubmed/19182780
- Kapranov et al., 2007 RNA maps reveal new RNA classes and a possible function for pervasive transcription http://www.ncbi.nlm.nih.gov/pubmed/17510325
- Louro et al., 2009 Long intronic noncoding RNA transcription: expression noise or expression choice? http://www.ncbi.nlm.nih.gov/pubmed/19071207
- Gupta et al., 2010 Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis http://www.ncbi.nlm.nih.gov/pubmed/20393566
LincRNAs
Long intergenic non-coding RNAs are long non-coding RNAs within the intergenic regions.
How many lincRNAs?
GENCODE version 19 has 11,324 lincRNAs
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | awk '$3=="transcript" {print}' | perl -nle 'if (/gene_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn 145641 protein_coding 17149 pseudogene 11324 lincRNA 9213 antisense 3055 miRNA 2206 processed_transcript 2034 misc_RNA 1916 snRNA 1457 snoRNA 814 sense_intronic 527 rRNA 316 sense_overlapping 196 IG_V_pseudogene 183 polymorphic_pseudogene 144 IG_V_gene 97 TR_V_gene 74 TR_J_gene 37 IG_D_gene 27 TR_V_pseudogene 25 3prime_overlapping_ncrna 22 Mt_tRNA 18 IG_C_gene 18 IG_J_gene 10 IG_C_pseudogene 5 TR_C_gene 4 TR_J_pseudogene 3 IG_J_pseudogene 3 TR_D_gene 2 Mt_rRNA
The Human lincRNA Catalog (http://www.broadinstitute.org/genome_bio/human_lincrnas/?q=lincRNA_catalog) has 14,353 lincRNAs
wget http://www.broadinstitute.org/genome_bio/human_lincrnas/sites/default/files/lincRNA_catalog/lincRNAs_transcripts.bed head lincRNAs_transcripts.bed chr1 139789 140339 TCONS_00000437 0.0 - 139789 140339 0,0,0 2 58,265, 0,285, chr1 141473 149707 TCONS_00000438 0.0 - 141473 149707 0,0,0 2 1538,3322, 0,4912, chr1 142807 146831 TCONS_00000439 0.0 - 142807 146831 0,0,0 3 204,124,190, 0,3578,3834, chr1 160445 161525 TCONS_00000119 0.0 + 160445 161525 0,0,0 2 245,212, 0,868, chr1 320161 321056 TCONS_00000120 0.0 + 320161 321056 0,0,0 3 492,58,25, 0,719,870, chr1 459655 461954 TCONS_00000121 0.0 + 459655 461954 0,0,0 3 337,135,204, 0,1498,2095, chr1 521368 523833 TCONS_00000442 0.0 - 521368 523833 0,0,0 3 370,135,337, 0,832,2128, chr1 523008 530148 TCONS_00000122 0.0 + 523008 530148 0,0,0 2 73,534, 0,6606, chr1 523047 529954 TCONS_00000123 0.0 + 523047 529954 0,0,0 2 62,340, 0,6567, chr1 529832 530595 TCONS_00000124 0.0 + 529832 530595 0,0,0 2 304,133, 0,630, cat lincRNAs_transcripts.bed | cut -f4 | sort -u | wc -l 14353 #how many spliced cat lincRNAs_transcripts.bed | cut -f10 | sort | uniq -c | sort -k1rn 8261 2 3657 3 1461 4 557 5 203 6 81 7 47 8 31 1 29 9 11 10 7 11 3 12 1 13 1 14 1 15 1 20 1 24
refSeq has 789 lincRNAs:
date Thu Oct 23 13:48:07 JST 2014 echo 'SELECT * FROM refGene' | mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz zcat refgene.tsv.gz | cut -f2 | grep -v "^name" | sort -u | wc -l 48418 zcat refgene.tsv.gz | cut -f2 | grep -v "^name" | cut -f1 -d'_' | sort | uniq -c 40781 NM 11311 NR zcat refgene.tsv.gz | cut -f13 | grep LINC | sort -u | wc -l 789
Publications
- lincRNAs act in the circuitry controlling pluripotency and differentiation - http://www.ncbi.nlm.nih.gov/pubmed/21874018
- Performed loss-of-function studies on most lincRNAs expressed in mouse embryonic stem (ES) cells and characterized the effects on gene expression.
- Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution - http://www.ncbi.nlm.nih.gov/pubmed/22196729
- To better understand the evolution and functions of these enigmatic RNAs, we used chromatin marks, poly(A)-site mapping and RNA-Seq data to identify more than 550 distinct lincRNAs in zebrafish.
- Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses - http://www.ncbi.nlm.nih.gov/pubmed/21890647
- Here, we present an integrative approach to define a reference catalog of >8000 human lincRNAs.
- lincRNAs: genomics, evolution, and mechanisms - http://www.ncbi.nlm.nih.gov/pubmed/23827673
- This Review outlines the emerging understanding of lincRNAs in vertebrate animals, with emphases on how they are being identified and current conclusions and questions regarding their genomics, evolution and mechanisms of action.