Non-coding RNA

From Dave's wiki
Jump to navigation Jump to search

A non-coding RNA (ncRNA) is an RNA molecule that is not translated into a protein. The classical ncRNAs include transfer RNAs, ribosomal RNAs, and small nuclear RNAs. The sequencing of full-length complementary DNAs by the FANTOM consortium revealed that many transcripts have no coding potential, i.e. non-coding. The FANTOM3 full-length cDNAs database contains 102,801 cDNAs and can be assessed at ftp://fantom.gsc.riken.jp/fantomdb/3.0/. Here's part of the FANTOM3 press release (see http://fantom.gsc.riken.jp/3/doc/PressRelease.pdf), which stated the significance of regulatory ncRNAs:

Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs that until recently was not suspected to exist or be relevant to our biology.

Assessing the coding potential of these cDNAs revealed that 38.6% of the cDNAs had very little coding potential (http://dx.doi.org/10.6084/m9.figshare.1046601). Various discoveries of long non-coding RNAs had already been previously cloned by the FANTOM consortium; this includes Hotair (http://www.ncbi.nlm.nih.gov/pubmed/17604720), which corresponds to this cDNA sequence http://www.ncbi.nlm.nih.gov/nuccore/26084774 and Braveheart (http://www.ncbi.nlm.nih.gov/pubmed/23352431), which corresponds to this cDNA sequence http://www.ncbi.nlm.nih.gov/nucleotide/AK143260.

ncRNAs are emerging as key regulators of embryogenesis. They control embryonic gene expression by several means, ranging from microRNA induced degradation of mRNAs to long ncRNA-mediated modification of chromatin. Many aspects of embryogenesis seem to be controlled by ncRNAs, including the maternal–zygotic transition, the maintenance of pluripotency, the patterning of the body axes, the specification and differentiation of cell types and the morphogenesis of organs. Drawing from several animal model systems, we describe two emerging themes for ncRNA function: promoting developmental transitions and maintaining developmental states. These examples also highlight the roles of ncRNAs in ensuring a robust commitment to one of two possible cell fates.

The ratio of non-coding to protein-coding DNA rises as a function of developmental complexity: http://www.nature.com/nrg/journal/v5/n4/fig_tab/nrg1321_F1.html and http://arxiv.org/abs/q-bio.GN/0401020 but see Junk DNA

Pervasive transcription constitutes a new level of eukaryotic genome regulation: http://www.nature.com/embor/journal/v10/n9/full/embor2009181.html

Simply fragments of pre-mRNAs: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1000371

but dismissed in http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1000625

Critique of ENCODE: http://gbe.oxfordjournals.org/content/5/3/578 and http://www.pnas.org/content/110/14/5294.full

Databases

NONCODE -> http://www.noncode.org/index.php

#!/bin/sh

wget http://www.noncode.org/datadownload/NONCODEv4_human.lncRNA.exp.gz
wget http://www.noncode.org/datadownload/NONCODEv4_human.gene.exp.gz
wget http://www.noncode.org/datadownload/NONCODEv4_human.func.gz
wget http://www.noncode.org/datadownload/NONCODEv4_hg19.lncAndGene.bed.gz
wget http://www.noncode.org/datadownload/NONCODEv4_human_cc.tgz
wget http://www.noncode.org/datadownload/NONCODEv4_human.fa.gz
wget http://www.noncode.org/datadownload/NONCODEv4u1_human_ncRNA.bed.gz
wget http://www.noncode.org/datadownload/NONCODEv4u1_human_lncRNA.bed.gz
wget http://www.noncode.org/datadownload/NONCODEv4u1_human_lncRNA_Gene.bed.gz
wget http://www.noncode.org/datadownload/NONCODEv4u1_human_lncRNA.gtf.gz
wget http://www.noncode.org/datadownload/NONCODEv4u1_human_ncRNA.fa.gz

Some numbers

zcat NONCODEv4u1_human_ncrna.fa.gz | grep "^>" | wc -l
145331
zcat NONCODEv4_human.fa.gz | grep "^>" | wc -l
148172
#As noted in their latest update
#2841 transcripts have been deleted.
expr 148172 - 145331
2841

I'm interested in the expression files:

zcat NONCODEv4_human.gene.exp.gz NONCODEv4_human.lncRNA.exp.gz | wc -l
151115

But the above file has different number of columns for some rows.

zcat "NONCODEv4_human.lncRNA.exp.gz" | perl -nle '@a=split(/\t/); $n=@a; print $n' | sort | uniq -c
    112 23
  95024 25
zcat "NONCODEv4_human.lncRNA.exp.gz" | perl -nle '@a=split(/\t/); $n=@a; print if $n==23' | less

I will exclude them as they seem to have no expression value.

zcat "NONCODEv4_human.lncRNA.exp.gz" | perl -nle '@a=split(/\t/); $n=@a; print unless $n==23' | gzip > NONCODE_v4_human_lncRNA_no_zero.exp.gz

Some very quick statistics using R

data <- read.table("NONCODE_v4_human_lncRNA_no_zero.exp.gz", header=T, stringsAsFactors=F, sep="\t")
summary(data)
NONCODE.ID           adipose             adrenal              brain
Length:95023       Min.   :     0.00   Min.   :     0.00   Min.   :    0.00
Class :character   1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:    0.00
Mode  :character   Median :     0.00   Median :     0.05   Median :    0.05
                   Mean   :    11.34   Mean   :    10.14   Mean   :    6.51
                   3rd Qu.:     0.50   3rd Qu.:     0.92   3rd Qu.:    0.71
                   Max.   :191100.00   Max.   :102810.00   Max.   :69899.50
   brain_R             breast              colon              foreskin
Min.   :    0.00   Min.   :     0.00   Min.   :     0.00   Min.   :    0.000
1st Qu.:    0.00   1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:    0.000
Median :    0.00   Median :     0.02   Median :     0.00   Median :    0.000
Mean   :    6.48   Mean   :    11.19   Mean   :    11.70   Mean   :    4.370
3rd Qu.:    0.42   3rd Qu.:     0.67   3rd Qu.:     0.47   3rd Qu.:    0.306
Max.   :58651.40   Max.   :157946.00   Max.   :175879.00   Max.   :31165.600
    heart              hela_R             HLF_1              HLF_2
Min.   :    0.00   Min.   :     0.0   Min.   :     0.0   Min.   :    0.00
1st Qu.:    0.00   1st Qu.:     0.0   1st Qu.:     0.0   1st Qu.:    0.00
Median :    0.00   Median :     0.0   Median :     0.0   Median :    0.00
Mean   :    3.79   Mean   :     9.7   Mean   :    27.9   Mean   :    5.41
3rd Qu.:    0.32   3rd Qu.:     0.2   3rd Qu.:     0.2   3rd Qu.:    0.41
Max.   :56278.60   Max.   :482248.0   Max.   :626288.0   Max.   :40215.70
    kidney             liver             liver_R              lung
Min.   :    0.00   Min.   :    0.00   Min.   :     0.0   Min.   :    0.000
1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:     0.0   1st Qu.:    0.000
Median :    0.00   Median :    0.00   Median :     0.0   Median :    0.000
Mean   :    5.77   Mean   :    4.38   Mean   :    22.6   Mean   :    4.929
3rd Qu.:    0.63   3rd Qu.:    0.20   3rd Qu.:     0.0   3rd Qu.:    0.523
Max.   :75171.40   Max.   :35716.30   Max.   :499541.0   Max.   :12131.100
  lymphNode             ovary            placenta_R          prostate
Min.   :    0.000   Min.   :    0.00   Min.   :    0.00   Min.   :    0.000
1st Qu.:    0.000   1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    0.000
Median :    0.000   Median :    0.05   Median :    0.00   Median :    0.000
Mean   :    5.160   Mean   :    8.93   Mean   :    6.33   Mean   :    4.976
3rd Qu.:    0.711   3rd Qu.:    0.86   3rd Qu.:    0.57   3rd Qu.:    0.652
Max.   :11890.000   Max.   :60103.10   Max.   :75331.20   Max.   :12193.400
skeltalMuscle          testes            testes_R           thyroid
Min.   :    0.00   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00
1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    0.00
Median :    0.00   Median :    0.20   Median :    0.28   Median :    0.04
Mean   :    4.68   Mean   :    8.46   Mean   :    5.11   Mean   :    7.12
3rd Qu.:    0.17   3rd Qu.:    1.43   3rd Qu.:    1.31   3rd Qu.:    0.88
Max.   :49610.30   Max.   :63606.50   Max.   :38256.60   Max.   :74919.30
whiteBloodCell
Min.   :    0.000
1st Qu.:    0.000
Median :    0.000
Mean   :    5.017
3rd Qu.:    0.326
Max.   :10675.900

data[data$adipose==191100,]
        NONCODE.ID adipose adrenal   brain brain_R breast  colon foreskin
86398 NONHSAT135828  191100  102810 52456.1 14910.2 157946 175879  3267.11
       heart  hela_R  HLF_1   HLF_2  kidney   liver liver_R    lung lymphNode
86398 56278.6 808.206 271.58 14484.9 75171.4 35716.3 500.349 11699.1     11890
       ovary placenta_R prostate skeltalMuscle  testes testes_R thyroid
86398 60103.1    3073.83  12193.4       49610.3 63606.5  3159.86 74919.3
     whiteBloodCell
86398        6783.71

NONHSAT135828 is quite highly expressed, however, I'm a bit worried about the difference between HLF_1 and HLF_2.

Different classes

Reviews

http://www.ncbi.nlm.nih.gov/pubmed/20628352, which discusses methods for annotating the non-coding regions of the genome.

http://www.ncbi.nlm.nih.gov/pubmed/23463798, which concludes that doing experiments is the best way to test whether a non-coding RNA is functional: "In the end, it may not be possible or meaningful to try to apply criteria such as stability, conservation, and expression level to find order in this chaos." and "Ultimately, the true test for function lies in the detailed, mechanistic dissection of the genetic pathways and cellular activities for each individual putative lncRNA."

http://www.ncbi.nlm.nih.gov/pubmed/15851066 - Non-coding RNAs: hope or hype?

Arguments against ncRNA

Lack of conservation - http://www.nature.com/nature/journal/v431/n7010/full/nature03016.html (argument against http://www.nature.com/nature/journal/v420/n6915/full/nature01266.html)

Well known ncRNAs are not conserved - http://www.nature.com/nature/journal/v431/n7010/full/nature03017.html

Transcriptional noise - http://www.nature.com/nsmb/journal/v14/n2/full/nsmb0207-103.html

Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes - http://www.ncbi.nlm.nih.gov/pubmed/22056041

Long non-coding RNAs

Long non-coding RNA are arbitrarily considered to be longer than ~200 nucleotides, on the basis of a convenient practical cut-off in RNA purification protocols that excludes small RNAs.

See http://www.labome.com/method/LncRNA-Research-Resources.html

Examples

  • Kcnq1ot1 - expression of the un-spliced lncRNA Kcnq1 overlapping transcript 1 (Kcnq1ot1) silences Kcnq1 on the paternal allele
  • Airn - expression of paternal-specific non-coding transcript "antisense Igf2r RNA non-coding", is required for the silencing of three genes on the paternal allele.
  • Xist - expression of lncRNA X-inactive specific transcript (Xist) from the designated inactive X-chromosome is essential for the silencing of the inactive X-chromosome
  • HOTAIR - expression of HOTAIR, which resides from the HOXC cluster, silences some genes on the Homeobox D (HOXD) cluster
  • KLHL1AS - http://www.ncbi.nlm.nih.gov/pubmed/11919683
  • H19 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC360709/
  • Xist http://www.ncbi.nlm.nih.gov/pubmed/1423610

Papers

LincRNAs

Long intergenic non-coding RNAs are long non-coding RNAs within the intergenic regions.

How many lincRNAs?

GENCODE version 19 has 11,324 lincRNAs

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | awk '$3=="transcript" {print}' |
perl -nle 'if (/gene_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn
145641 protein_coding
 17149 pseudogene
 11324 lincRNA
  9213 antisense
  3055 miRNA
  2206 processed_transcript
  2034 misc_RNA
  1916 snRNA
  1457 snoRNA
   814 sense_intronic
   527 rRNA
   316 sense_overlapping
   196 IG_V_pseudogene
   183 polymorphic_pseudogene
   144 IG_V_gene
    97 TR_V_gene
    74 TR_J_gene
    37 IG_D_gene
    27 TR_V_pseudogene
    25 3prime_overlapping_ncrna
    22 Mt_tRNA
    18 IG_C_gene
    18 IG_J_gene
    10 IG_C_pseudogene
     5 TR_C_gene
     4 TR_J_pseudogene
     3 IG_J_pseudogene
     3 TR_D_gene
     2 Mt_rRNA

The Human lincRNA Catalog (http://www.broadinstitute.org/genome_bio/human_lincrnas/?q=lincRNA_catalog) has 14,353 lincRNAs

wget http://www.broadinstitute.org/genome_bio/human_lincrnas/sites/default/files/lincRNA_catalog/lincRNAs_transcripts.bed
head lincRNAs_transcripts.bed 
chr1    139789  140339  TCONS_00000437  0.0     -       139789  140339  0,0,0   2       58,265, 0,285,
chr1    141473  149707  TCONS_00000438  0.0     -       141473  149707  0,0,0   2       1538,3322,      0,4912,
chr1    142807  146831  TCONS_00000439  0.0     -       142807  146831  0,0,0   3       204,124,190,    0,3578,3834,
chr1    160445  161525  TCONS_00000119  0.0     +       160445  161525  0,0,0   2       245,212,        0,868,
chr1    320161  321056  TCONS_00000120  0.0     +       320161  321056  0,0,0   3       492,58,25,      0,719,870,
chr1    459655  461954  TCONS_00000121  0.0     +       459655  461954  0,0,0   3       337,135,204,    0,1498,2095,
chr1    521368  523833  TCONS_00000442  0.0     -       521368  523833  0,0,0   3       370,135,337,    0,832,2128,
chr1    523008  530148  TCONS_00000122  0.0     +       523008  530148  0,0,0   2       73,534, 0,6606,
chr1    523047  529954  TCONS_00000123  0.0     +       523047  529954  0,0,0   2       62,340, 0,6567,
chr1    529832  530595  TCONS_00000124  0.0     +       529832  530595  0,0,0   2       304,133,        0,630,
cat lincRNAs_transcripts.bed | cut -f4 | sort -u | wc -l
14353
#how many spliced
cat lincRNAs_transcripts.bed | cut -f10 | sort | uniq -c | sort -k1rn
  8261 2
  3657 3
  1461 4
   557 5
   203 6
    81 7
    47 8
    31 1
    29 9
    11 10
     7 11
     3 12
     1 13
     1 14
     1 15
     1 20
     1 24

refSeq has 789 lincRNAs:

date
Thu Oct 23 13:48:07 JST 2014
echo 'SELECT * FROM refGene' | mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz
zcat refgene.tsv.gz | cut -f2 | grep -v "^name" | sort -u | wc -l
48418
zcat refgene.tsv.gz | cut -f2 | grep -v "^name" | cut -f1 -d'_' | sort | uniq -c
 40781 NM
 11311 NR
zcat refgene.tsv.gz | cut -f13 | grep LINC | sort -u | wc -l
789

Publications

  • lincRNAs act in the circuitry controlling pluripotency and differentiation - http://www.ncbi.nlm.nih.gov/pubmed/21874018
    • Performed loss-of-function studies on most lincRNAs expressed in mouse embryonic stem (ES) cells and characterized the effects on gene expression.
  • Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution - http://www.ncbi.nlm.nih.gov/pubmed/22196729
    • To better understand the evolution and functions of these enigmatic RNAs, we used chromatin marks, poly(A)-site mapping and RNA-Seq data to identify more than 550 distinct lincRNAs in zebrafish.
  • Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses - http://www.ncbi.nlm.nih.gov/pubmed/21890647
    • Here, we present an integrative approach to define a reference catalog of >8000 human lincRNAs.
  • lincRNAs: genomics, evolution, and mechanisms - http://www.ncbi.nlm.nih.gov/pubmed/23827673
    • This Review outlines the emerging understanding of lincRNAs in vertebrate animals, with emphases on how they are being identified and current conclusions and questions regarding their genomics, evolution and mechanisms of action.

See also