Coding potential of non-coding RNA

The Coding Potential Assessment Tool (CPAT) was developed to assess the coding potential of RNA sequences without the need of sequence alignment. More information on the tool can be found at their website and corresponding publication.

Briefly CPAT uses four sequence features to determine coding potential:

ORF length
ORF coverage
Fickett TESTCODE statistic
Hexamer usage bias

ORF length is one of the most fundamental features used to distinguish non-protein-coding RNA from protein-coding RNA because a long putative ORF is unlikely to be observed by random chance in non-coding sequences. ORF coverage was defined as the ratio of ORF length to transcript length. While some non-coding sequences may contain ORFs by chance, they usually have a lower coverage. The Fickett TESTCODE statistic distinguishes non-protein-coding RNA from protein-coding RNA according to the combinational effect of nucleotide composition and codon usage bias. The fourth and last feature is hexamer usage bias and is useful because of the dependence between adjacent amino acids in proteins.

In this post, I test CPAT by inputting a set of known protein coding RNAs (RefSeqs) and a set of putative non-coding RNAs taken from GENCODE. I use the logit model file and the hexamer frequency table files that were provided with CPAT.

wget ftp://ftp.sanger.ac.uk/pub/gencode/release_16/gencode.v16.lncRNA_transcripts.fa.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz
python ~/bin/CPAT/usr/local/bin/cpat.py -d ~/src/CPAT-1.2.1/dat/Human_train.RData -x ~/src/CPAT-1.2.1/dat/Human_Hexamer.tab -g gencode.v16.lncRNA_transcripts.fa -o lnc_rna_cpat.out
python ~/bin/CPAT/usr/local/bin/cpat.py -d ~/src/CPAT-1.2.1/dat/Human_train.RData -x ~/src/CPAT-1.2.1/dat/Human_Hexamer.tab -g human.rna.fna.gz -o refseq
#stats is available here https://github.com/arq5x/filo
#coding potential of lncRNA
cat lnc_rna_cpat.out | cut -f6 | stats
Total lines:            22444
Sum of lines:           1773.01404739716
Ari. Mean:              0.0789972396808571
Geo. Mean:              0.0157839580319542
Median:                 0.0157583998480091
Mode:                   1 (N=22)
Anti-Mode:              2.22144653496105e-13 (N=1)
Minimum:                2.22144653496105e-13
Maximum:                1
Variance:               0.0332286952048533
StdDev:                 0.182287397273792
#coding potential of mRNA
cat refseq | cut -f6 | stats
Total lines:            47251
Sum of lines:           36677.7025936008
Ari. Mean:              0.776231245764128
Geo. Mean:              0.405030720938695
Median:                 0.999573383225558
Mode:                   1 (N=3820)
Anti-Mode:              2.26544376749313e-13 (N=1)
Minimum:                2.26544376749313e-13
Maximum:                1
Variance:               0.141651908600363
StdDev:                 0.376366720899129

Input into R

refseq <- read.table("refseq",header=T)
lncrna <- read.table("lnc_rna_cpat.out",header=T)
head(refseq)
                              mRNA_size ORF_size Fickett_score Hexamer_score
GI|239744030|REF|XR_017086.3|      3159     1521        1.1446    0.34745117
GI|239745142|REF|XR_040656.2|       747      747        0.9367    0.25149694
GI|239746044|REF|XR_078542.1|      1338      681        0.8234    0.05853078
GI|239746948|REF|XR_037268.2|      1057      429        0.7305    0.00207399
GI|239749771|REF|XR_038087.2|      3159     1521        1.1446    0.34918451
GI|239751533|REF|XR_079389.1|      1338      681        0.8234    0.05091209
                              coding_prob
GI|239744030|REF|XR_017086.3|   0.9999992
GI|239745142|REF|XR_040656.2|   0.9894146
GI|239746044|REF|XR_078542.1|   0.8964281
GI|239746948|REF|XR_037268.2|   0.2289991
GI|239749771|REF|XR_038087.2|   0.9999992
GI|239751533|REF|XR_079389.1|   0.8916638
head(lncrna)
                                                                                                                  mRNA_size
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712|            712
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535|            535
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748|      2748
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491|        491
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629|        629
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336|        336
                                                                                                                  ORF_size
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712|           228
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535|           111
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748|      240
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491|       258
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629|       258
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336|       114
                                                                                                                  Fickett_score
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712|             0.8192
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535|             0.6819
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748|        1.1129
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491|         0.9340
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629|         0.6732
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336|         1.1303
                                                                                                                  Hexamer_score
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712|        0.089700116
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535|        0.004253738
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748|  -0.036719025
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491|    0.066306017
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629|   -0.262129058
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336|    0.248847902
                                                                                                                  coding_prob
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712|      0.075234173
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535|      0.008503811
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 0.082058427
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491|  0.122934737
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629|  0.006955770
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336|  0.155721804
boxplot(refseq$coding_prob, lncrna$coding_prob, main="Coding Probability", names=c("RefSeq","lncRNA"))

The coding probability of RefSeq RNAs are very high compared to lncRNAs, as expected.

install.packages("rgl")
library(rgl)
dim(refseq)
#[1] 47251     5
refseq$colour <- rep(2,47251)
dim(lncrna)
#[1] 22444     5
lncrna$colour <- rep(3,22444)
combined <- rbind(refseq, lncrna)
plot3d(combined[,c(2:4)],col=combined$colour)
rgl.postscript("cpat.pdf","pdf")

#install if necessary
sudo apt-get install imagemagick
convert cpat.pdf cpat.png

In red are RefSeq mRNAs and in green are lncRNAs. The several long ORFs in the RefSeq mRNAs spreads out the plot.

Remove the long ORFs

summary(combined$ORF_size)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.0    192.0    510.0    994.6   1356.0 108000.0
table(combined$ORF_size>40000)

FALSE  TRUE 
69688     7
new <- combined[combined$ORF_size<40000,]
dim(new)
[1] 69688     6
dim(combined)
[1] 69695     6
plot3d(new[,c(2:4)],col=new$colour)
rgl.postscript("new_cpat.pdf","pdf")

Most of the lncRNA have very short or no ORFs.

FANTOM3 clones

One of the main findings of FANTOM3 was the appreciation that many transcribed RNAs were not protein coding or had very low coding potential. As part of the FANTOM project, a collection of mouse full-length cDNAs were made; there are a total of 102,801 full-length cDNAs. I wonder how many of them have a very low coding potential. As per above:

#download the FANTOM3 full length cDNAs
wget ftp://fantom.gsc.riken.jp/fantomdb/3.0/fantom3.00.seq.gz
gunzip fantom3.00.seq.gz
mv fantom3.00.seq fantom3_cdna.fa
time python ~/src/CPAT-1.2.1/usr/local/bin/cpat.py -d /home/davetang/src/CPAT-1.2.1/dat/Mouse_train.RData -x /home/davetang/src/CPAT-1.2.1/dat/Mouse_Hexamer.tab -g fantom3_cdna.fa -o fantom3.out
102801 genes finished
real    5m46.575s
user    5m44.558s
sys     0m0.674s
head fantom3.out
mRNA_size       ORF_size        Fickett_score   Hexamer_score   coding_prob
RI|0610005A07|R000001A15|1277   1277    657     1.2052  0.470737800343  0.970083894611433
RI|0610005A19|R000001G01|959    959     315     1.1752  0.370696945252  0.635522146410408
RI|0610005A21|R000001G07|546    546     213     1.0958  0.426770923929  0.466524310212968
RI|0610005B05|R000001A09|630    630     444     1.3071  0.440894832259  0.883242064704715
RI|0610005B13|R000001C15|150    150     51      0.7433  0.356266243426  0.0939523506409684
RI|0610005B19|R000001G03|628    628     444     1.3071  0.440894832259  0.883245792471949
RI|0610005C11|R000001C05|1333   1333    1017    1.2668  0.423820060645  0.997675905453406
RI|0610005C13|R000001C17|897    897     240     0.7017  -0.0668194585448        0.073811043700264
RI|0610005C16|R000001E09|505    505     210     1.2186  0.302973942291  0.409297553064121
gzip fantom3.out

Visualise using R:

f3 <- read.table("fantom3.out.gz",
                 header=T)

summary(f3$coding_prob)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000001 0.0644300 0.6948000 0.5521000 0.9964000 1.0000000

table(f3$coding_prob<0.2)

FALSE  TRUE 
63093 39708 

hist(f3$coding_prob,
     breaks=100,
     xlab="Coding probability",
     main="Coding probability of FANTOM3 mouse full-length cDNAs",
     col=1)

Please cite this DOI if you wish to use this figure: http://dx.doi.org/10.6084/m9.figshare.1046601.

Conclusion

The Coding Potential Assessment Tool (CPAT) is a very fast tool for assessing the coding potential of many RNA transcripts. Inputting a set of known protein coding RNAs versus a set of non-coding RNAs returned results as expected, which was that the former had a much higher coding potential than the latter class of RNAs. Lastly, using CPAT on FANTOM3 full-length cDNAs seemed to confirm that a relatively large portion of transcribed RNAs are non-coding.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

7 comments Add yours

Alex P says:

July 3, 2013 at 03:33

Hi Dave,

When running:

python ~/Downloads/CPAT-1.2.1/bin/make_logitModel.py -c CDS.fa -n 5-3prime.fa -o agam -r genome.fa

I get this error:

Traceback (most recent call last):
File “/Users/ganglion/Downloads/CPAT-1.2.1/bin/make_logitModel.py”, line 28, in
import pysam
File “/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/pysam/__init__.py”, line 1, in
from csamtools import *
ImportError: dlopen(/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so, 2): Symbol not found: ___ks_insertsort_heap
Referenced from: /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so

What do you think I’m doing wrong? I’m on OS X.

Alex P says:

July 3, 2013 at 14:17

Hi Dave,

I tried running:

python ~/Downloads/CPAT-1.2.1/bin/make_logitModel.py -c agamCDS.fa -n agambiae.BASEFEATURES_PEST-AgamP3.6.gff3.5-3prime.fa -o agam -r agambiae.CHROMOSOMES-PEST.AgamP3.fa

and got this error message:
____________________________________
Traceback (most recent call last):
File “/Users/ganglion/Downloads/CPAT-1.2.1/bin/make_logitModel.py”, line 28, in
import pysam
File “/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/pysam/__init__.py”, line 1, in
from csamtools import *
ImportError: dlopen(/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so, 2): Symbol not found: ___ks_insertsort_heap
Referenced from: /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
_____________________________________

What do you think could be happening here?

1. Alex P says:
  
  July 3, 2013 at 14:19
  
  By the way, I am on OS X.
  
2. Alex P says:
  
  July 3, 2013 at 20:57
  
  Hi Dave, I was able to run cpat successfully.
  
  I have a new question now, however. When I plot my histogram of coding potential, the values in column 5 do not appear to be a probability. My x-axis is ranging from -3 to 2 and not between 0 and 1.0…what am I really graphing here?
  
  1. Davo says:
    
    July 4, 2013 at 02:52
    
    Hi Alex,
    
    You’re right, it should be between 0 and 1. I am not really sure and it may be hard to pinpoint without looking at your data, how you ran the analysis and your output.
    
    I guess the easiest thing to check is whether you’re plotting the right column. Sorry I couldn’t be of more help.
    
    Cheers,
    
    Dave
    
Alex P says:

July 4, 2013 at 03:05

I only have 5 columns instead of 6. I contacted the developer who said it’s an issue with R. He said my training dataset may be so different that the algorithm cannot converge during the make_logit step and therefore cpat.py fails to generate a probability of coding potential.

This is unfortunate…

1. Davo says:
  
  July 4, 2013 at 13:57
  
  Ah that is unfortunate. Well I hope you can find an alternative. Good luck!

FANTOM3 clones

Conclusion

Leave a Reply Cancel reply