The Coding Potential Assessment Tool (CPAT) was developed to assess the coding potential of RNA sequences without the need of sequence alignment. More information on the tool can be found at their website and corresponding publication.
Briefly CPAT uses four sequence features to determine coding potential:
- ORF length
- ORF coverage
- Fickett TESTCODE statistic
- Hexamer usage bias
ORF length is one of the most fundamental features used to distinguish non-protein-coding RNA from protein-coding RNA because a long putative ORF is unlikely to be observed by random chance in non-coding sequences. ORF coverage was defined as the ratio of ORF length to transcript length. While some non-coding sequences may contain ORFs by chance, they usually have a lower coverage. The Fickett TESTCODE statistic distinguishes non-protein-coding RNA from protein-coding RNA according to the combinational effect of nucleotide composition and codon usage bias. The fourth and last feature is hexamer usage bias and is useful because of the dependence between adjacent amino acids in proteins.
In this post, I test CPAT by inputting a set of known protein coding RNAs (RefSeqs) and a set of putative non-coding RNAs taken from GENCODE. I use the logit model file and the hexamer frequency table files that were provided with CPAT.
wget ftp://ftp.sanger.ac.uk/pub/gencode/release_16/gencode.v16.lncRNA_transcripts.fa.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz python ~/bin/CPAT/usr/local/bin/cpat.py -d ~/src/CPAT-1.2.1/dat/Human_train.RData -x ~/src/CPAT-1.2.1/dat/Human_Hexamer.tab -g gencode.v16.lncRNA_transcripts.fa -o lnc_rna_cpat.out python ~/bin/CPAT/usr/local/bin/cpat.py -d ~/src/CPAT-1.2.1/dat/Human_train.RData -x ~/src/CPAT-1.2.1/dat/Human_Hexamer.tab -g human.rna.fna.gz -o refseq #stats is available here https://github.com/arq5x/filo #coding potential of lncRNA cat lnc_rna_cpat.out | cut -f6 | stats Total lines: 22444 Sum of lines: 1773.01404739716 Ari. Mean: 0.0789972396808571 Geo. Mean: 0.0157839580319542 Median: 0.0157583998480091 Mode: 1 (N=22) Anti-Mode: 2.22144653496105e-13 (N=1) Minimum: 2.22144653496105e-13 Maximum: 1 Variance: 0.0332286952048533 StdDev: 0.182287397273792 #coding potential of mRNA cat refseq | cut -f6 | stats Total lines: 47251 Sum of lines: 36677.7025936008 Ari. Mean: 0.776231245764128 Geo. Mean: 0.405030720938695 Median: 0.999573383225558 Mode: 1 (N=3820) Anti-Mode: 2.26544376749313e-13 (N=1) Minimum: 2.26544376749313e-13 Maximum: 1 Variance: 0.141651908600363 StdDev: 0.376366720899129
Input into R
refseq <- read.table("refseq",header=T)
lncrna <- read.table("lnc_rna_cpat.out",header=T)
head(refseq)
mRNA_size ORF_size Fickett_score Hexamer_score
GI|239744030|REF|XR_017086.3| 3159 1521 1.1446 0.34745117
GI|239745142|REF|XR_040656.2| 747 747 0.9367 0.25149694
GI|239746044|REF|XR_078542.1| 1338 681 0.8234 0.05853078
GI|239746948|REF|XR_037268.2| 1057 429 0.7305 0.00207399
GI|239749771|REF|XR_038087.2| 3159 1521 1.1446 0.34918451
GI|239751533|REF|XR_079389.1| 1338 681 0.8234 0.05091209
coding_prob
GI|239744030|REF|XR_017086.3| 0.9999992
GI|239745142|REF|XR_040656.2| 0.9894146
GI|239746044|REF|XR_078542.1| 0.8964281
GI|239746948|REF|XR_037268.2| 0.2289991
GI|239749771|REF|XR_038087.2| 0.9999992
GI|239751533|REF|XR_079389.1| 0.8916638
head(lncrna)
mRNA_size
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 712
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 535
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 2748
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 491
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 629
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 336
ORF_size
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 228
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 111
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 240
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 258
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 258
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 114
Fickett_score
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 0.8192
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 0.6819
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 1.1129
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 0.9340
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 0.6732
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 1.1303
Hexamer_score
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 0.089700116
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 0.004253738
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| -0.036719025
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 0.066306017
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| -0.262129058
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 0.248847902
coding_prob
ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 0.075234173
ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 0.008503811
ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 0.082058427
ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 0.122934737
ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 0.006955770
ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 0.155721804
boxplot(refseq$coding_prob, lncrna$coding_prob, main="Coding Probability", names=c("RefSeq","lncRNA"))
The coding probability of RefSeq RNAs are very high compared to lncRNAs, as expected.
install.packages("rgl")
library(rgl)
dim(refseq)
#[1] 47251 5
refseq$colour <- rep(2,47251)
dim(lncrna)
#[1] 22444 5
lncrna$colour <- rep(3,22444)
combined <- rbind(refseq, lncrna)
plot3d(combined[,c(2:4)],col=combined$colour)
rgl.postscript("cpat.pdf","pdf")
#install if necessary sudo apt-get install imagemagick convert cpat.pdf cpat.png
In red are RefSeq mRNAs and in green are lncRNAs. The several long ORFs in the RefSeq mRNAs spreads out the plot.
Remove the long ORFs
summary(combined$ORF_size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 192.0 510.0 994.6 1356.0 108000.0
table(combined$ORF_size>40000)
FALSE TRUE
69688 7
new <- combined[combined$ORF_size<40000,]
dim(new)
[1] 69688 6
dim(combined)
[1] 69695 6
plot3d(new[,c(2:4)],col=new$colour)
rgl.postscript("new_cpat.pdf","pdf")
Most of the lncRNA have very short or no ORFs.
FANTOM3 clones
One of the main findings of FANTOM3 was the appreciation that many transcribed RNAs were not protein coding or had very low coding potential. As part of the FANTOM project, a collection of mouse full-length cDNAs were made; there are a total of 102,801 full-length cDNAs. I wonder how many of them have a very low coding potential. As per above:
#download the FANTOM3 full length cDNAs wget ftp://fantom.gsc.riken.jp/fantomdb/3.0/fantom3.00.seq.gz gunzip fantom3.00.seq.gz mv fantom3.00.seq fantom3_cdna.fa time python ~/src/CPAT-1.2.1/usr/local/bin/cpat.py -d /home/davetang/src/CPAT-1.2.1/dat/Mouse_train.RData -x /home/davetang/src/CPAT-1.2.1/dat/Mouse_Hexamer.tab -g fantom3_cdna.fa -o fantom3.out 102801 genes finished real 5m46.575s user 5m44.558s sys 0m0.674s head fantom3.out mRNA_size ORF_size Fickett_score Hexamer_score coding_prob RI|0610005A07|R000001A15|1277 1277 657 1.2052 0.470737800343 0.970083894611433 RI|0610005A19|R000001G01|959 959 315 1.1752 0.370696945252 0.635522146410408 RI|0610005A21|R000001G07|546 546 213 1.0958 0.426770923929 0.466524310212968 RI|0610005B05|R000001A09|630 630 444 1.3071 0.440894832259 0.883242064704715 RI|0610005B13|R000001C15|150 150 51 0.7433 0.356266243426 0.0939523506409684 RI|0610005B19|R000001G03|628 628 444 1.3071 0.440894832259 0.883245792471949 RI|0610005C11|R000001C05|1333 1333 1017 1.2668 0.423820060645 0.997675905453406 RI|0610005C13|R000001C17|897 897 240 0.7017 -0.0668194585448 0.073811043700264 RI|0610005C16|R000001E09|505 505 210 1.2186 0.302973942291 0.409297553064121 gzip fantom3.out
Visualise using R:
f3 <- read.table("fantom3.out.gz",
header=T)
summary(f3$coding_prob)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000001 0.0644300 0.6948000 0.5521000 0.9964000 1.0000000
table(f3$coding_prob<0.2)
FALSE TRUE
63093 39708
hist(f3$coding_prob,
breaks=100,
xlab="Coding probability",
main="Coding probability of FANTOM3 mouse full-length cDNAs",
col=1)
Please cite this DOI if you wish to use this figure: http://dx.doi.org/10.6084/m9.figshare.1046601.
Conclusion
The Coding Potential Assessment Tool (CPAT) is a very fast tool for assessing the coding potential of many RNA transcripts. Inputting a set of known protein coding RNAs versus a set of non-coding RNAs returned results as expected, which was that the former had a much higher coding potential than the latter class of RNAs. Lastly, using CPAT on FANTOM3 full-length cDNAs seemed to confirm that a relatively large portion of transcribed RNAs are non-coding.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave,
When running:
python ~/Downloads/CPAT-1.2.1/bin/make_logitModel.py -c CDS.fa -n 5-3prime.fa -o agam -r genome.fa
I get this error:
Traceback (most recent call last):
File “/Users/ganglion/Downloads/CPAT-1.2.1/bin/make_logitModel.py”, line 28, in
import pysam
File “/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/pysam/__init__.py”, line 1, in
from csamtools import *
ImportError: dlopen(/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so, 2): Symbol not found: ___ks_insertsort_heap
Referenced from: /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
What do you think I’m doing wrong? I’m on OS X.
Hi Dave,
I tried running:
python ~/Downloads/CPAT-1.2.1/bin/make_logitModel.py -c agamCDS.fa -n agambiae.BASEFEATURES_PEST-AgamP3.6.gff3.5-3prime.fa -o agam -r agambiae.CHROMOSOMES-PEST.AgamP3.fa
and got this error message:
____________________________________
Traceback (most recent call last):
File “/Users/ganglion/Downloads/CPAT-1.2.1/bin/make_logitModel.py”, line 28, in
import pysam
File “/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/pysam/__init__.py”, line 1, in
from csamtools import *
ImportError: dlopen(/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so, 2): Symbol not found: ___ks_insertsort_heap
Referenced from: /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
_____________________________________
What do you think could be happening here?
By the way, I am on OS X.
Hi Dave, I was able to run cpat successfully.
I have a new question now, however. When I plot my histogram of coding potential, the values in column 5 do not appear to be a probability. My x-axis is ranging from -3 to 2 and not between 0 and 1.0…what am I really graphing here?
Hi Alex,
You’re right, it should be between 0 and 1. I am not really sure and it may be hard to pinpoint without looking at your data, how you ran the analysis and your output.
I guess the easiest thing to check is whether you’re plotting the right column. Sorry I couldn’t be of more help.
Cheers,
Dave
I only have 5 columns instead of 6. I contacted the developer who said it’s an issue with R. He said my training dataset may be so different that the algorithm cannot converge during the make_logit step and therefore cpat.py fails to generate a probability of coding potential.
This is unfortunate…
Ah that is unfortunate. Well I hope you can find an alternative. Good luck!