The Coding Potential Assessment Tool (CPAT) was developed to assess the coding potential of RNA sequences without the need of sequence alignment. More information on the tool can be found at their website and corresponding publication.
Briefly CPAT uses four sequence features to determine coding potential:
- ORF length
- ORF coverage
- Fickett TESTCODE statistic
- Hexamer usage bias
ORF length is one of the most fundamental features used to distinguish non-protein-coding RNA from protein-coding RNA because a long putative ORF is unlikely to be observed by random chance in non-coding sequences. ORF coverage was defined as the ratio of ORF length to transcript length. While some non-coding sequences may contain ORFs by chance, they usually have a lower coverage. The Fickett TESTCODE statistic distinguishes non-protein-coding RNA from protein-coding RNA according to the combinational effect of nucleotide composition and codon usage bias. The fourth and last feature is hexamer usage bias and is useful because of the dependence between adjacent amino acids in proteins.
In this post, I test CPAT by inputting a set of known protein coding RNAs (RefSeqs) and a set of putative non-coding RNAs taken from GENCODE. I use the logit model file and the hexamer frequency table files that were provided with CPAT.
wget ftp://ftp.sanger.ac.uk/pub/gencode/release_16/gencode.v16.lncRNA_transcripts.fa.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz python ~/bin/CPAT/usr/local/bin/cpat.py -d ~/src/CPAT-1.2.1/dat/Human_train.RData -x ~/src/CPAT-1.2.1/dat/Human_Hexamer.tab -g gencode.v16.lncRNA_transcripts.fa -o lnc_rna_cpat.out python ~/bin/CPAT/usr/local/bin/cpat.py -d ~/src/CPAT-1.2.1/dat/Human_train.RData -x ~/src/CPAT-1.2.1/dat/Human_Hexamer.tab -g human.rna.fna.gz -o refseq #stats is available here https://github.com/arq5x/filo #coding potential of lncRNA cat lnc_rna_cpat.out | cut -f6 | stats Total lines: 22444 Sum of lines: 1773.01404739716 Ari. Mean: 0.0789972396808571 Geo. Mean: 0.0157839580319542 Median: 0.0157583998480091 Mode: 1 (N=22) Anti-Mode: 2.22144653496105e-13 (N=1) Minimum: 2.22144653496105e-13 Maximum: 1 Variance: 0.0332286952048533 StdDev: 0.182287397273792 #coding potential of mRNA cat refseq | cut -f6 | stats Total lines: 47251 Sum of lines: 36677.7025936008 Ari. Mean: 0.776231245764128 Geo. Mean: 0.405030720938695 Median: 0.999573383225558 Mode: 1 (N=3820) Anti-Mode: 2.26544376749313e-13 (N=1) Minimum: 2.26544376749313e-13 Maximum: 1 Variance: 0.141651908600363 StdDev: 0.376366720899129
Input into R
refseq <- read.table("refseq",header=T) lncrna <- read.table("lnc_rna_cpat.out",header=T) head(refseq) mRNA_size ORF_size Fickett_score Hexamer_score GI|239744030|REF|XR_017086.3| 3159 1521 1.1446 0.34745117 GI|239745142|REF|XR_040656.2| 747 747 0.9367 0.25149694 GI|239746044|REF|XR_078542.1| 1338 681 0.8234 0.05853078 GI|239746948|REF|XR_037268.2| 1057 429 0.7305 0.00207399 GI|239749771|REF|XR_038087.2| 3159 1521 1.1446 0.34918451 GI|239751533|REF|XR_079389.1| 1338 681 0.8234 0.05091209 coding_prob GI|239744030|REF|XR_017086.3| 0.9999992 GI|239745142|REF|XR_040656.2| 0.9894146 GI|239746044|REF|XR_078542.1| 0.8964281 GI|239746948|REF|XR_037268.2| 0.2289991 GI|239749771|REF|XR_038087.2| 0.9999992 GI|239751533|REF|XR_079389.1| 0.8916638 head(lncrna) mRNA_size ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 712 ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 535 ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 2748 ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 491 ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 629 ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 336 ORF_size ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 228 ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 111 ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 240 ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 258 ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 258 ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 114 Fickett_score ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 0.8192 ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 0.6819 ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 1.1129 ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 0.9340 ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 0.6732 ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 1.1303 Hexamer_score ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 0.089700116 ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 0.004253738 ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| -0.036719025 ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 0.066306017 ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| -0.262129058 ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 0.248847902 coding_prob ENST00000473358.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| 0.075234173 ENST00000469289.1|ENSG00000243485.1|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| 0.008503811 ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| 0.082058427 ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| 0.122934737 ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| 0.006955770 ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| 0.155721804 boxplot(refseq$coding_prob, lncrna$coding_prob, main="Coding Probability", names=c("RefSeq","lncRNA"))
The coding probability of RefSeq RNAs are very high compared to lncRNAs, as expected.
install.packages("rgl") library(rgl) dim(refseq) #[1] 47251 5 refseq$colour <- rep(2,47251) dim(lncrna) #[1] 22444 5 lncrna$colour <- rep(3,22444) combined <- rbind(refseq, lncrna) plot3d(combined[,c(2:4)],col=combined$colour) rgl.postscript("cpat.pdf","pdf")
#install if necessary sudo apt-get install imagemagick convert cpat.pdf cpat.png
In red are RefSeq mRNAs and in green are lncRNAs. The several long ORFs in the RefSeq mRNAs spreads out the plot.
Remove the long ORFs
summary(combined$ORF_size) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 192.0 510.0 994.6 1356.0 108000.0 table(combined$ORF_size>40000) FALSE TRUE 69688 7 new <- combined[combined$ORF_size<40000,] dim(new) [1] 69688 6 dim(combined) [1] 69695 6 plot3d(new[,c(2:4)],col=new$colour) rgl.postscript("new_cpat.pdf","pdf")
Most of the lncRNA have very short or no ORFs.
FANTOM3 clones
One of the main findings of FANTOM3 was the appreciation that many transcribed RNAs were not protein coding or had very low coding potential. As part of the FANTOM project, a collection of mouse full-length cDNAs were made; there are a total of 102,801 full-length cDNAs. I wonder how many of them have a very low coding potential. As per above:
#download the FANTOM3 full length cDNAs wget ftp://fantom.gsc.riken.jp/fantomdb/3.0/fantom3.00.seq.gz gunzip fantom3.00.seq.gz mv fantom3.00.seq fantom3_cdna.fa time python ~/src/CPAT-1.2.1/usr/local/bin/cpat.py -d /home/davetang/src/CPAT-1.2.1/dat/Mouse_train.RData -x /home/davetang/src/CPAT-1.2.1/dat/Mouse_Hexamer.tab -g fantom3_cdna.fa -o fantom3.out 102801 genes finished real 5m46.575s user 5m44.558s sys 0m0.674s head fantom3.out mRNA_size ORF_size Fickett_score Hexamer_score coding_prob RI|0610005A07|R000001A15|1277 1277 657 1.2052 0.470737800343 0.970083894611433 RI|0610005A19|R000001G01|959 959 315 1.1752 0.370696945252 0.635522146410408 RI|0610005A21|R000001G07|546 546 213 1.0958 0.426770923929 0.466524310212968 RI|0610005B05|R000001A09|630 630 444 1.3071 0.440894832259 0.883242064704715 RI|0610005B13|R000001C15|150 150 51 0.7433 0.356266243426 0.0939523506409684 RI|0610005B19|R000001G03|628 628 444 1.3071 0.440894832259 0.883245792471949 RI|0610005C11|R000001C05|1333 1333 1017 1.2668 0.423820060645 0.997675905453406 RI|0610005C13|R000001C17|897 897 240 0.7017 -0.0668194585448 0.073811043700264 RI|0610005C16|R000001E09|505 505 210 1.2186 0.302973942291 0.409297553064121 gzip fantom3.out
Visualise using R:
f3 <- read.table("fantom3.out.gz", header=T) summary(f3$coding_prob) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000001 0.0644300 0.6948000 0.5521000 0.9964000 1.0000000 table(f3$coding_prob<0.2) FALSE TRUE 63093 39708 hist(f3$coding_prob, breaks=100, xlab="Coding probability", main="Coding probability of FANTOM3 mouse full-length cDNAs", col=1)
Please cite this DOI if you wish to use this figure: http://dx.doi.org/10.6084/m9.figshare.1046601.
Conclusion
The Coding Potential Assessment Tool (CPAT) is a very fast tool for assessing the coding potential of many RNA transcripts. Inputting a set of known protein coding RNAs versus a set of non-coding RNAs returned results as expected, which was that the former had a much higher coding potential than the latter class of RNAs. Lastly, using CPAT on FANTOM3 full-length cDNAs seemed to confirm that a relatively large portion of transcribed RNAs are non-coding.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave,
When running:
python ~/Downloads/CPAT-1.2.1/bin/make_logitModel.py -c CDS.fa -n 5-3prime.fa -o agam -r genome.fa
I get this error:
Traceback (most recent call last):
File “/Users/ganglion/Downloads/CPAT-1.2.1/bin/make_logitModel.py”, line 28, in
import pysam
File “/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/pysam/__init__.py”, line 1, in
from csamtools import *
ImportError: dlopen(/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so, 2): Symbol not found: ___ks_insertsort_heap
Referenced from: /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
What do you think I’m doing wrong? I’m on OS X.
Hi Dave,
I tried running:
python ~/Downloads/CPAT-1.2.1/bin/make_logitModel.py -c agamCDS.fa -n agambiae.BASEFEATURES_PEST-AgamP3.6.gff3.5-3prime.fa -o agam -r agambiae.CHROMOSOMES-PEST.AgamP3.fa
and got this error message:
____________________________________
Traceback (most recent call last):
File “/Users/ganglion/Downloads/CPAT-1.2.1/bin/make_logitModel.py”, line 28, in
import pysam
File “/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/pysam/__init__.py”, line 1, in
from csamtools import *
ImportError: dlopen(/Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so, 2): Symbol not found: ___ks_insertsort_heap
Referenced from: /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/CPAT-1.2.1-py2.7-macosx-10.8-intel.egg/csamtools.so
_____________________________________
What do you think could be happening here?
By the way, I am on OS X.
Hi Dave, I was able to run cpat successfully.
I have a new question now, however. When I plot my histogram of coding potential, the values in column 5 do not appear to be a probability. My x-axis is ranging from -3 to 2 and not between 0 and 1.0…what am I really graphing here?
Hi Alex,
You’re right, it should be between 0 and 1. I am not really sure and it may be hard to pinpoint without looking at your data, how you ran the analysis and your output.
I guess the easiest thing to check is whether you’re plotting the right column. Sorry I couldn’t be of more help.
Cheers,
Dave
I only have 5 columns instead of 6. I contacted the developer who said it’s an issue with R. He said my training dataset may be so different that the algorithm cannot converge during the make_logit step and therefore cpat.py fails to generate a probability of coding potential.
This is unfortunate…
Ah that is unfortunate. Well I hope you can find an alternative. Good luck!