Tometools
tomeTools is a collection of programs used to analyse CAGE data as part of the ENCODE and FANTOM projects.
Check out http://fantom.gsc.riken.jp/5/datafiles/latest/extra/TSS_classifier/TSSpredictionREADME.pdf
Installing
Download tomeTools from http://sourceforge.net/projects/tometools/ and compile as follows
wget http://downloads.sourceforge.net/project/tometools/tometools-1.02.tar.gz tar -xzf tometools-1.02.tar.gz cd tometools-1.02/ ./configure --prefix=`pwd` #for Ubuntu do these steps sudo apt-get install libsqlite3-dev sudo apt-get install libgsl0-dev #add these two lines to the Makefile #FLAGS = -L /lib64 #LIBS = -lusb-1.0 -l pthread make make check make install
Creating a tome database
tomedb create hg19 hg19.fa
Running the TSS classifier
See http://sourceforge.net/p/tometools/wiki/TssClassifier/
#usage tssclass [options] <tomedb> <knownPeaks.bed> <novelPeaks.bed> #where known peaks are peaks we believe are real TSSs #novelPeaks are unannotated peaks #A training set comprised of both positive and negative sequences is extracted from the data. tssclass hg19 refgene_tss.bed.gz random.bed
Note that all novel TSS clusters are counted as false positives.
Plotting a ROC curve
library(pROC) mat <- read.table('num_4_roc.txt') plot.roc(mat[,4], mat[,1], print.thres='best', auc.polygon=T, print.auc=T, grid=T)
Obtaining the prediction scores
mv \(null\)_num_4_roc.txt num_4_roc.txt mv \(null\)_predictions.bed predictions.bed cat predictions.bed | grep -v null | cut -f4 | cut -f2 -d',' | stats
Creating a BED file from FANTOM5 CAGE peaks
wget http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_ann.txt.gz zcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"} {if(NR > 8){x = split($4,a,"bp"); y = split($1,b,":|,");z = split(b[2],c,"."); if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' > permissive_100bp_pos.bed zcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"} {if(NR > 8){x = split($4,a,"bp"); y = split($1,b,":|,");z = split(b[2],c,"."); if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) > 100 ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' > permissive_100bp_neg.bed zcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"} {if(NR > 8){x = split($4,a,"bp"); y = split($1,b,":|,");z = split(b[2],c,"."); if(a[1] == "NA" ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' >> permissive_100bp_neg.bed
OK what on earth is that monstrosity above? Let's dissect it, for those of us who can't read an entire awk script on one line. Firstly, the file looks like this:
#check out the file zcat hg19.cage_peak_ann.txt.gz | head ##ColumnVariables[00Annotation]=CAGE peak id ##ColumnVariables[short_description]=short form of the description below. Common descriptions in the long descriptions has been omited ##ColumnVariables[description]=description of the CAGE peak ##ColumnVariables[association_with_transcript]=transcript which 5end is the nearest to the the CAGE peak ##ColumnVariables[entrezgene_id]=entrezgene (genes) id associated with the transcript ##ColumnVariables[hgnc_id]=hgnc (gene symbol) id associated with the transcript ##ColumnVariables[uniprot_id]=uniprot (protein) id associated with the transcript 00Annotation short_description description association_with_transcript entrezgene_id hgnc_id uniprot_id chr10:100000569..100000577,+ p@chr10:100000569..100000577,+ CAGE_peak_at_chr10:100000569..100000577,+ NA NA NA NA chr10:100007474..100007500,- p@chr10:100007474..100007500,- CAGE_peak_at_chr10:100007474..100007500,- NA NA NA NA
The first two lines are straight forward; we are simply reading the file and setting the delimiter as tabs for the output:
zcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"}
Then it's probably easier to read like this:
{if(NR > 8){ x = split($4,a,"bp"); y = split($1,b,":|,"); z = split(b[2],c,"."); if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){ print b[1],c[1],c[3],$2,a[1],b[3] } }}' > permissive_100bp_pos.bed
Firstly it uses the awk build-in variable NR, to skip the first 8 lines. I would have skipped lines starting with a hash and containing "00Annotation" instead. The next bit of code uses the split() function to split a string into an array. Here's a nice example of how split() works:
now=`date +%T` #print the hour; $0 represents the whole input record echo $now | awk '{split($0,a,":"); print a[1]}' 23
The x, y, and z stores the number of items in the array but since it's not being used anywhere we can remove it. The fourth column is split using "bp" as a delimiter and stored in a. The first column is split on either a ":" or "|" and stored in b. Then the start and end range are split and stored in c. So now we have:
zcat hg19.cage_peak_ann.txt.gz | grep -v "^#" | grep -v "00Annotation" | awk 'BEGIN{OFS="\t"}{ split($4,a,"bp"); split($1,b,":|,"); split(b[2],c,"."); if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){ print b[1],c[1],c[3],$2,a[1],b[3] } }' > permissive_100bp_pos.bed
Now comes the conditional and to understand properly we need to look at the file again:
zcat hg19.cage_peak_ann.txt.gz | tail -6 | head -1 chrY:9928241..9928245,+ p1@ENST00000416843 CAGE_peak_1_at_ENST00000416843_5end -165bp_to_ENST00000416843_5end NA NA NA
a[1] is -165 in for the line above. Then comes the ternary operator, that turns a negative number into a positive number and leaves a positive number, positive. Here's an example:
echo 10 | awk 'a=$0<0?-$0:$0 {print a}' 10 echo -10 | awk 'a=$0<0?-$0:$0 {print a}' 10
So the awk line, will print out CAGE promoters that are within 100 bp of a transcript. The negative set are the CAGE promoters that are not within 100 bp of a transcript AND the CAGE promoters that are not associated with any transcripts.
Now run tssclass
tssclass hg19 permissive_100bp_pos.bed permissive_100bp_neg.bed
Case studies
Let's come up with a silly example to test the software. I'm going to create a 200 kb reference sequence, where the first half of the reference is made up of 50,000 occurrences of a C followed by a G and the second half is made up of 50,000 occurrences of a T followed by a A.
echo '>chr1' > ref.fa perl -e 'for(1..50000){ print "CG" }' >> ref.fa perl -e 'for(1..50000){ print "TA" } END {print "\n"}' >> ref.fa #create tomedb tomedb create ref ref.fa
The known peaks will be 50 bp regions within the CG region and the novel peaks will be 50 bp regions within the TA region
perl -le '$start =10000; for (1..1000){ $end = $start + 49; print join("\t", "chr1", $start, $end, $_, 0, "+"); $start += 50}' > knownPeaks.bed perl -le '$start = 110000; for (1001..2000){ $end = $start + 49; print join("\t", "chr1", $start, $end, $_, 0, "+"); $start += 50}' > novelPeaks.bed tssclass ref knownPeaks.bed novelPeaks.bed mv \(null\)_predictions.bed predictions.bed mv \(null\)_num_4_roc.txt num_4_roc.txt head predictions.bed chr1 10000 10049 1,0.5352 0.000000 + chr1 10050 10099 2,0.5358 0.000000 + chr1 10100 10149 3,0.5354 0.000000 + chr1 10150 10199 4,0.5354 0.000000 + chr1 10200 10249 5,0.5336 0.000000 + chr1 10250 10299 6,0.5318 0.000000 + chr1 10300 10349 7,0.5316 0.000000 + chr1 10350 10399 8,0.5356 0.000000 + chr1 10400 10449 9,0.5338 0.000000 + chr1 10450 10499 10,0.5382 0.000000 + tail predictions.bed chr1 159500 159549 1991,0.4658 0.000000 + chr1 159550 159599 1992,0.4676 0.000000 + chr1 159600 159649 1993,0.4696 0.000000 + chr1 159650 159699 1994,0.4668 0.000000 + chr1 159700 159749 1995,0.4696 0.000000 + chr1 159750 159799 1996,0.4706 0.000000 + chr1 159800 159849 1997,0.4624 0.000000 + chr1 159850 159899 1998,0.4634 0.000000 + chr1 159900 159949 1999,0.4640 0.000000 + chr1 159950 159999 2000,0.4670 0.000000 +
Obtaining RefSeq TSSs from the UCSC Genome Browser
echo 'SELECT * FROM refGene' | mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz zcat refgene.tsv.gz | grep -v "^bin" | awk 'OFS="\t" {if ($4=="-") print $3,$6-100,$6+300,$2,0,$4; else print $3,$5-300,$5+100,$2,0,$4}' | grep -v NR_ | awk '$1 !~ "_" {print}' | gzip > refgene_tss.bed.gz zcat refgene_tss.bed.gz | head chr11 89223809 89224209 NM_001291929 0 - chr14 52455927 52456327 NM_016039 0 + chr9 131084490 131084890 NM_016035 0 + chr22 35795815 35796215 NM_006739 0 + chr5 154230113 154230513 NM_032385 0 - chr11 320814 321214 NM_021034 0 - chrX 75647745 75648145 NM_020932 0 + chr11 41481086 41481486 NM_020929 0 - chrX 100645577 100645977 NM_021029 0 + chr8 144679745 144680145 NM_032378 0 -
Obtaining the hg19 genome coordinates
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ "select chrom, size from hg19.chromInfo" | grep -v _ > hg19.genome
Creating a random bed file
randomBed -g hg19.genome -l 20 -n 1000 > random.bed
Long non-coding RNAs (lncRNAs)
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.long_noncoding_RNAs.gtf.gz