Tometools

From Dave's wiki
Jump to navigation Jump to search

tomeTools is a collection of programs used to analyse CAGE data as part of the ENCODE and FANTOM projects.

Check out http://fantom.gsc.riken.jp/5/datafiles/latest/extra/TSS_classifier/TSSpredictionREADME.pdf

Installing

Download tomeTools from http://sourceforge.net/projects/tometools/ and compile as follows

wget http://downloads.sourceforge.net/project/tometools/tometools-1.02.tar.gz
tar -xzf tometools-1.02.tar.gz
cd tometools-1.02/
./configure --prefix=`pwd`

#for Ubuntu do these steps
sudo apt-get install libsqlite3-dev
sudo apt-get install libgsl0-dev
#add these two lines to the Makefile
#FLAGS    = -L /lib64 
#LIBS     = -lusb-1.0 -l pthread

make
make check
make install

Creating a tome database

tomedb create hg19 hg19.fa

Running the TSS classifier

See http://sourceforge.net/p/tometools/wiki/TssClassifier/

#usage tssclass [options] <tomedb> <knownPeaks.bed> <novelPeaks.bed>
#where known peaks are peaks we believe are real TSSs
#novelPeaks are unannotated peaks
#A training set comprised of both positive and negative sequences is extracted from the data.
tssclass hg19 refgene_tss.bed.gz random.bed

Note that all novel TSS clusters are counted as false positives.

Plotting a ROC curve

library(pROC)
mat <- read.table('num_4_roc.txt')
plot.roc(mat[,4], mat[,1], print.thres='best', auc.polygon=T, print.auc=T, grid=T)

Obtaining the prediction scores

mv \(null\)_num_4_roc.txt num_4_roc.txt
mv \(null\)_predictions.bed predictions.bed
cat predictions.bed | grep -v null | cut -f4 | cut -f2 -d',' | stats

Creating a BED file from FANTOM5 CAGE peaks

wget http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_ann.txt.gz

zcat hg19.cage_peak_ann.txt.gz |
awk 'BEGIN{OFS="\t"}
{if(NR > 8){x = split($4,a,"bp");
y = split($1,b,":|,");z = split(b[2],c,".");
if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){
print b[1],c[1],c[3],$2,a[1],b[3]} }}' > permissive_100bp_pos.bed

zcat hg19.cage_peak_ann.txt.gz |
awk 'BEGIN{OFS="\t"}
{if(NR > 8){x = split($4,a,"bp");
y = split($1,b,":|,");z = split(b[2],c,".");
if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) > 100 ){
print b[1],c[1],c[3],$2,a[1],b[3]} }}' > permissive_100bp_neg.bed

zcat hg19.cage_peak_ann.txt.gz |
awk 'BEGIN{OFS="\t"}
{if(NR > 8){x = split($4,a,"bp");
y = split($1,b,":|,");z = split(b[2],c,".");
if(a[1] == "NA" ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' >> permissive_100bp_neg.bed

OK what on earth is that monstrosity above? Let's dissect it, for those of us who can't read an entire awk script on one line. Firstly, the file looks like this:

#check out the file
zcat hg19.cage_peak_ann.txt.gz | head
##ColumnVariables[00Annotation]=CAGE peak id
##ColumnVariables[short_description]=short form of the description below. Common descriptions in the long descriptions has been omited
##ColumnVariables[description]=description of the CAGE peak
##ColumnVariables[association_with_transcript]=transcript which 5end is the nearest to the the CAGE peak
##ColumnVariables[entrezgene_id]=entrezgene (genes) id associated with the transcript
##ColumnVariables[hgnc_id]=hgnc (gene symbol) id associated with the transcript
##ColumnVariables[uniprot_id]=uniprot (protein) id associated with the transcript
00Annotation    short_description       description     association_with_transcript     entrezgene_id   hgnc_id uniprot_id
chr10:100000569..100000577,+    p@chr10:100000569..100000577,+  CAGE_peak_at_chr10:100000569..100000577,+       NA      NA      NA      NA
chr10:100007474..100007500,-    p@chr10:100007474..100007500,-  CAGE_peak_at_chr10:100007474..100007500,-       NA      NA      NA      NA

The first two lines are straight forward; we are simply reading the file and setting the delimiter as tabs for the output:

zcat hg19.cage_peak_ann.txt.gz |
awk 'BEGIN{OFS="\t"}

Then it's probably easier to read like this:

{if(NR > 8){
   x = split($4,a,"bp");
   y = split($1,b,":|,");
   z = split(b[2],c,".");
   if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){
      print b[1],c[1],c[3],$2,a[1],b[3]
   }
}}' > permissive_100bp_pos.bed

Firstly it uses the awk build-in variable NR, to skip the first 8 lines. I would have skipped lines starting with a hash and containing "00Annotation" instead. The next bit of code uses the split() function to split a string into an array. Here's a nice example of how split() works:

now=`date +%T`
#print the hour; $0 represents the whole input record
echo $now | awk '{split($0,a,":"); print a[1]}'
23

The x, y, and z stores the number of items in the array but since it's not being used anywhere we can remove it. The fourth column is split using "bp" as a delimiter and stored in a. The first column is split on either a ":" or "|" and stored in b. Then the start and end range are split and stored in c. So now we have:

zcat hg19.cage_peak_ann.txt.gz | grep -v "^#" | grep -v "00Annotation" |
awk 'BEGIN{OFS="\t"}{
   split($4,a,"bp");
   split($1,b,":|,");
   split(b[2],c,".");
   if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){
      print b[1],c[1],c[3],$2,a[1],b[3]
   }
}' > permissive_100bp_pos.bed

Now comes the conditional and to understand properly we need to look at the file again:

zcat hg19.cage_peak_ann.txt.gz | tail -6 | head -1
chrY:9928241..9928245,+ p1@ENST00000416843      CAGE_peak_1_at_ENST00000416843_5end     -165bp_to_ENST00000416843_5end  NA      NA      NA

a[1] is -165 in for the line above. Then comes the ternary operator, that turns a negative number into a positive number and leaves a positive number, positive. Here's an example:

echo 10 | awk 'a=$0<0?-$0:$0 {print a}'
10
echo -10 | awk 'a=$0<0?-$0:$0 {print a}'
10

So the awk line, will print out CAGE promoters that are within 100 bp of a transcript. The negative set are the CAGE promoters that are not within 100 bp of a transcript AND the CAGE promoters that are not associated with any transcripts.

Now run tssclass

tssclass hg19 permissive_100bp_pos.bed permissive_100bp_neg.bed

Case studies

Let's come up with a silly example to test the software. I'm going to create a 200 kb reference sequence, where the first half of the reference is made up of 50,000 occurrences of a C followed by a G and the second half is made up of 50,000 occurrences of a T followed by a A.

echo '>chr1' > ref.fa
perl -e 'for(1..50000){ print "CG" }' >> ref.fa
perl -e 'for(1..50000){ print "TA" } END {print "\n"}' >> ref.fa
#create tomedb
tomedb create ref ref.fa

The known peaks will be 50 bp regions within the CG region and the novel peaks will be 50 bp regions within the TA region

perl -le '$start =10000; for (1..1000){ $end = $start + 49; print join("\t", "chr1", $start, $end, $_, 0, "+"); $start += 50}' > knownPeaks.bed
perl -le '$start = 110000; for (1001..2000){ $end = $start + 49; print join("\t", "chr1", $start, $end, $_, 0, "+"); $start += 50}' > novelPeaks.bed
tssclass ref knownPeaks.bed novelPeaks.bed 
mv \(null\)_predictions.bed predictions.bed
mv \(null\)_num_4_roc.txt num_4_roc.txt
head predictions.bed 
chr1    10000   10049   1,0.5352        0.000000        +
chr1    10050   10099   2,0.5358        0.000000        +
chr1    10100   10149   3,0.5354        0.000000        +
chr1    10150   10199   4,0.5354        0.000000        +
chr1    10200   10249   5,0.5336        0.000000        +
chr1    10250   10299   6,0.5318        0.000000        +
chr1    10300   10349   7,0.5316        0.000000        +
chr1    10350   10399   8,0.5356        0.000000        +
chr1    10400   10449   9,0.5338        0.000000        +
chr1    10450   10499   10,0.5382       0.000000        +
tail predictions.bed 
chr1    159500  159549  1991,0.4658     0.000000        +
chr1    159550  159599  1992,0.4676     0.000000        +
chr1    159600  159649  1993,0.4696     0.000000        +
chr1    159650  159699  1994,0.4668     0.000000        +
chr1    159700  159749  1995,0.4696     0.000000        +
chr1    159750  159799  1996,0.4706     0.000000        +
chr1    159800  159849  1997,0.4624     0.000000        +
chr1    159850  159899  1998,0.4634     0.000000        +
chr1    159900  159949  1999,0.4640     0.000000        +
chr1    159950  159999  2000,0.4670     0.000000        +

Obtaining RefSeq TSSs from the UCSC Genome Browser

echo 'SELECT * FROM refGene' |
mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz

zcat refgene.tsv.gz |
grep -v "^bin" |
awk 'OFS="\t" {if ($4=="-") print $3,$6-100,$6+300,$2,0,$4; else print $3,$5-300,$5+100,$2,0,$4}' |
grep -v NR_ | awk '$1 !~ "_" {print}' | gzip > refgene_tss.bed.gz

zcat refgene_tss.bed.gz | head
chr11   89223809        89224209        NM_001291929    0       -
chr14   52455927        52456327        NM_016039       0       +
chr9    131084490       131084890       NM_016035       0       +
chr22   35795815        35796215        NM_006739       0       +
chr5    154230113       154230513       NM_032385       0       -
chr11   320814  321214  NM_021034       0       -
chrX    75647745        75648145        NM_020932       0       +
chr11   41481086        41481486        NM_020929       0       -
chrX    100645577       100645977       NM_021029       0       +
chr8    144679745       144680145       NM_032378       0       -

Obtaining the hg19 genome coordinates

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo"  | grep -v _ > hg19.genome

Creating a random bed file

randomBed -g hg19.genome -l 20 -n 1000 > random.bed

Long non-coding RNAs (lncRNAs)

wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.long_noncoding_RNAs.gtf.gz