Visualising FANTOM3 human CAGE data

First download the relevant tables from here. Download tss_library_expr_summary.tsv.bz2, tss_summary.tsv.bz2 and rna_lib_summary.tsv.bz2 and bunzip2 the files.

According to information derived from rna_lib_summary.tsv there are 13,897 CAGE tags from the prostate gland (rna_lib_id = HBA).

Double check this:

cat tss_library_expr_summary.tsv | awk '$3=="HBA" {print}' | cut -f5 | perl -e 'while(<>){ chomp; $i+=$_; } print "$i\n";'
13897

cat tss_library_expr_summary.tsv | awk '$3=="HBA" {print}' | cut -f2 | sort -u | wc -l
7029

Genomic coordinates of TSSs can be obtained from tss_summary.tsv. Here's a script which generates bedgraph files of all the CAGE tags for each tissue. Bugs may be present in the code; use at your own risk. Note I commented out strictures because I wanted to use scalars as filehandles (which is ambiguous code).

#!/usr/bin/perl

#use strict;
use warnings;

my $mapping = 'tss_summary.tsv';
my $tss = 'tss_library_expr_summary.tsv';
my $library = 'rna_lib_summary.tsv';

my %library = ();
open(IN,'<',$library) || die "Could not open $library: $!\n";
while(){
   chomp;
   my ($lib,$junk,$junk2,$junk3,$desc,@junk) = split(/\t/);
   next unless $lib =~ /^H/;
   $library{$lib} = $desc;
}
close(IN);

my %mapping = ();

open(IN,'<',$mapping) || die "Could not open $mapping: $!\n";
while(){
   chomp;
   my ($tss_id,$junk,$junk2,$chr,$junk3,$strand,$start,$end,@junk) = split(/\t/);
   if ($strand eq 'F'){
      $strand = '+';
   } else {
      $strand = '-';
   }
   $mapping{$tss_id}->{'CHR'} = $chr;
   $mapping{$tss_id}->{'STRAND'} = $strand;
   $mapping{$tss_id}->{'START'} = $start;
   $mapping{$tss_id}->{'END'} = $end;
}
close(IN);

foreach my $lib (keys %library){

   my $library = $lib;
   open($lib,'>',"${lib}.bed") || die "Could not open $lib.bed for writing: $!\n";

print $lib <track type=bedGraph name="$library{$library}" description="FANTOM3 CAGE $library{$library}" visibility=full color=0,0,255 altColor=255,0,0 priority=20
EOF

}

open(IN,'<',$tss) || die "Could not open $tss: $!\n";
while(){
   chomp;
   my ($tss_rna_lib_id,$tss_id,$rna_lib_id,$priming,$tss_tags,$tpm) = split(/\t/);
   next unless $rna_lib_id =~ /^H/;
   if (exists $mapping{$tss_id}){
      my $chr = $mapping{$tss_id}->{'CHR'};
      my $strand = $mapping{$tss_id}->{'STRAND'};
      my $start = $mapping{$tss_id}->{'START'};
      my $end = $mapping{$tss_id}->{'END'};
      if ($strand eq '-'){
         $tss_tags = '-' . $tss_tags;
      }
      print $rna_lib_id join("\t",$chr,$start,$end,$tss_tags),"\n";
   } else {
      warn "Missing information for $tss_id\n";
   }
}
close(IN);

exit(0);

This will generate 42 bedgraph files.

Double check:

cat HBA.bed | cut -f4 | sed 's/-//' | perl -e 'while(<>){ chomp; $i+=$_; } print "$i\n";'
13897

Loaded two of the bedgraphs into the UCSC Genome Browser.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Leave a Reply Cancel reply