The GROUP BY statement allows you to perform operations in a group wise manner. I first learned of the Useful FILe and stream Operations (filo) repository a long long time ago and keep coming back to it over and over again. The filo toolkit comes with three tools: groupBy, stats, and shuffle. The groupBy tool allows you to perform group by operations on the command line, which is incredibly handy for quickly checking some calculations. (I am aware that groupBy has been incorporated into BEDTools but people may only be interested in groupBy.)
Installing filo simply requires git and make (and some common libraries, which most Linux distributions come with).
git clone https://github.com/arq5x/filo.git cd filo make cd bin ./groupBy Program: groupBy (v1.1.0) Authors: Aaron Quinlan (aaronquinlan@gmail.com) Assaf Gordon Summary: Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command. Usage: groupBy -i [FILE] -g [group_column(s)] -c [op_column(s)] -o [ops] cat [FILE] | groupBy -g [group_column(s)] -c [op_column(s)] -o [ops] ...
I wanted to quickly calculate transcript sizes on the command line, which made me download filo again for the 168th time.
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.annotation.gtf.gz gunzip -c gencode.v37.annotation.gtf.gz | grep -v "^#" | perl -ane 'next unless $F[2] eq "exon"; $id = $1 if $_ =~ /transcript_id\s\"(.*?)\";/; printf("%s\t%i\n", $id, $F[4]-$F[3]+1);' | groupBy -g 1 -c 2 -o sum | head ENST00000456328.2 1657 ENST00000450305.2 632 ENST00000488147.1 1351 ENST00000619216.1 68 ENST00000473358.1 712 ENST00000469289.1 535 ENST00000607096.1 138 ENST00000417324.1 1187 ENST00000461467.1 590 ENST00000606857.1 840
The one liner decompresses the gzip file but to the standard output stream. Next, lines starting with a hash (#) are skipped by "grep -v" and then piped to Perl.
Perl can be run with various switches and here I use "-ane":
* -a turns on autosplit mode, so that each line is automatically split and stored inside the @F array
* -n causes Perl to iterate over each line
* -e allows you to use Perl on one line
The one-liner outputs lines that are exons and obtains the transcript ID from the "transcript_id" attribute by using a regex. Finally, the transcript_id ID and exon length is printed out.
Finally groupBy is used to group each gene and sum together the exon lengths, calculating the gene length.
I also like to use the stats tool from the filo toolkit, which is another reason I keep coming back to it. We can use it to calculate the average length of transcripts (and other statistics).
gunzip -c gencode.v37.annotation.gtf.gz | grep -v "^#" | perl -ane 'next unless $F[2] eq "exon"; $id = $1 if $_ =~ /transcript_id\s\"(.*?)\";/; printf("%s\t%i\n", $id, $F[4]-$F[3]+1);' | groupBy -g 1 -c 2 -o sum | cut -f2 | stats Total lines: 234485 Sum of lines: 384229184 Ari. Mean: 1638.60879800414 Geo. Mean: 1054.23935954379 Median: 914 Mode: 582 (N=752) Anti-Mode: 8 (N=1) Minimum: 8 Maximum: 347561 Variance: 4215421.29852931 StdDev: 2053.14911746062
Did you expect that the median transcript length to be less than 1kb?

This work is licensed under a Creative Commons
Attribution 4.0 International License.