SQL group by statement on the command line

The GROUP BY statement allows you to perform operations in a group wise manner. I first learned of the Useful FILe and stream Operations (filo) repository a long long time ago and keep coming back to it over and over again. The filo toolkit comes with three tools: groupBy, stats, and shuffle. The groupBy tool allows you to perform group by operations on the command line, which is incredibly handy for quickly checking some calculations. (I am aware that groupBy has been incorporated into BEDTools but people may only be interested in groupBy.)

Installing filo simply requires git and make (and some common libraries, which most Linux distributions come with).

git clone https://github.com/arq5x/filo.git
cd filo
cd bin


Program: groupBy (v1.1.0)
Authors: Aaron Quinlan (aaronquinlan@gmail.com)
         Assaf Gordon
Summary: Summarizes a dataset column based upon
	 common column groupings. Akin to the SQL "group by" command.

Usage:	 groupBy -i [FILE] -g [group_column(s)] -c [op_column(s)] -o [ops] 
	 cat [FILE] | groupBy -g [group_column(s)] -c [op_column(s)] -o [ops]

I wanted to quickly calculate transcript sizes on the command line, which made me download filo again for the 168th time.

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.annotation.gtf.gz

gunzip -c gencode.v37.annotation.gtf.gz | grep -v "^#" | perl -ane 'next unless $F[2] eq "exon"; $id = $1 if $_ =~ /transcript_id\s\"(.*?)\";/; printf("%s\t%i\n", $id, $F[4]-$F[3]+1);' | groupBy -g 1 -c 2 -o sum | head
ENST00000456328.2	1657
ENST00000450305.2	632
ENST00000488147.1	1351
ENST00000619216.1	68
ENST00000473358.1	712
ENST00000469289.1	535
ENST00000607096.1	138
ENST00000417324.1	1187
ENST00000461467.1	590
ENST00000606857.1	840

The one liner decompresses the gzip file but to the standard output stream. Next, lines starting with a hash (#) are skipped by "grep -v" and then piped to Perl.

Perl can be run with various switches and here I use "-ane":

* -a turns on autosplit mode, so that each line is automatically split and stored inside the @F array
* -n causes Perl to iterate over each line
* -e allows you to use Perl on one line

The one-liner outputs lines that are exons and obtains the transcript ID from the "transcript_id" attribute by using a regex. Finally, the transcript_id ID and exon length is printed out.

Finally groupBy is used to group each gene and sum together the exon lengths, calculating the gene length.

I also like to use the stats tool from the filo toolkit, which is another reason I keep coming back to it. We can use it to calculate the average length of transcripts (and other statistics).

gunzip -c gencode.v37.annotation.gtf.gz | grep -v "^#" | perl -ane 'next unless $F[2] eq "exon"; $id = $1 if $_ =~ /transcript_id\s\"(.*?)\";/; printf("%s\t%i\n", $id, $F[4]-$F[3]+1);' | groupBy -g 1 -c 2 -o sum | cut -f2 | stats
Total lines:		234485
Sum of lines:		384229184
Ari. Mean:		1638.60879800414
Geo. Mean:		1054.23935954379
Median:			914
Mode:			582 (N=752)
Anti-Mode:		8 (N=1)
Minimum:		8
Maximum:		347561
Variance:		4215421.29852931
StdDev:			2053.14911746062

Did you expect that the median transcript length to be less than 1kb?

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.