Updated 2014 September 19th to compare different transpose tools
I wrote a simple transpose tool, using Perl, for taking in tabular data and outputting a transposed version of the data. The primary motivation for writing this was because when viewing files with a lot of columns on the command-line, it becomes hard to match the columns to the column header. Most times, I just want to see a couple of typical values for each column, so that I can figure out how I could parse it. Here’s the script I wrote:
Let’s run the code using a simple example file I made up:
cat test.tsv #one two three four #five six seven eight #nine ten eleven twelve cat test.tsv | transpose #one five nine #two six ten #three seven eleven #four eight twelve
Now to explain the code; read this document if you are unfamiliar with Perl references. The loop starting on line 23, goes through each row of the array reference called $data. Each row of $data, contains the columns of data for that row. So $data->-> would be “three” (first row, third column).
The second loop starting on line 26, goes through each column of the current row. Line 29 does the transpose; the array reference $t_data will contain the same number of rows as the number of columns in $data. Each element inside $t_data is another array reference. The push() function, will keep building the rows of $t_data with the columns of $data. Once the loop has finished, $t_data->-> would be “nine” (first row, third column of the transposed data).
How this could be useful
This file has a lot of columns:
zcat file_with_lots_of_columns.tsv.gz | head -3 ID Interaction Cluster Location Cluster Head Location PET Count (Intra) PET Count (Inter) Head Peak Abs. Location Head Peak Rel. Location Head Peak Value Cluster Tail Location PET Count (Intra) PET Count (Inter) Tail Peak Abs. Location Tail Peak Rel. Location Tail Peak Value PET Count between Head/Tail P Value Q Value 4185618 Intra-Chromosome chr3:129203107-129349461 chr3:129203107-129344086 4056 1705 129160065 -113531 32 chr3:129253659-129349461 3815 1602 129160065 -141495 32 1719 0.00000e-1 0.00000e-1 1496645 Intra-Chromosome chr11:128541730-128721111 chr11:128541730-128692124 4299 2173 129765371 1148444 19 chr11:128557261-128721111 4620 2301 129765371 1126185 19 1666 0.00000e-1 0.00000e-1
It’s not easy to match the headers to the values. But now that we have this transpose tool:
zcat file_with_lots_of_columns.tsv.gz | head -3 | transpose ID 4185618 1496645 Interaction Intra-Chromosome Intra-Chromosome Cluster Location chr3:129203107-129349461 chr11:128541730-128721111 Cluster Head Location chr3:129203107-129344086 chr11:128541730-128692124 PET Count (Intra) 4056 4299 PET Count (Inter) 1705 2173 Head Peak Abs. Location 129160065 129765371 Head Peak Rel. Location -113531 1148444 Head Peak Value 32 19 Cluster Tail Location chr3:129253659-129349461 chr11:128557261-128721111 PET Count (Intra) 3815 4620 PET Count (Inter) 1602 2301 Tail Peak Abs. Location 129160065 129765371 Tail Peak Rel. Location -141495 1126185 Tail Peak Value 32 19 PET Count between Head/Tail 1719 1666 P Value 0.00000e-1 0.00000e-1 Q Value 0.00000e-1 0.00000e-1
Now we can clearly see typical values matched to the column headers.
Comparing different transpose tools
From the comments I learned about datamash and got some Python code for transposing a tab delimited file. As suggested by Mikhail, let’s compare the speeds.
For the Python code, I modified it so that it would read from STDIN:
cat transpose.py #!/usr/bin/env python import fileinput m =  for line in fileinput.input(): m.append(line.strip().split("\t")) for row in zip(*m): print "\t".join(row)
Let’s setup datamash:
wget http://ftp.gnu.org/gnu/datamash/datamash-1.0.6.tar.gz tar -xzf datamash-1.0.6.tar.gz cd datamash-1.0.6 ./configure --prefix=$HOME/somewhere make && make install
Test if everything is working with a simple file:
cat test.txt 1 2 3 4 5 6 7 8 9 10 11 12 cat test.txt | transpose.pl 1 5 9 2 6 10 3 7 11 4 8 12 cat test.txt | transpose.py 1 5 9 2 6 10 3 7 11 4 8 12 cat test.txt | datamash transpose 1 5 9 2 6 10 3 7 11 4 8 12
OK so far so good. I’ll generate a 1,000 by 1,000 matrix in R:
data <- matrix(data=rnorm(mean=10, n=1000000, sd=1), ncol=1000) write.table(x=data, file="1000.tsv", sep="\t", row.names=F, col.names=F)
So before we test this, I predicted that from slowest to fastest would be: my code, Cory’s code, then datamash. I’m testing this on my MacBook Air:
time cat 1000.tsv | datamash transpose > 1000_t_datamash.tsv real 0m7.099s user 0m7.039s sys 0m0.058s time cat 1000.tsv | transpose.py > 1000_t_py.tsv real 0m0.464s user 0m0.387s sys 0m0.075s time cat 1000.tsv | transpose.pl > 1000_t_pl.tsv real 0m1.133s user 0m1.029s sys 0m0.098s for file in `ls *.tsv`; do md5 $file; done MD5 (1000.tsv) = 16f845dfb6b8b5b86ce99628fdbdf779 MD5 (1000_t_datamash.tsv) = 59e6462d94e4216d2e09548b7d33ac81 MD5 (1000_t_pl.tsv) = 59e6462d94e4216d2e09548b7d33ac81 MD5 (1000_t_py.tsv) = 59e6462d94e4216d2e09548b7d33ac81
Well I’m glad I didn’t bet on that because I got the order entirely wrong.
This work is licensed under a Creative Commons
Attribution 4.0 International License.