Updated 2014 September 19th to compare different transpose tools
I wrote a simple transpose tool, using Perl, for taking in tabular data and outputting a transposed version of the data. The primary motivation for writing this was because when viewing files with a lot of columns on the command-line, it becomes hard to match the columns to the column header. Most times, I just want to see a couple of typical values for each column, so that I can figure out how I could parse it. Here's the script I wrote:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
use strict; | |
use warnings; | |
my $data = []; | |
my $t_data = []; | |
while(<>){ | |
chomp; | |
#skip comments | |
next if /^#/; | |
#skip lines without anything | |
next if /^$/; | |
#split lines on tabs | |
my @s = split(/\t/); | |
#store each line, which has been split on tabs | |
#in the array reference as an array reference | |
push(@{$data}, \@s); | |
} | |
#loop through array reference | |
for my $row (@{$data}){ | |
#go through each array reference | |
#each array element is each row of the data | |
for my $col (0 .. $#{$row}){ | |
#each row of $t_data is an array reference | |
#that is being populated with the $data columns | |
push(@{$t_data->[$col]}, $row->[$col]); | |
} | |
} | |
for my $row (@$t_data){ | |
my $line_to_print = ''; | |
for my $col (@{$row}){ | |
$line_to_print .= "$col\t"; | |
} | |
$line_to_print =~ s/\t$//; | |
print "$line_to_print\n"; | |
} | |
exit(0); |
Let's run the code using a simple example file I made up:
cat test.tsv #one two three four #five six seven eight #nine ten eleven twelve cat test.tsv | transpose #one five nine #two six ten #three seven eleven #four eight twelve
Now to explain the code; read this document if you are unfamiliar with Perl references. The loop starting on line 23, goes through each row of the array reference called $data. Each row of $data, contains the columns of data for that row. So $data->[0]->[2] would be "three" (first row, third column).
The second loop starting on line 26, goes through each column of the current row. Line 29 does the transpose; the array reference $t_data will contain the same number of rows as the number of columns in $data. Each element inside $t_data is another array reference. The push() function, will keep building the rows of $t_data with the columns of $data. Once the loop has finished, $t_data->[0]->[2] would be "nine" (first row, third column of the transposed data).
How this could be useful
This file has a lot of columns:
zcat file_with_lots_of_columns.tsv.gz | head -3 ID Interaction Cluster Location Cluster Head Location PET Count (Intra) PET Count (Inter) Head Peak Abs. Location Head Peak Rel. Location Head Peak Value Cluster Tail Location PET Count (Intra) PET Count (Inter) Tail Peak Abs. Location Tail Peak Rel. Location Tail Peak Value PET Count between Head/Tail P Value Q Value 4185618 Intra-Chromosome chr3:129203107-129349461 chr3:129203107-129344086 4056 1705 129160065 -113531 32 chr3:129253659-129349461 3815 1602 129160065 -141495 32 1719 0.00000e-1 0.00000e-1 1496645 Intra-Chromosome chr11:128541730-128721111 chr11:128541730-128692124 4299 2173 129765371 1148444 19 chr11:128557261-128721111 4620 2301 129765371 1126185 19 1666 0.00000e-1 0.00000e-1
It's not easy to match the headers to the values. But now that we have this transpose tool:
zcat file_with_lots_of_columns.tsv.gz | head -3 | transpose ID 4185618 1496645 Interaction Intra-Chromosome Intra-Chromosome Cluster Location chr3:129203107-129349461 chr11:128541730-128721111 Cluster Head Location chr3:129203107-129344086 chr11:128541730-128692124 PET Count (Intra) 4056 4299 PET Count (Inter) 1705 2173 Head Peak Abs. Location 129160065 129765371 Head Peak Rel. Location -113531 1148444 Head Peak Value 32 19 Cluster Tail Location chr3:129253659-129349461 chr11:128557261-128721111 PET Count (Intra) 3815 4620 PET Count (Inter) 1602 2301 Tail Peak Abs. Location 129160065 129765371 Tail Peak Rel. Location -141495 1126185 Tail Peak Value 32 19 PET Count between Head/Tail 1719 1666 P Value 0.00000e-1 0.00000e-1 Q Value 0.00000e-1 0.00000e-1
Now we can clearly see typical values matched to the column headers.
Comparing different transpose tools
From the comments I learned about datamash and got some Python code for transposing a tab delimited file. As suggested by Mikhail, let's compare the speeds.
For the Python code, I modified it so that it would read from STDIN:
cat transpose.py #!/usr/bin/env python import fileinput m = [] for line in fileinput.input(): m.append(line.strip().split("\t")) for row in zip(*m): print "\t".join(row)
Let's setup datamash:
wget http://ftp.gnu.org/gnu/datamash/datamash-1.0.6.tar.gz tar -xzf datamash-1.0.6.tar.gz cd datamash-1.0.6 ./configure --prefix=$HOME/somewhere make && make install
Test if everything is working with a simple file:
cat test.txt 1 2 3 4 5 6 7 8 9 10 11 12 cat test.txt | transpose.pl 1 5 9 2 6 10 3 7 11 4 8 12 cat test.txt | transpose.py 1 5 9 2 6 10 3 7 11 4 8 12 cat test.txt | datamash transpose 1 5 9 2 6 10 3 7 11 4 8 12
OK so far so good. I'll generate a 1,000 by 1,000 matrix in R:
data <- matrix(data=rnorm(mean=10, n=1000000, sd=1), ncol=1000) write.table(x=data, file="1000.tsv", sep="\t", row.names=F, col.names=F)
So before we test this, I predicted that from slowest to fastest would be: my code, Cory's code, then datamash. I'm testing this on my MacBook Air:
time cat 1000.tsv | datamash transpose > 1000_t_datamash.tsv real 0m7.099s user 0m7.039s sys 0m0.058s time cat 1000.tsv | transpose.py > 1000_t_py.tsv real 0m0.464s user 0m0.387s sys 0m0.075s time cat 1000.tsv | transpose.pl > 1000_t_pl.tsv real 0m1.133s user 0m1.029s sys 0m0.098s for file in `ls *.tsv`; do md5 $file; done MD5 (1000.tsv) = 16f845dfb6b8b5b86ce99628fdbdf779 MD5 (1000_t_datamash.tsv) = 59e6462d94e4216d2e09548b7d33ac81 MD5 (1000_t_pl.tsv) = 59e6462d94e4216d2e09548b7d33ac81 MD5 (1000_t_py.tsv) = 59e6462d94e4216d2e09548b7d33ac81
Well I'm glad I didn't bet on that because I got the order entirely wrong.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hey Dave, nice post and thanks for following me on twitter. Whenever I see tools like these developed in python/perl I think – isn’t there a base function that already exists in R that does the same thing?
Maybe your script works better with your workflow, but the transpose function you wrote is essentially t( ) from R right?
Hi Scott,
the equivalent in R would be the t() function. I just wrote it as a quick tool to match the column headers, as I illustrated in the post.
I could use the tool as such:
zcat file_with_lots_of_columns.tsv.gz | transpose | less -S
which makes it much easier to read. I work mostly on the command-line, so this is useful for quickly viewing the file.
I would use R in the end for the data analysis though.
Cheers,
Dave
Hi Dave,
Today I got to know this tool called datamash http://www.gnu.org/software/datamash/examples/
it has a function transpose to do the same thing, very handy.
just want to share with you.
Hi Ming,
thank you for sharing; I saw datamash before but didn’t realise it had a transpose tool. I thought it was for doing simple arithmetic, which I use the filo package for.
Cheers,
Dave
Here’s the Python version, implemented by Cory Giles, http://corygil.es/. Served me well for years. Speed comparison, including datamash, anyone?
import sys
m = []
with open(sys.argv[1]) as file:
for line in file:
m.append(line.strip().split("\t"))
for row in zip(*m):
print "\t".join(row)
Thanks for sharing. I’ll update the post with the speed comparisons one day. I’m guessing from slowest to fastest, it’ll be: my version, Cory’s version, and then the datamash version.
Disregard what I said about testing it one day; I couldn’t resist and tested it just now. Post updated.
Visited this Oct.2021 to find an efficient way to transpose a 200k x 500k matrix – The solution by Mikhail + tweak to pipe to read from STDIN by Dave works flawlessly (a lot faster than datamash itself). However I have to do a little bit of modification for Python 3.
#!/usr/bin/env python
import fileinput
m = []
for line in fileinput.input():
m.append(line.strip().split(” “))
for row in zip(*m):
print(“\t”.join(row), end=’\n’)
Many thanks again !
Hi Tuan,
thanks for sharing the Python 3 code!
Cheers,
Dave
Thanks, Tuan and Dave! Good update, just re-read the post, it remains timeless!
Mikhail