A transpose tool

Updated 2014 September 19th to compare different transpose tools

I wrote a simple transpose tool, using Perl, for taking in tabular data and outputting a transposed version of the data. The primary motivation for writing this was because when viewing files with a lot of columns on the command-line, it becomes hard to match the columns to the column header. Most times, I just want to see a couple of typical values for each column, so that I can figure out how I could parse it. Here's the script I wrote:


#!/usr/bin/env perl
use strict;
use warnings;
my $data = [];
my $t_data = [];
while(<>){
chomp;
#skip comments
next if /^#/;
#skip lines without anything
next if /^$/;
#split lines on tabs
my @s = split(/\t/);
#store each line, which has been split on tabs
#in the array reference as an array reference
push(@{$data}, \@s);
}
#loop through array reference
for my $row (@{$data}){
#go through each array reference
#each array element is each row of the data
for my $col (0 .. $#{$row}){
#each row of $t_data is an array reference
#that is being populated with the $data columns
push(@{$t_data->[$col]}, $row->[$col]);
}
}
for my $row (@$t_data){
my $line_to_print = '';
for my $col (@{$row}){
$line_to_print .= "$col\t";
}
$line_to_print =~ s/\t$//;
print "$line_to_print\n";
}
exit(0);

view raw

transpose.pl

hosted with ❤ by GitHub

Let's run the code using a simple example file I made up:

cat test.tsv 
#one     two     three   four
#five    six     seven   eight
#nine    ten     eleven  twelve

cat test.tsv | transpose
#one     five    nine
#two     six     ten
#three   seven   eleven
#four    eight   twelve

Now to explain the code; read this document if you are unfamiliar with Perl references. The loop starting on line 23, goes through each row of the array reference called $data. Each row of $data, contains the columns of data for that row. So $data->[0]->[2] would be "three" (first row, third column).

The second loop starting on line 26, goes through each column of the current row. Line 29 does the transpose; the array reference $t_data will contain the same number of rows as the number of columns in $data. Each element inside $t_data is another array reference. The push() function, will keep building the rows of $t_data with the columns of $data. Once the loop has finished, $t_data->[0]->[2] would be "nine" (first row, third column of the transposed data).

How this could be useful

This file has a lot of columns:

zcat file_with_lots_of_columns.tsv.gz | head -3
ID      Interaction     Cluster Location        Cluster Head Location   PET Count (Intra)       PET Count (Inter)       Head Peak Abs. Location Head Peak Rel. Location Head Peak Value Cluster Tail Location   PET Count (Intra)       PET Count (Inter)     Tail Peak Abs. Location Tail Peak Rel. Location Tail Peak Value PET Count between Head/Tail     P Value Q Value
4185618 Intra-Chromosome        chr3:129203107-129349461        chr3:129203107-129344086        4056    1705    129160065       -113531 32      chr3:129253659-129349461        3815    1602    129160065       -141495 32      1719    0.00000e-1    0.00000e-1
1496645 Intra-Chromosome        chr11:128541730-128721111       chr11:128541730-128692124       4299    2173    129765371       1148444 19      chr11:128557261-128721111       4620    2301    129765371       1126185 19      1666    0.00000e-1    0.00000e-1

It's not easy to match the headers to the values. But now that we have this transpose tool:

zcat file_with_lots_of_columns.tsv.gz | head -3 | transpose
ID      4185618 1496645
Interaction     Intra-Chromosome        Intra-Chromosome
Cluster Location        chr3:129203107-129349461        chr11:128541730-128721111
Cluster Head Location   chr3:129203107-129344086        chr11:128541730-128692124
PET Count (Intra)       4056    4299
PET Count (Inter)       1705    2173
Head Peak Abs. Location 129160065       129765371
Head Peak Rel. Location -113531 1148444
Head Peak Value 32      19
Cluster Tail Location   chr3:129253659-129349461        chr11:128557261-128721111
PET Count (Intra)       3815    4620
PET Count (Inter)       1602    2301
Tail Peak Abs. Location 129160065       129765371
Tail Peak Rel. Location -141495 1126185
Tail Peak Value 32      19
PET Count between Head/Tail     1719    1666
P Value 0.00000e-1      0.00000e-1
Q Value 0.00000e-1      0.00000e-1

Now we can clearly see typical values matched to the column headers.

Comparing different transpose tools

From the comments I learned about datamash and got some Python code for transposing a tab delimited file. As suggested by Mikhail, let's compare the speeds.

For the Python code, I modified it so that it would read from STDIN:

cat transpose.py
#!/usr/bin/env python

import fileinput
m = []
for line in fileinput.input():
   m.append(line.strip().split("\t"))
for row in zip(*m):
   print "\t".join(row)

Let's setup datamash:

wget http://ftp.gnu.org/gnu/datamash/datamash-1.0.6.tar.gz
tar -xzf datamash-1.0.6.tar.gz
cd datamash-1.0.6
./configure --prefix=$HOME/somewhere
make && make install

Test if everything is working with a simple file:

cat test.txt 
1	2	3	4
5	6	7	8
9	10	11	12

cat test.txt | transpose.pl
1	5	9	
2	6	10	
3	7	11	
4	8	12	

cat test.txt | transpose.py
1	5	9
2	6	10
3	7	11
4	8	12

cat test.txt | datamash transpose
1	5	9
2	6	10
3	7	11
4	8	12

OK so far so good. I'll generate a 1,000 by 1,000 matrix in R:

data <- matrix(data=rnorm(mean=10, n=1000000, sd=1), ncol=1000)
write.table(x=data, file="1000.tsv", sep="\t", row.names=F, col.names=F)

So before we test this, I predicted that from slowest to fastest would be: my code, Cory's code, then datamash. I'm testing this on my MacBook Air:

time cat 1000.tsv | datamash transpose > 1000_t_datamash.tsv

real	0m7.099s
user	0m7.039s
sys	0m0.058s

time cat 1000.tsv | transpose.py > 1000_t_py.tsv

real	0m0.464s
user	0m0.387s
sys	0m0.075s

time cat 1000.tsv | transpose.pl > 1000_t_pl.tsv

real	0m1.133s
user	0m1.029s
sys	0m0.098s

for file in `ls *.tsv`; do md5 $file; done
MD5 (1000.tsv) = 16f845dfb6b8b5b86ce99628fdbdf779
MD5 (1000_t_datamash.tsv) = 59e6462d94e4216d2e09548b7d33ac81
MD5 (1000_t_pl.tsv) = 59e6462d94e4216d2e09548b7d33ac81
MD5 (1000_t_py.tsv) = 59e6462d94e4216d2e09548b7d33ac81

Well I'm glad I didn't bet on that because I got the order entirely wrong.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
10 comments Add yours
  1. Hey Dave, nice post and thanks for following me on twitter. Whenever I see tools like these developed in python/perl I think – isn’t there a base function that already exists in R that does the same thing?

    Maybe your script works better with your workflow, but the transpose function you wrote is essentially t( ) from R right?

    1. Hi Scott,

      the equivalent in R would be the t() function. I just wrote it as a quick tool to match the column headers, as I illustrated in the post.

      I could use the tool as such:

      zcat file_with_lots_of_columns.tsv.gz | transpose | less -S

      which makes it much easier to read. I work mostly on the command-line, so this is useful for quickly viewing the file.

      I would use R in the end for the data analysis though.

      Cheers,

      Dave

        1. Hi Ming,

          thank you for sharing; I saw datamash before but didn’t realise it had a transpose tool. I thought it was for doing simple arithmetic, which I use the filo package for.

          Cheers,

          Dave

  2. Here’s the Python version, implemented by Cory Giles, http://corygil.es/. Served me well for years. Speed comparison, including datamash, anyone?

    import sys
    m = []
    with open(sys.argv[1]) as file:
    for line in file:
    m.append(line.strip().split("\t"))
    for row in zip(*m):
    print "\t".join(row)

    1. Thanks for sharing. I’ll update the post with the speed comparisons one day. I’m guessing from slowest to fastest, it’ll be: my version, Cory’s version, and then the datamash version.

  3. Visited this Oct.2021 to find an efficient way to transpose a 200k x 500k matrix – The solution by Mikhail + tweak to pipe to read from STDIN by Dave works flawlessly (a lot faster than datamash itself). However I have to do a little bit of modification for Python 3.

    #!/usr/bin/env python
    import fileinput
    m = []
    for line in fileinput.input():
    m.append(line.strip().split(” “))
    for row in zip(*m):
    print(“\t”.join(row), end=’\n’)

    Many thanks again !

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.