Handling big data in R

All credit goes to this post, so be sure to check it out! I’m just simply following some of the tips from that post on handling big data in R.

For this post, I will use a file that has 17,868,785 rows and 158 columns, which is quite big. Here’s the size of this file:

#gzipped size
ls -sh file.gz
575M file.gz

#raw size
ls -sh file
5.7G file

I was interested in the difference between the loading times of the gzipped and the raw file using read.table():

system.time(data <- read.table("file", header=T, sep="\t", stringsAsFactors=F))
    user   system  elapsed
1895.294   68.936 1964.674
#in minutes
1964.674 / 60
[1] 32.74457

#determining memory usage
#http://stackoverflow.com/questions/1395270/determining-memory-usage
sort( sapply(ls(),function(x){object.size(get(x))}))
data
11436043704
#in gigabytes
11436043704/1000000000
[1] 11.43604

#now the zipped file
system.time(data <- read.table(gzfile("file.gz"),header=T, sep="\t", stringsAsFactors=F))
    user   system  elapsed
2312.589   66.685 2379.782
2379.782/60
[1] 39.66303
#same memory usage
sort( sapply(ls(),function(x){object.size(get(x))}))
       data
11436043704

It took slightly longer to load a gzipped file (~7 minutes longer) but the space saved was 10 fold. Slightly related is this short post I wrote a while ago on processing gzipped files when using Perl.

Now let’s try reading in the data using the fread() function, which is part of the data.table package:

#install if necessary
install.packages("data.table")
library(data.table)
#there's a progress bar during loading!
system.time(data2 <- fread("file", header=T, sep="\t", stringsAsFactors=F))
'header' changed by user from 'auto' to TRUE
   user  system elapsed
 86.277   6.072  92.373
#that took less than 2 minutes!
92.373/60
[1] 1.53955

#slightly different memory usage
sort( sapply(ls(),function(x){object.size(get(x))}))
      data2        data
11364581312 11436043704
#but identical dimensions
dim(data2)
[1] 17868784      158
dim(data)
[1] 17868784      158

I tried to directly read the gzipped file, however it seems that fread() cannot directly read gzipped files. Not such a big problem for me; I can always gzip and gunzip files.

Difference between read.table() and fread()

When you create an object using read.table() the object is a data.frame. However when using fread(), the object is a data.table. What’s the difference? See this quick introduction to data.table and also this FAQ on data.table.

Here I just demonstrate some differences between a data.frame (my data object) and a data.table (my data2 object, sorry for the uninformative name!):

#we can get the names of the first six columns in the same manner
head(names(data2))
[1] "chr"    "start"  "end"    "id"     "score"  "strand"

#subsetting a data.frame
data[1:3,1:3]
   chr start   end
1 chr1 10069 10176
2 chr1 10071 10176
3 chr1 10071 10104

#subsetting a data.table using the same method as above
data2[1:3,1:3]
[1] 1 2 3 #what's this!?

#turns out that the 2nd argument in the square brackets expects an expression
#so the expression 1 2 3 was returned
#to mimic the same behaviour use "with=F"
data2[1:3,1:3,with=F]
    chr start   end
1: chr1 10069 10176
2: chr1 10071 10176
3: chr1 10071 10104

So why did we need this “with=F” parameter? Let’s take a look at the man page for data.table by typing:

?data.table

Skipping the rest of the manual and focusing on the “with” parameter:

with: By default ‘with=TRUE’ and ‘j’ is evaluated within the frame of ‘x’. The column names can be used as variables. When ‘with=FALSE’, ‘j’ works as it does in ‘[.data.frame’.

As stated the 2nd argument of the data.table is an expression, so you can do stuff like:

#calculate the median of the scores for the first 10 rows
data2[1:10,median(score)]
[1] 325.5
#how many entries have a score greater than 100
table(data2[,score>100])
   FALSE     TRUE
14865499  3003285
#create subset of data2 by storing entries with a score > 100
system.time(data2_score_100 <- subset(data2, score>100))
   user  system elapsed
  9.857   0.297  10.156
dim(data2_score_100)
[1] 3003285     158

I’ve just looked at the data.table package very briefly, but from what I’ve read it seems much more optimised than using data frames and I will definitely be reading it in much more detail.

Other ideas when handling big data in R

To test code on your data, you can take a random subset of the larger file:

#function for random sampling from https://gist.github.com/statshero/6122484
row.sample <- function(dta, rep) {
  dta[sample(1:nrow(dta), rep, replace=FALSE), ] 
} 
data_random_subset <- row.sample(data2,1000)
dim(data_random_subset)
[1] 1000  158

Also you can read in only a portion of your file, to get a feel of the dataset (or you could just read the entire file using the fread() function):

data_first_100 <- read.table("file", header=T, sep="\t", stringsAsFactors=F, nrows=100)
dim(data_first_100)
[1] 100 158

Conclusions

Discovering the fread() function was absolute gold. Reading in a file with 17,868,785 rows and 158 columns took less than 2 minutes and I could calculate simple statistics within seconds. I skimmed through the documentation for the data.table package and there were other useful features, so I will surely talk about this package more in the future.

Other tips

See this question on “best practices for storing and using data frames too large for memory”.

Use colClasses when reading in a file using read.table().

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
5 comments Add yours
  1. I, too, have relatively recently discovered the wonder of data.table. I do wish it could read gzip’d files, but use system and zcat to produce an unzipped on a ramdisk to a temporary file and then fread that.

    One little glitch I have not figured out is that data.table does not appear to play nicely with the bigmemory package, as when I tried to move data into a big.matrix, the values are messed up in some way. It’s possible big.memory does not know how to deal with integer64 types and I have not chased that down. My interim solution, believe it or not, is to write out the data.table once I’ve done the processing I need with it, and then get it back with read.big.matrix, again using a ramdisk.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.