All credit goes to this post, so be sure to check it out! I'm just simply following some of the tips from that post on handling big data in R.
For this post, I will use a file that has 17,868,785 rows and 158 columns, which is quite big. Here's the size of this file:
#gzipped size ls -sh file.gz 575M file.gz #raw size ls -sh file 5.7G file
I was interested in the difference between the loading times of the gzipped and the raw file using read.table():
system.time(data <- read.table("file", header=T, sep="\t", stringsAsFactors=F))
user system elapsed
1895.294 68.936 1964.674
#in minutes
1964.674 / 60
[1] 32.74457
#determining memory usage
#http://stackoverflow.com/questions/1395270/determining-memory-usage
sort( sapply(ls(),function(x){object.size(get(x))}))
data
11436043704
#in gigabytes
11436043704/1000000000
[1] 11.43604
#now the zipped file
system.time(data <- read.table(gzfile("file.gz"),header=T, sep="\t", stringsAsFactors=F))
user system elapsed
2312.589 66.685 2379.782
2379.782/60
[1] 39.66303
#same memory usage
sort( sapply(ls(),function(x){object.size(get(x))}))
data
11436043704
It took slightly longer to load a gzipped file (~7 minutes longer) but the space saved was 10 fold. Slightly related is this short post I wrote a while ago on processing gzipped files when using Perl.
Now let's try reading in the data using the fread() function, which is part of the data.table package:
#install if necessary
install.packages("data.table")
library(data.table)
#there's a progress bar during loading!
system.time(data2 <- fread("file", header=T, sep="\t", stringsAsFactors=F))
'header' changed by user from 'auto' to TRUE
user system elapsed
86.277 6.072 92.373
#that took less than 2 minutes!
92.373/60
[1] 1.53955
#slightly different memory usage
sort( sapply(ls(),function(x){object.size(get(x))}))
data2 data
11364581312 11436043704
#but identical dimensions
dim(data2)
[1] 17868784 158
dim(data)
[1] 17868784 158
I tried to directly read the gzipped file, however it seems that fread() cannot directly read gzipped files. Not such a big problem for me; I can always gzip and gunzip files.
Difference between read.table() and fread()
When you create an object using read.table() the object is a data.frame. However when using fread(), the object is a data.table. What's the difference? See this quick introduction to data.table and also this FAQ on data.table.
Here I just demonstrate some differences between a data.frame (my data object) and a data.table (my data2 object, sorry for the uninformative name!):
#we can get the names of the first six columns in the same manner
head(names(data2))
[1] "chr" "start" "end" "id" "score" "strand"
#subsetting a data.frame
data[1:3,1:3]
chr start end
1 chr1 10069 10176
2 chr1 10071 10176
3 chr1 10071 10104
#subsetting a data.table using the same method as above
data2[1:3,1:3]
[1] 1 2 3 #what's this!?
#turns out that the 2nd argument in the square brackets expects an expression
#so the expression 1 2 3 was returned
#to mimic the same behaviour use "with=F"
data2[1:3,1:3,with=F]
chr start end
1: chr1 10069 10176
2: chr1 10071 10176
3: chr1 10071 10104
So why did we need this "with=F" parameter? Let's take a look at the man page for data.table by typing:
?data.table
Skipping the rest of the manual and focusing on the "with" parameter:
with: By default 'with=TRUE' and 'j' is evaluated within the frame of 'x'. The column names can be used as variables. When 'with=FALSE', 'j' works as it does in '[.data.frame'.
As stated the 2nd argument of the data.table is an expression, so you can do stuff like:
#calculate the median of the scores for the first 10 rows data2[1:10,median(score)] [1] 325.5 #how many entries have a score greater than 100 table(data2[,score>100]) FALSE TRUE 14865499 3003285 #create subset of data2 by storing entries with a score > 100 system.time(data2_score_100 <- subset(data2, score>100)) user system elapsed 9.857 0.297 10.156 dim(data2_score_100) [1] 3003285 158
I've just looked at the data.table package very briefly, but from what I've read it seems much more optimised than using data frames and I will definitely be reading it in much more detail.
Other ideas when handling big data in R
To test code on your data, you can take a random subset of the larger file:
#function for random sampling from https://gist.github.com/statshero/6122484
row.sample <- function(dta, rep) {
dta[sample(1:nrow(dta), rep, replace=FALSE), ]
}
data_random_subset <- row.sample(data2,1000)
dim(data_random_subset)
[1] 1000 158
Also you can read in only a portion of your file, to get a feel of the dataset (or you could just read the entire file using the fread() function):
data_first_100 <- read.table("file", header=T, sep="\t", stringsAsFactors=F, nrows=100)
dim(data_first_100)
[1] 100 158
Conclusions
Discovering the fread() function was absolute gold. Reading in a file with 17,868,785 rows and 158 columns took less than 2 minutes and I could calculate simple statistics within seconds. I skimmed through the documentation for the data.table package and there were other useful features, so I will surely talk about this package more in the future.
Other tips
See this question on "best practices for storing and using data frames too large for memory".
Use colClasses when reading in a file using read.table().

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hello,
very good post!
Can you write Ram and cpu for the computer you used for this post?
Thanks! I think I used an Intel(R) Xeon(R) CPU X7560 @ 2.27GHz with a terabyte of RAM.
¿1 TB of RAM? lucky man. Or do you mean 1TB of hard disk?
1 TB of RAM; it’s the cluster at my workplace.
I, too, have relatively recently discovered the wonder of data.table. I do wish it could read gzip’d files, but use system and zcat to produce an unzipped on a ramdisk to a temporary file and then fread that.
One little glitch I have not figured out is that data.table does not appear to play nicely with the bigmemory package, as when I tried to move data into a big.matrix, the values are messed up in some way. It’s possible big.memory does not know how to deal with integer64 types and I have not chased that down. My interim solution, believe it or not, is to write out the data.table once I’ve done the processing I need with it, and then get it back with read.big.matrix, again using a ramdisk.