Reading a list of files into a single R data frame

I had been using map_dfr from the purrr package to load multiple files into one single data frame. But this function has been superseded with the following explanation:

The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.

I'll generate some random files to illustrate how map_dfr is used. I use several packages from the Tidyverse, so if you want to follow along, you can install them all at once by installing the tidyverse package.

# install if necessary
install.packages("tidyverse")

random_df <- function(num_row = 100, num_col = 100, seed = 1984){
  set.seed(seed)
  matrix(
    data = 
      runif(
        n = num_row * num_col,
        min = 0,
        max = 1
      ),
    nrow = num_row
  ) |> as.data.frame()
}

random_files <- function(nfiles, prefix = 'x', outdir = 'random', leading_zero = 6){
  if(!dir.exists(outdir)){
    dir.create(outdir)
  }
  purrr::map(1:nfiles, function(x){
    write.csv(
      x = random_df(seed = x),
      file = paste0(outdir, '/', prefix, stringr::str_pad(x, leading_zero, pad = 0), ".csv"),
      row.names = FALSE
    )
  }) -> dev_null
}

random_files(10)

list.files("random")
 [1] "x000001.csv" "x000002.csv" "x000003.csv" "x000004.csv" "x000005.csv" "x000006.csv" "x000007.csv"
 [8] "x000008.csv" "x000009.csv" "x000010.csv"

We can easily load all the files into a single data frame using map_dfr.

my_df <- map_dfr(list.files("random", full.names = TRUE), readr::read_csv, show_col_types = FALSE)
dim(my_df)
[1] 1000  100

Here's how to do the same thing using pmap and bind_rows. (pmap comes with a basic progress bar, which is nice.) Note that I am using the base R pipe (|>), which requires R-4.1.0 or higher.

purrr::pmap(
  list(list.files("random", full.names = TRUE)),
  readr::read_csv, show_col_types = FALSE, .progress = TRUE
) |>
  dplyr::bind_rows() -> my_df2

all.equal(my_df, my_df2)
[1] TRUE

One of the reasons map_dfr was superseded is because it requires dplyr::bind_rows, which adds a package dependency. We can use the base R functions do.call and rbind() instead. In addition, my code above uses read_csv from the readr package. We can also substitute that function using the base R read.csv() function too.

purrr::pmap(
  list(list.files("random", full.names = TRUE)),
  read.csv, .progress = TRUE
) |>
  do.call("rbind", args = _) -> my_df3

all.equal(my_df2, my_df3)
[1] "Attributes: < Names: 1 string mismatch >"                                            
[2] "Attributes: < Length mismatch: comparison on first 2 components >"                   
[3] "Attributes: < Component “class”: Lengths (4, 1) differ (string compare on first 1) >"
[4] "Attributes: < Component “class”: 1 string mismatch >"                                
[5] "Attributes: < Component 2: target is externalptr, current is numeric >"  

The message above from all.equal is saying that the object attributes are different. We can use the attributes() function to see the differences.

names(attributes(my_df2))
[1] "row.names" "names"     "spec"      "problems"  "class"
names(attributes(my_df3))
[1] "names"     "row.names" "class"

Besides the object attributes, the values in the data frames are equal.

We can go one more step in removing the purrr dependency by using lapply instead. The code below uses all base R functions to load a list of files.

lapply(
  list.files("random", full.names = TRUE),
  read.csv
) |>
  do.call("rbind", args = _) -> my_df4

all.equal(my_df3, my_df4)
[1] TRUE

At this point, you may be wondering whether we needed the Tidyverse packages in the first place. There has already been a lot of discussion on the topic of base R versus Tidyverse, so look it up if you are interested. The point of this post was to illustrate how to read a list of files into a single data frame.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.