Map, join, and pivot in R

In this post, I will describe a series of data processing steps in R that I often perform that involves the map_df, inner_join, and pivot_longer functions from the purrr, dplyr, and tidyr packages, respectively. They are all part of the tidyverse, so to follow this post, please install the tidyverse package.

install.packages("tidyverse")
library(tidyverse)

The typical scenario is that I have used two (or more) tools to process a list of samples and the tools generate an output file for each sample. To mimic this, I have written a simple function that will generate 10 output files for "tool x" and another 10 output files for "tool y".

gen_ran_files <- function(nfiles, suffix, seed){
  if(nfiles > 26){
    stop("Please input a number between 1 and 27")
  }
  set.seed(seed)
  for(i in 1:nfiles){
    write.csv(
      x = data.frame(sample = letters[i], purity = rnorm(1)),
      file = paste0(letters[i], "_", suffix, ".csv"),
      row.names = FALSE
    )
  }
}

gen_ran_files(10, "x", 1948)
gen_ran_files(10, "y", 1984)

We now have 20 output CSV files, 10 from each tool.

list.files(pattern = ".csv$")
 [1] "a_x.csv" "a_y.csv" "b_x.csv" "b_y.csv" "c_x.csv" "c_y.csv" "d_x.csv" "d_y.csv" "e_x.csv" "e_y.csv"
[11] "f_x.csv" "f_y.csv" "g_x.csv" "g_y.csv" "h_x.csv" "h_y.csv" "i_x.csv" "i_y.csv" "j_x.csv" "j_y.csv"

We want to compare the outputs of each tool but the results are spread across 20 different files. We can use the map_dfr function to create two data frames containing the results of each tool. In the code below, the map_dfr function is used to "apply" the read_csv function to a list of CSV files. This returns a single data frame that was created by row-binding, i.e. appending the rows returned from read_csv. In addition, the data type for the columns (character and double) are explicitly provided for read_csv.

x_output <- map_dfr(list.files(pattern = "x.csv"), read_csv, col_types = cols('c', 'd'))
y_output <- map_dfr(list.files(pattern = "y.csv"), read_csv, col_types = cols('c', 'd'))

x_output
# A tibble: 10 × 2
   sample  purity
   <chr>    <dbl>
 1 a       0.112 
 2 b       2.16  
 3 c      -0.648 
 4 d       0.149 
 5 e      -1.90  
 6 f      -0.204 
 7 g      -0.356 
 8 h      -0.0173
 9 i      -0.378 
10 j      -0.467 

Since the results are still split into two, we will use the inner_join function to combine them. The function needs to know how to join the results and we specify that we want to join by using the sample column, which is shared between the two data frames. If your tools generate different column names, e.g. sample and sample_id, you can use a named vector as input to the by argument: by = c("sample" = "sample_id"). An inner join returns all rows that intersect between two sets and is usually what we want when combining results. But if there are unique entries in any set, they will not be returned by inner_join. Since we generated the output in the same manner, there won't be any unique entries and because of this, inner_join, left_join, right_join, and full_join will all return the same result. By default, columns with the same name will have an .x and .y suffix added to them. You can change this behaviour by specifying your own suffix vector.

inner_join(
  x = x_output,
  y = y_output,
  by = "sample",
  suffix = c(".tool_x", ".tool_y")
) -> xy

head(xy)
# A tibble: 6 × 3
  sample purity.tool_x purity.tool_y
  <chr>          <dbl>         <dbl>
1 a              0.112         0.409
2 b              2.16         -0.323
3 c             -0.648         0.636
4 d              0.149        -1.85 
5 e             -1.90          0.954
6 f             -0.204         1.19 

We now have all the results from the 20 output files conveniently stored in one data frame. However, we should tidy up the data frame by making sure that:

  1. Every column is a variable.
  2. Every row is an observation.
  3. Every cell is a single value.

Currently, the second and third columns contain two variables: purity and tool. We will use pivot_longer to split columns 2 and 3. Using cols = !sample is a shortcut for saying we want to pivot all columns except the sample column, i.e. columns 2 and 3. The .value is a special name that tells the function that part of the old column name specifies the new column name. The names_sep controls how the column is split and we want to split using the period that separates purity and tool name. Note that we have to escape the period (.) by using a double escape to escape the escape because R does not recognise \..

pivot_longer(
  data = xy,
  cols =  !sample,
  names_to = c(".value", "tool"),
  names_sep = "\\."
) -> xyl

head(xyl)
# A tibble: 6 × 3
  sample tool   purity
  <chr>  <chr>   <dbl>
1 a      tool_x  0.112
2 a      tool_y  0.409
3 b      tool_x  2.16 
4 b      tool_y -0.323
5 c      tool_x -0.648
6 c      tool_y  0.636

To finish off, we can plot the (random) results to compare the tools.

ggplot(xyl, aes(sample, purity, fill = tool)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  theme_bw()

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.