In this post, I will describe a series of data processing steps in R that I often perform that involves the map_df
, inner_join
, and pivot_longer
functions from the purrr
, dplyr
, and tidyr
packages, respectively. They are all part of the tidyverse, so to follow this post, please install the tidyverse
package.
install.packages("tidyverse")
library(tidyverse)
The typical scenario is that I have used two (or more) tools to process a list of samples and the tools generate an output file for each sample. To mimic this, I have written a simple function that will generate 10 output files for "tool x" and another 10 output files for "tool y".
gen_ran_files <- function(nfiles, suffix, seed){
if(nfiles > 26){
stop("Please input a number between 1 and 27")
}
set.seed(seed)
for(i in 1:nfiles){
write.csv(
x = data.frame(sample = letters[i], purity = rnorm(1)),
file = paste0(letters[i], "_", suffix, ".csv"),
row.names = FALSE
)
}
}
gen_ran_files(10, "x", 1948)
gen_ran_files(10, "y", 1984)
We now have 20 output CSV files, 10 from each tool.
list.files(pattern = ".csv$")
[1] "a_x.csv" "a_y.csv" "b_x.csv" "b_y.csv" "c_x.csv" "c_y.csv" "d_x.csv" "d_y.csv" "e_x.csv" "e_y.csv"
[11] "f_x.csv" "f_y.csv" "g_x.csv" "g_y.csv" "h_x.csv" "h_y.csv" "i_x.csv" "i_y.csv" "j_x.csv" "j_y.csv"
We want to compare the outputs of each tool but the results are spread across 20 different files. We can use the map_dfr
function to create two data frames containing the results of each tool. In the code below, the map_dfr
function is used to "apply" the read_csv
function to a list of CSV files. This returns a single data frame that was created by row-binding, i.e. appending the rows returned from read_csv
. In addition, the data type for the columns (character and double) are explicitly provided for read_csv
.
x_output <- map_dfr(list.files(pattern = "x.csv"), read_csv, col_types = cols('c', 'd'))
y_output <- map_dfr(list.files(pattern = "y.csv"), read_csv, col_types = cols('c', 'd'))
x_output
# A tibble: 10 × 2
sample purity
<chr> <dbl>
1 a 0.112
2 b 2.16
3 c -0.648
4 d 0.149
5 e -1.90
6 f -0.204
7 g -0.356
8 h -0.0173
9 i -0.378
10 j -0.467
Since the results are still split into two, we will use the inner_join
function to combine them. The function needs to know how to join the results and we specify that we want to join by using the sample
column, which is shared between the two data frames. If your tools generate different column names, e.g. sample
and sample_id
, you can use a named vector as input to the by
argument: by = c("sample" = "sample_id")
. An inner join returns all rows that intersect between two sets and is usually what we want when combining results. But if there are unique entries in any set, they will not be returned by inner_join
. Since we generated the output in the same manner, there won't be any unique entries and because of this, inner_join
, left_join
, right_join
, and full_join
will all return the same result. By default, columns with the same name will have an .x
and .y
suffix added to them. You can change this behaviour by specifying your own suffix vector.
inner_join(
x = x_output,
y = y_output,
by = "sample",
suffix = c(".tool_x", ".tool_y")
) -> xy
head(xy)
# A tibble: 6 × 3
sample purity.tool_x purity.tool_y
<chr> <dbl> <dbl>
1 a 0.112 0.409
2 b 2.16 -0.323
3 c -0.648 0.636
4 d 0.149 -1.85
5 e -1.90 0.954
6 f -0.204 1.19
We now have all the results from the 20 output files conveniently stored in one data frame. However, we should tidy up the data frame by making sure that:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Currently, the second and third columns contain two variables: purity and tool. We will use pivot_longer
to split columns 2 and 3. Using cols = !sample
is a shortcut for saying we want to pivot all columns except the sample
column, i.e. columns 2 and 3. The .value
is a special name that tells the function that part of the old column name specifies the new column name. The names_sep
controls how the column is split and we want to split using the period that separates purity and tool name. Note that we have to escape the period (.
) by using a double escape to escape the escape because R does not recognise \.
.
pivot_longer(
data = xy,
cols = !sample,
names_to = c(".value", "tool"),
names_sep = "\\."
) -> xyl
head(xyl)
# A tibble: 6 × 3
sample tool purity
<chr> <chr> <dbl>
1 a tool_x 0.112
2 a tool_y 0.409
3 b tool_x 2.16
4 b tool_y -0.323
5 c tool_x -0.648
6 c tool_y 0.636
To finish off, we can plot the (random) results to compare the tools.
ggplot(xyl, aes(sample, purity, fill = tool)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
theme_bw()
This work is licensed under a Creative Commons
Attribution 4.0 International License.