I use tidyverse packages a lot and most of the times I prefer them over base R functions, especially when it comes to plotting. However, sometimes I want to write an R script with no dependencies. This is typically referred to as using base R, i.e. using only functions that come with R. Theoretically this means that anyone with R installed can run the script. (This is not a guarantee though because people use different versions of R and if my script uses functionality introduced in a later version of R, like the base R pipe, users with outdated versions of R will not be able to run the script.)
In one of my scripts, I need to convert data from long format to wide. The pivot_longer
and pivot_wider
functions in the tidyr package can be used to convert data into long and wide format, respectively. You may already be familiar with data in wide format; one example of wide data is a gene expression data, where gene
expression for a gene is measured in different tissues.
gene_exp <- read.delim(
file = "https://davetang.org/file/TagSeqExample.tab",
header = TRUE
)
head(gene_exp)
gene T1a T1b T2 T3 N1 N2
1 Gene_00001 0 0 2 0 0 1
2 Gene_00002 20 8 12 5 19 26
3 Gene_00003 3 0 2 0 0 0
4 Gene_00004 75 84 241 149 271 257
5 Gene_00005 10 16 4 0 4 10
6 Gene_00006 129 126 451 223 243 149
We can convert the wide gene expression data to long format using pivot_longer
.
tidyr::pivot_longer(
data = gene_exp,
cols = -gene,
names_to = "sample",
values_to = "count"
) -> gene_exp_long
head(gene_exp_long)
# A tibble: 6 × 3
gene sample count
<chr> <chr> <int>
1 Gene_00001 T1a 0
2 Gene_00001 T1b 0
3 Gene_00001 T2 2
4 Gene_00001 T3 0
5 Gene_00001 N1 0
6 Gene_00001 N2 1
There are advantages to using wide and long format but I typically convert my wide data to long format for use with ggplot2.
library(ggplot2)
ggplot(gene_exp_long[1:(6*20), ], aes(gene, count, fill = sample)) +
geom_col(position = position_dodge()) +
coord_flip() +
theme_minimal() +
theme(axis.title.y = element_blank()) +
NULL
Converting long data back to wide data can be done using pivot_wider
.
tidyr::pivot_wider(
data = gene_exp_long,
id_cols = gene,
names_from = sample,
values_from = count
)
# A tibble: 18,760 × 7
gene T1a T1b T2 T3 N1 N2
<chr> <int> <int> <int> <int> <int> <int>
1 Gene_00001 0 0 2 0 0 1
2 Gene_00002 20 8 12 5 19 26
3 Gene_00003 3 0 2 0 0 0
4 Gene_00004 75 84 241 149 271 257
5 Gene_00005 10 16 4 0 4 10
6 Gene_00006 129 126 451 223 243 149
7 Gene_00007 13 4 21 19 31 4
8 Gene_00008 0 3 0 0 0 0
9 Gene_00009 202 122 256 43 287 357
10 Gene_00010 10 8 56 145 14 15
# 18,750 more rows
Now, how do we do this using base R?
Reshape
If you look online for how to mimic the pivot_longer
and pivot_wider
functions in base R, you will be introduced to the reshape()
function. The documentation for reshape()
describes the function as:
This function reshapes a data frame between "wide" format (with repeated measurements in separate columns of the same row) and "long" format (with the repeated measurements in separate rows).
The documentation also shows how reshape()
is typically used:
- Typical usage for converting from long to wide format:
# reshape(data, direction = "wide",
# idvar = "___", timevar = "___", # mandatory
# v.names = c(___), # time-varying variables
# varying = list(___)) # auto-generated if missing
- Typical usage for converting from wide to long format:
# reshape(data, direction = "long",
# varying = c(___), # vector
# sep) # to help guess 'v.names' and 'times'
Here we convert the wide gene expression data to long format using reshape
.
reshape(
data = gene_exp,
direction = "long",
varying = colnames(gene_exp)[-1],
v.names = "count",
times = colnames(gene_exp)[-1],
timevar = "sample"
) -> out
# order by gene like pivot_longer
out <- out[order(out$gene), ]
# remove row names
row.names(out) <- NULL
# remove id column
out$id <- NULL
head(out)
gene sample count
1 Gene_00001 T1a 0
2 Gene_00001 T1b 0
3 Gene_00001 T2 2
4 Gene_00001 T3 0
5 Gene_00001 N1 0
6 Gene_00001 N2 1
table(out$count == gene_exp_long$count)
TRUE
112560
We achieved the same result using reshape
but with a bit more typing. (I simply compared the count values above instead of using identical
or all.equal
because reshape
adds attributes to the object that make it different to the pivot_longer
object.)
The arguments for varying
and times
should be the column names of the data frame minus the variable to keep constant. v.names
corresponds to values_to
and timevar
corresponds to names_to
in pivot_longer
.
The reshape()
function can also be used to convert long format back to wide.
reshape(
data = out,
direction = "wide",
idvar = "gene",
timevar = "sample",
v.names = "count"
) -> out2
colnames(out2) <- sub("^count\\.", "", colnames(out2))
head(gene_exp)
gene T1a T1b T2 T3 N1 N2
1 Gene_00001 0 0 2 0 0 1
2 Gene_00002 20 8 12 5 19 26
3 Gene_00003 3 0 2 0 0 0
4 Gene_00004 75 84 241 149 271 257
5 Gene_00005 10 16 4 0 4 10
6 Gene_00006 129 126 451 223 243 149
head(out2)
gene T1a T1b T2 T3 N1 N2
1 Gene_00001 0 0 2 0 0 1
7 Gene_00002 20 8 12 5 19 26
13 Gene_00003 3 0 2 0 0 0
19 Gene_00004 75 84 241 149 271 257
25 Gene_00005 10 16 4 0 4 10
31 Gene_00006 129 126 451 223 243 149
It wasn't obvious to me how I could control the name of the columns (count
is added to the start of the column name) so I simply added one more line of code to remove the variable name.
Conclusions
R is a statistical language and the design/implementation of functions, their arguments, and documentation reflect this. I'm not a statistician and a lot of the times when I'm reading the documentation for base R functions, it is not immediately obvious to me how I should use it. Personally, Tidyverse packages are more intuitive and easier to use, which is probably the main reason why I prefer it.
However, as I mentioned in the introduction, there are times when I want an R script to have little to no dependencies. In one of my scripts, I need to convert data back to wide format and used pivot_wider
. But now I can use the base R function reshape
without having to install the tidyr
package.
Further reading
This work is licensed under a Creative Commons
Attribution 4.0 International License.