Using dplyr to aggregate in R

I recently realised that dplyr can be used to aggregate and summarise data the same way that aggregate() does. I wrote a post on using the aggregate() function in R back in 2013 and in this post I’ll contrast between dplyr and aggregate().

I’ll use the same ChickWeight data set as per my previous post.

?ChickWeight

# The ChickWeight data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks.
# ...

data <- ChickWeight

Finding the mean weight depending on diet:

aggregate(data$weight, list(diet = data$Diet), mean)
  diet        x
1    1 102.6455
2    2 122.6167
3    3 142.9500
4    4 135.2627

# alternatively using a formula
# the weight is dependent on the diet
# diet explains the weight response
aggregate(weight ~ Diet, data = data, mean)
  Diet   weight
1    1 102.6455
2    2 122.6167
3    3 142.9500
4    4 135.2627

# dplyr approach
group_by(data, Diet) %>% summarise(mean = mean(weight))
# A tibble: 4 x 2
    Diet     mean
  <fctr>    <dbl>
1      1 102.6455
2      2 122.6167
3      3 142.9500
4      4 135.2627

Aggregating on time.

aggregate(data$weight, list(time=data$Time), mean)
   time         x
1     0  41.06000
2     2  49.22000
3     4  59.95918
4     6  74.30612
5     8  91.24490
6    10 107.83673
7    12 129.24490
8    14 143.81250
9    16 168.08511
10   18 190.19149
11   20 209.71739
12   21 218.68889

group_by(data, Time) %>% summarise(mean = mean(weight))
# A tibble: 12 x 2
    Time      mean
   <dbl>     <dbl>
1      0  41.06000
2      2  49.22000
3      4  59.95918
4      6  74.30612
5      8  91.24490
6     10 107.83673
7     12 129.24490
8     14 143.81250
9     16 168.08511
10    18 190.19149
11    20 209.71739
12    21 218.68889

Aggregating on two variables.

head(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean))
  time diet        x
1    0    1 41.40000
2    2    1 47.25000
3    4    1 56.47368
4    6    1 66.78947
5    8    1 79.68421
6   10    1 93.05263

# alternatively
head(aggregate(weight ~ Time + Diet, data = data, mean))
  Time Diet   weight
1    0    1 41.40000
2    2    1 47.25000
3    4    1 56.47368
4    6    1 66.78947
5    8    1 79.68421
6   10    1 93.05263

group_by(data, Diet, Time) %>% summarise(mean = mean(weight))
Source: local data frame [48 x 3]
Groups: Diet [?]

     Diet  Time      mean
   <fctr> <dbl>     <dbl>
1       1     0  41.40000
2       1     2  47.25000
3       1     4  56.47368
4       1     6  66.78947
5       1     8  79.68421
6       1    10  93.05263
7       1    12 108.52632
8       1    14 123.38889
9       1    16 144.64706
10      1    18 158.94118
# ... with 38 more rows

Aggregating and calculating two summaries.

aggregate(weight ~ Diet, data = data, FUN = function(x) c(mean = mean(x), n = length(x)))
  Diet weight.mean weight.n
1    1    102.6455 220.0000
2    2    122.6167 120.0000
3    3    142.9500 120.0000
4    4    135.2627 118.0000

group_by(data, Diet) %>% summarise(mean = mean(weight), n = length(weight))
# A tibble: 4 x 3
    Diet     mean     n
  <fctr>    <dbl> <int>
1      1 102.6455   220
2      2 122.6167   120
3      3 142.9500   120
4      4 135.2627   118

Aggregating on a data subset.

aggregate(weight ~ Diet, data = subset(data, Diet!=1), mean)
  Diet   weight
1    2 122.6167
2    3 142.9500
3    4 135.2627

data %>%
  filter(Diet != 1) %>%
  group_by(Diet) %>%
  summarise(mean = mean(weight))
# A tibble: 3 x 2
    Diet     mean
  <fctr>    <dbl>
1      2 122.6167
2      3 142.9500
3      4 135.2627

Summary

I prefer the dplyr approach, which allows you to “pipe” or “chain” different functions. Once you learn the dplyr functions a.k.a. verbs, you can easily string together a nice pipeline.

data %>%
  filter(Diet != 1) %>%
  group_by(Diet) %>%
  summarise(mean = mean(weight)) %>%
  arrange(mean)
# A tibble: 3 x 2
    Diet     mean
  <fctr>    <dbl>
1      2 122.6167
2      4 135.2627
3      3 142.9500
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
Posted in RTagged
3 comments Add yours
  1. What if i want to agregate a whole dataset
    As in for your case you only agregated weight
    Suppose i had, other variables like height,BMI etc
    How would i agregate them in dplyr

    1. I guess there might be an easier way if you want to add a lot of variables, but it is possible, to add them to the expression, separated by a comma. In this case it would make sense to name the first mean by the variable name. This means, that after “weight = mean(weight)” you could add “, height = mean(weight)”.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.