Using aggregate and apply in R

2016 October 13th: I wrote a post on using dplyr to perform the same aggregating functions as in this post; personally I prefer dplyr.

I recently came across a course on data analysis and visualisation and now I'm gradually going through each lecture. I just finished following the second lecture and the section "Working with dataframes and vectors efficiently" introduced to me the function called aggregate, which I can see as being extremely useful. In this post, I will write about aggregate, apply, lapply and sapply, which were also introduced in the lecture.

Let's get started with the ChickWeight dataset (see ?ChickWeight) available in the R datasets:

#load data
data <- ChickWeight
head(data)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

#dimension of the data
dim(data)
[1] 578   4

#how many chickens
unique(data$Chick)
 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48

#how many diets
unique(data$Diet)
[1] 1 2 3 4
Levels: 1 2 3 4

#how many time points
unique(data$Time)
 [1]  0  2  4  6  8 10 12 14 16 18 20 21

library(ggplot2)
ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
       geom_line() +
       geom_point()

chick_time_weight_line_as_charOver time the chickens got heavier.

Now to use aggregate; the usage as defined by ?aggregate:

S3 method for class 'data.frame'

aggregate(x, by, FUN, ..., simplify = TRUE)

#find the mean weight depending on diet
aggregate(data$weight, list(diet = data$Diet), mean)
  diet        x
1    1 102.6455
2    2 122.6167
3    3 142.9500
4    4 135.2627

#aggregate on time
aggregate(data$weight, list(time=data$Time), mean)
   time         x
1     0  41.06000
2     2  49.22000
3     4  59.95918
4     6  74.30612
5     8  91.24490
6    10 107.83673
7    12 129.24490
8    14 143.81250
9    16 168.08511
10   18 190.19149
11   20 209.71739
12   21 218.68889

#use a different function
aggregate(data$weight, list(time=data$Time), sd)
   time         x
1     0  1.132272
2     2  3.688316
3     4  4.495179
4     6  9.012038
5     8 16.239780
6    10 23.987277
7    12 34.119600
8    14 38.300412
9    16 46.904079
10   18 57.394757
11   20 66.511708
12   21 71.510273

#we could also aggregate on time and diet
head(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
  time diet        x
1    0    1 41.40000
2    2    1 47.25000
3    4    1 56.47368
4    6    1 66.78947
5    8    1 79.68421
6   10    1 93.05263
tail(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
   time diet        x
43   12    4 151.4000
44   14    4 161.8000
45   16    4 182.0000
46   18    4 202.9000
47   20    4 233.8889
48   21    4 238.5556

#to see the weights over time across different diets
ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
             facet_wrap(~Diet) +
             guides(col=guide_legend(ncol=3))

weight_time_diet

Now there's this very informative post on using apply in R. However, I tend to forget which specific apply function to use. In lecture 2 of the course, apply was introduced, and to reinforce my own understanding I'll provide the examples here.

?apply
#apply functions over array margins
#apply(X, MARGIN, FUN, ...)
#make up some dataframe
df <- data.frame(first = c(1:10), second = c(11:20))
df
   first second
1      1     11
2      2     12
3      3     13
4      4     14
5      5     15
6      6     16
7      7     17
8      8     18
9      9     19
10    10     20
#2 is for columns
apply(df, 2, mean)
 first second 
   5.5   15.5
#1 is for rows
apply(df, 1, mean)
 [1]  6  7  8  9 10 11 12 13 14 15

#write function to sample 10 numbers
#from a Poisson distribution according to lambda
f <- function(l){
   rpois(10, l)
}
f(10)
 [1] 10  3 14 13 13 12  8  8 13 12

#lapply = apply a function over a list
#for reproducibility
set.seed(123)
#save into draws
draws <- lapply(1:5,f)
draws
[[1]]
 [1] 0 2 1 2 3 0 1 2 1 1

[[2]]
 [1] 5 2 3 2 0 4 1 0 1 5

[[3]]
 [1] 5 4 3 8 4 4 3 3 2 1

[[4]]
 [1] 8 7 5 6 1 4 5 2 3 2

[[5]]
 [1] 3 4 4 4 3 3 3 5 4 7

sapply(draws, mean)
[1] 1.3 2.3 3.7 4.3 4.0
#difference with lapply?
#lapply always returns a list. sapply (if it can) simplifies the results
lapply(draws,mean)
[[1]]
[1] 1.3

[[2]]
[1] 2.3

[[3]]
[1] 3.7

[[4]]
[1] 4.3

[[5]]
[1] 4
#get same result as sapply
unlist(lapply(draws,mean))
[1] 1.3 2.3 3.7 4.3 4.0

I'm only onto the third lecture and have already picked up some cool tricks. Here is the link to the course again if you want to follow it.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

14 thoughts on “Using aggregate and apply in R

  1. I'm a bit stuck on how I would use the apply or aggregate function on a correlation. How do I specify which column is x and which one is y? I tried to specify the x and y in the function (i.e. cor(x = df$A, y = df$B), but that didn't work.
    I get the following error:
    Error in FUN(X[[1L]], ...) : supply both 'x' and 'y' or a matrix-like 'x'

    I am looking to get a single vector (sapply? or mapply?).

    • You can use the cor() function on a data frame or matrix without needing to use apply or aggregate. For example:

      cor(iris[,1:4])

  2. Hi, thank you for your very useful post.

    I keep receiving and error:

    Error in aggregate.data.frame(as.data.frame(x), ...) :
    no rows to aggregate

    I wonder if you could help with a solution.

    Many thanks,
    Jan

    • Hi Jan,

      it's a bit hard to troubleshoot since I'm not sure what you did and I don't have a hold of your data frame.

      But here are some things to check: is there anything inside your data frame object? What do you get when you use the nrow() function on your data frame?

      Cheers,

      Dave

  3. Pingback: aggregate() function in R | My Blog

  4. Super clear and helpful. Exactly what I needed.I had some n/a values so adding na.rm=TRUE after mean was all that was needed. Thanks a million!
    aggdata<-aggregate(avgGDP$Ranking, list(Ranking=avgGDP$Income.Group),mean, na.rm=TRUE)

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.