Using aggregate and apply in R

2016 October 13th: I wrote a post on using dplyr to perform the same aggregating functions as in this post; personally I prefer dplyr.

I recently came across a course on data analysis and visualisation and now I'm gradually going through each lecture. I just finished following the second lecture and the section "Working with dataframes and vectors efficiently" introduced to me the function called aggregate, which I can see as being extremely useful. In this post, I will write about aggregate, apply, lapply and sapply, which were also introduced in the lecture.

Let's get started with the ChickWeight dataset (see ?ChickWeight) available in the R datasets:

#load data
data <- ChickWeight
head(data)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

#dimension of the data
dim(data)
[1] 578   4

#how many chickens
unique(data$Chick)
 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48

#how many diets
unique(data$Diet)
[1] 1 2 3 4
Levels: 1 2 3 4

#how many time points
unique(data$Time)
 [1]  0  2  4  6  8 10 12 14 16 18 20 21

library(ggplot2)
ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
       geom_line() +
       geom_point()

chick_time_weight_line_as_charOver time the chickens got heavier.

Now to use aggregate; the usage as defined by ?aggregate:

S3 method for class 'data.frame'

aggregate(x, by, FUN, ..., simplify = TRUE)

#find the mean weight depending on diet
aggregate(data$weight, list(diet = data$Diet), mean)
  diet        x
1    1 102.6455
2    2 122.6167
3    3 142.9500
4    4 135.2627

#aggregate on time
aggregate(data$weight, list(time=data$Time), mean)
   time         x
1     0  41.06000
2     2  49.22000
3     4  59.95918
4     6  74.30612
5     8  91.24490
6    10 107.83673
7    12 129.24490
8    14 143.81250
9    16 168.08511
10   18 190.19149
11   20 209.71739
12   21 218.68889

#use a different function
aggregate(data$weight, list(time=data$Time), sd)
   time         x
1     0  1.132272
2     2  3.688316
3     4  4.495179
4     6  9.012038
5     8 16.239780
6    10 23.987277
7    12 34.119600
8    14 38.300412
9    16 46.904079
10   18 57.394757
11   20 66.511708
12   21 71.510273

#we could also aggregate on time and diet
head(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
  time diet        x
1    0    1 41.40000
2    2    1 47.25000
3    4    1 56.47368
4    6    1 66.78947
5    8    1 79.68421
6   10    1 93.05263
tail(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
   time diet        x
43   12    4 151.4000
44   14    4 161.8000
45   16    4 182.0000
46   18    4 202.9000
47   20    4 233.8889
48   21    4 238.5556

#to see the weights over time across different diets
ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
             facet_wrap(~Diet) +
             guides(col=guide_legend(ncol=3))

weight_time_diet

Now there's this very informative post on using apply in R. However, I tend to forget which specific apply function to use. In lecture 2 of the course, apply was introduced, and to reinforce my own understanding I'll provide the examples here.

?apply
#apply functions over array margins
#apply(X, MARGIN, FUN, ...)
#make up some dataframe
df <- data.frame(first = c(1:10), second = c(11:20))
df
   first second
1      1     11
2      2     12
3      3     13
4      4     14
5      5     15
6      6     16
7      7     17
8      8     18
9      9     19
10    10     20
#2 is for columns
apply(df, 2, mean)
 first second 
   5.5   15.5
#1 is for rows
apply(df, 1, mean)
 [1]  6  7  8  9 10 11 12 13 14 15

#write function to sample 10 numbers
#from a Poisson distribution according to lambda
f <- function(l){
   rpois(10, l)
}
f(10)
 [1] 10  3 14 13 13 12  8  8 13 12

#lapply = apply a function over a list
#for reproducibility
set.seed(123)
#save into draws
draws <- lapply(1:5,f)
draws
[[1]]
 [1] 0 2 1 2 3 0 1 2 1 1

[[2]]
 [1] 5 2 3 2 0 4 1 0 1 5

[[3]]
 [1] 5 4 3 8 4 4 3 3 2 1

[[4]]
 [1] 8 7 5 6 1 4 5 2 3 2

[[5]]
 [1] 3 4 4 4 3 3 3 5 4 7

sapply(draws, mean)
[1] 1.3 2.3 3.7 4.3 4.0
#difference with lapply?
#lapply always returns a list. sapply (if it can) simplifies the results
lapply(draws,mean)
[[1]]
[1] 1.3

[[2]]
[1] 2.3

[[3]]
[1] 3.7

[[4]]
[1] 4.3

[[5]]
[1] 4
#get same result as sapply
unlist(lapply(draws,mean))
[1] 1.3 2.3 3.7 4.3 4.0

I'm only onto the third lecture and have already picked up some cool tricks. Here is the link to the course again if you want to follow it.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
Posted in RTagged
14 comments Add yours
  1. I'm a bit stuck on how I would use the apply or aggregate function on a correlation. How do I specify which column is x and which one is y? I tried to specify the x and y in the function (i.e. cor(x = df$A, y = df$B), but that didn't work.
    I get the following error:
    Error in FUN(X[[1L]], ...) : supply both 'x' and 'y' or a matrix-like 'x'

    I am looking to get a single vector (sapply? or mapply?).

    1. You can use the cor() function on a data frame or matrix without needing to use apply or aggregate. For example:

      cor(iris[,1:4])

  2. Hi, thank you for your very useful post.

    I keep receiving and error:

    Error in aggregate.data.frame(as.data.frame(x), ...) :
    no rows to aggregate

    I wonder if you could help with a solution.

    Many thanks,
    Jan

    1. Hi Jan,

      it's a bit hard to troubleshoot since I'm not sure what you did and I don't have a hold of your data frame.

      But here are some things to check: is there anything inside your data frame object? What do you get when you use the nrow() function on your data frame?

      Cheers,

      Dave

  3. Super clear and helpful. Exactly what I needed.I had some n/a values so adding na.rm=TRUE after mean was all that was needed. Thanks a million!
    aggdata<-aggregate(avgGDP$Ranking, list(Ranking=avgGDP$Income.Group),mean, na.rm=TRUE)

Leave a Reply

Your email address will not be published. Required fields are marked *