Using aggregate and apply in R

2016 October 13th: I wrote a post on using dplyr to perform the same aggregating functions as in this post; personally I prefer dplyr.

I recently came across a course on data analysis and visualisation and now I'm gradually going through each lecture. I just finished following the second lecture and the section "Working with dataframes and vectors efficiently" introduced to me the function called aggregate, which I can see as being extremely useful. In this post, I will write about aggregate, apply, lapply and sapply, which were also introduced in the lecture.

Let's get started with the ChickWeight dataset (see ?ChickWeight) available in the R datasets:

#load data
data <- ChickWeight
head(data)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

#dimension of the data
dim(data)
[1] 578   4

#how many chickens
unique(data$Chick)
 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48

#how many diets
unique(data$Diet)
[1] 1 2 3 4
Levels: 1 2 3 4

#how many time points
unique(data$Time)
 [1]  0  2  4  6  8 10 12 14 16 18 20 21

library(ggplot2)
ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
       geom_line() +
       geom_point()

Over time the chickens got heavier.

Now to use aggregate; the usage as defined by ?aggregate:

S3 method for class 'data.frame'

aggregate(x, by, FUN, ..., simplify = TRUE)

#find the mean weight depending on diet
aggregate(data$weight, list(diet = data$Diet), mean)
  diet        x
1    1 102.6455
2    2 122.6167
3    3 142.9500
4    4 135.2627

#aggregate on time
aggregate(data$weight, list(time=data$Time), mean)
   time         x
1     0  41.06000
2     2  49.22000
3     4  59.95918
4     6  74.30612
5     8  91.24490
6    10 107.83673
7    12 129.24490
8    14 143.81250
9    16 168.08511
10   18 190.19149
11   20 209.71739
12   21 218.68889

#use a different function
aggregate(data$weight, list(time=data$Time), sd)
   time         x
1     0  1.132272
2     2  3.688316
3     4  4.495179
4     6  9.012038
5     8 16.239780
6    10 23.987277
7    12 34.119600
8    14 38.300412
9    16 46.904079
10   18 57.394757
11   20 66.511708
12   21 71.510273

#we could also aggregate on time and diet
head(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
  time diet        x
1    0    1 41.40000
2    2    1 47.25000
3    4    1 56.47368
4    6    1 66.78947
5    8    1 79.68421
6   10    1 93.05263
tail(aggregate(data$weight,
               list(time = data$Time, diet = data$Diet),
               mean
              )
    )
   time diet        x
43   12    4 151.4000
44   14    4 161.8000
45   16    4 182.0000
46   18    4 202.9000
47   20    4 233.8889
48   21    4 238.5556

#to see the weights over time across different diets
ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
             facet_wrap(~Diet) +
             guides(col=guide_legend(ncol=3))

Now there's this very informative post on using apply in R. However, I tend to forget which specific apply function to use. In lecture 2 of the course, apply was introduced, and to reinforce my own understanding I'll provide the examples here.

?apply
#apply functions over array margins
#apply(X, MARGIN, FUN, ...)
#make up some dataframe
df <- data.frame(first = c(1:10), second = c(11:20))
df
   first second
1      1     11
2      2     12
3      3     13
4      4     14
5      5     15
6      6     16
7      7     17
8      8     18
9      9     19
10    10     20
#2 is for columns
apply(df, 2, mean)
 first second 
   5.5   15.5
#1 is for rows
apply(df, 1, mean)
 [1]  6  7  8  9 10 11 12 13 14 15

#write function to sample 10 numbers
#from a Poisson distribution according to lambda
f <- function(l){
   rpois(10, l)
}
f(10)
 [1] 10  3 14 13 13 12  8  8 13 12

#lapply = apply a function over a list
#for reproducibility
set.seed(123)
#save into draws
draws <- lapply(1:5,f)
draws
[[1]]
 [1] 0 2 1 2 3 0 1 2 1 1

[[2]]
 [1] 5 2 3 2 0 4 1 0 1 5

[[3]]
 [1] 5 4 3 8 4 4 3 3 2 1

[[4]]
 [1] 8 7 5 6 1 4 5 2 3 2

[[5]]
 [1] 3 4 4 4 3 3 3 5 4 7

sapply(draws, mean)
[1] 1.3 2.3 3.7 4.3 4.0
#difference with lapply?
#lapply always returns a list. sapply (if it can) simplifies the results
lapply(draws,mean)
[[1]]
[1] 1.3

[[2]]
[1] 2.3

[[3]]
[1] 3.7

[[4]]
[1] 4.3

[[5]]
[1] 4
#get same result as sapply
unlist(lapply(draws,mean))
[1] 1.3 2.3 3.7 4.3 4.0

I'm only onto the third lecture and have already picked up some cool tricks. Here is the link to the course again if you want to follow it.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

14 comments Add yours

Alexander says:

February 28, 2014 at 11:29

quite helpful. Thanks!

Rachel says:

March 29, 2014 at 19:44

This was just what I needed. Thank You!

Mark says:

May 22, 2014 at 11:35

Thank you! This was really well written and easy to follow. Exactly what I needed 🙂

nanobi says:

August 18, 2014 at 10:01

I’m a bit stuck on how I would use the apply or aggregate function on a correlation. How do I specify which column is x and which one is y? I tried to specify the x and y in the function (i.e. cor(x = df$A, y = df$B), but that didn’t work.
I get the following error:
Error in FUN(X[[1L]], …) : supply both ‘x’ and ‘y’ or a matrix-like ‘x’

I am looking to get a single vector (sapply? or mapply?).

1. Davo says:
  
  August 18, 2014 at 13:18
  
  You can use the cor() function on a data frame or matrix without needing to use apply or aggregate. For example:
  
  cor(iris[,1:4])
  
Jan says:

August 26, 2014 at 07:18

Hi, thank you for your very useful post.

I keep receiving and error:

Error in aggregate.data.frame(as.data.frame(x), …) :
no rows to aggregate

I wonder if you could help with a solution.

Many thanks,
Jan

1. Davo says:
  
  August 26, 2014 at 07:32
  
  Hi Jan,
  
  it’s a bit hard to troubleshoot since I’m not sure what you did and I don’t have a hold of your data frame.
  
  But here are some things to check: is there anything inside your data frame object? What do you get when you use the nrow() function on your data frame?
  
  Cheers,
  
  Dave
  
Pingback: aggregate() function in R | My Blog
Grace Xu says:

April 6, 2015 at 21:40

Thanks! This is very easy to follow example.

petter says:

September 23, 2015 at 01:42

quite helpful ! thx bro !

Valerock says:

October 26, 2015 at 21:06

This is splendid. I will follow the course immediately. Grateful. Thanks

Mark says:

November 17, 2015 at 06:30

This function description was really helpful

Anita Owens says:

December 25, 2015 at 23:57

Super clear and helpful. Exactly what I needed.I had some n/a values so adding na.rm=TRUE after mean was all that was needed. Thanks a million!
aggdata<-aggregate(avgGDP$Ranking, list(Ranking=avgGDP$Income.Group),mean, na.rm=TRUE)

Pattareeya says:

April 18, 2017 at 20:05

Thank you for the example of aggregate function and ggplot2. It is easy to follow and understand.

Using aggregate and apply in R

S3 method for class 'data.frame'

Like this:

Related

Leave a Reply Cancel reply