2016 October 13th: I wrote a post on using dplyr to perform the same aggregating functions as in this post; personally I prefer dplyr.

I recently came across a course on data analysis and visualisation and now I'm gradually going through each lecture. I just finished following the second lecture and the section "Working with dataframes and vectors efficiently" introduced to me the function called aggregate, which I can see as being extremely useful. In this post, I will write about aggregate, apply, lapply and sapply, which were also introduced in the lecture.

Let's get started with the ChickWeight dataset (see ?ChickWeight) available in the R datasets:

#load data data <- ChickWeight head(data) weight Time Chick Diet 1 42 0 1 1 2 51 2 1 1 3 59 4 1 1 4 64 6 1 1 5 76 8 1 1 6 93 10 1 1 #dimension of the data dim(data) [1] 578 4 #how many chickens unique(data$Chick) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48 #how many diets unique(data$Diet) [1] 1 2 3 4 Levels: 1 2 3 4 #how many time points unique(data$Time) [1] 0 2 4 6 8 10 12 14 16 18 20 21 library(ggplot2) ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) + geom_line() + geom_point()

*Over time the chickens got heavier*.

Now to use aggregate; the usage as defined by ?aggregate:

## S3 method for class 'data.frame'

aggregate(x, by, FUN, ..., simplify = TRUE)

#find the mean weight depending on diet aggregate(data$weight, list(diet = data$Diet), mean) diet x 1 1 102.6455 2 2 122.6167 3 3 142.9500 4 4 135.2627 #aggregate on time aggregate(data$weight, list(time=data$Time), mean) time x 1 0 41.06000 2 2 49.22000 3 4 59.95918 4 6 74.30612 5 8 91.24490 6 10 107.83673 7 12 129.24490 8 14 143.81250 9 16 168.08511 10 18 190.19149 11 20 209.71739 12 21 218.68889 #use a different function aggregate(data$weight, list(time=data$Time), sd) time x 1 0 1.132272 2 2 3.688316 3 4 4.495179 4 6 9.012038 5 8 16.239780 6 10 23.987277 7 12 34.119600 8 14 38.300412 9 16 46.904079 10 18 57.394757 11 20 66.511708 12 21 71.510273 #we could also aggregate on time and diet head(aggregate(data$weight, list(time = data$Time, diet = data$Diet), mean ) ) time diet x 1 0 1 41.40000 2 2 1 47.25000 3 4 1 56.47368 4 6 1 66.78947 5 8 1 79.68421 6 10 1 93.05263 tail(aggregate(data$weight, list(time = data$Time, diet = data$Diet), mean ) ) time diet x 43 12 4 151.4000 44 14 4 161.8000 45 16 4 182.0000 46 18 4 202.9000 47 20 4 233.8889 48 21 4 238.5556 #to see the weights over time across different diets ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) + facet_wrap(~Diet) + guides(col=guide_legend(ncol=3))

Now there's this very informative post on using apply in R. However, I tend to forget which specific apply function to use. In lecture 2 of the course, apply was introduced, and to reinforce my own understanding I'll provide the examples here.

?apply #apply functions over array margins #apply(X, MARGIN, FUN, ...) #make up some dataframe df <- data.frame(first = c(1:10), second = c(11:20)) df first second 1 1 11 2 2 12 3 3 13 4 4 14 5 5 15 6 6 16 7 7 17 8 8 18 9 9 19 10 10 20 #2 is for columns apply(df, 2, mean) first second 5.5 15.5 #1 is for rows apply(df, 1, mean) [1] 6 7 8 9 10 11 12 13 14 15 #write function to sample 10 numbers #from a Poisson distribution according to lambda f <- function(l){ rpois(10, l) } f(10) [1] 10 3 14 13 13 12 8 8 13 12 #lapply = apply a function over a list #for reproducibility set.seed(123) #save into draws draws <- lapply(1:5,f) draws [[1]] [1] 0 2 1 2 3 0 1 2 1 1 [[2]] [1] 5 2 3 2 0 4 1 0 1 5 [[3]] [1] 5 4 3 8 4 4 3 3 2 1 [[4]] [1] 8 7 5 6 1 4 5 2 3 2 [[5]] [1] 3 4 4 4 3 3 3 5 4 7 sapply(draws, mean) [1] 1.3 2.3 3.7 4.3 4.0 #difference with lapply? #lapply always returns a list. sapply (if it can) simplifies the results lapply(draws,mean) [[1]] [1] 1.3 [[2]] [1] 2.3 [[3]] [1] 3.7 [[4]] [1] 4.3 [[5]] [1] 4 #get same result as sapply unlist(lapply(draws,mean)) [1] 1.3 2.3 3.7 4.3 4.0

I'm only onto the third lecture and have already picked up some cool tricks. Here is the link to the course again if you want to follow it.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

quite helpful. Thanks!

This was just what I needed. Thank You!

Thank you! This was really well written and easy to follow. Exactly what I needed ðŸ™‚

I'm a bit stuck on how I would use the apply or aggregate function on a correlation. How do I specify which column is x and which one is y? I tried to specify the x and y in the function (i.e. cor(x = df$A, y = df$B), but that didn't work.

I get the following error:

Error in FUN(X[[1L]], ...) : supply both 'x' and 'y' or a matrix-like 'x'

I am looking to get a single vector (sapply? or mapply?).

You can use the cor() function on a data frame or matrix without needing to use apply or aggregate. For example:

`cor(iris[,1:4])`

Hi, thank you for your very useful post.

I keep receiving and error:

Error in aggregate.data.frame(as.data.frame(x), ...) :

no rows to aggregate

I wonder if you could help with a solution.

Many thanks,

Jan

Hi Jan,

it's a bit hard to troubleshoot since I'm not sure what you did and I don't have a hold of your data frame.

But here are some things to check: is there anything inside your data frame object? What do you get when you use the nrow() function on your data frame?

Cheers,

Dave

Thanks! This is very easy to follow example.

quite helpful ! thx bro !

This is splendid. I will follow the course immediately. Grateful. Thanks

This function description was really helpful

Super clear and helpful. Exactly what I needed.I had some n/a values so adding na.rm=TRUE after mean was all that was needed. Thanks a million!

aggdata<-aggregate(avgGDP$Ranking, list(Ranking=avgGDP$Income.Group),mean, na.rm=TRUE)

Thank you for the example of aggregate function and ggplot2. It is easy to follow and understand.