Say it in R with "by", "apply" and friends
28 Jan 2012
14:43
aggregate
,
apply
,
by
,
data.table
,
doBy
,
language
,
plyr
,
R
,
sqldf
,
Tutorials
14 comments
Iris versicolor By Danielle Langlois License: CC-BY-SA |
Languages are full of surprises, in particular for non-native speakers. The other day I learned that there is courtesy and curtsey. Both words sounded very similar to me, but of course created some laughter when I mixed them up in an email.
With languages you can get into habits of using certain words and phrases, but sometimes you see or hear something, which shakes you up again. So did the following two lines in R with me:
f <- function(x) x^2
sapply(1:10, f)
[1] 1 4 9 16 25 36 49 64 81 100
It reminded me of the phrase that everything is a list in R. It showed me again how easily a for loop can be turned into a statement using the apply
family of functions and how little I know about all the subtleties of R.
I remember how happy I felt, when I finally understood the by
function in R. I started to use it all the time, closing my eyes on aggregate
and the apply
functions family. Here is an example where I calculate the means of the various measurements by species of the famous iris data set using by
.
by
do.call("rbind", as.list(
by(iris, list(Species=iris$Species), function(x){
y <- subset(x, select= -Species)
apply(y, 2, mean)
}
)))
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Now let's find alternative ways of expressing ourselves, using other words/functions of the R language, such as aggregate, apply, sapply, tapply, data.table, ddply, sqldf
, and summaryBy
.
aggregate
Theaggregate
function splits the data into subsets and computes summary statistics for each of them. The output of aggregate
is a data.frame
, including a column for species.
iris.x <- subset(iris, select= -Species)
iris.s <- subset(iris, select= Species)
aggregate(iris.x, iris.s, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Addition: As John Christie points out in the comments, aggregate
has also a formula interface, which simplifies the call to:
aggregate( . ~ Species, iris, mean)
apply and tapply
The combination oftapply
and apply
achieves a similar result, but this time the output is a matrix
and hence I lose the column with species. The species are now the row names.
apply(iris.x, 2, function(x) tapply(x, iris.s, mean))
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
split and apply
Here I split the data first into subsets for each of the species and calculate then the mean for each column in the subset. The output is amatrix
again, but transposed.
sapply(split(iris.x, iris.s), function(x) apply(x, 2, mean))
setosa versicolor virginica
Sepal.Length 5.006 5.936 6.588
Sepal.Width 3.428 2.770 2.974
Petal.Length 1.462 4.260 5.552
Petal.Width 0.246 1.326 2.026
ddply
Hadley Wickham'splyr
package provides tools for splitting, applying and combining data. The function ddply
is similar to the by function, but it returns a data.frame
instead of a by
list and maintains the column for the species.
library(plyr)
ddply(iris, "Species", function(x){
y <- subset(x, select= -Species)
apply(y, 2, mean)
})
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Addition: Sean mentions in the comments an alternative, using the colMeans
function, while Andrew reminds us of the reshape
package with its functions melt
and cast
.
ddply(iris, "Species", function(x) colMeans(subset(x, select= -Species)))
## or
ddply(iris, "Species", colwise(mean))
## same output as above
library(reshape)
cast(melt(iris, id.vars='Species'),formula=Species ~ variable,mean)
## same output as above
summaryBy
ThesummaryBy
function of the doBy
package by Søren Højsgaard and Ulrich Halekoh has a very intuitive interface, using formulas.
library(doBy)
summaryBy(Sepal.Length + Sepal.Width + Petal.Length + Petal.Width ~ Species, data=iris, FUN=mean)
Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
sqldf
If you are fluent in SQL, then the sqldf package by Gabor Grothendieck might be the one for you.
library(sqldf)
sqldf("select Species, avg(Sepal_Length), avg(Sepal_Width),
avg(Petal_Length), avg(Petal_Width) from iris
group by Species")
Species avg(Sepal_Length) avg(Sepal_Width) avg(Petal_Length) avg(Petal_Width)
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
data.table
Thedata.table
package by M Dowle, T Short and S Lianoglou is the real rock star to me. It provides an elegant and fast way to complete the task. The statement reads in plain English from right to left: take columns 1 to 4, split them by the factor in column "Species" and calculate on the sub data (.SD
) the means.
library(data.table)
iris.dt <- data.table(iris)
iris.dt[,lapply(.SD,mean),by="Species",.SDcols=1:4]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] setosa 5.006 3.428 1.462 0.246
[2,] versicolor 5.936 2.770 4.260 1.326
[3,] virginica 6.588 2.974 5.552 2.026
apply
I should mention that R provides theiris
data set also in an array form. The third dimension of the iris3
array holds the species information. Therefore I can use the apply
function again, I go down the third and then the second dimension to calculate the means.
apply(iris3, c(3,2), mean)
Sepal L. Sepal W. Petal L. Petal W.
Setosa 5.006 3.428 1.462 0.246
Versicolor 5.936 2.770 4.260 1.326
Virginica 6.588 2.974 5.552 2.026
14 comments :
I've been trying to collect posts like these for my greater understanding... can you tell me why, in the `by' example, that you need to do `do.call("rbind", ...)' instead of just `rbind(...)'? I tried the `rbind(...)' way, and I don't get the same result as your way, and your way has it packaged better...
do.call allows you to pass the arguments from the outside into the function, so that the function (in this case rbind) is only called once. See also Marc Schwartz comments on R-help: https://stat.ethz.ch/pipermail/r-help/2007-April/129251.html
Thanks!
You certainly show the subtelties of R and when you learn a language ( I am not a native English speaker either), it is great to find different ways of passing on your thoughts, or your questions till you feel happy with your choice.
Thank you!
Questio
Thanks for this very clear account. V. helpful to this beginner.
Taking seriously your comment re English, here are a few pointers:
- Another confusing pair is loose/lose. Loose is adjective = 'not tight'; 'lose' is the verb for misplacing something
- Also of/off "a few I know of"
- unknownR (not unkown)
- "forvigable" is a charming invention, but the correct word is "forgiving"
Having said that, I'd far rather read an article like this with a smattering of English errors than one written by a native English speaker who has no idea how to explain things! Unfortunately, many programmers have no intuitive grasp of what a novice needs to know.
Many thanks for your feedback and corrections!
Great overview. I would suggest one minor simplification of the ddply example:
library(plyr)
ddply(iris, "Species", function(x) colMeans(subset(x, select= -Species)))
Fantastic! Please do so more often, it was very useful. Also, thanks for pointing out to the unknownR package.
Hi - great post. I would probably use the reshape package for this job - you end up with the following oneliner:
cast(melt(iris, id.vars='Species'),formula=Species ~ variable,mean)
cast is a great function - it allows for very complex pivot tables to be formed
Definitely a "keeper". I had given up on the "apply, by and company" in favor of the abstraction allowed by ddply. But this is very good study material to get a grasp of these "traditional" function. sqldf is also a very nice discovery. Many thanks for your post!
The interface to aggregate is quite a bit more versatile than indicated here. This is an alternative use.
aggregate( . ~ Species, iris, mean)
I think that's really the simplest way to write this... and one of the fastest executing. In fact, given that aggregate has a fully formula interface it now completely obviates the summaryBy command.
How could I forget this?! It's even part of the examples in the help file to aggregate.
Many thanks for the reminder John!
Admittedly not quite as neat as the aggregate example, here's another ddply version:
ddply(iris, "Species", colwise(mean))
Thanks for the great tutorials! Can I just ask - maybe it's purely stylistic - but why do you use subset() instead of [ ] indexing? For some reason I am convinced that the bracket notation is the cleaner/safer/purer option, but I can see it must just be a matter of preference?
Doing the same thing several ways in R was great , you showed what it takes to get a grasp on the language rather than just knowing one way of doing something. But for a beginner, if you could add a small section that explains what is going on, it would have helped in some places. To not distract the rest of the people, perhaps these sections could be mouse over popups, or expandable rows underneath the examples.
Post a Comment