The reshape function
The other day I wrote about the R functions by, apply and friends, which allow me to operate on subsets of data. All those functions work nicely, if the data is given in the right format. More often than not it isn't and I have to reshape the data beforehand. Thus, time to discuss the reshape
function. I will focus on the reshape
function in base R, and not the package of the same name.
I use Fischer's iris data set again, as it is readily available after starting R. The iris data set has 150 observation and the first 6 rows look like this:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I would like to create a box whisker plot, showing the measurements of the observations for each of the species, as in the chart below.
I know, that if I had all measurements in one column and the dimension in another column, I could produce a graph like this in one line with lattice
.
library(lattice)
bwplot(Measurement ~ Species | Dimension, data=reshaped.iris)
Hence the reshape
function is what I need. From the help file I learn that I want to transform my data from a wide format into a long format (direction="long")
. In the long format I would like a varibale with the measurements (v.names="Measurement"), which I get by running through the first four columns (varying=1:4
). I know which measurement I am reading by looking at the column names (times=names(iris)[1:4]
), and I capture the dimension names in a new variable (timevar="Dimension"
). This gives me the following statement:
That's it, I can create the lattice box-whisker plot.
reshaped.iris <- reshape(iris, varying=1:4, v.names="Measurement",
timevar="Dimension", times=names(iris)[1:4],
idvar="Measure ID", direction="long")
head(reshaped.iris)
Species Dimension Measurement Measure ID
1.Sepal.Length setosa Sepal.Length 5.1 1
2.Sepal.Length setosa Sepal.Length 4.9 2
3.Sepal.Length setosa Sepal.Length 4.7 3
4.Sepal.Length setosa Sepal.Length 4.6 4
5.Sepal.Length setosa Sepal.Length 5.0 5
6.Sepal.Length setosa Sepal.Length 5.4 6
In my next example I would like the measurements of length and width in separate columns and capture the flower part in a new variable, so I can create scatterplots of length against width. Tweaking the reshape statement slightly gives me:
reshaped.iris.sp <- reshape(iris, varying=list(c(1,3),c(2,4)),
v.names=c("Length", "Width"),
timevar="Part", times=c("Sepal", "Petal"),
idvar="Measure ID", direction="long")
head(reshaped.iris.sp)
Species Part Length Width Measure ID
1.Sepal setosa Sepal 5.1 3.5 1
2.Sepal setosa Sepal 4.9 3.0 2
3.Sepal setosa Sepal 4.7 3.2 3
4.Sepal setosa Sepal 4.6 3.1 4
5.Sepal setosa Sepal 5.0 3.6 5
6.Sepal setosa Sepal 5.4 3.9 6
xyplot(Length ~ Width | Species, groups=Part,
data=reshaped.iris.sp, auto.key=list(space="right"))
xyplot(Length ~ Width | Part, groups=Species,
data=reshaped.iris.sp, auto.key=list(space="right"))
I think, the charts illustrate quite nicely why the iris data set has become a typical test case for many classification techniques in machine learning.
1 comment :
Cool Post. This is helpful. Your blog is great. In keeping with R's "there are many ways of doing something" approach I approached the first problem by using the melt command as follows. (It is an add-on of course).
my.melt = melt(iris,id.var="Species",variable_name="Dimension")
bwplot(value ~ Species | Dimension, data=my.melt,layout=c(4,1))
In the second case I did it using basic data frame manipulation because that is the frame of mind I've been in recently. Using reshape or melt is probably more elegant and general though I also like to point out that knowing how to "sling around" data frames can be a very useful skill. This could be consolidated even more but probably at the expense of readability.
df1 = cbind(iris[c(1:2,5)],Part = unlist(strsplit(names(iris)[1],".",fixed=T))[1])
df2 = cbind(iris[c(3:4,5)],Part = unlist(strsplit(names(iris)[3],".",fixed=T))[1])
names(df1)[1:2]=c("Length","Width"); names(df2)[1:2]=c("Length","Width")
xyplot(Length ~ Width|Species, groups=Part,data=rbind(df1,df2), auto.key=list(space="right"),layout=c(3,1))
Post a Comment