Minimal examples help

24 Feb 2015 07:24 data.table , join , R 9 comments

The other day I got stuck working with a huge data set using data.table in R. It took me a little while to realise that I had to produce a minimal reproducible example to actually understand why I got stuck in the first place. I know, this is the mantra I should follow before I reach out to R-help, Stack Overflow or indeed the package authors. Of course, more often than not, by following this advise, the problem becomes clear and with that the solution obvious.

Ok, here is the problem. Well, easy to write down now, after I understood it.

Suppose, I have some data that describes my sales targets by product and quarter:

library(data.table)
Plan <- data.table(
  Product=c(rep("Apple",3),rep("Kiwi",3),rep("Coconut",3)),
  Quarter=rep(c(1,2,3), 3),
  Target=1:9)
Plan
##    Product Quarter Target
## 1:   Apple       1      1
## 2:   Apple       2      2
## 3:   Apple       3      3
## 4:    Kiwi       1      4
## 5:    Kiwi       2      5
## 6:    Kiwi       3      6
## 7: Coconut       1      7
## 8: Coconut       2      8
## 9: Coconut       3      9

Further, I have some actual data, which is also broken down by region, but has no data for coconut:

Actual <- data.table(
 Region=rep(c("North", "South"), each=4),
 Product=rep(c("Apple", "Kiwi"), times=4),
 Quarter=rep(c(1,1,2,2), 2), Sales=1:8)
Actual
##    Region Product Quarter Sales
## 1:  North   Apple       1     1
## 2:  North    Kiwi       1     2
## 3:  North   Apple       2     3
## 4:  North    Kiwi       2     4
## 5:  South   Apple       1     5
## 6:  South    Kiwi       1     6
## 7:  South   Apple       2     7
## 8:  South    Kiwi       2     8

What I would like to do is to join both data sets together, so that I can compare my sales figures with my targets. In particular, I would like to see also my targets for future quarters. However, I would like to filter out the target data for those products that are not available in a region, coconut in my example.

First I have to set keys for my data sets on which I would like to join them:

setkey(Actual, Product, Quarter)
setkey(Plan, Product, Quarter)

Because I want to see also future targets I am not using Plan[Actual]. Instead I join the Plan data for each region; but then I get also the target data for coconut:

Actual[, .SD[Plan], by=list(Region)]
##     Region Product Quarter Sales Target
##  1:  North   Apple       1     1      1
##  2:  North   Apple       2     3      2
##  3:  North   Apple       3    NA      3
##  4:  North Coconut       1    NA      7
##  5:  North Coconut       2    NA      8
##  6:  North Coconut       3    NA      9
##  7:  North    Kiwi       1     2      4
##  8:  North    Kiwi       2     4      5
##  9:  North    Kiwi       3    NA      6
## 10:  South   Apple       1     5      1
## 11:  South   Apple       2     7      2
## 12:  South   Apple       3    NA      3
## 13:  South Coconut       1    NA      7
## 14:  South Coconut       2    NA      8
## 15:  South Coconut       3    NA      9
## 16:  South    Kiwi       1     6      4
## 17:  South    Kiwi       2     8      5
## 18:  South    Kiwi       3    NA      6

Ok, that means I have to filter for the products in my actual data to match the relevant planning data:

Actual[, .SD[
  Plan[
    Product %in% unique(.SD[, Product])
    ]
  ], by=list(Region)]
##     Region Product Quarter Sales Target
##  1:  North   Apple       1     1      1
##  2:  North   Apple       2     3      2
##  3:  North   Apple       3    NA      3
##  4:  North    Kiwi       1     2      4
##  5:  North    Kiwi       2     4      5
##  6:  North    Kiwi       3    NA      6
##  7:  South   Apple       1     5      1
##  8:  South   Apple       2     7      2
##  9:  South   Apple       3    NA      3
## 10:  South    Kiwi       1     6      4
## 11:  South    Kiwi       2     8      5
## 12:  South    Kiwi       3    NA      6

That's it. Now I can get back to my original huge and complex data set and move on.

Please let me know if there is a better way of achieving the above.

Session Info

R version 3.1.2 Patched (2015-01-20 r67564)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] data.table_1.9.4

loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.4 reshape2_1.4.1 stringr_0.6.2

9 comments :

Arun Srinivasan said...: Very nice post. My only (minor) suggestion would be that `unique()` is not really necessary, i.e., `Actual[, .SD[Plan[Product %in% .SD$Product]], by=Region]`; 24 February 2015 at 09:24
Markus Gesmann said...: Well, without unique I do get more rows back.; 24 February 2015 at 12:26
Arun Srinivasan said...: Markus, I'm confused. `a %in% b` will always return in a logical vector = `length(a)`. 1:5 %in% c(1,1,1) vs 1:5 %in% c(1). So I'm not sure how `Product %in% .SD$Product` would return a different result compared to `Product %in% unique(.SD$Product)`. What am I missing?; 24 February 2015 at 19:38
Markus Gesmann said...: Here is the example as a gist, including your version at the end: https://gist.github.com/mages/c70c8dba25ae73587fea; 24 February 2015 at 20:49
Arun Srinivasan said...: Thanks. Just checked with 1.9.4. It seems to be due to auto indexing bug, which has since then been fixed in 1.9.5 (which I am on). You can verify this by running the same after disabling auto indexing with options(datatable.auto.index=FALSE) (or upgrading to 1.9.5).; 24 February 2015 at 20:55
Markus Gesmann said...: Makes sense and works now. Thanks!; 24 February 2015 at 21:15
venoush said...: This might be faster I guess...
Plan[Actual[CJ(unique(Product), unique(Plan$Quarter), unique(Region))]]; 25 February 2015 at 08:50
venoush said...: Oh, I forgot to add that
1. you need to add Region to setkey(Actual, Product, Quarter)
2. you can simply add [order(Region)] to get the same ordering in the output

Plan[Actual[CJ(unique(Product), unique(Plan$Quarter), unique(Region))]][order(Region)]; 25 February 2015 at 09:32
Gabor Grothendieck said...: This can be understood as a triple join. In terms of SQL:

sqldf("select * from Plan
left join (select distinct Product, Region from Actual) using (Product)
left join Actual using (Region, Product, Quarter)
where Region is not null
order by Region, Product, Quarter"); 25 February 2015 at 18:43

Minimal examples help

Session Info

9 comments :

Post a Comment

Popular Posts

Blog Archive

My Blog List

ContactForm