Data.table rocks! Data manipulation the fast way in R

27 Nov 2012 07:42 data.table , Insurance , R , Tutorials 10 comments

I really should make it a habit of using data.table. The speed and simplicity of this R package are astonishing.

Here is a simple example: I have a data frame showing incremental claims development by line of business and origin year. Now I would like add a column with the cumulative claims position for each line of business and each origin year along the development years.

It's one line with data.table! Here it is:

myData[order(dev), cvalue:=cumsum(value), by=list(origin, lob)]

It is even easy to read! Notice also that I don't have to copy the data. The operator ':=' works by reference and is one of the reasons why data.table is so fast.

And it is getting even better. Suppose you want to get the latest claims development position for each line of business and origin year. Again, it is only one line:

latestData <- myData[, .SD[max(dev)] , by=list(origin, lob)]

Oh boy, I should update my ChainLadder package and utilise the power and elegancy of data.table. Many thanks to Matt Dowle and his collaborators for all their fantastic work.

Here is the R code of the examples above:

Session Info

R Under development (unstable) (2012-10-19 r60974)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lattice_0.20-10  data.table_1.8.4

loaded via a namespace (and not attached):
[1] grid_2.16.0

10 comments :

Lyudmil Antonov said...: The command:

myData[order(dev), cvalue:=cumsum(value), by=list(origin, lob)]

gives

Error in `[.data.table`(myData, order(dev), `:=`(cvalue, cumsum(value)), :

Combining := in j with by is not yet implemented. Please let maintainer('data.table') know if you are interested in this.; 27 November 2012 at 11:27
Matthew Dowle said...: Do you have an old version? Markus used 1.8.4 - see the output of his sessionInfo(). Type "update.packages()" to upgrade.; 27 November 2012 at 11:53
Vijay Barve said...: This is a nice practical example. Thanks for that.

I needed two changes to first statement to get it to run

1. "paste" in place of "paste0"

2. sep="" at the end to avoide a space in the url

So it looks like this

url <- paste("http://www.google.com/fusiontables/api/query?",
"sql=SELECT+*+FROM+1SL7c4TwyI1YxuQELc0R3PjsYC3TwhP3o7k_NZzc",sep=""); 27 November 2012 at 20:34
Markus Gesmann said...: Hi Vijay,

I guess you use an older version of R than I, as paste0 was added to R with version 2.15.0. See the NEWS section in the R Journal for more details: http://journal.r-project.org/archive/2012-1/RJournal_2012-1.pdf

Cheers

Markus; 27 November 2012 at 20:44
Harald R said...: Hi Vijay,

if you don't want upgrade to a newer R version for some reason try adding this to your .Rprofile:

## function paste0
if (!exists("paste0", where = "package:base")) {
paste0 <- function(...) paste(..., sep = "")
}

Cheers
harald; 28 November 2012 at 03:47
Alok said...: Thanks for the post Markus. Beside the speed, would you say that data.table package is a good replacement to plyr and reshape package in its fullest sense?; 28 November 2012 at 09:06
Markus Gesmann said...: I am afraid, that I know too little about all those packages to form an opinion.; 28 November 2012 at 09:32
Alok said...: Thanks Markus. Guess, I will fiddle with it myself and check it out.; 28 November 2012 at 13:14
Arun said...: Nice question. Yes. v1.8.11 implements fast 'melt' and 'dcast' functions (in C). Have a look at benchmarks here: https://gist.github.com/arunsrinivasan/7839891; 15 December 2013 at 08:49
Jagjit Singh said...: Paul Walker was an American Actor. He began his Carrier best Guest –
Starring in Several Television Shows such as the Young and the Rastless
as well as Touched by an Angel. Paul Walker has given Number of awesome
Movies and this will be his Last Movie for his Fans. That’s name is Fast
& Furious 7, which is coming in March 2015.

http://furious7fullmovie.com/; 7 April 2015 at 12:07

Data.table rocks! Data manipulation the fast way in R

Session Info

10 comments :

Post a Comment

Popular Posts

Blog Archive

My Blog List

ContactForm