Saturday, 31 December 2011

Is R turning into an operating system?

Over the years I convinced my colleagues and IT guys that LaTeX/XeLaTeX is the way forward to produce lots of customer reports with individual data, charts, analysis and text. Success! But of course the operating system in the office is still MS Windows.

With my background in Solaris/Linux/Mac OSX I am still a little bit lost in the Windows world, when I have to do such simple tasks as finding and replacing a string in lots of files. Apparently the acronym is FART (find and replace text).

So, what to you do without your beloved command line tools and admin rights? Eventually you start using R instead. So here is my little work-around for replacing "Merry Christmas" with "Happy New Year" in lots of files:

for( f in filenames ){

  x <- readLines(f)
  y <- gsub( "Merry Christmas", "Happy New Year", x )
  cat(y, file=f, sep="\n")

}
You can find a complete self-contained example on github.

Of course R is not an operating system, but yet it can complement it well, if your other resources are limited.

Last but not least: Happy New Year!

Friday, 23 December 2011

googleVis 0.2.13: new stepped area chart and improved geo charts

On 7th December Google published a new version of their Visualisation API. The new version adds a new chart type: Stepped Area Chart and provides improvements to Geo Chart. Now Geo Chart has similar functionality to Geo Map, but while Geo Map requires Flash, Geo Chart doesn't, as it renders SVG/VML graphics. So it also works on your iOS devices.

These new features have been added to the googleVis R package in version 0.2.13, which went live on CRAN a few days ago.

The function gvisSteppedAreaChart works very much in the same way as gvisAreaChart. Here is a little example:

library(googleVis)
df <- data.frame(country=c("US", "GB", "BR"), 
                 val1=c(1,3,4), val2=c(23,12,32))
SteppedArea <- gvisSteppedAreaChart(df, xvar="country", 
                                        yvar=c("val1", "val2"),
                           options=list(isStacked=TRUE, 
                                        width=400, height=150))
plot(SteppedArea2)

The interface to gvisGeoChart changed slightly to take into account the new version of Geo Chart by Google. The argument numvar has been renamed to colorvar and a new argument sizevar has been added. This allows you to set the size and colour of the bubbles in displayMode='markers' depending on columns in your data frame. Further, you can set far more options than you could before, in particular you can set not only the region, but also the resolution of your map. Although more granular maps are not available for all countries, for more details see the Google documentation.

Here are two examples, plotting the test data CityPopularity with a Geo Chart. The first plot shows the popularity of New York, Boston, Miami, Chicago, Los Angeles and Houston on the US map, with the resolution set to 'metros' and region set to 'US'. The Google Map API makes the correct assumption about which cities we mean.

library(googleVis> ## requires googleVis version >= 0.2.13
gcus <- gvisGeoChart(CityPopularity, 
        locationvar="City", colorvar="Popularity", 
          options=list(displayMode="markers", 
                       region="US", resolution="metros"), 
         chartid="GeoChart_US")
plot(gcus)

In the second example we set the region to 'US-TX', therefore Google will look for cities with the same names in Texas. And what a surprise, there are cities/towns named Chicago, Los Angeles, Miami, Boston and of course Houston in Texas.

gctx <- gvisGeoChart(CityPopularity, 
        locationvar="City", colorvar="Popularity", 
          options=list(displayMode="markers", 
                       region="US-TX", resolution="metros"), 
         chartid="GeoChart_TX")
plot(gctx)

With the new version of the Visualisation API Google introduced also the concept of DataTable Roles. This is an interesting idea, as it allows you to add context to the data, similar to the approach used with annotated time lines. Google classifies the DataTable Roles still experimental, but it is a space to watch and ideas on how this could be translated into R will be much appreciated.

And now the news of the googleVis package since version 0.2.10:

Version 0.2.13 [2011-12-19]
==========================

Changes

    o The list of arguments for gvisGeoChart changed:
      - the argument 'numvar' has been renamed to 'colorvar' to
        reflect the updated Google API. Additionally gvisGeoChart
     gained a new argument 'sizevar'.
    o Updated googleVis vignette with a section on using googleVis 
      output in presentations  
    o Renamed demo EventListner to EventListener

NEW FEATURES

    o Google published a new version of their Visualisation API on 7
      December 2011. Some of the new features have been implemented
      into googleVis already:
      - New stepped area chart function gvisSteppedAreaChart
      - gvisGeoChart has a new marker mode, similar to the mode in
       gvisGeoMap. See example(gvisGeoChart) for the new
        functionalities.

Version 0.2.12 [2011-12-07]
==========================

Bug Fixes

    o gvisMotionChart didn't display data with special characters,
      e.g. spaces, &, %, in column names correctly. 
      Thanks to Alexander Holcroft for reporting this issue.

Version 0.2.11 [2011-11-16]
==========================

Changes

   o Updated vignette and documentation with instructions on changing
     the Flash security settings to display Flash charts locally. 
     Thanks to Tony Breyal.
   o New example to plot weekly data with gvisMotionChart
   o Removed local copies of gadget files to reduce package file
     size. A local copy of the R script to generate the original gadget
     files is still included in inst/gadgets 

Version 0.2.10 [2011-09-24]
==========================

Changes

   o Updated section 'Using googleVis output with Google Sites,
     Blogger, etc.' vignette

   o Updated example for gvisMotionChart, showing how the initial
     chart setting can be changed, e.g to display a line chart.

   o New example for gvisAnnotatedTimeLine, showing how to shade
     areas. Thanks to Mike Silberbauer for providing the initial code.    
   
NEW FEATURES
 
    o New demo WorldBank. It demonstrates how country level data can
      be accessed from the World Bank via their API and displayed with a
      Motion Chart. Inspired by Google's Public Data Explorer, see
      http://www.google.com/publicdata/home

Tuesday, 13 December 2011

Data is the new gold

We need more data journalism. How else will we find the nuggets of data and information worth reading?

Life should become easier for data journalists, as the Guardian, one of the data journalism pioneers, points out in this article about the new open data initiative of the European Union (EU). The aims of the EU's open data strategy are bold. Data is seen as the new gold of the digital age. The EU is estimating that public data is already generating economic value of €32bn each year, with growth potential to €70bn, if more data will be made available. Here is the link to the press statement, which I highly recommend reading:

EUROPA - Press Releases - Neelie Kroes Vice-President of the European Commission responsible for the Digital Agenda, Data is the new gold, Opening Remarks, Press Conference on Open Data Strategy Brussels, 12th December 2011



I am particularly impressed that the EU even aims to harmonise the way data will be published by the various bodies. We know that working with data, open or proprietary, often means spending a lot of time on cleaning, reshaping and transforming it, in order to join it with other sources and to make sense out of it.

Data standards would really help in this respect. And the EU is pushing this as well. I can observe this in the insurance industry already, where new European regulatory requirements (Solvency II) force companies to increase their data management capabilities. This is often a huge investment and has to be seen as a long term project.

Although the press statement doesn't mention anything about open source software projects, I think that they are essential for unfolding the full potential of open data.

Open source projects like R provide a platform to share new ideas. I'd say that R, but equally other languages as well, provide interfaces between minds and hands. Packages, libraries, etc. make it possible to spread ideas and knowledge. Having access to scientific papers is great but being able to test the ideas in practice accelerates the time it takes to embed new developments from academia into the business world.

Saturday, 10 December 2011

LondonR, 6 December 2011

The London R user group met again last Wednesday at the Shooting Star pub. And it was busy. More than 80 people had turned up. Was it the free beer and food, sponsored by Mango, which attracted the folks or the speakers? Or the venue? James Long, who organises the Chicago R user group meetings and who gave gave the first talk that night, noted that to his knowledge only the London and Chicago R users would meet in a pub.


However, it were the speakers and their talks which attracted me:
You will notice that this London R meeting had a theme around risk pricing. James talked about reinsurance pricing using R in the cloud, while Chibisi focused more on personal lines insurance with generalised linear models and Richard came from the angle of investment management and portfolio optimisation.

Thursday, 1 December 2011

Fitting distributions with R

Fitting distribution with R is something I have to do once in a while, but where do I start?

A good starting point to learn more about distribution fitting with R is Vito Ricci's tutorial on CRAN. I also find the vignettes of the actuar and fitdistrplus package a good read. I haven't looked into the recently published Handbook of fitting statistical distributions with R, by Z. Karian and E.J. Dudewicz, but it might be worthwhile in certain cases, see Xi'An's review. A more comprehensive overview of the various R packages is given by the CRAN Task View: Probability Distributions, maintained by Christophe Dutang.

How do I decide which distribution might be a good starting point?

I came across the paper Probabilistic approaches to risk by Aswath Damodaran. In Appendix 6.1 Aswath discusses the key characteristics of the most common distributions and in Figure 6A.15 he provides a decision tree diagram for choosing a distribution:


JD Long points to the Clickable diagram of distribution relationships by John Cook in his blog entry about Fitting distribution X to data from distribution Y . With those two charts I find it not too difficult anymore to find a reasonable starting point.

Once I have decided which distribution might be a good fit I start usually with the fitdistr function of the MASS package. However, since I discovered the fitdistrplus package I have become very fond of the fitdist function, as it comes with a wonderful plot method. It plots an empirical histogram with a theoretical density curve, a QQ and PP-plot and the empirical cumulative distribution with the theoretical distribution. Further, the package provides also goodness of fit tests via gofstat.

Suppose I have only 50 data points, of which I believe that they follow a log-normal distribution. How much variance can I expect? Well, let's experiment. I draw 50 random numbers from a log-normal distribution, fit the distribution to the sample data and repeat the exercise 50 times and plot the results using the plot function of the fitdistrplus package.


I notice quite a big variance in the results. For some samples other distributions, e.g. logistic, could provide a better fit. You might argue that 50 data points is not a lot of data, but in real life it often is, and hence this little example already shows me that fitting a distribution to data is not just about applying an algorithm, but requires a sound understanding of the process which generated the data as well.