Next Kölner R User Meeting: Friday, 4 December 2015
The 16th Cologne R user group meeting is scheduled for this Friday, 4 December 2015 and we have great line up with three talks followed by networking drinks.
- Monitoring process change using Bayesian methods (Mick Cooney)
A common business problem is to evaluate the effect of a change of process, and this talk will discuss a straightforward approach to this using conjugate priors.
- Editing R files with DataJoy (Dietmar Janetzko)
Brief introduction to the online collaborative data analysis platform DataJoy.
- Emoticons decoder for social media sentiment analysis in R (Jessica Peterka-Bonetta)
While R seems to be able to display some emoticons properly, many other times it doesn’t, making any further analysis impossible unless you get rid of them. With a little hack, I decoded these emoticons and put them all in a dictionary for further use.
Drinks and Networking
The event will be followed by drinks (Kölsch!) and networking opportunities.For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.
The organisers, Kirill Pomogajko and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.
Notes from Warsaw R meetup
I had the great pleasure time to attend the Warsaw R meetup last Thursday. The organisers Olga Mierzwa and Przemyslaw Biecek had put together an event with a focus on R in Insurance (btw, there is a conference with the same name), discussing examples of pricing and reserving in general and life insurance.Experience vs. Data
I kicked off with some observations of the challenges in insurance pricing. Accidents are thankfully rare events, that's why we buy insurance. Hence, there is often not a lot of claims data available for pricing. Combining the information from historical data and experts with domain knowledge can provide a rich basis for the assessment of risk. I presented some examples using Bayesian analysis to understand the probability of an event occurring. Regular readers of my blog will recognise the examples from earlier posts. You find my slides on GitHub.Download slides |
Non-life insurance in R
Emilia Kalarus from Triple A shared some of her experience of using R in non-life insurance companies. She focused on the challenges in working across teams, with different systems, data sets and mentalities.As an example, Emilia talked about the claims reserving process, which in her view should be embedded in the full life cycle of insurance, namely product development, claims, risk and performance management. Following this thought, she presented an idea for claims reserving that models the life of a claim from not incurred and not reported (NINR), to incurred but not reported (IBNR), reported but not settled (RBNS) and finally paid.
Stochastic mortality modelling
The final talk was given by Adam Wróbel from the life insurer Nationale Nederlanden, discussing stochastic mortality modelling. Adam's talk on analysing mortality made me realise that life and non-life insurance companies may be much closer to each other than I thought.Although life and non-life companies are usually separated for regulatory reasons, they both share the fundamental challenge of predicting future cash flows. An example where the two industries meet is product liability.
Over the last century, technology has changed our environment fundamentally, more so than ever before. Yet, we still don't know which long-term impact some of the new technologies and products will have on our life expectancy. Some will prolong our lives, others may make us ill.
A classic example is asbestos, initial regarded as a miracle mineral, as it was impossible to set on fire, abundant, cheap to mine, and easy to manufacture. Not surprisingly it was widely used until it was linked to causing cancer. Over the last 35 years, the non-life insurance industry has paid well in excess of hundred billion dollars in compensations.
Slides and Code
The slides and R code of the presentations are hosted on the Warsaw R GitHub page.Hierarchical Loss Reserving with Stan
I continue with the growth curve model for loss reserving from last week's post. Today, following the ideas of James Guszcza [2] I will add an hierarchical component to the model, by treating the ultimate loss cost of an accident year as a random effect. Initially, I will use thenlme
R package, just as James did in his paper, and then move on to Stan/RStan [6], which will allow me to estimate the full distribution of future claims payments.Last week's model assumed that cumulative claims payment could be described by a growth curve. I used the Weibull curve and will do so here again, but others should be considered as well, e.g. the log-logistic cumulative distribution function for long tail business, see [1].
The growth curve describes the proportion of claims paid up to a given development period compared to the ultimate claims cost at the end of time, hence often called development pattern. Cumulative distribution functions are often considered, as they increase monotonously from 0 to 100%. Multiplying the development pattern with the expected ultimate loss cost gives me then the expected cumulative paid to date value.
However, what I'd like to do is the opposite, I know the cumulative claims position to date and wish to estimate the ultimate claims cost instead. If the claims process is fairly stable over the years and say, once a claim has been notified the payment process is quite similar from year to year and claim to claim, then a growth curve model is not unreasonable. Yet, the number and the size of the yearly claims will be random, e.g. if a windstorm, fire, etc occurs or not. Hence, a random effect for the ultimate loss cost across accident years sounds very convincing to me.
Here is James' model as described in [2]:
\[
\begin{align}
CL_{AY, dev} & \sim \mathcal{N}(\mu_{AY, dev}, \sigma^2_{dev}) \\
\mu_{AY,dev} & = Ult_{AY} \cdot G(dev|\omega, \theta)\\
\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}\\
Ult_{AY} & \sim \mathcal{N}(\mu_{ult}, \sigma^2_{ult})\\
G(dev|\omega, \theta) & = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)
\end{align}
\]The cumulative losses \(CL_{AY, dev}\) for a given accident year \(AY\) and development period \(dev\) follow a Normal distribution with parameters \(\mu_{AY, dev}\) and \(\sigma_{dev}\).
The mean itself is modelled as the product of an accident year specific ultimate loss cost \(Ult_{AY}\) and a development period specific parametric growth curve \(G(dev | \omega, \theta)\). The variance is believed to increase in proportion with the mean. Finally, the ultimate loss cost is modelled with a Normal distribution as well.
Assuming a Gaussian distribution of losses doesn't sound quite intuitive to me, as loss are often skewed to the right, but I shall continue with this assumption here to make a comparison with [2] possible.
Using the example data set given in the paper I can reproduce the result in R with
nlme
:The fit looks pretty good, with only 5 parameters. See James' paper for a more detailed discussion.
Let's move this model into Stan. Here is my attempt, which builds on last week's pooled model. With the generated quantities code block I go beyond the scope of the original paper, as I try to estimate the full posterior predictive distribution as well.
The 'trick' is the line
mu[i] <- ult[origin[i]] * weibull_cdf(dev[i], omega, theta);
where I have an accident year (here labelled origin) specific ultimate loss. The notation
ult[origin[i]]
illustrates the hierarchical nature in Stan's language nicely.Let's run the model:
The estimated parameters look very similar to the
nlme
output above.Let's take a look at the parameter traceplot and the densities of the estimated ultimate loss costs by origin year.
This looks all not too bad. The trace plots don't show any particular patterns, apart from \(\sigma_{ult}\), which shows a little skewness.
The
generated quantities
code block in Stan allows me to get also the predictive distribution beyond the current data range. Here I forecast claims up to development year 12 and plot the predictions, including the 95% credibility interval of the posterior predictive distribution with the observations. The model seems to work rather well, even with the Gaussian distribution assumptions. Yet, it has still only 5 parameters. Note, this model doesn't need an additional artificial tail factor either.
Conclusions
The Bayesian approach sounds to me a lot more natural than many classical techniques around the chain-ladder methods. Thanks to Stan, I can get the full posterior distributions on both, the parameters and predictive distribution. I find communicating credibility intervals much easier than talking about the parameter, process and mean squared error.James Guszcza contributed to a follow-up paper with Y. Zhank and V. Dukic [3] that extends the model described in [2]. It deals with skewness in loss data sets and the autoregressive nature of the errors in a cumulative time series.
Frank Schmid offers a more complex Bayesian analysis of claims reserving in [4], while Jake Morris highlights the similarities between a compartmental model used in drug research and loss reserving [5].
Finally, Glenn Meyers published a monograph on Stochastic Loss Reserving Using Bayesian MCMC Models earlier this year [7] that is worth taking a look at.
References
[1] David R. Clark. LDF Curve-Fitting and Stochastic Reserving: A Maximum Likelihood Approach. Casualty Actuarial Society, 2003. CAS Fall Forum.[2] James Guszcza. Hierarchical Growth Curve Models for Loss Reserving, 2008, CAS Fall Forum, pp. 146–173.
[3] Y. Zhang, V. Dukic, and James Guszcza. A Bayesian non-linear model for forecasting insurance loss payments. 2012. Journal of the Royal Statistical Society: Series A (Statistics in Society), 175: 637–656. doi: 10.1111/j.1467-985X.2011.01002.x
[4] Frank A. Schmid. Robust Loss Development Using MCMC. Available at SSRN. See also http://lossdev.r-forge.r-project.org/
[5] Jake Morris. Compartmental reserving in R. 2015. R in Insurance Conference.
[6] Stan Development Team. Stan: A C++ Library for Probability and Sampling, Version 2.8.0. 2015. http://mc-stan.org/.
[7] Glenn Meyers. Stochastic Loss Reserving Using Bayesian MCMC Models. Issue 1 of CAS Monograph Series. 2015.
Session Info
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ChainLadder_0.2.3 rstan_2.8.0 ggplot2_1.0.1 Rcpp_0.12.1
[5] lattice_0.20-33
loaded via a namespace (and not attached):
[1] nloptr_1.0.4 plyr_1.8.3 tools_3.2.2
[4] digest_0.6.8 lme4_1.1-10 statmod_1.4.21
[7] gtable_0.1.2 nlme_3.1-122 mgcv_1.8-8
[10] Matrix_1.2-2 parallel_3.2.2 biglm_0.9-1
[13] SparseM_1.7 proto_0.3-10 coda_0.18-1
[16] gridExtra_2.0.0 stringr_1.0.0 MatrixModels_0.4-1
[19] lmtest_0.9-34 stats4_3.2.2 grid_3.2.2
[22] nnet_7.3-11 tweedie_2.2.1 inline_0.3.14
[25] cplm_0.7-4 minqa_1.2.4 actuar_1.1-10
[28] reshape2_1.4.1 car_2.1-0 magrittr_1.5
[31] scales_0.3.0 codetools_0.2-14 MASS_7.3-44
[34] splines_3.2.2 rsconnect_0.3.79 systemfit_1.1-18
[37] pbkrtest_0.4-2 colorspace_1.2-6 quantreg_5.19
[40] labeling_0.3 sandwich_2.3-4 stringi_1.0-1
[43] munsell_0.4.2 zoo_1.7-12
Loss Developments via Growth Curves and Stan
Last week I posted a biological example of fitting a non-linear growth curve with Stan/RStan. Today, I want to apply a similar approach to insurance data using ideas by David Clark [1] and James Guszcza [2].Instead of predicting the growth of dugongs (sea cows), I would like to predict the growth of cumulative insurance loss payments over time, originated from different origin years. Loss payments of younger accident years are just like a new generation of dugongs, they will be small in size initially, grow as they get older, until the losses are fully settled.
Here is my example data set:
Following the articles cited above I will assume that the growth can be explained by a Weibull curve, with two parameters \(\theta\) (scale) and \(\omega\) (shape):
\[
G(dev|\omega, \theta) = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)
\]Inspired by the classical over-dispersed Poisson GLM in actuarial science, Guszcza [2] assumes a power variance structure for the process error as well. For the purpose of this post I will assume that claims cost follow a Normal distribution, to make a comparison with a the classical least square regression easier. With a prior estimate of the ultimate claims cost the cumulative loss can be modelled as:
\[
\begin{align}
CL_{AY, dev} & \sim \mathcal{N}(\mu_{dev}, \sigma^2_{dev}) \\
\mu_{dev} & = Ult \cdot G(dev|\omega, \theta)\\
\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}
\end{align}
\]Perhaps, the above formula suggests a hierarchical modelling approach, with different ultimate loss costs by accident year. Indeed, this is the focus of [2] and I will endeavour to reproduce the results with Stan in my next post, but today I will stick to a pooled model that assumes a constant ultimate loss across all accident years, i.e. \(Ult_{AY} = Ult \;\forall\; AY\).
To prepare my analysis I read in the data as a long data frame instead of the matrix structure above. Additionally, I compose another data frame that I will use later to predict payments two years beyond the first 10 years. Furthermore, to align the output with [2] I relabelled the development periods from years to months, so that year 1 becomes month 6, year 2 becomes month 18, etc. The accident years run from 1991 to 2000, while the variable
origin
maps those years from 1 to 10.To get a reference point for Stan I start with a non-linear least square regression:
The output above doesn't look unreasonable, apart from the accident years 1991, 1992 and 1998. The output of
gnls
gives me an opportunity to provide my prior distributions with good starting points. I will use an inverse Gamma as a prior for \(\sigma\), constrained Normals for the parameters \(\theta, \omega\) and \(Ult\) as well. If you have better ideas, then please get in touch.The Stan code below is very similar to last week. Again, I am interested here in the posterior distributions, hence I add a block to generate quantities from those. Note, Stan comes with a build-in function for the cumulative Weibull distribution
weibull_cdf
.I store the Stan code in a separate text file, which I read into R to compile the model. The compilation takes a little time. The sampling itself is done in a few seconds.
Let's take a look at the output:
The parameters are a little different to the output of
gnls
, but well within the standard error. From the plots I notice that the samples for the ultimate loss as well as for \(\theta\) are a little skewed to the right. Well, assuming cumulative losses to follow a Normal distribution was a bit a of stretch to start with. Still, the samples seem to converge.Finally, I can plot the 90% credible intervals of the posterior distributions.
The 90% prediction credible interval captures most of the data and although this model might not be suitable for reserving individual accident years, it could provide an initial starting point for further investigations. Additionally, thanks to the Bayesian model I have access to the full distribution, not just point estimators and standard errors.
My next post will continue with this data set and the ideas of James Guszcza by allowing the ultimate loss cost to vary by accident year, treating it as a random effect. Here is a teaser of what the output will look like:
References
[1] David R. Clark. LDF Curve-Fitting and Stochastic Reserving: A Maximum Likelihood Approach. Casualty Actuarial Society, 2003. CAS Fall Forum.[2] Guszcza, J. Hierarchical Growth Curve Models for Loss Reserving, 2008, CAS Fall Forum,
pp. 146–173.
Session Info
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] rstan_2.8.0 ggplot2_1.0.1 Rcpp_0.12.1
[4] nlme_3.1-122 lattice_0.20-33 ChainLadder_0.2.3
loaded via a namespace (and not attached):
[1] nloptr_1.0.4 plyr_1.8.3 tools_3.2.2
[4] digest_0.6.8 lme4_1.1-10 statmod_1.4.21
[7] gtable_0.1.2 mgcv_1.8-8 Matrix_1.2-2
[10] parallel_3.2.2 biglm_0.9-1 SparseM_1.7
[13] proto_0.3-10 gridExtra_2.0.0 coda_0.18-1
[16] stringr_1.0.0 MatrixModels_0.4-1 stats4_3.2.2
[19] lmtest_0.9-34 grid_3.2.2 nnet_7.3-11
[22] inline_0.3.14 tweedie_2.2.1 cplm_0.7-4
[25] minqa_1.2.4 reshape2_1.4.1 car_2.1-0
[28] actuar_1.1-10 magrittr_1.5 codetools_0.2-14
[31] scales_0.3.0 MASS_7.3-44 splines_3.2.2
[34] systemfit_1.1-18 rsconnect_0.3.79 pbkrtest_0.4-2
[37] colorspace_1.2-6 labeling_0.3 quantreg_5.19
[40] sandwich_2.3-4 stringi_1.0-1 munsell_0.4.2
[43] zoo_1.7-12
Non-linear growth curves with Stan
I suppose the go to tool for fitting non-linear models in R isnls
of the stats
package. In this post I will show an alternative approach with Stan/RStan, as illustrated in the example, Dugongs: "nonlinear growth curve", that is part of Stan's documentation. The original example itself is taken from OpenBUGS. The data describes the length and age measurements for 27 captured dugongs (sea cows). Carlin and Gelfand (1991) model the data using a nonlinear growth curve with no inflection point and an asymptote as \(x_i\) tends to infinity:
\[
Y_i \sim \mathcal{N}(\mu_i, \sigma^2),\; i = 1,\dots,27\\
\mu_i = \alpha - \beta \lambda^{x_i},\; \alpha,\, \beta > 0;\, 0 < \lambda < 1 \] Fitting the curve with
nls
gives the following results:Writing the model in Stan requires a few more lines, but gives me also the opportunity to generate output from the posterior distributions.
The predicted parameters and errors are very much the same as in the least square output of
nls
, but with the Stan output I can also review the 90% credible intervals.Session Info
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] rstan_2.8.0 ggplot2_1.0.1.9003 Rcpp_0.12.1
loaded via a namespace (and not attached):
[1] colorspace_1.2-6 scales_0.3.0 plyr_1.8.3
[4] parallel_3.2.2 tools_3.2.2 inline_0.3.14
[7] gtable_0.1.2 gridExtra_2.0.0 codetools_0.2-14
[10] grid_3.2.2 stats4_3.2.2 munsell_0.4.2
R in Insurance 2016
Following the successful 3rd R in Insurance conference in Amsterdam this year, we will return to London next year. We will be back at Cass Business School, 11 July 2016.
The event will focus again on the use of R in insurance, bringing together experts from industry and academia with a diverse background of disciplines, such as actuarial science, catastrophe modelling, finance, statistics and computer science.
We are delighted to announce or keynote speakers already: Dan Murphy and Mario V. Wüthrich.
More details about the event will be published in due course here. For now, save the date in your diary.
PS: The picture above is a cluster analysis of the Lloyd's building in R.
ChainLadder 0.2.2 is out with improved glmReserve function
We released version 0.2.2 ofChainLadder
a few weeks ago. This version adds back the functionality to estimate the index parameter for the compound Poisson model in glmReserve
using the cplm
package by Wayne Zhang. Ok, what does this all mean? I will run through a couple of examples and look behind the scene of
glmReserve
. However, the clue is in the title, glmReserve
is a function that uses a generalised linear model to estimate future claims, assuming claims follow a Tweedie distribution. I should actually talk about a family of distributions that is known as Tweedie, named by Bent Jørgensen after Maurice Tweedie. Joe Rickert published a nice post about the Tweedie distribution last year.Like most other functions in ChainLadder,
glmReserve
purpose is to predict future insurance claims based on historical data.The data at hand is often presented in form of a claims triangle, such as the following example data set from the ChainLadder package:
library(ChainLadder)
cum2incr(UKMotor)
## dev
## origin 1 2 3 4 5 6 7
## 2007 3511 3215 2266 1712 1059 587 340
## 2008 4001 3702 2278 1180 956 629 NA
## 2009 4355 3932 1946 1522 1238 NA NA
## 2010 4295 3455 2023 1320 NA NA NA
## 2011 4150 3747 2320 NA NA NA NA
## 2012 5102 4548 NA NA NA NA NA
## 2013 6283 NA NA NA NA NA NA
The rows present different origin periods in which accidents occurred and the columns along each row show the incremental reported claims over the years (the data itself is stored in a cumulative form). Suppose all claims will be reported within 7 years, then I'd like to know how much money should be set aside for the origin years 2008 to 2013 for claims that have incurred but not been reported (IBNR) yet. Or, to put it differently, I have to predict the
NA
fields in the bottom right hand triangle.First, let's reformat the data as it would be stored in a database, that is in a long format of incremental claims over the years (I add the years also as factors, which I will use later):
claims <- as.data.frame(cum2incr(UKMotor)) # convert into long format
library(data.table)
claims <- data.table(claims)
claims <- claims[ , ':='(cal=origin+dev-1, # calendar period
originf=factor(origin),
devf=factor(dev))]
claims <- claims[order(dev), cum.value:=cumsum(value), by=origin]
setnames(claims, "value", "inc.value")
head(claims)
## origin dev inc.value cal originf devf cum.value
## 1: 2007 1 3511 2007 2007 1 3511
## 2: 2008 1 4001 2008 2008 1 4001
## 3: 2009 1 4355 2009 2009 1 4355
## 4: 2010 1 4295 2010 2010 1 4295
## 5: 2011 1 4150 2011 2011 1 4150
## 6: 2012 1 5102 2012 2012 1 5102
Let's visualise the data:library(lattice)
xyplot(cum.value/1000 + log(inc.value) ~ dev , groups=origin, data=claims,
t="b", par.settings = simpleTheme(pch = 16),
auto.key = list(space="right",
title="Origin\nperiod", cex.title=1,
points=FALSE, lines=TRUE, type="b"),
xlab="Development period", ylab="Amount",
main="Claims development by origin period",
scales="free")
The left plot of the chart above shows the cumulative claims payment over time, while the right plot shows the log-transformed incremental claims development for each origin/accident year.One of the oldest methods to predict future claims development is called chain-ladder, which can be regarded as a weighted linear regression through the origin of cumulative claims over the development periods. Multiplying those development factors to the latest available cumulative position allows me to predict future claims in an iterative fashion.
It is well know in actuarial science that a Poisson GLM produces the same forecasts as the chain-ladder model.
Let's check:
# Poisson model
mdl.pois <- glm(inc.value ~ originf + devf, data=na.omit(claims),
family=poisson(link = "log"))
# predict claims
claims <- claims[, ':='(
pred.inc.value=predict(mdl.pois,
.SD[, list(originf, devf, inc.value)],
type="response")), by=list(originf, devf)]
# sum of future payments
claims[cal>max(origin)][, sum(pred.inc.value)]
## [1] 28655.77
# Chain-ladder forecast
summary(MackChainLadder(UKMotor, est.sigma = "Mack"))$Totals[4,]
## [1] 28655.77
Ok, this worked. However, both of these models make actually fairly strong assumptions. The Poisson model by its very nature will only produce whole numbers, and although payments could be regarded as whole numbers, say in pence or cents, it does feel a little odd to me. Similarly, modelling the year on year developments via a weighted linear regression through the origin, as in the case of the chain-ladder model, sounds not intuitive either.There is another aspect to highlight with the Poisson model; its variance is equal to the mean. Yet, in real data I often observe that the variance increases in proportion to the mean. Well, this can be remedied by using an over-dispersed quasi-Poisson model.
I think a more natural approach would be to assume a compound distribution that models the frequency and severity of claims separately, e.g. Poisson frequency and Gamma severity.
Here the Tweedie distributions comes into play.
Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter \(\Phi\).
Furthermore, the variance is a power function of the mean, i.e. \(\mbox{Var}(X)=\Phi \, E[X]^p\).
The canonical link function for a Tweedie distribution in a GLM is the power link \(\mu^q\) with \(q=1-p\). Note, \(q=0\) is interpreted as \(\log(\mu)\).
Thus, let \(\mu_i = E(y_i)\) be the expectation of the ith response. Then I have the following model.
\[
y \sim \mbox{Tweedie}(q, p)\\
E(y) = \mu^q = Xb \\
\mbox{Var}(y) = \Phi \mu^p
\]The variance power \(p\) characterises the distribution of the responses \(y\). The following are some special cases:
- Normal distribution, p = 0
- Poisson distribution, p = 1
- Compound Poisson-Gamma distribution, 1 < p < 2
- Gamma distribution, p = 2
- Inverse-Gaussian, p = 3
- Stable, with support on the positive real numbers, p > 2
Finally, I get back to the
glmReserve
function, which Wayne Zhang, the other author of the cplm
package, contributed to the ChainLadder
package.With
glmReserve
I can model a claims triangle using the Tweedie distribution family.In my first example I set use the parameters \(p=1, q=0\), which should return the results of the Poisson model.
(m1 <- glmReserve(UKMotor, var.power = 1, link.power = 0))
## Latest Dev.To.Date Ultimate IBNR S.E CV
## 2008 12746 0.9732000 13097 351 125.8106 0.35843464
## 2009 12993 0.9260210 14031 1038 205.0826 0.19757473
## 2010 11093 0.8443446 13138 2045 278.8519 0.13635790
## 2011 10217 0.7360951 13880 3663 386.7919 0.10559429
## 2012 9650 0.5739948 16812 7162 605.2741 0.08451188
## 2013 6283 0.3038201 20680 14397 1158.1250 0.08044210
## total 62982 0.6872913 91638 28656 1708.1963 0.05961042
Perfect, I get the same results, plus further information about the model.Setting the argument
var.power=NULL
will estimate \(p\) in the interval \((1,2)\) using the cplm
package.(m2 <- glmReserve(UKMotor, var.power=NULL))
## Latest Dev.To.Date Ultimate IBNR S.E CV
## 2008 12746 0.9732000 13097 351 110.0539 0.31354394
## 2009 12993 0.9260870 14030 1037 176.9361 0.17062307
## 2010 11093 0.8444089 13137 2044 238.5318 0.11669851
## 2011 10217 0.7360951 13880 3663 335.6824 0.09164138
## 2012 9650 0.5739948 16812 7162 543.6472 0.07590718
## 2013 6283 0.3038201 20680 14397 1098.7988 0.07632138
## total 62982 0.6873063 91636 28654 1622.4616 0.05662252
m2$model
##
## Call:
## cpglm(formula = value ~ factor(origin) + factor(dev), link = link.power,
## data = ldaFit, offset = offset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.7901 -1.6969 0.0346 1.6087 8.4465
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.25763 0.04954 166.680 < 2e-16 ***
## factor(origin)2008 0.03098 0.05874 0.527 0.605588
## factor(origin)2009 0.09999 0.05886 1.699 0.110018
## factor(origin)2010 0.03413 0.06172 0.553 0.588369
## factor(origin)2011 0.08933 0.06365 1.403 0.180876
## factor(origin)2012 0.28091 0.06564 4.279 0.000659 ***
## factor(origin)2013 0.48797 0.07702 6.336 1.34e-05 ***
## factor(dev)2 -0.11740 0.04264 -2.753 0.014790 *
## factor(dev)3 -0.62829 0.05446 -11.538 7.38e-09 ***
## factor(dev)4 -1.03168 0.06957 -14.830 2.28e-10 ***
## factor(dev)5 -1.31346 0.08857 -14.829 2.28e-10 ***
## factor(dev)6 -1.86307 0.13826 -13.475 8.73e-10 ***
## factor(dev)7 -2.42868 0.25468 -9.536 9.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated dispersion parameter: 10.788
## Estimated index parameter: 1.01
##
## Residual deviance: 299.27 on 15 degrees of freedom
## AIC: 389.18
##
## Number of Fisher Scoring iterations: 4
From the model I note that the dispersion parameter \(\phi\) was estimated as 10.788 and the index parameter \(p\) as 1.01.Not surprisingly the estimated reserves are similar to the Poisson model, but with a smaller predicted standard error.
Intuitively the modelling approach makes a lot more sense, but I end up with one parameter for each origin and development period, hence there is a danger of over-parametrisation.
Looking at the plots above again I note that many origin periods have a very similar development. Perhaps a hierarchical model would be more appropriate?
For more details on
glmReserve
see the help file and package vignette.Session Info
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 ChainLadder_0.2.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 nloptr_1.0.4 plyr_1.8.3
[4] tools_3.2.2 digest_0.6.8 lme4_1.1-9
[7] statmod_1.4.21 gtable_0.1.2 nlme_3.1-121
[10] lattice_0.20-33 mgcv_1.8-7 Matrix_1.2-2
[13] parallel_3.2.2 biglm_0.9-1 SparseM_1.7
[16] proto_0.3-10 coda_0.17-1 stringr_1.0.0
[19] MatrixModels_0.4-1 stats4_3.2.2 lmtest_0.9-34
[22] grid_3.2.2 nnet_7.3-10 tweedie_2.2.1
[25] cplm_0.7-4 minqa_1.2.4 ggplot2_1.0.1
[28] reshape2_1.4.1 car_2.1-0 actuar_1.1-10
[31] magrittr_1.5 scales_0.3.0 MASS_7.3-44
[34] splines_3.2.2 systemfit_1.1-18 pbkrtest_0.4-2
[37] colorspace_1.2-6 quantreg_5.19 sandwich_2.3-4
[40] stringi_0.5-5 munsell_0.4.2 chron_2.3-47
[43] zoo_1.7-12
Notes from the Kölner R meeting, 18 September 2015
Last Friday the Cologne R user group came together for the 15th time. Since its inception over three years ago the group evolved from a small gathering in a pub into an active data science community, covering wider topics than just R. Still, R is the link and clue between the different interests. Last Friday's agenda was a good example of this, with three talks touching on workflow management, web development and risk analysis.R in a big data pipeline
Download slides |
luigi
into a heterogeneous workflow of different applications. This is especially useful when R needs to be integrated with hadoop/hdfs based technologies, such as Spark and Hive. Luigi is not unlike Make, which Kirill presented at our last meeting in June. In a configuration file Yuki specified the various workflow steps and dependencies between the jobs.Kicking off the luigi script starts the workflow, and
luigid
server allows Yuki to monitor the various parts of the dependency graph visually. Thus, he can see the progress of his workflow in real time and identify quickly, when and where a sub process fails. As Yuki pointed out, this becomes critical in production systems, where failures need to be known and fixed quickly, unlike when ones carries out an explorative analysis in a development/research environment. See also Yuki's blog post for further details.Shiny + Shinyjs
Download presentation files |
shinyjs
, a package written by Dean Attali. The name suggests already that the package provides additional JavaScript functionality. Indeed, it does, but without the need to learn JavaScript, as those functions are wrapped into R. Paul showed us an example of a shinyapp that depending on the user plotted a different graph. Behind the scene his script would either hide or shows those plots, conditioned on the user. With only a few lines in R it allowed him to develop a user specific application. To achieve this he created a login screen that checks for user name and password. In his example he had hard coded the login credentials, instead of using a secure connection via a professional shiny server instance. However this was sufficient for his purpose, where he tests how students react to different economic scenarios in a lab environment at university.
Experience vs. Data
Download slides |
I presented some Bayesian ideas to analyse risks with little data. I used the wonderful "Hit and run accident" example from Daniel Kahneman's book Thinking, fast and slow to explain Bayes' formula, introduced Bayesian belief networks for a claims analysis and discussed the challenge of predicting events when they haven't happened yet (also in Stan). Along the way I mentioned a few ideas on communicating risk, which I learned from David Spiegelhalter earlier this year.
Next Kölner R meeting
The next meeting will be scheduled in December. Details will be published on our Meetup site. Thanks again to Revolution Analytics/Microsoft for their sponsorship.Please get in touch, if you would like to present at the next meeting.
Next Kölner R User Meeting: Friday, 18 September 2015
The 15th Cologne R user group meeting is scheduled for this Friday, 18 September 2015 and we have a full agenda with three talks followed by networking drinks.
- R in big data pipeline with luigi (Yuki Katoh)
R in big data pipeline: Put your awesome R codes into production. Learn how to build solid big data pipeline around it.
- shinyjs (Paul Viefers)
Using JavaScript in shiny, without knowing JavaScript
- Experience vs Data (Markus Gesmann)
How to asses risks with small data sets. Bayesian ideas, belief networks and a small simulation example in Stan/rstan.
Drinks and Networking
The event will be followed by drinks (Kölsch!) and networking opportunities.For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.
The organisers, Kirill Pomogajko and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.
Bayesian regression models using Stan in R
It seems the summer is coming to end in London, so I shall take a final look at my ice cream data that I have been playing around with to predict sales statistics based on temperature for the last couple of weeks [1], [2], [3].Here I will use the new
brms
(GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.Stan models with brms
Like in my previous post about the log-transformed linear model with Stan, I will use Bayesian regression models to estimate the 95% prediction credible interval from the posterior predictive distribution.Thanks to
brms
this will take less than a minute of coding, because brm
allows me to specify my models in the usual formula syntax and I can leave it to the package functions to create and execute the Stan files.Let's start. Here is the data again:
My models are written down in very much the same way as with
glm
. Only the binomial model requires a slightly different syntax. Here I use the default priors and link functions:Last week I wrote the Stan model for the log-transformed linear model myself. Here is the output of
brm
. The estimated parameters are quite similar, apart from \(\sigma\):I access the underlying Stan model via
log.lin.mod$model
and note that the prior of \(\sigma\) is modelled via a Cauchy distribution, unlike the inverse Gamma I used last week. I believe that explains my small difference in \(\sigma\).To review the model I start by plotting the trace and density plots for the MCMC samples.
Prediction credible interval
Thepredict
function gives me access to the posterior predictive statistics, including the 95% prediction credible interval. Combining the outputs of all four models into one data frame gives me then the opportunity to compare the prediction credible intervals of the four models in one chart.
There are no big news in respect of the four models, but for the fact that here I can look at the posterior prediction credible intervals, rather then the theoretical distributions two weeks ago. The over-prediction of the log-transformed linear model is apparent again.
How much stock should I hold on a hot day?
Running out of stock on a hot summer's day would be unfortunate, because those are days when sells will be highest. But how much stock should I hold?Well, if I set the probability of selling out at 2.5%, then I will have enough ice cream to sell with 97.5% certainty. To estimate those statistics I have to calculate the 97.5% percentile of the posterior predictive samples.
Ok, I have four models and four answers ranging from 761 to 2494. The highest number is more than 3 times the lowest number!
I had set the market size at 800 in my binomial model, so I am not surprised by its answer of 761. Also, I noted earlier that the log-normal distribution is skewed to the right, so that explains the high prediction of 2494. The Poisson model, like the log-transform linear model, has the implicit exponential growth assumption. Its mean forecast is well over 1000 as I pointed out earlier and hence the 97.5% prediction of 1510 is to be expected.
Conclusions
How much ice cream should I hold in stock? Well, if I believe strongly in my assumption of a market size of 800, then I should stick with the output of the binomial model, or perhaps hold 800 just in case.Another aspect to consider would be the cost of holding unsold stock. Ice cream can usually be stored for some time, but the situation would be quite different for freshly baked bread or bunches of flowers that have to be sold quickly.
Session Info
Note, I used the developer version ofbrms
from GitHub.R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lattice_0.20-33 brms_0.4.1.9000 ggplot2_1.0.1
[4] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] codetools_0.2-14 digest_0.6.8 MASS_7.3-43
[4] grid_3.2.2 plyr_1.8.3 gtable_0.1.2
[7] stats4_3.2.2 magrittr_1.5 scales_0.2.5
[10] stringi_0.5-5 reshape2_1.4.1 proto_0.3-10
[13] tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[16] parallel_3.2.2 colorspace_1.2-6
Visualising the predictive distribution of a log-transformed linear model
Last week I presented visualisations of theoretical distributions that predict ice cream sales statistics based on linear and generalised linear models, which I introduced in an earlier post.Theoretical distributions |
The posterior predictive distribution is what I am most interested in. From the simulations I can get the 95% prediction interval, which will be slightly wider than the theoretical 95% interval, as it takes into account the parameter uncertainty as well.
Ok, first I take my log-transformed linear model of my earlier post and turn it into a Stan model, including a section to generate output from the posterior predictive distribution.
After I have complied and run the model, I can extract the simulations and calculate various summary statistics. Furthermore, I use my parameters also to predict the median and mean, so that I can compare them against the sample statistics. Note again, that for the mean calculation of the log-normal distribution I have to take into account the variance as well.
Ok, that looks pretty reasonable, and also quite similar to my earlier output with
glm
. Using my plotting function of last week I can also create a nice 3D plot again.Posterior predictive distributions |
Just as expected, I note a slightly wider 95% interval range in the posterior predictive distributions compared to the theoretical distributions at the top.
Session Info
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] tools_3.2.2 codetools_0.2-14 stats4_3.2.2
Visualising theoretical distributions of GLMs
Two weeks ago I discussed various linear and generalised linear models in R using ice cream sales statistics. The data showed not surprisingly that more ice cream was sold at higher temperatures.icecream <- data.frame(
temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
I used a linear model, a log-transformed linear model, a Poisson and Binomial generalised linear model to predict sales within and outside the range of data available.I came to the conclusion that I preferred the binomial model for two reasons: the model simulates only whole numbers, just like the observational data and it has natural boundaries defined by zero sales and a market saturation level. Hence, it will neither predict negative sales, nor infinite sales.
However, I didn't review any residual plots or anything else but the mean predictions. So, let's do this in this post.
Arthur Charpentier reminded me via Twitter of an earlier post of his, where he looked at GLMs as well. Arthur used a nice 3D plot to visualise the models. I took his code and wrapped it into a function (it is not pretty, but it will do for now):
glmModelPlot <- function(x, y, xlim,ylim, meanPred, LwPred, UpPred,
plotData, main=NULL){
## Based on code by Arthur Charpentier:
## http://freakonometrics.hypotheses.org/9593
par(mfrow=c(1,1))
n <- 2
N <- length(meanPred)
zMax <- max(unlist(sapply(plotData, "[[", "z")))*1.5
mat <- persp(xlim, ylim, matrix(0, n, n), main=main,
zlim=c(0, zMax), theta=-30,
ticktype="detailed",box=FALSE)
C <- trans3d(x, UpPred, rep(0, N),mat)
lines(C, lty=2)
C <- trans3d(x, LwPred, rep(0, N), mat)
lines(C, lty=2)
C <- trans3d(c(x, rev(x)), c(UpPred, rev(LwPred)),
rep(0, 2*N), mat)
polygon(C, border=NA, col=adjustcolor("yellow", alpha.f = 0.5))
C <- trans3d(x, meanPred, rep(0, N), mat)
lines(C, lwd=2, col="grey")
C <- trans3d(x, y, rep(0,N), mat)
points(C, lwd=2, col="#00526D")
for(j in N:1){
xp <- plotData[[j]]$x
yp <- plotData[[j]]$y
z0 <- plotData[[j]]$z0
zp <- plotData[[j]]$z
C <- trans3d(c(xp, xp), c(yp, rev(yp)), c(zp, z0), mat)
polygon(C, border=NA, col="light blue", density=40)
C <- trans3d(xp, yp, z0, mat)
lines(C, lty=2)
C <- trans3d(xp, yp, zp, mat)
lines(C, col=adjustcolor("blue", alpha.f = 0.5))
}
}
Linear model
I start by plotting the output of the linear model, a GLM assuming a Normal distribution with an identity link function.The 3D plot shows the observations as circles on the xy-plane, together with the mean predictions as a solid line. In yellow I show the theoretical residual interval between the 5th and 95th quantile of the Normal distributions parametrised by the model output. Additionally I present the density distributions of the model at each observation.
This is conceptually similar to a Q-Q, which compares the standardised residuals with the theoretical quantiles. In the case of a linear model the prediction interval would shrink to the theoretical interval as the number of data points increases.
The 3D plot looks pretty, but I think it is helpful as well. Let me put this in context to the traditional residual plots, see below.
The residual plots identify observations 2, 5 and 10 as extreme points. Indeed, those points are either on the edge or outside (point 10) of the yellow area in the plot above.
xlim <- c(min(icecream$temp)*0.95, max(icecream$temp)*1.05)
ylim <- c(floor(min(icecream$units)*0.95),
ceiling(max(icecream$units)*1.05))
lin.mod <- glm(units ~ temp, data=icecream,
family=gaussian(link="identity"))
par(mfrow=c(2,2))
plot(lin.mod)
title(outer=TRUE, line = -1,
main = list("Linear regression",
cex=1.25,col="black", font=2))
meanPred <- predict(lin.mod, type="response")
sdgig <- sqrt(summary(lin.mod)$dispersion)
UpPred <- qnorm(.95, meanPred, sdgig)
LwPred <- qnorm(.05, meanPred, sdgig)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
stp <- 251
x = rep(icecream$temp[i], stp)
y = seq(ylim[1], ylim[2], length=stp)
z0 = rep(0, stp)
z = dnorm(y, meanPred[i], sdgig)
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Linear regression")
Log-transformed linear model
The log-transformed model shows an increasing variance around the mean as the temperatures increase. Yet, this shouldn't come as a surprise, as this a property of the log-normal distribution. The variance is given as \(\text{Var}(X) = \exp(2\mu + \sigma^2)(\exp(\sigma^2) - 1) = \text{E}[X]^2 (\exp(\sigma^2) - 1)\) .Again, the residual plots highlight points 2, 5 and 10 as extremes. I note also from the first residual plot that the log-transformed model over-predicts for colder temperatures.
log.lin.mod <- glm(log(units) ~ temp, data=icecream,
family=gaussian(link="identity"))
par(mfrow=c(2,2))
plot(log.lin.mod)
title(outer=TRUE, line = -1,
main = list("Log-transformed LM",
cex=1.25,col="black", font=2))
meanLogPred <- predict(log.lin.mod, type="response")
sdgig <- sqrt(summary(log.lin.mod)$dispersion)
meanPred <- exp(meanLogPred + 0.5 * sdgig^2)
UpPred <- qlnorm(.95, meanlog = meanLogPred, sdlog = sdgig)
LwPred <- qlnorm(.05, meanlog = meanLogPred, sdlog = sdgig)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
stp <- 251
x = rep(icecream$temp[i], stp)
y = seq(ylim[1], ylim[2], length=stp)
z0 = rep(0, stp)
z = dlnorm(y, meanlog = meanLogPred[i], sdlog = sdgig)
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Log-transformed LM")
Poisson (log) GLM
The Poisson model shows a narrower range between the 5th and 95th quantile then the previous models. My Poisson model didn't assume over-dispersion, and hence the yellow range increases in line with the mean.The same points 2, 5 and 10 are highlighted again as extreme, but now they are well outside the yellow area.
pois.mod <- glm(units ~ temp, data=icecream,
family=poisson(link="log"))
par(mfrow=c(2,2))
plot(pois.mod)
title(outer=TRUE, line = -1,
main = list("Poisson (log) GLM",
cex=1.25,col="black", font=2))
meanPred <- predict(pois.mod, type="response")
UpPred <- qpois(.95, meanPred)
LwPred <- qpois(.05, meanPred)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
y = seq(ylim[1], ylim[2])
x = rep(icecream$temp[i], length(y))
z0 = rep(0, length(y))
z = dpois(y, meanPred[i])
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Poisson (log) GLM")
Binomial (logit) GLM
Here is my favourite model from my previous post again.The theoretical range between the 5th and 95th quantile has narrowed further. Data item 10 looks now really suspicious. Observations 2 and 5 are once again highlighted. Now I wonder, is someone eating my ice cream?
market.size <- 800
icecream$opportunity <- market.size - icecream$units
bin.glm <- glm(cbind(units, opportunity) ~ temp, data=icecream,
family=binomial(link = "logit"))
par(mfrow=c(2,2))
plot(bin.glm)
title(outer=TRUE, line = -1,
main = list("Binomial (logit) GLM",
cex=1.25,col="black", font=2))
meanProb <- predict(bin.glm, type="response")
meanPred <- meanProb*market.size
UpPred <- qbinom(.95, market.size, meanProb)
LwPred <- qbinom(.05, market.size, meanProb)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
y = ylim[1]:ylim[2]
x = rep(icecream$temp[i], length(y))
z0 = rep(0, length(y))
z = dbinom(y, market.size, meanProb[i])
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Binomial (logit) GLM")
Conclusions
I think Arthur's 3D plots help to visualise what GLMs are conceptually about. They illustrate the theoretical distribution around the predictions. Let me say this again, the yellow areas in the charts above show not the 90% prediction interval, but the theoretical residual distribution if the models where true.Session Info
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.0
Generalised Linear Models in R
Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points.Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention.
I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well. That’s the purpose of this post.
Using data on ice cream sales statistics I will set out to illustrate different models, starting with traditional linear least square regression, moving on to a linear model, a log-transformed linear model and then on to generalised linear models, namely a Poisson (log) GLM and Binomial (logistic) GLM. Additionally, I will run a simulation with each model.
Along the way I aim to reveal the intuition behind a GLM using Ramus’ distribution centric description. I hope to clarify why one would want to use different distributions and link functions.
The data
Here is the example data set I will be using. It shows the number of ice creams sold at different temperatures.icecream <- data.frame(
temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
I will be comparing the results from various models and hence, I write a helper function to provide consistent graphical outputs:basicPlot <- function(...){
plot(units ~ temp, data=icecream, bty="n", lwd=2,
main="Number of ice creams sold", col="#00526D",
xlab="Temperature (Celsius)",
ylab="Units sold", ...)
axis(side = 1, col="grey")
axis(side = 2, col="grey")
}
basicPlot()
As expected more ice cream is sold at higher temperature.The challenge
I would like to create a model that predicts the number of ice creams sold for different temperatures, even outside the range of the available data.I am particularly interested in how my models will behave in the more extreme cases when it is freezing outside, say when the temperature drops to 0ºC and also what it would say for a very hot summer's day at 35ºC.
Linear least square
My first approach is to take a ruler and draw a straight line through the points, such that it minimises the average distance between the points and the line. That's basically a linear least square regression line:lsq.mod <- lsfit(icecream$temp, icecream$units)
basicPlot()
abline(lsq.mod, col="orange", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "linear least square"),
col=c("#00526D","orange"), pch=c(1,NA))
That's easy and looks not unreasonable. Linear regression
For the least square method I didn't think about distributions at all. It is just common sense, or is it?Let's start with a probability distribution centric description of the data.
I believe the observation \(y_i\) was drawn from a Normal (aka Gaussian) distribution with a mean \(\mu_i\), depending on the temperature \(x_i\) and a constant variance \(\sigma^2\) across all temperatures.
On another day with the same temperature I might have sold a different quantity of ice cream, but over many days with the same temperature the number of ice creams sold would start to fall into a range defined by \(\mu_i\pm\sigma\).
Thus, using a distribution centric notation my model looks like this:
\[
y_i \sim \mathcal{N}(\mu_i, \sigma^2), \\
\mathbb{E}[y_i]=\mu_i = \alpha + \beta x_i \; \text{for all}\; i
\]Note, going forward I will drop the \(\text{for all}\; i\) statement for brevity.
Or, alternatively I might argue that the residuals, the difference between the observations and predicted values, follow a Gaussian distribution with a mean of 0 and variance \(\sigma^2\):
\[
\varepsilon_i = y_i - \mu_i \sim \mathcal{N}(0, \sigma^2)
\]Furthermore, the equation
\[
\mathbb{E}[y_i]=\mu_i = \alpha + \beta x_i
\]states my belief that the expected value of \(y_i\) is identical to the parameter \(\mu_i\) of the underlying distribution, while the variance is constant.
The same model written in the classic error terms convention would be:
\[
y_i = \alpha + \beta x_i + \varepsilon_i, \text{ with }
\varepsilon_i \sim \mathcal{N}(0, \sigma^2)
\]I think the probability distribution centric convention makes it clearer that my observation is just one realisation from a distribution. It also emphasises that the parameter of the distribution is modelled linearly.
To model this in R explicitly I use the
glm
function, in which I specify the "response distribution" (namely the number of ice creams) as Gaussian and the link function from the expected value of the distribution to its parameter (i.e. temperature) as identity. Indeed, this really is the trick with a GLM, it describes how the distribution of the observations and its expected value, often after a smooth transformation, relate to the actual measurements (predictors) in a linear way. The 'link' is the inverse function of the original transformation of the data. That's what its says on the GLM tin and that's all there is to it!
(Actually, there is also the bit, which estimates the parameters from the data via maximum likelihood, but I will skip this here.)
lin.mod <- glm(units ~ temp, data=icecream,
family=gaussian(link="identity"))
library(arm) # for 'display' function only
display(lin.mod)
## glm(formula = units ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) -159.47 54.64
## temp 30.09 2.87
## ---
## n = 12, k = 2
## residual deviance = 14536.3, null deviance = 174754.9 (difference = 160218.6)
## overdispersion parameter = 1453.6
## residual sd is sqrt(overdispersion) = 38.13
Thus, to mimic my data I could generate random numbers from the following Normal distribution:\[
y_i \sim \mathcal{N}(\mu_i, \sigma^2) \text{ with }
\mu_i = -159.5 + 30.1 \; x_i \text{ and }
\sigma = 38.1
\]Although the linear model looks fine in the range of temperatures observed, it doesn't make much sense at 0ºC. The intercept is at -159, which would mean that customers return on average 159 units of ice cream on a freezing day. Well, I don't think so.
Log-transformed linear regression
Ok, perhaps I can transform the data first. Ideally I would like ensure that the transformed data has only positive values. The first transformation that comes to my mind in those cases is the logarithm.So, let's model the ice cream sales on a logarithmic scale. Thus my model changes to:
\[
\log(y_i) \sim \mathcal{N}(\mu_i, \sigma^2)\\
\mathbb{E}[\log(y_i)]=\mu_i = \alpha + \beta x_i
\]This model implies that I believe the sales follow a log-normal distribution:\[y_i \sim \log\mathcal{N}(\mu_i, \sigma^2)\]Note, the log-normal distribution is skewed to the right, which means that I regard higher sales figures more likely than lower sales figures.
Although the model is still linear on a log-scale, I have to remember to transform the predictions back to the original scale because \(\mathbb{E}[\log(y_i)] \neq \log(\mathbb{E}[y_i])\). This is shown below. For a discussion of the transformation of the lognormal distribution see for example the help page of the R function
rlnorm
.\[
y_i \sim \log\mathcal{N}(\mu_i, \sigma^2)\\
\mathbb{E}[y_i] = \exp(\mu_i + \sigma^2/2) = \exp(\alpha + \beta x_i + \sigma^2/2)
\]
log.lin.mod <- glm(log(units) ~ temp, data=icecream,
family=gaussian(link="identity"))
display(log.lin.mod)
## glm(formula = log(units) ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) 4.40 0.20
## temp 0.08 0.01
## ---
## n = 12, k = 2
## residual deviance = 0.2, null deviance = 1.4 (difference = 1.2)
## overdispersion parameter = 0.0
## residual sd is sqrt(overdispersion) = 0.14
log.lin.sig <- summary(log.lin.mod)$dispersion
log.lin.pred <- exp(predict(log.lin.mod) + 0.5*log.lin.sig)
basicPlot()
lines(icecream$temp, log.lin.pred, col="red", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "log-transformed LM"),
col=c("#00526D","red"), pch=c(1,NA))
This plot looks a little better than the previous linear model and it predicts that I would sell, on average, 82 ice creams when the temperature is 0ºC:
exp(coef(log.lin.mod)[1] + 0.5 * log.lin.sig)
## (Intercept)
## 82.3945
Although this model makes a little more sense, it appears that is predicts too many sales at the low and high-end of the observed temperature range. Furthermore, there is another problem with this model and the previous linear model as well. The assumed model distributions generate real numbers, but ice cream sales only occur in whole numbers. As a result, any draw from the model distribution should be a whole number.Poisson regression
The classic approach for count data is the Poisson distribution.The Poisson distribution has only one parameter, here \(\mu_i\), which is also its expected value. The canonical link function for \(\mu_i\) is the logarithm, i.e. the logarithm of the expected value is regarded a linear function of the predictors. This means I have to apply the exponential function to the linear model to get back to the original scale. This is distinctively different to the log-transformed linear model above, where the original data was transformed, not the expected value of the data. The log function here is the link function, as it links the transformed expected value to the linear model.
Here is my model:
\[
y_i \sim \text{Poisson}(\mu_i)\\
\mathbb{E}[y_i]=\mu_i=\exp(\alpha + \beta x_i) = \exp(\alpha) \exp(\beta x_i)\\
\log(\mu_i) = \alpha + \beta x_i\]Let me say this again, although the expected value of my observation is a real number, the Poisson distribution will generate only whole numbers, in line with the actual sales.
pois.mod <- glm(units ~ temp, data=icecream,
family=poisson(link="log"))
display(pois.mod)
## glm(formula = units ~ temp, family = poisson(link = "log"), data = icecream)
## coef.est coef.se
## (Intercept) 4.54 0.08
## temp 0.08 0.00
## ---
## n = 12, k = 2
## residual deviance = 60.0, null deviance = 460.1 (difference = 400.1)
pois.pred <- predict(pois.mod, type="response")
basicPlot()
lines(icecream$temp, pois.pred, col="blue", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Poisson (log) GLM"),
col=c("#00526D","blue"), pch=c(1,NA))
This looks pretty good. The interpretation of the coefficients should be clear now. I have to use the
exp
function to make a prediction of sales for a given temperature. However, R will do this for me automatically, if I set in the predict
statement above type="response"
.From the coefficients I can read off that a 0ºC I expect to sell \(\exp(4.45)=94\) ice creams and that with each one degree increase in temperature the sales are predicted to increase by \(\exp(0.076) - 1 = 7.9\%\).
Note, the exponential function turns the additive scale into a multiplicative one.
So far, so good. My model is in line with my observations. Additionally, it will not predict negative sales and if I would simulate from a Poisson distribution with a mean given by the above model I will always only get whole numbers back.
However, my model will also predict that I should expect to sell over 1000 ice creams if the temperature reaches 32ºC:
predict(pois.mod, newdata=data.frame(temp=32), type="response")
## 1
## 1056.651
Perhaps the exponential growth in my model looks a little too good to be true. Indeed, I am pretty sure that my market will be saturated at around 800. Warning: this is a modelling assumption from my side!Binomial regression
Ok, let's me think about the problem this way: If I have 800 potential sales then I'd like to understand the proportion sold at a given temperature.This suggests a binomial distribution for the number of successful sales out of 800. The key parameter for the binomial distribution is the probability of success, the probability that someone will buy ice cream as a function of temperature.
Dividing my sales statistics by 800 would give me a first proxy for the probability of selling all ice cream.
Therefore I need an S-shape curve that maps the sales statistics into probabilities between 0 and 100%.
A canonical choice is the logistic function:
\[
\text{logit}(u) = \frac{e^u}{e^u + 1} = \frac{1}{1 + e^{-u}}
\]
With that my model can be described as:
\[
y_i \sim \text{Binom}(n, \mu_i)\\
\mathbb{E}[y_i]=\mu_i=\text{logit}^{-1}(\alpha + \beta x_i)\\
\text{logit}(\mu_i) = \alpha + \beta x_i\]
market.size <- 800
icecream$opportunity <- market.size - icecream$units
bin.glm <- glm(cbind(units, opportunity) ~ temp, data=icecream,
family=binomial(link = "logit"))
display(bin.glm)
## glm(formula = cbind(units, opportunity) ~ temp, family = binomial(link = "logit"),
## data = icecream)
## coef.est coef.se
## (Intercept) -2.97 0.11
## temp 0.16 0.01
## ---
## n = 12, k = 2
## residual deviance = 84.4, null deviance = 909.4 (difference = 825.0)
bin.pred <- predict(bin.glm, type="response")*market.size
basicPlot()
lines(icecream$temp, bin.pred, col="purple", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Binomial (logit) GLM"),
col=c("#00526D","purple"), pch=c(1,NA))
The chart doesn't look too bad at all!
As the temperature increases higher and higher this model will predict that sales will reach market saturation, while all the other models so far would predict higher and higher sales.
I can predict sales at 0ºC and 35ºC using the inverse of the logistic function, which is given in R as
plogis
:# Sales at 0 Celsius
plogis(coef(bin.glm)[1])*market.size
## (Intercept)
## 39.09618
# Sales at 35 Celsius
plogis(coef(bin.glm)[1] + coef(bin.glm)[2]*35)*market.size
## (Intercept)
## 745.7449
So, that is 39 ice creams at 0ºC and 746 ice creams at 35ºC. Yet, these results will of course change if I change my assumptions on the market size. A market size of 1,000 would suggest that I can sell 55 units at 0ºC and 846 at 35ºC. Summary
Let's bring all the models together into one graph, with temperatures ranging from 0 to 35ºC.temp <- 0:35
p.lm <- predict(lin.mod, data.frame(temp=temp), type="response")
p.log.lm <- exp(predict(log.lin.mod, data.frame(temp=0:35), type="response") +
0.5 * summary(log.lin.mod)$dispersion)
p.pois <- predict(pois.mod, data.frame(temp=temp), type="response")
p.bin <- predict(bin.glm, data.frame(temp=temp), type="response")*market.size
basicPlot(xlim=range(temp), ylim=c(-20,market.size))
lines(temp, p.lm, type="l", col="orange", lwd=2)
lines(temp, p.log.lm, type="l", col="red", lwd=2)
lines(temp, p.pois, type="l", col="blue", lwd=2)
lines(temp, p.bin, type="l", col="purple", lwd=2)
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D","orange", "red",
"blue", "purple"),
bty="n", lwd=rep(2,5),
lty=c(NA,rep(1,4)),
pch=c(1,rep(NA,4)))
The chart shows the predictions of my four models over a temperature range from 0 to 35ºC. Although the linear model looks OK between 10 and perhaps 30ºC, it shows clearly its limitations. The log-transformed linear and Poisson models appear to give similar predictions, but will predict an ever accelerating increase in sales as temperature increases. I don't believe this makes sense as even the most ice cream loving person can only eat so much ice cream on a really hot day. And that's why I would go with the Binomial model to predict my ice cream sales.
Simulations
Having used the distribution centric view to describe my models leads naturally to simulations. If the models are good, then I shouldn't be able to identify the simulation from the real data.In all my models the linear structure was
\[
g(\mu_i) = \alpha + \beta x_i\]or in matrix notation
\[
g(\mu) = A v\] with \(A_{i,\cdot} = [1, x_i]\) and \(v=[\alpha, \beta]\), whereby \(A\) is the model matrix, \(v\) the vector of coefficients and \(g\) the link function.
With that being said, let's simulate data from each distribution for the temperatures measured in the original data and compare against the actual sales units.
n <- nrow(icecream)
A <- model.matrix(units ~ temp, data=icecream)
set.seed(1234)
(rand.normal <- rnorm(n,
mean=A %*% coef(lin.mod),
sd=sqrt(summary(lin.mod)$dispersion)))
## [1] 152.5502 278.3509 339.2073 244.5335 374.3981 404.4103 375.2385
## [8] 403.3892 483.9470 486.5775 526.3881 557.6662
(rand.logtrans <- rlnorm(n,
meanlog=A %*% coef(log.lin.mod),
sdlog=sqrt(summary(log.lin.mod)$dispersion)))
## [1] 196.0646 266.1002 326.8102 311.5621 315.0249 321.1920 335.3733
## [8] 564.6961 515.9348 493.4822 530.8356 691.2354
(rand.pois <- rpois(n,
lambda=exp(A %*% coef(pois.mod))))
## [1] 220 266 279 337 364 434 349 432 517 520 516 663
(rand.bin <- rbinom(n,
size=market.size,
prob=plogis(A %*% coef(bin.glm))))
## [1] 197 249 302 313 337 386 387 414 516 538 541 594
basicPlot(ylim=c(100,700))
cols <- adjustcolor(c("orange", "red", "blue", "purple"),
alpha.f = 0.75)
points(icecream$temp, rand.normal, pch=19, col=cols[1])
points(icecream$temp, rand.logtrans, pch=19, col=cols[2])
points(icecream$temp, rand.pois, pch=19, col=cols[3])
points(icecream$temp, rand.bin, pch=19, col=cols[4])
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D",cols), lty=NA,
bty="n", lwd=rep(2,5),
pch=c(1,rep(19,4)))
The chart displays one simulation per observation from each model, but I believe it shows already some interesting aspects. Not only do I see that the Poisson and Binomial model generate whole numbers, while the Normal and log-transformed Normal predict real numbers, I notice the skewness of the log-normal distribution in the red point at 19.4ºC.
Furthermore, the linear model predicts equally likely figures above and below the mean and at 16.4ºC the prediction appears a little low - perhaps as a result.
Additionally the high sales prediction at 25.1ºC of the log-transformed Normal and Poisson model shouldn't come as a surprise either.
Again, the simulation of the binomial model seems the most realistic to me.
Conclusions
I hope this little article illustrated the intuition behind generalised linear models. Fitting a model to data requires more than just applying an algorithm. In particular it is worth to think about:- the range of expected values: are they bounded or range from \(-\infty\) to \(\infty\)?
- the type of observations: do I expect real numbers, whole numbers or proportions?
- how to link the distribution parameters to the observations
There are many aspects of GLMs which I haven't touched on here, such as:
- all the above models incorporate a fixed level of volatility. However, in practice, the variability of making a sale at low temperatures might be significantly different than at high temperatures. Check the residual plots and consider an over-dispersed model.
- I used so called canonical link functions in my models, which have nice theoretical properties, but other choices are possible as well.
- the distribution has to be chosen from the exponential family, e.g. Normal, Gamma, Poisson, binomial, Tweedie, etc.
- GLMs use maximum likelihood as the criteria for fitting the models. I sketched the difference between least square and maximum likelihood in an earlier post.
RMarkdown
Buy the RMarkdown file of this post
Session Info
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arm_1.8-6 lme4_1.1-8 Matrix_1.2-1 MASS_7.3-42
loaded via a namespace (and not attached):
[1] minqa_1.2.4 tools_3.2.1 coda_0.17-1 abind_1.4-3
[5] Rcpp_0.11.6 splines_3.2.1 nlme_3.1-121 grid_3.2.1
[9] nloptr_1.0.4 lattice_0.20-31
No comments :
Post a Comment