Sunday, 22 July 2012

London Olympics and a prediction for the 100m final

It is less than a week before the 2012 Olympic games will start in London. No surprise therefore that the papers are all over it, including a lot of data and statistis around the games.

The Economist investigated the potential financial impact on sponsors (some benefits), tax payers (no benefits) and the athletes (if they are lucky) in its recent issue and video.

The Guardian has a whole series around the Olympics, including the data of all Summer Olympic Medallists since 1896.

100m men final

The biggest event of the Olympics will be one of the shortest: the 100 metres men final. It will be all over in less than 10 seconds. In 1968 Jim Hines was the first gold medal winner, who achieved a sub-ten-seconds time and since 1984 all gold medal winners have run faster than 10 seconds. The historical run times of the past Olympics going back to 1896 are available from databasesport.com.

Looking at the data it appears that a simple log-linear model will give a reasonable forecast for the 2012 Olympic's result (ignoring the 1896 time). Of course such a model doesn't make sense forever, as it would suggest that future run-times will continue to shrink. Hence, some kind of logistics model might be a better approach, but I have no idea what would be a sensible floor for it. Others have used ideas from extreme value theory to investigate the 100m sprint, see the paper by Einmahl and Smeets, which would suggest a floor greater than 9 seconds.


Historical winning times for the 100m mean final.
Red line: log-linear regression, black line: logistic regression.

My simple log-linear model forecasts a winning time of 9.68 seconds, which is 1/100 of a second faster than Usain Bolt's winning time in Beijing in 2008, but still 1/10 of a second slower than his 2009 World Record (9.58s) in Berlin.

Never-mind, I shall stick to my forecast. The 100m final will be held on 5 August 2012. Now even I get excited about the Olympics, and be it for less than 10 seconds.

R code

Here is the R code used in this the post:

library(XML)
library(drc) 
url <- "http://www.databaseolympics.com/sport/sportevent.htm?enum=110&sp=ATH"
data <- readHTMLTable(readLines(url), which=2, header=TRUE)
golddata <- subset(data, Medal %in% "GOLD")
golddata$Year <- as.numeric(as.character(golddata$Year))
golddata$Result <- as.numeric(as.character(golddata$Result))
tail(golddata,10)
logistic <- drm(Result~Year, data=subset(golddata, Year>=1900), fct = L.4())
log.linear <- lm(log(Result)~Year, data=subset(golddata, Year>=1900))  
years <- seq(1896,2012, 4)
predictions <- exp(predict(log.linear, newdata=data.frame(Year=years)))
plot(logistic,  xlim=c(1896,2012),
     ylim=c(9.5,12),
     xlab="Year", main="Olympic 100 metre",
     ylab="Winning time for the 100m men's final (s)") 
points(golddata$Year, golddata$Result)
lines(years, predictions, col="red")
points(2012, predictions[length(years)], pch=19, col="red")
text(2012, 9.55, round(predictions[length(years)],2))

Update 5 August 2012

You find a comparison of my forecast to the final outcome of Usain Bolt's winning time of 9.63s on my follow-up post.

Wednesday, 11 July 2012

Bridget Riley exhibition in London

The other day I saw a fantastic exhibition of work by Bridget Riley. Karsten Schubert, who is Riley's main agent, has a some of her most famous and influential artwork from 1960 - 1966 on display, including the seminal Moving Squares from 1961.

Photo of Moving Squares by Bridget Riley, 1961
Emulsion on board, 123.2 x 121.3cm

In the 1960s Bridget Riley created some great black and white artwork, which at a first glance may look simple and deterministic or sometimes random, but has fascinated me since I saw some of her work for the first time about 9 years ago at the Tate Modern.

Her work prompted a very simple question to me: When does a pattern appear random? As human beings most of our life is focused on pattern recognition. It is about making sense of the world around us, being able to understand what people are saying; seeing lots of different things and yet knowing when something is a table and when it is not. No surprise, I suppose, that pattern recognition is such a big topic in statistics and machine learning.

Neil Dodgson published some interesting research papers related to the work by Bridget Riley. In a paper from 2008 he concluded that even limited randomness can produce a great deal of complexity, such as in Riley's Fragment 6/9, and hence might be so appealing to the pattern detection systems of our brains. I wonder, if it may also help to explain why some people can spend hours looking at stock market charts.

Of course I couldn't resist trying to reproduce the Moving Squares in R. Here it is:

## Inspired by Birdget Riley's Moving Squares
x <- c(0, 70, 140, 208, 268, 324, 370, 404, 430, 450, 468,
        482, 496, 506,516, 523, 528, 533, 536, 542, 549, 558, 
        568, 581, 595, 613, 633, 659, 688, 722, 764, 810)
y <- seq(from=0, to=840, by=70)
m <- length(y)
n <- length(x)
z <- t(matrix(rep(c(0,1), m*n/2), nrow=m))
image(x[-n], y[-m], z[-n,-m], col=c("black", "white"), 
      axes=FALSE, xlab="", ylab="")

However, what may look similar on screen is quite different when you see the actual painting. Thus, if you are in London and have time, make your way to the gallery in Soho. I recommend it!

Tuesday, 10 July 2012

Review: Kölner R Meeting 6 July 2012

The second Cologne R user meeting took place last Friday, 6 July 2012, at the Institute of Sociology. Thanks to Bernd Weiß, who provided the meeting room, we didn't have to worry about the infrastructure, like we did at our first gathering.

Again, we had an interesting mix of people turning up, with a very diverse background from chemistry to geo-science, energy, finance, sociology, pharma, physics, psychology, mathematics, statistics, computer science, telco, etc. Yet, the gender mix was still somewhat biased, with only one female attendee.

We had two fantastic talks by Bernd Weiß and Stephan Sprenger. Bernd talked about Emacs' Org-Mode and R. He highlighted the differences between literate programming and reproducible research and if you follow his slides, you get the impression that Emacs with Org-mode is the all-singing-all-dancing editor, or for those who speak German: eine eierlegende Wollmilchsau.

Tuesday, 3 July 2012

Applying a function successively in R

At the R in Finance conference Paul Teetor gave a fantastic talk about Fast(er) R Code. Paul mentioned the common higher-order function Reduce, which I hadn't used before.

Reduce allows me to apply a function successively over a vector.

What does that mean? Well, if I would like to add up the figures 1 to 5, I could say:

add <- function(x,y) x+y
add(add(add(add(1,2),3),4),5)
or
Reduce(add, 1:5)