Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is winter which makes it time for one of Netherlands beloved sports: speed skating. Speed skating is done over various distances, but for me, the most beautiful is the 10 km. The top men do this in about 13 minutes. In this post I try to use a simple PLS model to predict the finish time as function of lap times. It turns out that is not so easy. The later rounds are most related to the finish time.

The tournament is on a 400 skating rink. To quote wikipedia ‘All races are run counter-clockwise. In all individual competition forms, only two skaters are allowed to race at once, and must remain in their respective lanes. Skaters must change lanes every lap. The skater changing from the outside lane to the inside has right-of-way.‘ It follows that there are 25 rounds. Hence there are also 25 lap times.

#### Source data

Data for 10 km have been downloaded from schaatsstatistieken (pdf file in Dutch English language version) beginning of December. It contains lap times where the final time was under 13’30 at that point in calendar time. Since then some additional results have been added. Surely Dutch championships this weekend will add more. The data was transformed to .txt file in Calibre. Unfortunately the result seems to have rather random placed line-breaks and no delimiters except the space so it took me some lines to read the data. It was not possible to just use the counts of spaces, these vary over skater’s names and event locations. I put this script in the appendix since it breaks the text too much.

#### Plotting data

The plot shows all lap times in the data. The pattern is fairly simple; the first lap is slow, speed is made. The intermediate laps have a fairly constant speed. The last laps have an increase or decrease in speed.
# plot data
long <- reshape(scating4,varying=list(colnames(times),colnames(rtimes)),
v.names=c(‘total’,’round’),direction=’long’,times=seq(400,10000,400),
timevar=’Distance’)
xyplot(round ~ Distance ,groups= nums,data=long,type=’l’)

The second plot shows selected details, all results from the year 2000. The plot shows typical results.  Frank Dittrich is ever increasing speeds. Gianni Romme in Heerenveen slowed lightly over laps. Bob de Jong had nice flat lap times.
long\$year <- substr(long\$date,1,4)
xyplot(lap ~ Distance ,groups= paste(name,’ (‘,location,’)’,
sep=”),data=long[long\$year==’2000′,],
type=’l’,ylab=’lap time’,main=’Year 2000′,
auto.key=list(space=’bottom’,lines=TRUE,points=FALSE))

#### Predictive model

To model these these correlated data it seems obvious to use PLS. The question is then; at which distance is the final time well predicted? At 2 km it is not so good. A Root Mean Squared Error of Prediction (RMSEP) of 7.5 at 4 components in the model translates into easily fat chance of an error of more than 7 seconds, see also the plot observed-predicted.
pls1 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000,data=scating4,validation=’CV’,ncomp=5)
RMSEP(pls1)
(Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV           9.879    7.766    7.569    7.492    7.469    7.469
adjCV        9.879    7.764    7.566    7.490    7.466    7.465

formattl <- function(x)
paste(x %/% 60,”‘”,
sprintf(‘%02i’,round(x %% 60))
,sep=”)

obspredplot <- function(y,main) {

plot(x=scating4\$m10000,y=y,axes=FALSE,
xlab=’Observed time’,ylab=’Predicted time’,
main=main,
xlim=c(759,811),ylim=c(759,821))
axis(1,at=seq(760,810,10),labels=formattl(seq(760,810,10)))
axis(2,at=seq(760,820,10),labels=formattl(seq(760,820,10)))
abline(a=0,b=1,col=’blue’)
return(invisible(0))
}

obspredplot(y=as.numeric(predict(pls1,ncomp=2,type=’response’)),
‘Data upto 2 km’)

As more data is added, the predictions get better. After 8 km data the RMSEP is 2.5 so a good prediction can be made.

pls2 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000
+rm2400 + rm2800 + rm3200 + rm3600 + rm4000,data=scating4,
validation=’CV’,ncomp=9)
RMSEP(pls2)
(Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           9.879    6.808    6.484    6.354    6.329    6.320    6.315
adjCV        9.879    6.807    6.482    6.351    6.323    6.312    6.307
7 comps  8 comps  9 comps
CV       6.319    6.319    6.319
pls3 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000
+rm2400 + rm2800 + rm3200 + rm3600 + rm4000
+rm4400 + rm4800 + rm5200 + rm5600 +rm6000,data=scating4,
validation=’CV’,ncomp=14)
RMSEP(pls3)
(Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           9.879    5.219    4.645    4.507    4.474    4.492    4.525
adjCV        9.879    5.217    4.643    4.505    4.469    4.485    4.514
7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       4.536    4.538    4.539     4.540     4.540     4.540     4.540
adjCV    4.525    4.526    4.527     4.528     4.528     4.528     4.528
14 comps
CV        4.540
pls4 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000
+rm2400 + rm2800 + rm3200 + rm3600 + rm4000
+rm4400 + rm4800 + rm5200 + rm5600 +rm6000
+rm6400 + rm6800 + rm7200 + rm7600 +rm8000,data=scating4,
validation=’CV’,ncomp=14)
RMSEP(pls4)
(Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps

CV           9.879    2.806    2.558    2.411    2.366    2.351    2.372
adjCV        9.879    2.805    2.557    2.409    2.362    2.347    2.366
7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       2.385    2.386    2.387     2.389     2.390     2.390     2.391
adjCV    2.377    2.378    2.380     2.381     2.382     2.383     2.383
14 comps
CV        2.391
obspredplot(as.numeric(predict(pls4,ncomp=2,type=’response’)),’Data upto 8 km’)
Further calculations shows that it is not the amount of data that determines the prediction quality. Just using data at 2, 4, 6 and 8 km gives good predictions.
pls4b <- plsr(m10000 ~  rm2000 + rm4000 +rm6000 +rm8000,data=scating4,
validation=’CV’,ncomp=4)
RMSEP(pls4b)
(Intercept)  1 comps  2 comps  3 comps  4 comps
CV           9.879    2.979    2.903    2.891    2.892
adjCV        9.879    2.978    2.902    2.889    2.889
In hindsight I could have found this much simpler. Just looking at correlations shows highest correlation just under 6 and 8 km.
plot(y=cor(cbind(rtimes,times[,25]),use=’complete.obs’)[26,1:25],x=seq(400,10000,400),
ylab=’Correlation with 10 km time’,xlab=’Lap (distance)’,
main=’Correlation between lap time and 10 km time’)

#### Appendix: reading and transforming data

library(lattice)
library(pls)
library(randomForest)
r1 <- readLines(“Rondetijden 10000 meter mannen onder 13.30,00 – SchaatsStatistieken.nl.txt”,encoding=’UTF-8′)
# removing unneeded input
nc <- sapply(1:length(r1),function(x) nchar(r1[x]))
r2 <- r1[nc!=0]
r2 <- r2[r2!=”datum”]
r2 <- r2[r2!=”# naam”]
r2 <- r2[-grep(“pagina”,r2)]
r2 <- r2[-grep(“SchaatsStatistieken”,r2)]
r2 <- r2[-grep(“^400m”,r2)]
r2 <- r2[-grep(“^nat plaats”,r2)]
r4 <- sapply(1:length(r2),function(x)
gsub(“800m 1200m 1600m 2000m 2400m 2800m 3200m 3600m 4000m 4400m 4800m 5200m 5600m 6000m 6400m 6800m 7200m 7600m 8000m 8400m 8800m 9200m 9600m 10000m “,
“”,  r2[x]))
lengths <- function(x) {
nc <- sapply(1:length(x),function(xx) nchar(x[xx]))
nc
}
while (length(grep(‘^[0-9]{1,3} ‘,r4,invert=TRUE))>0) {
le <- grep(‘^[0-9]{1,3} ‘,r4,invert=TRUE)
r4[le-1] <- paste(r4[le-1],r4[le])
r4 <- r4[-le]

r5 <- r4[-grep(‘- – – -‘,r4)]

# extracting columns
num <- regexpr(” “,r5)
scating1 <- data.frame(nums = as.numeric(substr(r5,1,num-1)),
strings = substring(r5,num+1))
#country three letter uppercase, name befor that
num <- regexpr(“[A-Z]{3}”,scating1\$strings)
scating1\$name <- substr(scating1\$strings,1,num-1)
scating1\$country <- substr(scating1\$strings,num,num+2)
scating1\$strings <- substring(scating1\$strings,num+4)
#date starts with 4 numbers characterising year
#location befor date
num <- regexpr(“[0-9]{4}”,scating1\$strings)
#table(num)
scating1\$location <- substr(scating1\$strings,1,num-1)
scating1\$date <- substr(scating1\$strings,num,num+10)
scating1\$strings <- substring(scating1\$strings,num+11)
#notation of , and .
scating1\$strings <- gsub(‘.’,’#’,scating1\$strings,fixed=TRUE)
scating1\$strings <- gsub(‘,’,’.’,scating1\$strings,fixed=TRUE)

distance <- paste(‘m’,seq(400,10000,400),sep=”)
scating1\$strings <- paste(scating1\$strings,’ ‘)
times <- matrix(0.0,nrow=nrow(scating1),
ncol=25,dimnames=list(1:nrow(scating1),distance))

for (i in 1:25) {
num <- regexpr(” “,scating1\$strings)
time <- substr(scating1\$strings,1,num-1)
time[time==’-‘] <- ‘0#0’
ntime <- 60*as.numeric(substr(time,1,regexpr(“#”,time)-1))+
as.numeric(substr(time,regexpr(“#”,time)+1,lengths(time)))
ntime[time==’0#0′] <- NA
times[,i] <- ntime
scating1\$strings <- substr(scating1\$strings,num+1,
lengths(scating1\$strings))
}
scating2 <- cbind(scating1[,!(names(scating1)==’strings’)],times)
rtimes <- times
colnames(rtimes) <- paste(‘r’,colnames(times),sep=”)
for (i in 2:25) rtimes[,i] <- times[,i] – times[,i-1]
scating3 <- cbind(scating2,rtimes)

scating4 <- scating3[complete.cases(scating3),]
# next two corrections have been applied in the source data (website)
scating4 <- scating4[scating4\$m400>20,]    # wrong data in 1st lap
scating4 <- scating4[scating4\$num != 336,] # wrong data at 6400 m