**Environmental Science and Data Analytics**, and kindly contributed to R-bloggers)

In an earlier post of mine, I carried out an analysis on ski jumping data for Zakopane, Poland and attempted to predict which athletes would end up on the podium. I also created a classification tree and tested it on the 2017 competition data with good results. For this side project of mine, I hope to predict podium finishes for three major competitions of the ski jumping World Cup. The next competition in the World Cup calendar is Willingen, Germany. Therefore, the second post in this series relates to the prediction of podium finishes in this competition.

The approach this time is similar to that for the Zakopane analysis with the excpetion of adding an extra predictor variable to the classification tree. The variables used to build the classification tree for Zakopane had the dependent variable of podium finish (a logical result of Yes or No) and predictor variables of first round jump and second round jump. The additional predictor variable used here is the total style points achieved. These points are awarded for sound landing, in-air technique and linearity of motion.

The Willingen dataset includes data for all competitions between 2010-2016. Note that the event in 2013 was cancelled. The raw data were scraped from the FIS website using the RCurl and XML packages. The functions of interest are getURL() and readHTMLTable().

## The Analysis

The distribution of all ski jumps (first and second rounds inclusive) is given below. All jumps at Willingen are on the left while only those jumps which contributed to a podium place are on the right. A visual estimation puts the overall mean jump at around 130.0 m while the mean podium jump looks to be around 140.0 m. A look at the summary statistics shows that the overall mean jump is actually 131.5 m while the mean podium jump is 142.5 m.

All of the summary statistics are given in the following table. These include measures of central tendency such as the mean and median as well as measures of variation such as the standard deviation, range and interquartile range.

The next plot shows the same distributions only this time by way of frequency density plots. Also included are lines indicating the median for all jumps and the mean for the podium jumps. The density curves are superimposed to show the shape of these data. Shapiro-Wilk tests for normality give p-values of 0.0001876 for all jumps and 0.358 for podium jumps. The distribution of podium jumps has a normal distribution whereas the all jumps population is non-parametric.

Which jumpers have performed well at Willingen between 2010-2016? Severin Freund is one of them but is out with an ACL injury. Therefore, he is omitted from the analysis. The table below gives the top jumpers in terms of podium finishes. The probability of a podium finish is calculated by: total podiums / appearances.

There are some familiar names in there. As in Zakopane, Kamil Stoch is once again one of the top jumpers. He has 3 podium finishes from 5 competitions; a probability of a podium place of 60%. We can extract the top jumpers from this table and create an additional dataframe which includes extra information. Here it is:

The additional variables include the podium place (1st, 2nd or 3rd), probabilities of winning the event and getting a podium finish, the mean podium jump, the overall mean jump, the overall median jump and the overall maximum jump. This closer look at the numbers reveals that Kamil Stoch’s three podium finishes were all 1st places. Peter Prevc has one 1st place, one 2nd and one 3rd from his three podium finishes. The other athletes have not won the event between 2010-2016.

The 2017 competition favourite, then, is surely Kamil Stoch. Kenneth Gangnes, Anders Jacobsen and Martin Koch appear to be out of the World Cup this year as I can find no record of them on the official website. I found an article stating Gangnes had damaged his ACL in June 2016 so perhaps he is out due to injury. In any case, the only robust predictions I could make from the dataset would be a win for Kamil Stoch and a podium finish for Peter Prevc. My wildcard options are: Daniel Andre Tande, Andreas Wellinger and Richard Freitag. Stefan Kraft would be in there but he is out due to sickness as far as I know.

## The Classification Tree

Intuition is all well and good. However, a classification tree can give us conditions which can assist in predicting whether an athlete’s performance may be good enough to get him a podium place.

The partykit package was used to build the classification tree for Willingen. Whether a podium place was achieved is the dependent variable. The lengths of the first and second round jumps as well as total style points are the predictor variables. I used a 90% training 10% test split as the dataset is not particularly large (*n* = 347). Here is the plot output:

We see that the first partitioning is determined by the length of the first round jump. A jump of over 141.0 m and a total style points score of greater than 253.9 all but guarantees a podium place. There is, however, a small chance of a podium with a total style points score of less than 253.9 given a first round jump greater than 141.0 m.

A first round jump of less than 141.0 m, a second round jump greater than 140.5 m and a total style points score over 257 is also associated with a high probability of a podium finish. A run of this model on the test set gives a model accuracy of 97%, with the model correctly predicting one podium finish and 33 non-podium finishes. One podium finish prediction was false.

## Conclusion

The decision tree seems to have given us some solid numbers to aim for both in terms of jump length and style points. The overall mean podium jump is 142.5 m and this would be a good target to set. Of course, style points also matter, as shown by the classification tree.

My predictions are:

- Kamil Stoch
- Peter Prevc
- Wildcard (A. Wellinger, R. Freitag, D. Prevc, D. A. Tande or S. Kraft

I’ll be back with the results and an update on how the model performs on the 2017 results and whether the conditions set by it were accurate.

**leave a comment**for the author, please follow the link and comment on their blog:

**Environmental Science and Data Analytics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...