While understanding and spending time improving specific techniques, and strengthening indvidual muscles is important, occasionally it is necessary to do some rounds of actual sparring to see your flow and spot weaknesses. This exercise sets forces you to use all that you have practiced: to scrape links, download data, regular expressions, merge data and then analyze it.
We will download data from the website football-data.co.uk that has data on some football/soccer leagues results and odds quoted by bookmakers where you can bet on the results.
Answers are available here.
Use R to scan the German section on football-data.co.uk for any links and save them in a character vector called
all_links. There are many ways to accomplish this.
Among the links you found should be a number pointing to comma-separated values files with data on Bundesliga 1 and 2 separated by season. Now update
all_links vector so that only links to
csv files remain. Use regular expressions.
all_links so that only links to csv tables ‘from Bundesliga 1 from Season 1993/1994 to 2013/2014 inclusive’ remain.
Import to a list in your workspace all the 21 remaining csv files in
all_links, each one as a
read.csv, with the url and
na.strings = c("", "NA"). Not that you might need to add a prefix for them, so the links are complete.
Take the list and generate a one big data.frame with all the
data.frames previously imported. One way to do this is using
rbind.fill function from a well-known package. Name the new
Take a good look at the new dataset. Our
read.csv did not work perfectly on this data: it turns out that there are some empty rows and empty columns, identify and count them. Update the
bundesl so it no longer has empty rows m nor columns.
Format the Date column so R understands using
Remove all columns which are not 100% complete, and the variable
Div as well.
Which are the top 3 teams in terms of numbers of wins in Bundesliga 1 for our period? You are free to use base-R functions or any package. Be warned that his task is not as simple as it seems due the nature in the data and small inconsitency in the data.
Which team has held the longest winning streak in our data?