**PirateGrunt » R**, and kindly contributed to R-bloggers)

So I tidied the code up a bit from last time; no more for loop. Actually, I tidied it up a lot. My goal had been to arrange the data in such a way that I could get a simple moving average of the score difference for each team. That wound up being a semi-lengthy process. So, now, I have a function which will return the game results for a single season:

GetSeasonResults = function(year) { games.URL.stem = "http://www.pro-football-reference.com/years/" URL = paste(games.URL.stem, year, "/games.htm", sep="") games = readHTMLTable(URL) dfSeason = games[[1]] # Clean up the df dfSeason = subset(dfSeason, Week!="Week") dfSeason = subset(dfSeason, Week!="") dfSeason$Date = as.character(dfSeason$Date) dfSeason$GameDate = mdy(paste(dfSeason$Date, year), quiet=T) year(dfSeason$GameDate) = with(dfSeason, ifelse(month(GameDate) <=6, year(GameDate)+1, year(GameDate))) dfSeason = dfSeason[,c(14, 1, 5, 7, 8, 9)] colnames(dfSeason) = c("GameDate", "Week", "Winner", "Loser", "WinnerPoints", "LoserPoints") dfSeason$Winner = as.character(dfSeason$Winner) dfSeason$Loser = as.character(dfSeason$Loser) dfSeason$WinnerPoints = as.integer(as.character(dfSeason$WinnerPoints)) dfSeason$LoserPoints = as.integer(as.character(dfSeason$LoserPoints)) dfSeason$ScoreDifference = dfSeason$WinnerPoints - dfSeason$LoserPoints dfSeason = subset(dfSeason, !is.na(ScoreDifference)) return(dfSeason) }

This means that I may now lapply to get a set of results for many seasons. However, this still isn’t quite what I want. What I’m after is a set of results for an individual team. To do this, I created a function which will zip through the games data to pull out results for a single team. The games data is structured such that a team could appear in either the winner or loser column. I’ve handled this by making two passes through the dataframe. The first pass picks up all the times that the team won, the second picks up all the times the team lost. I then just bind the winning and losing dataframes together.

BuildTeamData = function(Team, dfGamesData) { dfWinner = dfGamesData[which(dfGamesData$Winner == Team), c("GameDate", "Week", "Loser", "ScoreDifference")] dfLoser = dfGamesData[which(dfGamesData$Loser == Team), c("GameDate", "Week", "Winner", "ScoreDifference")] dfLoser$ScoreDifference = -dfLoser$ScoreDifference colnames(dfWinner) = c("GameDate", "Week", "OpposingTeam", "ScoreDifference") colnames(dfLoser) = c("GameDate", "Week", "OpposingTeam", "ScoreDifference") dfTeam = rbind(dfWinner, dfLoser) dfTeam$ThisTeam = Team dfTeam = dfTeam[order(dfTeam$GameDate),] return (dfTeam) }

The result is a dataframe with twice as many rows as the original set of game data (each winner and loser get their own rows). However, it winds up being an intuitive way to view the results for an individual team. I can now take a moving average of the point differential to use as a predictor of whether or not they’re likely to win. The algorithm is (at the moment) incredibly simple. If a teams average point differential is larger than their opponent, I presume that they will win.

Does something this straightforward work? Well, sort of. I used a period of between 9 and 16 prior games, also monkeyed around with capping the absolute value of point differential and applied an exponential smoother so that more recent games would get more weight. The result? If you take the average point differential for the past 10 games and compare, you’ll forecast the winner about 62% of the time. This is hardly something to get excited about, but it’s reliably better than a coin toss. I’m struck by the fact that smoothing and point capping don’t improve the model all that much (results posted at some later time). My intuition is that recent performance takes personnel changes and team morale into account and that a blowout of a weak opponent ought to count for little in predicting the result of a different game. Perhaps these things do matter when other elements of data are introduced, but at present, it doesn’t influence the algorithm all that much.

I’ve used this method for two weeks and I’ve predicted the winner about 60% of the time (again actual results to appear in a later post). Given how little opportunity I have to keep up with players and what’s going on, this is high enough to keep things interesting for me. I’ve downloaded the NFL mobile app and look forward to exploring what other information can improve this method. Next step is to see whether offensive and defensive stats are at all useful in predicting a result.

I can’t imagine anyone is reading this, but there’s loads more code to generate the moving averages and test the results of the model. It’s available to anyone who’d care to see it.

**leave a comment**for the author, please follow the link and comment on his blog:

**PirateGrunt » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...