Build a Trump vs Biden Prediction Model With R From Scratch
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s predict the 2020 presidential election!
Get the full code on my GitHub page
Follow me on Twitter for model updates
Creating a simple prediction model for the 2020 general election between Trump and Biden is actually fairly simple. All we need is to estimate each candidate’s state-by-state average polling performance and polling standard deviation to create a basic monte carlo simulation.
This simulation is currently tracking very closely with Nate Silver’s 538 model. As of this writing, my model is forecasting a 70% chance Biden wins the election which is in lockstep with the 538 forecast model.
To build this model, first we focus on writing a dynamic scraping program that pulls in all of the current state polling data from RealClearPolitics. Then we use the state polling data to simulate 10,000 elections. The winner of each trial is determined by which candidate has the most electoral votes.
This is an advanced beginner data science project. All code is written in the R programming language and I make heavy use of the Tidyverse set of packages. Check out my Basic Data Manipulation in R video for a quick refresher on the main Tidyverse packages I use.
Get the States and Electoral Votes
First we need a list of each state and their electoral votes. At the end of this section we’ll have the following two-column dataframe:
Note: There will be more than 50 rows in the State column. Maine and Nebraska do not necessarily allocate all of their votes to a single candidate. Maine, for example, allocates two electoral votes to the winner of the state, one vote for the winner of Maine Congressional District 1 (Maine CD1), and one vote for the winner of Maine Congressional District 2.
So for simplicity, when I reference ‘state’ or ‘states’, it refers to the 50 actual states along with Maine CD1, Maine CD2, Nebraska CD1, and Washington DC.
The only place RealClearPolitics lists all of the states and their electoral votes is on their 2020 Electoral College Map.
There is a section at the top that lists states that are toss ups or are leaning in one direction. For simplicity, I’ll call these the toss up states.
At the bottom of the page there are two tables that list states that are solid Trump or solid Biden. I’ll call these solid states.
The html behind this page is somewhat complicated so we’ll have to scrap the toss up states and the solid states separately.
To start we’ll load the raw html from the page into our R session in a variable called votes_html using the read_html() function from the rvest pacakge:
All the toss up states are in between the html tag and :
To extract everything between these tags we’ll use a regex statement in the str_extract_all() function from the stringr package.
Toss_ups is now a vector of states and their corresponding electoral votes.
To get the Solid_States we can use the html_nodes() and html_table() functions from the rvest package.
This function grabs all of the html tables from the web page, but the relevant tables for the solid states are in the following indices:
For example, Solid_States[][] corresponds to the first column in the solid states tables:
Now combine the Toss_ups and the Solid_States into one vector called Electoral_Votes.
There are still a few blanks in the Electoral_Votes vector:
To get rid of them we’ll use str_detect():
Now we convert the Electoral_Votes vector into a two column dataframe. Column one will contain the states and column two will contain the electoral votes.
To do this, we will split each element of Electoral_Votes by removing the close parenthesis and splitting by the space and open parenthesis string.
Get the 2020 State Data
For each state we need to calculate average polling spread between Trump and Biden and the standard deviation of that spread. In this section we’ll build a dataframe that captures this data:
Note: The spread is the models way of capturing which way the polls are leaning with only one number. For each state I use the Trump polling average minus the Biden polling average. Therefore, a state with a positive spread indicates Trump is winning and a state with a negative spread indicates Biden is winning.
For example, to indicate that Biden is winning Pennsylvania by 5.5% we would represent that in the spread column with -5.5.
Each state has it’s own page and unique link with polling data from that state.
To get all the 2020 state links, we’ll need to scrap them from the 2020 Electoral College Map page. We’ll store the raw html data into a variable called Summary_2020.
Select the data between the html link tags (href=” and >”) and select only the links that contain the string “trump_vs_biden”:
We now have a list of links for all the states currently available. The actual name of the state is embedded in the link. We need to break the links apart to extract the state name. It’s important we have a column with a clean set of state names because we are building a number of datasets that will need to be combined before we can build the model. The state name is the key to combining all the datasets.
Note: Some states might not have polling yet so we’ll have to supplement with national polling numbers and 2016 actual numbers which we’ll tackle in the next sections.
First let’s clean up the links:
We’ll split Summary_2020 by “/” using str_split_fixed(). This will give us three columns which we can rename, Abbrev, state_id, id.
Next, we’ll use the mutate() function to create two new columns, State and Link, in the Summary_2020 dataframe.
To generate the state names we need to remove all of the text that is not the state name from state_id.
Note: Currently the polls are only including Trump and Biden, but it’s likely they’ll include more third party candidates in the near future. We’ll be proactive and replace Jo Jorgensen (Libertarian) and Howie Hawkins (Green) in the state_id.
*** If anyone else is added to any of the state polls they’re name will need to be removed from state_id or the code will break.
Using Pennsylvania as an example, we can now scrap the polling data by using the rvest package with the links we’ve generated. The following code pulls all the tables on the page for the respective link, the last table is always going to have the polling data we need (in this case it’s 4th).
Note: All of the state polls start with “RCP Average”. We’ll need to remove these entries from the next section because we’ll calculate our own spread. The catch is that there is an encoding issue all of these entries which makes it difficult to remove from the dataset.
For example, if we subset State_Polls_2020_PA[][1,1] we can see that the value is “RCP Average”, however we can see in the code below that State_Polls_2020_PA[][1,1] does not equal “RCP Average”.
This is going to cause issues when we try to subset. To deal with this we are going to store the correct value in a variable called rcp_average.
Next, we’ll loop through the Link column and save the polling data into a dataframe called State_Polls_2020.
First we need to create the State_Polls_2020 dataframe. Along with the six columns in the RealClearPolitics polling table (which are fairly self explanatory), we’ll add another column called Rank which orders the polls by most recent to least recent. Some of the polls are really old so we’ll use the Rank column to make sure all the polls going into our state averages are relevant.
Let’s split the “RCP Average” rows out from the State_Polls_2020 dataframe and into a new dataframe called RCP_Averages_2020 using the filter() function from the dplyr package.
The last step is to calculate the summary data based on the state polling:
To wrap up the 2020 state polling data we need to calculate the average and standard deviation for each state. To do that we’ll use the group_by() and summarize() function from the dplyr package.
For the average we’ll only take the most recent five polls for each state and for the standard deviation well use all polls.
Finally, we’ll combine the average and standard deviation datasets into State_Summary_2020 using the left_join() function from the dplyr package.
Get the 2020 National Data
The national data follows the same format as the state polling data. We’ll use the national data to supplement the state data since the national data is updated more frequently.
Get the 2016 Data
We’ll also supplement the 2020 data with the actual 2016 results. We’ll also use the 2016 national data to calculate the polling bias in 2016.
To get the data we’ll copy the 2020 code and make a few changes which are commented in the code but I’ll outline below:
- Replace all the variables with 2020 to 2016
- Update the Electoral College Map link and the General Election link
- Replace Biden with Clinton
- Replace the 3rd party candidates to Johnson, Stein, and McMullin.
- We’ll use the actual spread for each state by filtering rows where the poll is “Final Results”.
Build the Forecast Model
First we need to create a master dataset called forecast_data where we combine the dataframes that we just built using left_join() from the dplyr package.
Many of the states do not have polling yet. To deal with this we’re going to supplement the 2020 polls with the 2016 actual results. To try to get a better estimate of where these states may actually be polling we’ll apply an adjustment to the 2016 actual results based on the 2020 national polling.
For example, as of this writing we don’t have 2020 polling data for Illinois. However, we know that in 2016 the spread was Clinton +16.0%. We also know that the national spread is currently Biden +6.9% , and we know that the final polling spread in 2016 was Clinton +3.2% . So we’ll estimate that Biden is outperforming Clinton by +3.7% and we’ll add that back to the 2016 Illinois spread. This brings our estimated spread for Biden up to +19.7%. To make the spread adjustment in forecast_data create a column called National_Adj which contains the national spread between Biden and Clinton. We’ll also add a National_SD column which we’ll use when there is no standard deviation data available from 2016 or 2020 (this may happen if there was only one poll taken during the course of the 2016 or 2020 election).
The Spread and Sd columns will be used to run the actual simulation. If 2020 polls are not available Spread will be the adjusted 2016 spread. If 2020 polls are available Spread will be half the 2020 spread and half the adjusted 2016 spread. If standard deviation is not available in 2016 or 2020 Sd will be National_SD. If 2020 standard deviation is available Sd will be half the 2020 standard deviation and half the 2016 standard deviation. If only 2016 standard deviation is available Sd will simply be the 2016 standard deviation.
We’ll use the mutate() and case_when() (similar to a switch function in other programming languages) function, from the dplyr package, to create the columns.
Now we need to set up our simulation.
The variable n will hold the number of trials we want to run. I’ve set it to 10,000 which should be more than enough.
And we’ll create a matrix called results_matrix which has our trials in the columns and our states in the rows.
Instead of simulating the election results with a normal distribution we’ll use a Johnson distribution with a delta of 0.5. This is a distribution that increases the probability in the tails of the distribution. Using a distribution with fatter tails is something 538 recommends because there tends to be more uncertainty in elections with where politicization is high and because Donald Trump is a highly unpredictable politician. I suspect as we move closer to the election, when there is less time for an unpredictable event, 538 moves closer to a normal distribution (delta = 1).
Dist describes the shape of our distribution. The delta of 0.5 is what give the distribution fatter tails compared to the normal distribution.
We’ll use the rJohnson() function from the SuppDists package to randomly generate 10,000 (or whatever value you’ve chosen for n) trials from this distribution and save the results into a vector called dist_multiplier. These values represent how many standard deviations each trial will be from the mean.
Next, we’ll add the Spread plus the dist_multiplier * Sd to each row and this will give us our simulated results.
Let’s recap what is happening here.
We started with the results_matrix:
Basically we’re putting all of the Spread data into each column:
We randomly generate 10,000 standard deviations from the distribution we created.
In the above screen shot, the first trial has a standard deviation of 1.305. Since the number is positive, this means that Trump outperformed in this trial. For each state, we’ll add a 1.305 standard deviation move in Trumps direction to the spread.
In practice, New Jersey has a -17.61% spread (negative means it favors Biden). We can look to the forecast_data dataframe and we can see that the Sd for New Jersey is 5.029. To get the winner for New Jersey for the first trial, we would add 1.305 * 5.029 = 6.56 to the Spread. This gives us a final result for the first trial of -17.61 + 6.56 = -11.05%. Since the spread is negative, Biden would still win New Jersey in the first trial despite the fact that Trump outperforms.
Which is exactly what we get if we look at the results_matrix.
Next, we’ll create two new matrices trump_wins and biden_wins. If Trump wins a state in one of the trials the spread will be positive. In trump_wins we’ll replace the positives with a 1 and the negatives with a 0. We’ll do the opposite for biden_wins.
For each of these datasets we’ll sum across rows by using the apply() function and divide the total by n. This will give us the win probability for each candidate by state. We’ll then add these win probabilities back the the main forecast_data dataframe.
We now have a sparse matrix for Trump and Biden where 1 represents the states they won and 0 represents the states they lost in a given trail.
We can multiply the rows by a vector of electoral votes and that will give us the total electoral votes won by each candidate during each trial.
We can now create two new vectors trump_votes and biden_votes which will contain each candidates electoral votes for each trial. We’ll create the results dataframe to combine both vectors. Then we’ll use the mutate() function to create a winner column to store the winner of each trial.
The final step is to count how many wins each candidate has by using the group_by() and tally() functions. We’ll then use one last mutate() function to calculate the win percentage for each candidate
To display the results in the console we can run the following code:
At the time of this writing our model is forecasting a 70% chance Biden will win the 2020 election. Which is tracking very close to the 538 forecast model.
Build a Trump vs Biden Prediction Model With R From Scratch was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.