# Build a Trump vs Biden Prediction Model With R From Scratch

**Stories by Matt C on Medium**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s predict the 2020 presidential election!

___

Get the full code on my GitHub page

Follow me on Twitter for model updates

___

Creating a simple prediction model for the 2020 general election between Trump and Biden is actually fairly simple. All we need is to estimate each candidate’s state-by-state **average polling performance** and **polling standard deviation** to create a basic **monte carlo** simulation.

**This simulation is currently tracking very closely with ****Nate Silver’s 538 model****. **As of this writing, my model is forecasting a 70% chance Biden wins the election which is in lockstep with the 538 forecast model.

To build this model, first we focus on writing a dynamic scraping program that pulls in all of the current state polling data from **RealClearPolitics****. **Then we use the state polling data to simulate 10,000 elections. The winner of each trial is determined by which candidate has the most electoral votes.

This is an advanced beginner data science project. All code is written in the **R programming language** and I make heavy use of the Tidyverse set of packages. Check out my Basic Data Manipulation in R video for a quick refresher on the main Tidyverse packages I use.

**Required Packages**

**Get the States and Electoral Votes**

First we need a list of each state and their electoral votes. At the end of this section we’ll have the following two-column dataframe:

___

**Note:** There will be more than 50 rows in the **State** column. Maine and Nebraska do not necessarily allocate all of their votes to a single candidate. Maine, for example, allocates two electoral votes to the winner of the state, one vote for the winner of Maine Congressional District 1 (Maine CD1), and one vote for the winner of Maine Congressional District 2.

So for simplicity, when I reference ‘state’ or ‘states’, it refers to the 50 actual states along with Maine CD1, Maine CD2, Nebraska CD1, and Washington DC.

___

The only place RealClearPolitics lists all of the states and their electoral votes is on their **2020 Electoral College Map****.**

There is a section at the top that lists states that are toss ups or are leaning in one direction. For simplicity, I’ll call these the **toss up states**.

At the bottom of the page there are two tables that list states that are solid Trump or solid Biden. I’ll call these **solid states**.

The html behind this page is somewhat complicated so we’ll have to scrap the **toss up states** and the **solid states** separately.

To start we’ll load the raw html from the page into our R session in a variable called **votes_html **using the** read_html() **function from the **rvest** pacakge:

All the toss up states are in between the html tag ** and :**

To extract everything between these tags we’ll use a regex statement in the **str_extract_all()** function from the **stringr** package.

**Toss_ups **is now a vector of states and their corresponding electoral votes.

To get the **Solid_States** we can use the **html_nodes()** and **html_table()** functions from the **rvest** package.

This function grabs *all* of the html tables from the web page, but the relevant tables for the solid states are in the following indices:

For example, **Solid_States[[16]][[1]]** corresponds to the first column in the solid states tables:

Now combine the **Toss_ups **and the **Solid_States **into one vector called **Electoral_Votes.**

There are still a few blanks in the **Electoral_Votes** vector:

To get rid of them we’ll use **str_detect()**:

Now we convert the **Electoral_Votes **vector into a two column dataframe. Column one will contain the states and column two will contain the electoral votes.

To do this, we will split each element of **Electoral_Votes** by removing the close parenthesis and splitting by the space and open parenthesis string.

**Get the 2020 State Data**

For each state we need to calculate average polling spread between Trump and Biden and the standard deviation of that spread. In this section we’ll build a dataframe that captures this data:

___

**Note:** The spread is the models way of capturing which way the polls are leaning with only one number. For each state I use the Trump polling average minus the Biden polling average. Therefore, a state with a **positive spread indicates Trump** is winning and a state with a **negative spread indicates Biden** is winning.

For example, to indicate that Biden is winning Pennsylvania by 5.5% we would represent that in the spread column with -5.5.

___

Each state has it’s own page and unique link with polling data from that state.

To get all the 2020 state links, we’ll need to scrap them from the **2020 Electoral College Map** page. We’ll store the raw html data into a variable called **Summary_2020**.

Select the data between the html link tags (**href=” **and** >”**) and select only the links that contain the string “trump_vs_biden”:

We now have a list of links for all the states currently available. The actual name of the state is embedded in the link. We need to break the links apart to extract the state name. It’s important we have a column with a clean set of state names because we are building a number of datasets that will need to be combined before we can build the model. The state name is the key to combining all the datasets.

___

**Note: **Some states might not have polling yet so we’ll have to supplement with national polling numbers and 2016 actual numbers which we’ll tackle in the next sections.

___

First let’s clean up the links:

We’ll split **Summary_2020** by “/” using **str_split_fixed(). **This will give us three columns which we can rename, **Abbrev**, **state_id**, **id**.

Next, we’ll use the **mutate()** function to create two new columns, **State** and **Link,** in the **Summary_2020** dataframe.

To generate the state names we need to remove all of the text that is not the state name from **state_id**.

___

**Note: **Currently the polls are only including Trump and Biden, but it’s likely they’ll include more third party candidates in the near future. We’ll be proactive and replace Jo Jorgensen (Libertarian) and Howie Hawkins (Green) in the **state_id**.

***** If anyone else is added to any of the state polls they’re name will need to be removed from state_id or the code will break.**

___

Using Pennsylvania as an example, we can now scrap the polling data by using the **rvest **package with the links we’ve generated. The following code pulls *all* the tables on the page for the respective link, the last table is always going to have the polling data we need (in this case it’s 4th).

___

**Note: **All of the state polls start with “RCP Average”. We’ll need to remove these entries from the next section because we’ll calculate our own spread. The catch is that there is an encoding issue all of these entries which makes it difficult to remove from the dataset.

For example, if we subset **State_Polls_2020_PA[[4]][1,1] **we can see that the value is “RCP Average”, however we can see in the code below that **State_Polls_2020_PA[[4]][1,1] **does not equal “RCP Average”.

This is going to cause issues when we try to subset. To deal with this we are going to store the correct value in a variable called **rcp_average.**

___

Next, we’ll loop through the **Link** column and save the polling data into a dataframe called **State_Polls_2020**.

First we need to create the **State_Polls_2020** dataframe. Along with the six columns in the RealClearPolitics polling table (which are fairly self explanatory), we’ll add another column called **Rank **which orders the polls by most recent to least recent. Some of the polls are really old so we’ll use the **Rank **column to make sure all the polls going into our state averages are relevant.

Let’s split the “RCP Average” rows out from the **State_Polls_2020 **dataframe and into a new dataframe called **RCP_Averages_2020** using the **filter()** function from the **dplyr **package.

The last step is to calculate the summary data based on the state polling:

To wrap up the 2020 state polling data we need to calculate the average and standard deviation for each state. To do that we’ll use the **group_by() **and **summarize() **function from the **dplyr **package.

For the average we’ll only take the most recent five polls for each state and for the standard deviation well use all polls.

Finally, we’ll combine the average and standard deviation datasets into **State_Summary_2020 **using the **left_join()** function from the **dplyr** package**.**

### Get the 2020 National Data

The national data follows the same format as the state polling data. We’ll use the national data to supplement the state data since the national data is updated more frequently.

### Get the 2016 Data

We’ll also supplement the 2020 data with the actual 2016 results. We’ll also use the 2016 national data to calculate the polling bias in 2016.

To get the data we’ll copy the 2020 code and make a few changes which are commented in the code but I’ll outline below:

- Replace all the variables with
**2020**to**2016** - Update the Electoral College Map link and the General Election link
- Replace
**Biden**with**Clinton** - Replace the 3rd party candidates to
**Johnson**,**Stein**, and**McMullin.** - We’ll use the
*actual*spread for each state by filtering rows where the poll is “Final Results”.

### Build the Forecast Model

First we need to create a master dataset called **forecast_data **where we combine the dataframes that we just built using **left_join()** from the **dplyr** package.

Many of the states do not have polling yet. To deal with this we’re going to supplement the 2020 polls with the 2016 actual results. To try to get a better estimate of where these states may actually be polling we’ll apply an adjustment to the 2016 actual results based on the 2020 national polling.

For example, as of this writing we don’t have 2020 polling data for Illinois. However, we know that in 2016 the spread was **Clinton +16.0%**. We also know that the national spread is currently **Biden +6.9% **, and we know that the final polling spread in 2016 was **Clinton +3.2% **. So we’ll estimate that Biden is outperforming Clinton by **+3.7% **and we’ll add that back to the 2016 Illinois spread. This brings our estimated spread for Biden up to **+19.7%. **To make the spread adjustment in **forecast_data **create a column called **National_Adj **which contains the national spread between Biden and Clinton**. **We’ll also add a **National_SD **column which we’ll use when there is no standard deviation data available from 2016 or 2020 (this may happen if there was only one poll taken during the course of the 2016 or 2020 election).

The **Spread** and **Sd **columns will be used to run the actual simulation. If 2020 polls are not available **Spread **will be the adjusted 2016 spread. If 2020 polls are available **Spread **will be half the 2020 spread and half the adjusted 2016 spread. If standard deviation is not available in 2016 or 2020 **Sd **will be **National_SD**. If 2020 standard deviation is available **Sd **will be half the 2020 standard deviation and half the 2016 standard deviation. If only 2016 standard deviation is available **Sd **will simply be the 2016 standard deviation.

We’ll use the **mutate()** and **case_when() **(similar to a switch function in other programming languages)** **function, from the **dplyr** package, to create the columns.

Now we need to set up our simulation.

The variable **n **will hold the number of trials we want to run. I’ve set it to 10,000 which should be more than enough.

And we’ll create a matrix called **results_matrix **which has our trials in the columns and our states in the rows.

Instead of simulating the election results with a normal distribution we’ll use a Johnson distribution with a delta of 0.5. This is a distribution that increases the probability in the tails of the distribution. Using a distribution with fatter tails is something 538 recommends because there tends to be more uncertainty in elections with where politicization is high and because Donald Trump is a highly unpredictable politician. I suspect as we move closer to the election, when there is less time for an unpredictable event, 538 moves closer to a normal distribution (delta = 1).

**Dist **describes the shape of our distribution. The delta of 0.5 is what give the distribution fatter tails compared to the normal distribution.

We’ll use the **rJohnson() **function from the **SuppDists **package to randomly generate 10,000 (or whatever value you’ve chosen for n) trials from this distribution and save the results into a vector called **dist_multiplier**. These values represent how many standard deviations each trial will be from the mean.

Next, we’ll add the **Spread** plus the **dist_multiplier** * **Sd **to each row and this will give us our simulated results.

___

Let’s recap what is happening here.

We started with the **results_matrix:**

Basically we’re putting all of the **Spread **data into each column:

We randomly generate 10,000 standard deviations from the distribution we created.

In the above screen shot, the first trial has a standard deviation of 1.305. Since the number is positive, this means that Trump outperformed in this trial. For each state, we’ll add a 1.305 standard deviation move in Trumps direction to the spread.

In practice, New Jersey has a -17.61% spread (negative means it favors Biden). We can look to the **forecast_data **dataframe and we can see that the **Sd **for New Jersey is 5.029. To get the winner for New Jersey for the first trial, we would add **1.305 * 5.029 = 6.56** to the **Spread. **This gives us a final result for the first trial of **-17.61 + 6.56 = -11.05%. **Since the spread is negative, Biden would still win New Jersey in the first trial despite the fact that Trump outperforms.

Which is exactly what we get if we look at the **results_matrix.**

___

Next, we’ll create two new matrices **trump_wins **and **biden_wins. **If Trump wins a state in one of the trials the spread will be positive. In **trump_wins** we’ll replace the positives with a 1 and the negatives with a 0. We’ll do the opposite for **biden_wins**.

For each of these datasets we’ll sum across rows by using the **apply()** function and divide the total by **n**. This will give us the win probability for each candidate by state. We’ll then add these win probabilities back the the main **forecast_data **dataframe.

We now have a sparse matrix for Trump and Biden where 1 represents the states they won and 0 represents the states they lost in a given trail.

We can multiply the rows by a vector of electoral votes and that will give us the total electoral votes won by each candidate during each trial.

We can now create two new vectors **trump_votes** and **biden_votes **which will contain each candidates electoral votes for each trial. We’ll create the **results **dataframe to combine both vectors. Then we’ll use the **mutate()** function to create a **winner **column to store the winner of each trial.

The final step is to count how many wins each candidate has by using the **group_by()** and **tally()** functions. We’ll then use one last** mutate()** function to calculate the win percentage for each candidate

To display the results in the console we can run the following code:

At the time of this writing our model is forecasting a 70% chance Biden will win the 2020 election. Which is tracking very close to the 538 forecast model.

Build a Trump vs Biden Prediction Model With R From Scratch was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

**leave a comment**for the author, please follow the link and comment on their blog:

**Stories by Matt C on Medium**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.