The Forecast: The Methodology Behind Our 2020 Election Model - OZY | A Modern Media Company

The Forecast: The Methodology Behind Our 2020 Election Model

The Forecast: The Methodology Behind Our 2020 Election Model

By Daniel Malloy



Because this is the nitty-gritty behind the numbers.

By Daniel Malloy

The OZY 2020 forecasting model, developed with 0ptimus, estimates the probability of Republican and Democratic victories in individual presidential and congressional elections; additionally, the model simulates the number of aggregate electoral college votes and congressional seats expected to be won by each party. The congressional model is built upon our 2018 congressional forecasting model, while the presidential model is a new endeavor. 

Summary of the Methodology

We start with a dataset of 200-plus base features that are refreshed on a rolling basis and span economic indicators, political environment measures (both national and local), candidate traits, campaign finance reports and engineered variables designed to draw context-specific information into the model. Not every feature makes it into every estimator: We select different feature sets to complement our different estimation approaches.

We use classical logistic regression and classification tree algorithms which prove to be more accurate when ensembled together (Montgomery et al. 2011). Finally, we complete the ensemble by incorporating weighted polls when available to produce our final prediction. We then run simulations on our predictions to derive the range of possible outcomes.

Our quantitative estimates for each race are converted into qualitative ratings as described in the chart below.

Dataset and Variables

Our presidential model uses a dataset of 113 variables for presidential elections spanning each presidential contest from 1992 to 2020. The congressional model uses 265 variables in the House and 201 in the Senate spanning each non-special congressional election from 2006 until 2020. 

In addition to primary-source data, we also engineer several important features. For example, in addition to reported Federal Election Commission data from campaign reports, we also incorporate a formula to compare GOP and Democratic campaign finance numbers in each district/state, as well as an indicator for whether a House race has surpassed $3 million in contributions or a Senate race has surpassed $20 million. And polling data for each race is consolidated using a weighted average that accounts for recency, sample size and pollster quality, summarized below.

We refresh the dataset on a rolling basis to ensure that any and all changes to individual races are accounted for quickly. This includes adding any new individual race polling, changes in the national environment and special election environment variables, quarterly and 48-hour FEC reports, new economic indicators, primary election outcomes and candidate status changes. 

Prediction Framework

We use two different dependent variables to model the probability of party victory in a race. In our Party-Oriented Model (both presidential and congressional), the dependent variable is a binary for GOP win/loss in a race. In our Incumbency Model, the dependent variable is a binary for an incumbent party’s win/loss in a race. We transform the incumbent party’s probability to win a race to a Republican’s probability to win a race before adding it to the ensemble. In open-seat elections, we treat the candidate of the party that held the seat previously as the incumbent party. We combine the predictions of both party-oriented and incumbency model, as well as polling averages, for our final House predictions. In Senate and presidential races, we use only the ensemble of Party-Oriented Models to make our model-based predictions, which are later merged with polls to make our final prediction. 

Modeling Method

Our modeling process utilizes an ensemble technique, incorporating various algorithms and variable subsets, including logistic regression, random forest, XGBoost and elastic net to forecast elections. Including a variety of algorithms and variable subsets in our ensemble reduces error in two ways. First, ensembles have proved to be more accurate on average than their constituent models alone. Second, they are less prone to making substantial errors: If they miss, they miss by smaller margins on average (Montgomery et al. 2011). 

In the House model, we combine two separate ensemble models to leverage both party-oriented variables as well as incumbency-based variables, and then add any recent polling information. In the Senate, a single party-oriented ensemble model is sufficient to produce accurate results, and is later combined with polls to make a final prediction. In the presidential model, we use a stacking ensemble that takes a range of estimators and optimizes over them. 


We gather polling data for competitive races in House, Senate and presidential races. We use polls posted by RealClearPolitics, HuffPollster, Polling Report and FiveThirtyEight in our analysis.

Each individual poll in a race is converted to a probability representing the probability of a GOP win in that region. The probability is generated by sampling from a normal distribution, which is centered at the GOP candidate’s point total. Each draw represents a simulated point total for the GOP candidate. The difference of the individual draw minus the GOP candidate’s point total is subtracted from the other candidate’s point total to create a simulated point total for the other candidate. If the GOP candidate’s point total is greater than the other candidate’s point total, a win is recorded, otherwise a loss. The number of wins divided by the total number of draws represents a simulated probability of a GOP win given the poll’s margin.

The spread of the sampling distribution is based on the estimated total survey error of the poll. A poll’s reported margin of error often does not adequately capture its uncertainty (Shirani-Mehr et al. 2018), thus an adjustment is necessary to reflect the true uncertainty of a GOP win. We use an empirical distribution of polling errors gathered from House and Senate races dating back to 2002 as a baseline for estimating total survey error. The individual probabilities are then ensembled.

Weights are given based on the poll’s date, meaning the number of days until the election, as well as to the pollster’s rating, which is available publicly from FiveThirtyEight. A linear decay function is applied to the poll’s date as well as the polls rating. Polls with higher pollster ratings that are closer to the election are weighted more heavily. For races where a primary has not occurred yet, we include polls for all potential general election head-to-head scenarios, unless it makes sense in a specific race to assume the primary winners (e.g., Florida Senate, Arizona Senate). The final, aggregated probability is then added to our ensemble model.

Election Simulations: House and Senate

After calculating probabilities for each individual House and Senate race, we then turn our attention to predicting the aggregate number of seats we expect the GOP to win and the probability of maintaining control of the House and Senate. We use each seat’s predicted probabilities to run simulations of the 2018 congressional elections.

The mechanism of a wave is simulated by treating our predicted probabilities as beta random variables. Each race is assigned a beta distribution centered on the predicted probability, with shape parameters chosen to reflect the volatility of toss-up races in wave elections and conversely, the relative resilience of noncompetitive races.

For each iteration of our simulation, the order of races is randomized to mitigate systematic bias in our prediction. Starting with the first race, a value is sampled randomly from the beta distribution and a weighted coin flip is conducted with the probability of heads being equal to the sampled value. If the result is heads, it is counted as a GOP win. We then calculate GOP wins above expectation for simulated races, which is the total number of GOP wins less the sum of the predicted probabilities of all simulated races. A shift is applied to the probability distribution of the next race in sequence, weighted according to the total GOP wins above (or below) expectation. The sequence is continued until all 435 races are simulated. The final number of GOP seats won is recorded, then we start over from the top.

We repeat this process 14,000,605 times to create a distribution of potential outcomes. This allows us to know what scenarios are realistic and calculate the probability of any particular result (e.g., GOP wins 220 seats; percentage chance of GOP taking majority with 218-plus seats).

Election Simulations: Electoral College

Using the probabilities obtained from the model for each state, we run simulations to predict the aggregate number of seats President Donald Trump is expected to win and the probability of getting a majority. In every simulation, we consider the effect of an outcome in one state on the outcomes in other states (plausible correlations can arise from any numbers of variables, including education, racial demographics and geography). We start by flipping a coin on each state, arranged in increasing order of distance from toss-up. If the outcome of the coin flip is 1, it is considered as a win for GOP and loss otherwise.

The core idea underlying our simulations is that competitive states should inform other competitive states. We derive the distribution of outcome relationships by performing an XGBoost regression. The training data reflects the relationship between states and the change in their conditional probabilities based on the voting outcome of one state. It also implicitly takes relationships demonstrated by variables such as Partisan Voting Index (PVI) into account. For example, if Trump wins Wisconsin, it is also likely that he will win Florida as well; hence, the probability of GOP win in Florida is increased.

However, when the coin is flipped on any other state that is not competitive, the probabilities of all the other states are stable. It has to be noted that, in each simulation, we fix the outcome of safe states. Although there certainly is some signal in the probabilities of safe states, in simulations such as these, including them can cause enormous cascades in which a state like Kansas goes blue and so does the rest of America.

This simulation is repeated 140,605 times to get the probability and distribution of states won by each candidate. 

Sign up for the weekly newsletter!

Related Stories