The Methodology Behind Our Election Forecast - OZY | A Modern Media Company

The Methodology Behind Our Election Forecast

The Methodology Behind Our Election Forecast

By Daniel Malloy

Here's where those fancy prediction numbers come from.
SourceScott Tranter/Optimus


Because this is where those fancy numbers come from.

By Daniel Malloy

This year OZY is teaming up with technology and data firm 0ptimus and its sister company Decision Desk HQ, which reports and analyzes election results, to present “The Forecast.”

An exclusive prediction model developed by 0ptimus, The Forecast uses historical trends in election data to estimate the predicted vote share for each Democratic primary candidate within each state of this year’s primary and caucus calendar, as well as each candidate’s probability of winning the most primary delegates. Our projections identify how each candidate is expected to perform at the polls or in the caucus room on election night.

After collecting a wide range of data related to presidential primary and caucus contests, we estimate state-level models to examine the historical relationships between our predictors and past primary and caucus performances. We then optimize model performance and create an ensemble of the best-performing models to create predictions of the current Democratic field’s nomination campaign. To keep our predictions current, our dataset is continuously refreshed to reflect the most accurate representation of the current presidential field.

Dataset and Variables

The dataset for our project incorporates a significant number of variables historically associated with explaining the presidential primary process in the modern era. Broadly speaking, we incorporate polling at the national and state levels, a variety of campaign finance metrics (such as individual contributions, total disbursements and candidate loans), endorsements from elected party officials, media coverage and candidates’ biographical details (such as home state).

The scope of our dataset includes all competitive presidential primary elections from 1992 to the present in order to capture the breadth of modern presidential primaries and caucuses in the contemporary political landscape. Our historical data also include the results of individual state-level contests in each of these competitive nomination campaigns. This means that nomination campaigns featuring an incumbent president (e.g., Barack Obama’s unopposed campaign for the 2012 Democratic nomination) are omitted from our analysis, since they don’t provide useful data for predictive analysis. The 2020 data is updated on a rolling basis to reflect the most up-to-date state of the race. These updates are especially important for considering updates to polling data, FEC reports and media coverage.

Our campaign finance data comes from the FEC; media coverage from LexisNexis; endorsements from FiveThirtyEight, Cohen et al. Replication Data and Democracy in Action; and polls from HuffPost Pollster, FiveThirtyEight, Wikipedia, LexisNexis, RealClearPolitics and PollingReport.

Developing the Model

Our prediction framework generates forecasts for each Democratic presidential candidate on a state-by-state basis with predictions in later contests being influenced by earlier contests both directly (by contest wins) and indirectly (by changes in polling, media coverage, etc.). We forecast vote shares rather than win probability because the allocation of pledged delegates in the Democratic primary contest awards delegates on a proportional basis rather than winner-take-all rules. In our national simulation (described below), we then translate these state vote projections to delegate counts to identify the overall probabilities of a candidate becoming the Democratic Party’s nominee.

Using the dataset introduced above, we ran iterative tests to ascertain the most predictive combination of features to include in each of our state-by-state models. This process allows us to explore how the relationship between a predictor and the election result may differ across states. For example, the proportion of campaign finance receipts from individuals might be more predictive in larger states but less predictive in smaller states.

Using our dataset, we ran our predictive models for each combination of predictors and optimized our overall forecasting model based on how each combination of variables accurately predicted the historical training data.  In this process, we estimated millions of models to accurately assess the relationship between our predictors and the historical election results for each state. To avoid overfitting the model, we ensemble among the models that performed best on the historical data to then create our predicted values.

The modeling process for predicting vote share in a nomination contest including multiple co-partisan candidates looks different from models common to general elections in the United States because there are multiple candidates in the race who need to be considered.

Our model utilizes two estimation techniques for uncovering the relationships between our predictors and election results: linear regression and XGBoost. That is, for each iteration of our model in each state, we estimate a linear regression model and an XGBoost model. After exploring the metrics for each model, we weight predictions from each model to uncover each iteration’s prediction. By averaging across these two algorithms, we can leverage the benefits for each without overfitting using a single model.


Polling data in presidential primary nomination contests is perhaps the most important variable considered by our models: Polling helps reflect attitude shifts among the voting public, it captures the effects of major endorsements and campaign spending, and it demonstrates where candidates may rank at both the national and state levels. Therefore, we include a wide array of publicly available polls as they are made available. In terms of historical data, we’ve collected as many polls as we could find including those found in popular polling databases as well as those found in historical media coverage of primary races.

In terms of data for the 2020 Democratic primary, we continue to include polls that are made available by popular aggregators such as FiveThirtyEight, RealClearPolitics, PollingReport and other polls found in the news. These polls include observations for candidates who are declared and are still actively campaigning. We update our polling information daily to keep up with the dynamic nature of the campaign. To represent the poll standing of each candidate, we calculate both an average of national and state-level polls. These averages are weighted by recency, with all polls from the past month factored in, but polls from the preceding two weeks are given substantially more weight than polls conducted earlier.

National Simulations

In addition to our state-by-state primary and caucus election results predictions, we’ve also put together a simulator that identifies each candidate’s probability of capturing a plurality of pledged delegates entering the Democratic National Convention this July. Using the distributions in our state-by-state projections, we simulate 10,000 potential outcomes while accounting for the relationships among states and across the election calendar. Further, we consider the types of situations that prompt candidates to drop out of the nomination contest to better account for their forecasted allotment of pledged delegates. 

Our national simulations will continue to be updated as new data (e.g., polling or endorsements) influence our state projections and as state primary and caucus results begin to roll in.

Sign up for the weekly newsletter!

Related Stories