In the era of 24-hour news cycles and surgically targeted political campaigns, the ability to accurately gauge and constantly update public opinion measurements in targeted subgroups is an advantage for political campaign managers and public policy advocates.
Traditional opinion polling, successful in generating cross-sectional and population-wide estimates, is poorly equipped to produce speedy and granular public-opinion measurements due to its slow time frame (typically 2-3 days) and relatively small sample size (typically around 1,000 respondents). We believe big data analytics – that is, analysis of large data sets typically based on non-random samples –holds major promise in achieving improved spatial and temporal resolution. But inference from such non-traditional surveys presents a challenge because the available data generally neither will come from any formal sampling procedure nor will be a representative sample of the general population.
In a forthcoming academic paper in the International Journal of Forecasting, which we co-authored with David Rothschild and Sharad Goel at Microsoft Research, we demonstrate the use of a large but highly biased opt-in survey placed on the Xbox gaming platform to generate national and state-level public opinion estimates during the 2012 U.S. presidential election campaign. Our estimates are comparable to those obtained from poll aggregators that collect and synthesize hundreds of traditional polls.
Our survey is similar to typical Internet opt-in surveys. However, the Xbox platform offers the important benefit that the survey was continuously available (with the limit of at most one response per day per user), and thus we have repeated measurements in many households, which allows us to track changes in opinion much more effectively than with traditional repeated cross-section polls.
Big Data, Big Bias, Big Model
Our data include around 750,000 responses from more than 350,000 Xbox users, averaging about 15,000 responses per day. More than 30,000 of our respondents took the survey more than five times during the month and a half that our survey was open. In contrast, a typical random-sample telephone survey has a sample size of around 1,000 that spans 2-3 days. Moreover, compared with the high cost of traditional polls, where careful design and training of the surveyors/data collectors are required, the data collection cost of the Xbox survey is effectively pennies per response, as we take advantage of existing technological infrastructure.
As expected, our data set has far more young and male respondents than the general electorate. Figure 1 compares the demographic composition of the Xbox data set with 2012 U.S. presidential election exit poll.
Without any statistical adjustment, the raw data would suggest a victory for Mitt Romney over President Barack Obama (Figure 2). But, of course, we would know to adjust.
Demographic and political attributes we collected from the respondents are central to the statistical adjustment. We divided the whole data set into small cells formed by the combinations of the attributes. For example, one such cell might be composed of young Asian females living in Ohio with a college degree and conservative political leaning. It is reasonable to assume that within each group our sample is representative. In fact, the more attributes we control for, the more plausible this assumption will be. Obama’s vote share was estimated for each of these cells. Analysis on such a granular scale is almost impossible with traditional polls, but applying a statistical technique called Bayesian Hierarchical Modeling, the Xbox survey can give accurate and stable estimates on these small demographic groups. National- and state-level estimates can then be generated by aggregating the estimates with weights in reference to a typical electorate.
The adjusted national daily snapshots provide a reasonable timeline of the two-party vote share movement over the 45 days leading up to the election (Figure 3). The estimates in the last few days are arguably closer to the actual outcome than those of the aggregators of traditional polls. The state-by-state level predictions are also solid, only missing North Carolina and Florida by small margins.
The Xbox example might be extreme in terms of bias, but it is indicative of a common problem in big data analytics: the data are often a convenient sample, and thus have a huge bias. That is why a carefully constructed model, like that one that we used for analyzing the Xbox survey, is necessary for extracting information from big data. Big data needs big models.
The Case for Non-representative Polls
Internet polls, or in general, non-representative polls, are by no means shiny new inventions. In fact, non-representative polls were common before the advent of modern probabilistic polling, which aims to sample each individual in the targeted population with equal probability. In an infamous incident in the 1936 U.S. presidential election, a then-popular magazine, Literary Digest, mailed out 10 million questionnaires to its subscribers and automobile owners and received 2 million responses. Based on these responses, they predicted a victory for Republican candidate Alf Landon over incumbent Democratic President Franklin Roosevelt. During the same election campaign, pollsters such as George Gallup and Elmo Roper used much smaller but representative samples to predict the election with reasonable accuracy. Since then, non-representative polling has fallen out of favor among pollsters.
However, as representative of the general population as a random sample of the phone book might have been in the past, it is increasingly difficult to collect a random sample using the standard random-digit dialing method today. Traditional pollsters devote a large amount of resources to clean up “tainted” samples. With big data and modern computing power, we believe it is time to rethink the relevance and importance of non-representative polling. With adequate statistical modeling, non-representative polls can generate much finer and timelier public opinion measurements than are possible with traditional polls.
Naturally, there are suspicions about the extent to which non-representative polls can apply. For example, above-65-year-old females who play Xbox are probably not typical. Can we do a good job predicting this subgroup? It turned out that the Xbox prediction is within 1 percentage point of the actual vote share. We attribute this success to two factors. The first is the sheer size of our data. Even though the proportion of above-65-year-old females is low in our sample, in terms of quantity it is still comparable to, if not more than, traditional polls. Second, the statistical model uses the information from females in general and above-65-year-olds in general and strengthens the estimate for the more specific subgroup.
The dichotomy of representative and non-representative polls is not as clear as it seems; which paradigm to use is ultimately a cost-benefit analysis. It is well understood that representative or random-sample surveys are imperfect, and so pollsters perform statistical adjustment using responses to demographic and political questions. And, from the other direction, it is desirable for non-representative polls to have as low a bias as possible. Thus, it still makes sense to set up opt-in surveys to have as broad a reach as possible and to minimize the aspects of self-selection of respondents.
The emergence of big data analytics has rendered non-representative polls more and more favorable. Moreover, we don’t consider non-representative polls as a replacement of traditional representative polls, but as a flexible supplement. The areas that could see the biggest impact from non-representative polls are most likely to be smaller, local elections, where representative polls are scarce due to time and cost constraints.
Wei Wang (firstname.lastname@example.org) is a Ph.D. candidate in statistics at Columbia University. Andrew Gelman (email@example.com) is a professor of statistics and political science at Columbia University.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.