The big data paradox and opinion polls

The first scene in Annie Hall is Woody Allen’s character telling the following joke: “There’s an old joke – um… two elderly women are at a Catskill mountain resort, and one of ’em says, “Boy, the food at this place is really terrible.” The other one says, “Yeah, I know; and such small portions.””

I believe this joke is useful in thinking about how we can get opinion polls wrong – i.e., how we think about the relationship between quality (representative data) and quantity (sample size). Specifically, when I see opinion polls, and in particular unscientific polls, it is easy to say that the data is useless – and the sample size is too small. The issue is that the two are separate problems, but we can easily forget that.

Meng (2018) describes what he the calls the Big Data Paradox: “The bigger the data, the surer we fool ourselves.” I believe this is, for the most part, the case with opinion polls. That is, when opinion polls have sample sizes much greater than 1,000, we should ask specific questions about why the sample size is greater than 1,000 – and in particular how people justify their sample size. We should not simply be impressed by the large sample size. Similarly, the solution to achieving better polls is not necessarily more data.

To discuss some of these issues, Meng (2018) shows that there are three (and only three!) sources of potential bias to consider when evaluating the difference between a sample average ($ \bar{Y}_{n} $) and a population average ($ \bar{Y}_{N} $): data quality defect, data quantity, and inherent problem difficulty. The relationship can be written as (from Bradley et al. 2021):

$$ \bar{Y}_{n}-\bar{Y}_{N} = \hat{\rho}_{\small{Y, R}} \times \sqrt{\frac{N-n}{n}} \times \sigma_{Y} $$

Let us begin at the end of the equation: $ \sigma_{Y} $. This is the standard deviation of the outcome we are interested in. The greater the standard deviation, the greater the inherent problem difficulty. If this standard deviation is zero, we will not have any bias (i.e., the difference between the sample avarage and the population average will be zero). In other words, it will be no challenge to reach the true estimate as it does not matter who we poll (and even n = 1 will provide an unbiased estimate). As this standard deviation increases, it will become more difficult to poll the outcome of interest. This is why we call it the inherent problem difficulty. In this post, I will not go into greater detail with problem difficulty but just assume that is non-zero (which is a very realistic assumption). In most cases, this is outside the control of the researcher.

Next, and of greater interest here, is $ \sqrt{\frac{N-n}{n}} $. This is where the potential of big data comes into play. Specifically, the more people we poll, all else equal, the smaller the bias. In the most extreme example, when $ N = n $, our bias will be zero, and we no longer have to worry about $ \sigma_{Y} $ and $ \hat{\rho}_{\small{Y, R}} $. However, in the polls we are interested in, $ N > n $. The simple point underlying the big data paradox is that all things are not equal, and we often cannot increase the sample size in isolation from other potential biases.

The first part of the equation is $ \hat{\rho}_{\small{Y, R}} $. This is related to how representative our poll is of the population of interest, i.e., the data defect correlation. This is the correlation between the outcome ($ Y $) and whether the respondent is participating in the poll or not ($ R $). If there is no correlation between values on the outcome and the likelihood of people participating in a poll, there will not be any bias. This is what we achieve with true random sampling (when we have no problems with compliance and the like). In other words, the greater the correlation between the outcome of interest and participation in the poll, the greater the bias. If, for example, we are interested in vote intention ($ Y $) and Conservative voters are more likely to participate in a poll ($ R $), there will be a non-zero correlation (i.e., a bias). Clinton et al. (2022), for example, found that Democrats were more likely to cooperate with telephone interviewers in the 2020 presidential pre-election telephone polls.

The interesting aspect of the equation is that the product of the three terms can explain all bias in any poll (we are not missing out on anything). There are no other hidden assumptions or dynamics that can also lead to a systematic bias (i.e., a difference between a sample estimate and a population estimate). It all boils down to the data quality defect, data quantity, and inherent problem difficulty. When we multiply the three terms we will get the full bias, and if just one of the terms are 0, we will not have a problem with our poll (i.e., a representative poll will not have any bias even when $ \sigma_{Y} $ and $ \sqrt{\frac{N-n}{n}} $ are greater than zero).

More data (i.e., increasing n) will often come at a cost of $ \hat{\rho}_{\small{Y, R}} $. That is, when we work with big data (or just more data), we will often see that to achieve extra data, we will end up with a non-representative sample of the population (and new problems). When you see a poll with an impressive sample size, be extra cautious about $ \hat{\rho}_{\small{Y, R}} $.

More importantly, even a small correlation between Y and R leads to a much greater bias than simply increasing the sample size. In other words, you cannot easily address problems with a bias simply by increasing n. On the contrary, you run the risk of making the problem worse. For a good post on this in relation to Meng’s paper, see this 2018 post by Jerzy Wieczorek (see also this Twitter thread).

We know that the quality of an opinion poll is understood by its ability to provide a sample average that is similar to that of a population average. However, often, we look at the sample size of an opinion poll and we are, all else equal, more impressed by a greater sample size. The problem here is the big data paradox. Again: “The bigger the data, the surer we fool ourselves.” And unless we have data on all cases (n = N), which we in most cases do not have, we need to make sure we do not get impressed by the size of data if we are interested in a population average.

That being said, how do we know what sample size to use? There are various different justifications for a specific sample size. Lakens (2022), for example, outlines six different possible justifications (from Table 1 in the paper):

Measure entire population. A researcher can specify the entire population, it is finite, and it is possible to measure (almost) every entity in the population.
Resource constraints. Limited resources are the primary reason for the choice of the sample size a researcher can collect.
Accuracy. The research question focusses on the size of a parameter, and a researcher collects sufficient data to have an estimate with a desired level of accuracy.
A-priori power analysis. The research question has the aim to test whether certain effect sizes can be statistically rejected with a desired statistical power.
Heuristics. A researcher decides upon the sample size based on a heuristic, general rule or norm that is described in the literature, or communicated orally.
No justification. A researcher has no reason to choose a specific sample size, or does not have a clearly specified inferential goal and wants to communicate this honestly.

The first justification is having n ≈ N. In opinion polls, this is rarely an option but, on the contrary, one of the reasons we do opinion polls in the first place. Instead, we often work a combination of the four other justifications, from the accuracy (to get a small margin of error) to heuristics (e.g., most representative polls are around 1,000 respondents). In most cases we do face resource constraints, and we want to collect as much data as possible within the constraints (time, money, etc.) we are working with.

When we talk about big data, we often say that more data is not always better, and hence big data is not a selling point in and by itself. That is, increasing n might even correlate positively with $ \hat{\rho}_{\small{Y, R}} $, hence the big data paradox. Accordingly, a lot of data is a necessary but not sufficient condition for a good opinion poll. The paper by Meng (2018) helps us understand some of these challenges in the domain of polling (and I can highly recommend reading the paper).