Experiments and societal challenges

The randomised controlled trial is seen as the gold standard within the social sciences. I do not disagree. I love well-executed experiments with strong causal identification. If two studies, one experimental and one non-experimental, differ in their conclusions, there is no doubt which study I, all else equal, will side with.

Most importantly, I often find myself defending experiments when discussing the pros and cons of different research designs. I do not find qualitative approaches better than experiments (how exactly are such approaches better suited to answer questions of a causal nature?). I do not have specific ethical concerns that are unique to experiments (I have concerns related to the collection of most data, observational and experimental). I do not have concerns related to the external validity of experiments (people should be a lot more concerned about the internal and external validity of observational research).

The one concern I do have and share with critics is that experimental research is very conservative in terms of what societal changes are being discussed and, at the end of the day, recommended. Experimental research will often test rather limited deviations from status quo and limit the realm of possibility in politics. I am not implying that this is a critique that is isolated to experimental research, but it is something I have been thinking a lot about recently in relation to experiments.

In the policy evaluation toolbox, the main strength of the experimental research design is its simplicity. The Neyman–Rubin causal model is easy to understand and with a single treatment and an outcome, we can consider potential outcomes and map them to counterfactual scenarios in an experiment. In doing this, we are able to estimate average tratment effects and demonstrate how specific treatments matter (or not).

However, as we know, the use of our evaluation methods can shape how we think about topics and the magnitude of interventions to study in the first place. Specifically, we are less likely to pay attention to macroscopic changes when looking at society in the microscope when we consider conducting experiments.

My concern is that too much attention to experiments will make us more likely to focus on quick and easy solutions to societal challenges. For example, I read the job market paper by Berk Can Deniz on newspaper websites and their adoption of A/B testing. In his study, he finds that when newspapers adopt an experimental approach, they focus more on incrementalism than bigger change. That is, minuscule yet reliable improvements. In other words, adopting an experimental research design can lead to less innovation.

I have nothing against incremental and valid change rather than radical and uncertain change. However, I do think it is worth considering how the methods we use – especially experiments – shape the way in which we think about problems – and their solutions.

When I talk about quick and easy solutions, I am also concerned about the timeframe in which we talk about solutions to big problems. Specifically, we pay too much attention to an outcome that can be measured relatively early in the causal chain to increase the internal validity (i.e., causal identification). Of course, one might argue that incremental changes are the only way to address big problems. I do not agree with this point.

Consider the example Bill Gates provides in his book, How to Avoid a Climate Disaster, on the topic of how rich countries should reach net-zero emissions by 2050. He has a very good point on the difference between working towards “reductions by 2030” and “get to zero by 2050”:

If we set out to reduce emissions only somewhat by 2030, we’ll focus on the efforts that will get us to that goal—even if those efforts make it harder, or impossible, to reach the ultimate goal of getting to zero. For example, if “reduce by 2030” is the only measure of success, then it would be tempting to replace coal-fired power plants with gas-fired ones; after all, that would reduce our emissions of carbon dioxide. But any gas plants built between now and 2030 will still be in operation come 2050 — they have to run for decades in order to recoup the cost of building them — and natural gas plants still produce greenhouse gases. We would meet our “reduce by 2030” goal but have little hope of ever getting to zero.

That is, even if the outcome is identical over time, the solution (“treatment”) needed might differ conditional upon at what time we assess the relevance and importance of the solution. In the example above, we can consider multiple ways to decarbonize, but we need to consider the time frame before we consider the solutions.

For the easy fixes, I see it as no coincidence that there has been more attention to nudging as a paradigm to solve problems. Studies on nudging rely heavily on the use of experimental research designs. As always, I am not the only one to point this out. In a recent comment, Gal and Rucker (2022) discuss a so-called experimental validation bias in relation to nudging:

We suggest that the dominance of the nudge approach in applied behavioural science is largely due to experimental validation bias, that is, the tendency to overvalue interventions that can be ‘validated’ by experiments. This bias results in interventions of limited ambition and scope, leading to an impoverished view of the relevance of behavioural science to the real world.

What we know now is that a lot of the small nudges are not well-equipped to solve societal challenges. In other words, large problems require large solutions.

It is a feature of experiments that they mostly give small effects (and when they do not give small effects, it might be because they are not pre-registered, cf. Kaplan and Irvin 2015). In practise, one of the reasons experiments provide unbiased causal estimates is that such estimates are often close to zero.

Experiments test small and realistic interventions and we have no reason to believe that such interventions have huge effects. On the contrary, we should be skeptical towards experiments with impressive effect sizes. This is even a problem from a statistical perspective, as true null effects often lead to false discoveries (see, for example, this study on how high false discovery rates stem from a high fraction of null effects).

The focus on average treatment effects (or variations hereof) also matters for how we think about interventions and how we should deal with societal challenges. I am not talking about the focus on average effects that hides potential individual-level variability in the effects, but contextual-level complexity in how to understand treatment-context heterogeneity.

What about the cases where we do find large effects with societal relevance? In that case, we should care about interactions between units in the experiment. If a social policy is more likely to have strong effects, then our stable unit treatment value assumption (SUTVA) is – all else equal – less likely to hold. There are good attempts in the literature on how to correctly identify and estimate such spillover effects in experiments (e.g., Vazquez-Bare 2022), but it is something I rarely see researchers discuss when they talk about assumptions underlying their research design.

My point here is that experimental research makes us less likely, on average, to “think in systems”, and more likely to think in simple counterfactual scenarios. We are more likely to look for solutions in specific interventions that can provide quick and easy fixes to large problems rather than consider a larger opportunity space with a larger set of influences and societal challenges to be taken into account. Put simply, I don’t think experimental studies on specific nudges can do a lot to solve global problems such as the obesity epidemic or climate change.

In sum, as Elizabeth Popp Berman describes in the book, Thinking Like an Economist: How Efficiency Replaced Equality in U.S. Public Policy, our norms can shape how we think: “once a particular intellectual framework is institutionalized, it can take on a life of its own, defining the boundaries of what is seen as politically reasonable.”

Again, I love experiments, and the last thing I want to see is a world where social scientists do not care about strong causal identification. However, we need to be aware of and discuss potential limitations of such approaches, even if other methods and research designs suffer from similiar limitations.