How effective is nudging? #2

In a new study, Mertens et al. (2022) examine the effectiveness of nudging. Specifically, they conduct a meta-analysis of 455 effect sizes from 214 publications. Here is the key finding presented in the abstract: “Our results show that choice architecture interventions overall promote behavior change with a small to medium effect size of Cohen’s d = 0.45 (95% CI [0.39, 0.52]).”

The interpretation of a Cohen’s d effect size of 0.45 as small to medium is somewhat of a humblebrag. Psychological research has demonstrated that effect sizes are often smaller than original guidelines would suggest (see e.g., Gignac and Szodorai 2016). In brief, a Cohen’s d of 0.45 is not a “small to medium” effect.

Unsurprisingly, there has been a lot of discussion about the magnitude of these effects and the key finding, especially related to the selection of specific studies and whether we should believe the findings at all (e.g., here, here, here, here, here and here).

Yet, I believe there is another important issue to be addressed if we are to fully understand – and even talk about – how effective nudging is. In a previous post, I identified a source of bias in reviews of the effectiveness of nudging that is rooted in how nudging is defined. Alas, the problem I identified in the study is also relevant here. Here is how Mertens et al. (2022) selected the studies of interest:

“We searched the electronic databases PsycINFO, PubMed, PubPsych, and ScienceDirect using a combination of keywords associated with choice architecture (nudge OR “choice architecture”) and empirical research (method* OR empiric* OR procedure OR design).”

I am not surprised to see that such a selection strategy will return studies with large effect sizes. Accordingly, in contrast to other discussions of the meta-analysis, I am less interested in the studies in the meta-analysis that find large effect sizes (Wansink and whatnot). Instead, I am genuinely interested in the nudges in the literature with effect sizes that are indistinguishable from zero. In other words, why would we expect anything else than large effects from studies with low statistical power?

As I pointed out in my previous post, the main problem with any attempt to conduct a meta-analysis (or any type of review of the evidence for nudging) is the definition of a nudge as proposed by Richard H. Thaler and Cass R. Sunstein in Nudge:

“A nudge, as we will use the term, is any aspect of the choice architecture that alters people’s behavior in a predictable way without forbidding any options or significantly changing their economic incentives.”

The problem is that a nudge will, by definition, always have a non-zero effect (i.e., alter people’s behaviour). This is a problem if a study is about a nudge if, and only if, an intervention manages to alter people’s behaviour. If you introduce a nudge, and this nudge does not alter people’s behaviour, it is – by definition – not a nudge.

I decided to consult the replication material (availablbe on OSF) and arrange the effect sizes from the smallest to the greatest effect). My aim was to examine whether the studies included in the meta-analysis are willing to talk about nudges that are not … nudges™.

The first effect size showing up in the replication data is from Goswami and Urminsky (2016). This effect size is zero. 0. However, when I compared the mean for the control group with the mean for the intervention group, I could see that the effect size should be greater than zero. I informed the corresponding authors of the issue and decided to move on to the next effect. They did not get back to me so I cannot confirm whether there is an error or not.

The second effect size is from Zikmund-Fisher (2011). This effect size is also 0 and I can confirm, upon reading the paper, that the effect of interest is indeed 0. This is an interesting study. First, it does not introduce a nudge or talk about choice architecture. Instead, it mentions “nudge” once in the discussion:

“To use the terminology of Thaler and Sunstein [21], it is a type of ‘‘nudge’’ that can help induce particular behaviors without restricting individual choice. Its use, therefore, may be inappropriate in patient decision aids or other situations where one’s goal is to inform patients’ decision making without influencing their decisions.”

This study is not talking about a specific nudge, but only mentions “nudge” in order to discuss nudges in general terms (with the terminology introduced by Thaler and Sunstein). In other words, this study is not about a nudge.

Let us move on to the third effect size. The study in question is Loibl et al. (2018). The study mentions nudges one time in the manuscript: “Reminders are well-known nudges designed to reduce inattention and forgetfulness by alerting individuals about a certain task and thus affecting their decisions”. However, the small effect size being reported here is not for a reminder but a default (Cohen’s d = 0.0002). Interestingly, there is another effect size in the meta-analysis from the study, and that effect size is indeed for an intervention that is a reminder (i.e., a well-known nudge). This effect is statistically significant (Cohen’s d = 0.0954). In other words, nudging is not being discussed in the study in relation to the intervention with a small effect, only in relation to the intervention that turns out to have an effect.

In a meta-analysis, we are often looking at multiple effect sizes from the same publication. In this meta-analysis, more than half of the publications report more than one effect size. For example, eight of the studies report six effect sizes. When we have more effect sizes, we can also allow some of them to show smaller effects and, while talk about nudges, also show the interventions that are not nudges. Unsurprisingly, there is a strong negative correlation between the number of reported effect sizes and the smallest reported effect size (r = -.21). In other words, if a study only report a single effect size and talk about a nudge, it is less likely that this effect is small.

Let us move on to the next effect sizes. The next two effect sizes are from a report, Using Behavioural Insights to Encourage Charitable Donations. There are six effect sizes in the meta-analysis from this report, and most of them are small, but nowhere is there any mention of “nudge” (or “nudging”) or “choice architecture”. To understand why this report is included in the meta-analysis, we need to look at the full description of the study selection:

“To compensate for the potential bias this temporal restriction might introduce to the results of our meta-analysis, we identified additional studies, including studies published before 2008, through the reference lists of relevant review articles and a search for research reports by governmental and nongovernmental behavioral science units.”

This is one of the studies that is identified via “a search for research reports by governmental and nongovernmental behavioral science units”. I find it interesting that this report presents weak effect sizes and no mention of “nudge” with a single word. If there had been a large effect in the report, they might have been more likely to talk about nudges?

I have additional issues with the attempt to compensate for a potential bias, and I am concerned that the authors only introduce new problems. For example, 10 out of 56 effect sizes from studies prior to 2008 are from studies by Brian Wansink. Furthermore, one of the effect sizes, from Diliberti et al. (2004), has a Cohen’s d of 3.0784! My guess is that the study would not have survived all the way to a meta-analysis in 2022 if the Cohen’s d had been 0. In other words, my concern is that there is a survivorship bias in the selective “sampling” of studies published before 2008. The effect sizes are also significantly greater in the studies in the meta-analysis that are published prior to 2008.

Let us take one final effect size. The next effect size is from Bronchetti et al. (2013). (The replication material says the publication is from 2011 but from what I can see it is from 2013. I assume the authors looked at the working paper from 2011.) This is actually a good example of a nudge with no effect, and it is the only effect of interest. The study is even called “When a nudge isn’t enough: Defaults and saving among low-income tax filers”. What I find interesting from this study is that the proposed intervention is being discussed in the following way: “In short, while intentions to save (spend) appear to be stronger among late (early) filers, our default manipulation does not “nudge” late filers to save more often.”

So is the default manipulation a nudge? Based upon the definition of nudging by Thaler and Sunstein, the intervention is not a nudge because it did not nudge. Do notice how nudge is being used as a verb here. My hypothesis is that when studies do not find a statistically significant effect of an intervention, but still talk about nudging, they are less likely to mention a nudge (noun) and are more likely to nudge (verb). You can try to nudge people, but it might not be a nudge.

There are other issues that are worth mentioning here. For example, what about negative effect sizes? Are we less likely to call a nudge a nudge if it has the opposite effect on the intended behaviour? Would we call a medical drug for a drug if it made the patient worse off? In other words, are we more likely to call an intervention a nudge if the effect size is non-negative? If this is the case, extreme effect sizes from underpowered studies are more likely to get the label “nudge” when they show effects in the “theoretically expected” direction.

In sum, I believe the definition of nudging is a subtle but significant problem similar to that of selecting on the dependent variable. That is, nudges are the successful cases and the sampling of effect sizes in reviews (including meta-analyses) interested in estimating the effectiveness of nudging will often overestimate the impact of such interventions.

Accordingly, my recommendation is to not pay too much attention to such reviews and their exaggerated, consultancy-friendly conclusions. Instead, I believe there is much more value in megastudies that examine multiple different interventions in highly powered settings (e.g., Milkman et al. 2021, 2022). Unsurprisingly, these studies also find smaller effect sizes (and do remember the study by DellaVigna and Linos 2022 showing that the academic journal articles suffer from selective publication and low statistical power). Eventually, when we have enough of such metastudies, we might conduct a meta-analysis of these studies to provide more reliable estimates on the effectiveness of nudging (prediction/spoiler alert: the average effect size will be much smaller than a Cohen’s d near 0.45).

Last, why are so many people eager to believe that nudges have so large effects? I believe one answer is that we are biased in our expectations towards effect sizes. Or as Andrew Little formulates it: “Ironically, this is probably driven by the fact that cute/surprising/”behavioral” results are more likely to get publicized, and we don’t properly adjust for this when forming beliefs. Behavioral biases lead us to overestimate the magnitude of behavioral biases.” I believe a similar bias might have been at play when Thaler and Sunstein wrote their definition of nudging (e.g., the choice architecture might have nudged them to focus on a definition putting emphasis on positive findings). That is indeed ironic.