How effective is nudging?

Since the publication of Nudge: Improving Decisions about Health, Wealth, and Happiness, thousands of studies have examined how different nudges can change attitudes and behaviour. The core proposition of the book was that nudging is an effective way to change behaviour, and for that reason, the book is packed with examples of how nudges can lead to better outcomes than traditional, paternalistic policies.

However, all nudges are not effective. We should not see the evidence presented in Nudge as representative for the broader population of nudge findings. As for most social science these years, we know that a lot of findings cannot replicate and the expected effect in most studies should be close to zero (both in terms of statistical and practical significance).

In a recent paper, “How effective is nudging? A quantitative review on the effect sizes and limits of empirical nudging studies”, the authors conclude that “only 62% of nudging treatments are statistically significant”. Is this a small number? Compared to other findings, especially within the domain of social psychology, finding that 62% of the treatments are statistically significant is quite impressive.

My primary issue with this study – and similar studies interested in whether nudges work – is that there is no agreement in the ltierature on when a treatment should be categorized as a nudge. For that reason, I am not convinced that a quantitative review as the one above can provide a meaningful estimate on the number of nudges, that are statistically significant (even if we only look at the published literature). Specifically, look at the criteria used in the literature review to select relevant studies: “we did not include studies that did not mention, ‘nudge’ or ‘nudging’, that did not quote the original work (Thaler and Sunstein, 2008), or that had no other link to the nudge concept.”

My guess is that statistically significant nudging studies, that cite the original work on nudging, are more likely to cite Thaler and Sunstein. Or, more importantly, studies that do find statistically significant effects are more likely to be called nudges in the first place. To understand why, let us take a look at the definition of a nudge offered in Nudge:

“A nudge, as we will use the term, is any aspect of the choice architecture that alters people’s behavior in a predictable way without forbidding any options or significantly changing their economic incentives.”

In the definition of “a nudge”, it is important that an aspect of the choice architecture alters people’s behavior. In other words, if there is no effect of an aspect of the choice architecture, it is not a nudge. This is a definition that is set up to succeed. If an intervention works, it is a nudge. If not, it is not a nudge. Conceptually, it makes no sense to say that x% of nudges work, as 100% of nudges will have an impact. If not, they are not nudges. Heads you win, tails I lose.

With this definition in mind, I am surprised that 38% of the effects are not statistically significant. However, I think I know why the number is that high. In other words, in the studies examined in the review, more than 62% of the nudges will work. The review looks at 100 studies including 317 effect sizes. Not all effect sizes are created equally and most studies have one key effect that is more important than the rest. If a study reports two effects, one significant and one insignificant, you will more often see that the effect called the most important effect (in this case the nudge), is the statistically significant effect. The authors do not provide a lot of information on this, but they do basically confirm this pattern: “Occasionally, statistically insignificant effects are reported to be insignificant in the discussion section of the primary publications.”

For those reasons, and if we set aside my conceptual concerns for a second, I fully agree with the authors when they conclude “that the findings of this study have to be interpreted with great care and are rather represent an upper bound of the effectiveness of nudging.”

The authors are aware of the obvious publication bias: “Moreover, we might be victim of a possible publication bias as many studies with insignificant results are often not published.” I am surprised, however, that they do not discuss the simple fact that they only look at 100 studies. This might seem impressive compared to other studies, but if you have followed the field since 2008, you know that there was conducted much more than 100 studies from 2008 to 2018. For example, the Behavioral Insights Team in the UK (i.e., one single research team) conducted more than 300 experiments over a six year period (see Maynard and Munafó 2018).

To better understand this problem, consider this paper, “RCTs to Scale: Comprehensive Evidence from Two Nudge Units”, that compares all effects from two Nudge Units in the United States to that of published nudging studies. The finding is as unsurprising as it is depressing: there is a huge discrepancy where published effects are much larger and have low statistical power. This discrepancy can primarily be explained by a publication bias.

The point of this post is not to say that nudging is not effective. However, we should be much more aware of the conceptual and methodological challenges in providing reliable answers to how effective nudges are, and especially the relative effectiveness of nudging compared to other solutions (see Benartzi et al. 2017 for a similar point).

Koster skolelukninger stemmer?

I forbindelse med udgivelsen af bogen KV17: Analyser af kommunalvalget 2017, har Kommunen.dk lavet en podcast med Ulrik Kjær. Afsnittet tager udgangspunkt i kapitlet omkring de elektorale konsekvenser af skolelukninger.

Her er en kort introduktion:

I valgperioden 2010-2013 blev der lukket 147 skoler fordelt på 55 kommuner, og i perioden 2014-2017 blev der lukket 56 skoler fordelt på 30 kommuner. Samlet 203 lukkede skoler i 85 kommuner indgår i undersøgelsen.

Men holder antagelsen om, at skolelukninger koster stemmer? Det korte svar er nej.

Det er et aktuelt og relevant emne, da flere kommmuner kommer til at opleve skolelukninger i den nærmeste fremtid.

Jeg kan kun varmt anbefale podcasten, der kan findes her.

Observationer relateret til COVID-19 #3

Se også: #1, #2

Traditioner. Forud for pandemien var det en naturlov, at der var større fodboldbegivenheder og OL hvert fjerde år. Som noget, intet kunne rokke ved. Jul og nytår var også noget, der altid ville finde sted traditionen tro. Traditioner er vigtige, og det kom ikke bag på mig, da en meningsmåling i december viste, at en stor andel af danskerne ikke havde i sinde at følge myndighedernes anbefalinger. Julen blev dog ikke som forventet, og pandemien har lært os, at ingen tradition er hellig. Med lidt god vilje kan vi godt ændre på den slags. Verden er langt mere foranderlig, end vi tror. Intet er helligt. Det er på mange måder befriende.

Morgenmadsrutine. Jeg er et rutinemenneske, når det kommer til morgenmad. Den står altid på havregryn. Uden sukker. Med nødder og sojamælk (jeg stoppede med at købe komælk i 2018). Og appelsinjuice og stempelkandekaffe. Det er en rutine, der blot er blevet konsolideret i løbet af pandemien. En af de nye aktiviteter jeg har fået taget op i løbet af pandemien, er hjemmelavet juice. Min favorit er appelsin med gulerod og ingefær, men æble med ingefær og bladselleri kan også gå an. Når man primært arbejder hjemmefra er det også lettere at lave en hel kande koffeinfri stempelkandekaffe om morgenen, hvilket fungerer som en perfekt overgang fra morgen til formiddag. Jeg savner omvendt også afbræk i denne rutine, herunder især kontinental hotelmorgenmadsbuffet (den eneste type buffet, jeg kan se logikken i).

Bagklogskab. Allerede i januar var mit feed fyldt af tweets og nyhedshistorier om, at det nu var ét år siden, at WHO eller andre sagde det ene eller det andet. Det er let at finde udsagn fra folk, der tog fejl. Der er ikke altid noget nævneværdigt ved at tage fejl. De fleste af os tager fejl om mange ting på noget nær daglig basis, inklusiv ting vi er eller bør være eksperter i. Problemet opstår, hvis man ikke kan erkende disse fejl – og dermed ikke tager ved lære af samme. Cass R. Sunstein bør man således holde sig fra at tage alvorligt. Der er også gode eksempler på bidrag, der tidligt og med stor præcision, forklarede hvad vi sandsynligvis ville gå i møde (se eksempelvis her).

Alkohol. Jeg drak ikke så meget som én dråbe alkohol i 2019, hvilket var et interessant forsøg. Jeg er – set i bakspejlet – glad for at jeg havde dette forsøg i 2019 og ikke i 2020. For det første fordi en af årsagerne til ikke at drikke var at undersøge de sociale aspekter ved ikke at drikke alkohol. Dette ville have været svært (eller for nemt?) i 2020. For det andet fordi alkohol trods alt er fantastisk (i moderate mængder) – især midt i en global pandemi. Som med mange andre ting har alkohol været et godt redskab til at hjælpe med at sondre mellem hverdag og weekend, hvorfor jeg afholder mig fra at drikke på hverdage. Har du interesse i at lære mere om alkoholens betydning for din krop og samfundet, kan jeg fint anbefale bogen ‘Drink? The New Science of Alcohol and Your Health‘ af David J. Nutt.

Vacciner. Det har været fantastisk at følge udviklingen af en effektiv vaccine. Det var absolut ingen selvfølge, at der ville komme en vaccine så hurtigt – især ikke hvis man tager tidligere pandemier i betragtning. I begyndelsen af pandemien var mange eksperters bud, at vi skulle vente en årrække på en effektiv vaccine. Hvor formidlingen af data i 2020 især handlede om antallet af nye smittede, handler formidlingen af data i 2021 i høj grad om antal vaccinerede. Der er således sket et skift fra “negative” til “positive” tal. Jeg ved dog absolut intet om vacciner, og det har været interessant at se, hvordan mange DJØF-uddannede, der besidder en ekspertviden i forhold til hvordan og hvornår vacciner bør godkendes. Her skal det nævnes, at jeg absolut intet har imod at folk udtaler sig om forhold, der ikke er direkte relateret til det felt, de forsker i, men at det er godt at være relativt beskeden (læs evt. dette fantastiske paper). Det eneste jeg har det kategorisk svært at affinde mig med er, hvor dårlige vi er til at forholde os rationelt til de (lave) risici, der er forbundet med at tage vacciner (især sammenlignet med de gevinster der er for samfundet generelt betragtet).

Carlsbergfondet. I 2016, da jeg havde indleveret min ph.d., søgte jeg postdoc-midler hos Carlsbergfondet, hvorfor jeg selvsagt er biased (jeg bærer dog absolut intet nag over en afvist forskningsansøgning – det er the name of the game). Det er min klare opfattelse, hvis ikke navnet allerede skulle have afsløret det, at der intet ædrueligt er ved foretagendet. Det har længe været åbenlyst, at det primært er venner af huset, der har fået forskningsmidler fra Carlsbergfondet. Det kunne aldrig falde mig ind at søge midler der igen, eller anbefale andre at gøre det. Jeg er ikke bekendt med, at nogen forsker, der offentligt har ytret kritik af forskning finansieret af Carlsbergfondet, efterfølgende har haft succes med at søge midler der. Og jeg har endnu mindre grund til at tro, at det er noget vi vil se så meget som ét eksempel på i fremtiden. På den måde er Carlsbergfondet tættere på at være et marketingsbureau end en forskningsfond (et af de definerende karakteristika ved sidstnævnte er desuden, at der gennemføres fagfællebedømmelser af ansøgningerne). For et godt eksempel på, hvad jeg blandt andet mener er problematisk, kan man se dette eksempel på, hvordan Carlsbergfondet ser på forskning. Med det kendskab vi har til fraværet af fagfællebedømmelser hos Carlsbergfondet og deres dubiøse praksis generelt betragtet, skal det heller ikke være nogen overraskelse, at jeg er yderst kritisk ift. meget af den “forskning”, de kaster penge efter.

Måltidskasser. Før pandemien havde jeg ikke så meget som overvejet at hoppe med på måltidskassevognen. Især de kendsgerninger, at det er relativt nemt at time, hvornår man er hjemme, og man har mere fleksibel tid til at lave mad, har gjort det mere attraktivt at modtage en ugentlig måltidskasse. Det kan kun anbefales, hvis man har muligheden herfor. Der er ligeledes et stokastisk element, hvor man prøver ny og varierende mad, der fungerer rigtig godt med måltidskasser (og dette videnskabelige studie kan forklare hvorfor).

Sygdom. Det hele handler om ikke at blive syg. Mange ting er derfor også vendt på hovedet. Hvor det før pandemien var atypisk at tænke på at blive syg, er det i en pandemi langt mere vigtigt at man aktivt gør hvad man kan for, at konstruere en hverdag, hvor man er rask og ikke tænker på at blive syg hele tiden, mens man tager sig sine forholdsregler. Normalt når jeg har en slem forkølelse, tænker jeg meget på, at jeg vil værdsætte at være rask mere, eller at jeg burde have værdsat det endnu mere, når jeg ikke var syg. Det er lidt samme situation nu med tiden før COVID-19.

Rejselitteratur. Parallelt med at jeg var i lockdown fik jeg læst Amor Towles’ A Gentleman in Moscow. Det er en utroligt velskrevet og interessant bog, og til trods for at jeg mangler en kontrafaktisk læseoplevelse, er det min vurdering, at det giver en bedre læseoplevelse at læse bogen mens man er i lockdown. Det er det tætteste man, når ingen rejser er mulige, kommer på den “normale” oplevelse at læse bøger, der foregår det sted, man rejser.

Brexit. Det er imponerende hvor lidt verden har ændret sig, når det kommer til Brexit. Dermed ikke sagt, at Brexit ikke har haft en stor betydning – og ikke kommer til at få det i årtier fremover. Nyhederne op til nytår talte om et “extraordinarily uncertain moment“, da et “No deal”-Brexit stadig var en mulighed. Dette kom samtidig med en ny mutation af COVID-19-virussen, der forårsagede at mange danskere var strandet i London. Jeg må blot konstatere, at pandemien har gjort det meget svært for mig at mærke Brexit på egen krop. Endnu.

. Jeg har aldrig haft et problem med at gå langt (altså relativt betragtet). Tværtimod. Når det kommer til at finde nye gåture i London, har jeg forsøgt at lade mig inspirere af London Greenground map og forskellige gåruter i relation til Capital Ring og Green Chain. Det er dog begrænset, hvor mange nye ruter jeg rent faktisk har fået udforsket, da jeg åbenbart er et vanemenneske, der nyder den samme rute med mindre variationer. En af grundende hertil er nok, at det er mindre attraktivt at udforske nye områder, når infrastrukturen ikke er normal, herunder at det er begrænset hvilke muligheder der er for at købe mad eller finde noget så basalt som et offentligt toilet, der har åbent.

Instagram. Twitter har som altid været en fornøjele under pandemien, men Instagram er om noget blot blevet endnu mere ligegyldigt og overflødigt. En af grundene er nok, at det er begrænset hvor meget visuelt nyt, folk kan dele, når de blot er hjemme. En anden grund er selvfølgelig blot, at det er et Facebook-produkt (jeg kan i den forbindelse varmt anbefale bogen No Filter: The Inside Story of Instagram). Instagram bliver nok det næste sociale medie, jeg siger farvel til.

Minimalisme. Jeg har før leget med tanken om, hvor dejligt det kunne være, hvis verden gik i stå i et års tid, så man havde mulighed for at samle op på det hele og få færdiggjort en masse (“be careful what you wish for“). Jeg har forsøgt at leve mere minimalistisk i løbet af de seneste 5-10 år – ikke grænsende til spartansk, men klart på en måde hvor jeg nyder at skille mig af med ting mere end jeg nyder at erhverve mig samme. Pandemien har været en god anledning til også at få gjort mere rent digitalt. Der er projekter der, må jeg erkende, aldrig kommer til at materialisere sig. Jeg har slettet et tocifret antal mapper med idéer, analyser, tanker m.v., da de ganske enkelt ikke tiltalte mig mere og derfor ville være mere arbejde end fornøjelse. Der kunne sandsynligvis have kommet gode projekter ud af det, men værdien af at have et minimalistisk setup, med fokus på det der er vigtigt for en, er langt mere værd. Ligeledes har jeg slettet udkast til en lang række blogindlæg, der ville tage en del tid at få færdiggjort (mere tid end jeg følte, jeg var klar til at bruge på dem).

Sammenblanding. Det bliver sværere og sværere at isolere effekterne og betydningen af COVID-19. Hvilken effekt har COVID-19 på mine præferencer? Hvilken effekt har ét år på mine præferencer? Giver det overhovedet mening at se på noget længere som isoleret af COVID-19 på den ene eller anden måde? Hvordan ville livet se ud nu i fraværet af COVID-19? Det ved jeg ikke. Derfor tænker jeg også, at dette for nu må blive det sidste indlæg med forskellige observationer, der er mere eller mindre relateret til COVID-19. Vi kommer til at leve med pandemien og dens følgevirkninger de næste mange år, men det må være på sin plads at konstatere, eller konkludere, at vi nu er ude af den periode, hvor COVID-19 var ny og unik. Alt er flettet sammen med COVID-19 nu. Pandemien har ganske vel ført noget godt med sig også, men jeg er godt nok træt af den i skrivende stund. Selv det at skrive om pandemien nu gør mig træt.

KV17: Analyser af kommunalvalget 2017

Bogen KV17: Analyser af kommunalvalget 2017 er på gaden nu. Du kan kaste dine cafépenge efter den her. Der er tale om det autoritative værk i forhold til at forstå kommunalvalget i 2017, for dem der godt vil være klædt bedre på til kommunalvalget i november.

Jeg bidrager selv med to kapitler til bogen. Det ene vedrører split-stemme adfærd (skrevet sammen med Ulrik Kjær). I det kapitel belyser vi, hvor mange vælgere der vil stemme på det samme parti ved folketingsvalget og kommunalvalget. Det bliver spændende at se, hvordan dette vil udspille sig ved det kommende kommunalvalg – især i kølvandet på COVID-19.

Det andet vedrører de elektorale konsekvenser af skolelukninger (skrevet sammen med Ulrik Kjær og Johan Ries Møller). I det kapitel undersøger vi, om borgmesterpartiet mister stemmer, når der lukkes skoler. Her bruger vi blandt andet data på, hvilke folkeskoler, der er lukket i de forskellige kommuner (omtalt i et tidligere indlæg her). Den gængse forestilling er, at borgmesterpartiet bliver straffet af vælgerne, når de lukker skoler, men vi viser, at virkeligheden ikke er helt så simpel.

Foruden ovennævnte er der kapitler om den kommunale valgdeltagelse, den landspolitiske valgvind, lokale valgprogrammer, lokale medier, tilliden til kommunalpolitikere og meget mere.

Hvis du slet ikke kan få nok af den danske kommunalvalgslitteratur, er der selvfølgelig også bogen om kommunalvalget 2013, hvor der også blev præsenteret en lang række interessante fund, der har betydning for kommunalvalg generelt betragtet.

God læselyst!

A few studies you should read before you do a mediation analysis

I am tired of reading and reviewing academic studies using mediation analysis, especially when researchers are relying on cross-sectional, observational data. None mentioned, none forgotten. In the best possible world, people would read more Pearl and understand the challenges of demonstrating empirical evidence in line with causal pathways (and maybe reconsider whether they want to do a mediation analysis at all).

However, I am very much aware that researchers will need to do mediation analyses. You gotta do what you gotta do to satisfy Reviewer 2. In this post, I recommend a few studies that I hope people will read before they conduct a mediation analysis. Without further ado, here are my recommendations:

– Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects (Acharya et al. 2016)
– Yes, but what’s the mechanism? (don’t expect an easy answer) (Bullock et al. 2010)
– Enough Already about “Black Box” Experiments: Studying Mediation Is More Difficult than Most Scholars Suppose (Green et al. 2010)
– The Mediation Myth (Kline 2015)
– Unwarranted inferences from statistical mediation tests – An analysis of articles published in 2015 (Fiedler et al. 2018)
– Causal Mediation Analysis: Warning! Assumptions Ahead (Keele 2015)
– Power Anomalies in Testing Mediation (Kenny and Judd 2014)
– The “Goldilocks Zone”: (Too) many confidence intervals in tests of mediation just exclude zero (Götz et al. 2021)

A problem with survey data when studying social media

We cannot understand modern politics without studying social media. Politicians as well as ordinary citizens rely on social media to discuss and consume political content. One of the data sources researchers rely on to study behaviour on social media is survey data. However, there can be specific challenges with studying social media. Here, I will illustrate such a challenge when using survey data to study behaviour on social media. Specifically, even if you rely on a representative sample to study social media behaviour, there is no guarantee that you can use this sample to make meaningful inferences about social media users.

To understand this, we need to understand that there is a difference between the sample size you have and the sample size you end up using in your statistical models. If you have interviewed 1,000 citizens, but only 100 of these actually use social media, how much can we then actually say based on this data?

Research from the UK shows that users of Twitter and Facebook are not representative of the general population (see also this paper). However, there are even more potential problems with using survey data to study behaviour on social media. Specifically, we know that the “effective sample” is not necessarily similar to the real sample. That is, just because you have a specific sample, you cannot expect that estimates obtained from a regression will apply to the population that the actual sample is represenative for (see this great paper for more information).

I was thinking about this issue when I read a paper titled “Ties, Likes, and Tweets: Using Strong and Weak Ties to Explain Differences in Protest Participation Across Facebook and Twitter Use”. You can find the paper here. There are so different issues with the paper but I will focus on one particular issue here, namely the small sample we end up looking at in the manuscript and the implications hereof.

The paper examines whether people have strong and weak ties on Facebook and Twitter and how that matters for their participation in protest activities. Specifically, the paper argues that different types of social ties matter on Facebook and Twitter. The paper expects, in two hypotheses, that strong ties matter more for Facebook use in relation to protest behaviour whereas weak ties matter more for Twitter use in relation to protest behaviour. This is also what the paper (supposedly) finds empirical support for. Here is the main result presented in Table 1 in the paper:

That’s a lot of numbers. Very impressive. And look at that sample size… 995! But here is the problem: While the paper relies on a representative survey with 1,000 respondents, only 164 of these respondents use Facebook and Twitter. You could have had a sample size of 100,000, but if only 164 of those used Facebook and Twitter, how much should be believe that the findings generalise to the full sample?

Out of the 164 respondents using Facebook and Twitter, only 125 have weak or strong ties. And only 66 of the respondents have variation in the ties within the respective social media platform (i.e. not the same weak or strong ties on Facebook or Twitter). Only 18 respondents in the sample have different ties across the respective social media platforms (i.e. not the same weak or strong ties on Facebook and Twitter). Here is a figure showing how we end up with only having variation on the relevant variables for 2% of the sample:

This means that when we enter a regression framework where we begin to control for all of the aforementioned variables, we will be putting a lot of emphasis on very few cases.

Why do we care about this? Because the results are weak (and definitely not strong). Even minor adjustments to the analysis will make these results throw in the towel and beg for mercy. However, this is not the impression you get when you read the paper, and in particular how confident the authors are that the results are representative: “To make the results more representative of the population, all analyses were conducted using a post-stratification weight (although the results are virtually the same when using unweighted data).”

I informed the authors that their findings are not virtually the same when using unweighted data, and that the coefficient for ‘Strong-tie Twitter use’ is actually for ‘Weak-tie Facebook use’ and vice versa. Based on this, the authors issued a corrigendum to the article, stating that: “On Table 1, the study reports regression coefficients for variables in the study. Due to a clerical error, the coefficients for two variables, strong-tie Twitter use and weak-tie Facebook use, are flipped. In Figure 1, however, the same coefficients are correctly displayed. A corrected table appears below. The authors apologize for the confusion this error may have caused.”

Notice how there is nothing about the fact that the results do not hold up when looking at the unweighted data. Interestingly, while not addressing the issue in the text in the corrigendum, the new Table 1 looks nothing like the old Table 1 (i.e. the table presented above). Here is the new table:

You will see that this table is corrected in a weird manner and looks nothing like the old Table 1. What happened to Model 1? In the new table, we only see two different versions of Model 2. Notice here that, for the unweighted data, neither the strong ties or weak ties on Twitter has a statistically significant effect. Only the two coefficients for ties on Facebook are statistically significant. The same results? No. Virtually the same results? Also no.

Why do the authors say that the results are virtually the same? Because they conduct statistical tests to see whether the coefficients are different across the two models and find no statistically significant differences. This is a very conservative threshold and the coefficients would need to change a lot before they would no longer be “virtually” the same.

However, take a look at the results in the model and see whether they are in line with the key “finding” in the paper. The relevant heuristic here is the following question: Would the authors still have made the same interpretation, i.e. that weak ties matter more than strong ties on Twitter, if only presented with Model 2 using unweighted data? I find that unlikely, especially as the coefficient for weak ties on Facebook is statistically significant in this model.

While there is something predictable about the response from the authors, I do find it interesting that they acknowledge the relevance of reporting the results using unweighted data. Kudos for the transparency, I guess.

What can we learn from this? There might be some methodological recommendations for other researchers who actually care about these issues. First, improve the (effective) sample size. Remember that 1,000 observations might not be 1,000 observations once you are done clicking on fancy buttons in SPSS. This is even more relevant when you might have a lot of measurement error. One study, for example, showed that self-reported Facebook usage is at best correlated .4 with Facebook logs of user activity.

Second, we should care about better sampling (ideally primarily studying social media users). There is no need to have a representative sample if it is limited how much any of these findings actually apply to the representative sample (or the population of interest). I doubt we have learned anything about the relevance of social media from looking at this observational data from a non-representative survey in Chile with limited variation on the key variables of interest.

Third, while we know a lot about social media, there is still a lot to be understood and I would like to see researchers deal with “simpler” hypotheses before turning to complex ideas about how strong and weak ties work across different social media platforms. Sure, it is an interesting idea and I am convinced the authors will spend more time celebrating and promoting their h-index than taking my concerns into account. However, I am – again – not convincend that we have learned a lot about how social media matter upon reading this paper.

There are many challenges with survey data when studying social media, and I am not against using such data at all. Most of my research rely on survey data and I believe we can use such data to say a lot about social behaviour, including on social media. However, there are particular problems that we should be aware of, including what sample we are actually looking at and how that shapes the results we get out of our statistical models.

Hvor lav er opbakningen til Venstre? #2

Der er sket meget i meningsmålingerne i løbet af de seneste par måneder. I januar viste min prognose, at Venstre stod til at få omkring 17 procent af stemmerne. Siden da har partiet mistet endnu flere vælgere i meningsmålingerne. I min seneste prognose står partiet til at få omkring 10 procent af stemmerne. Til sammenligning ligger de Konservative på omkring 15 procent.

Der har været især to store begivenheder i 2021 for Venstre. Den ene er relateret til Lars Løkke Rasmussens afgang, den anden Inger Støjbergs ditto. Mit bud er, at kun den ene af disse er relateret til Venstres opbakning i meningsmålingerne, selvom man selvfølgelig kan diskutere, om de to begivenheder er relateret til hinanden og om der er tale om kausale effekter.

I nedenstående figur viser jeg alle meningsmålinger fra 2020 og 2021, vi har haft i skrivende stund. Ligeledes har jeg fremhævet tre væsentlige begivenheder. Den første er corona-nedlukningen den 11. marts 2020. Den anden er Lars Løkke Rasmussens udmeldelse af Venstre den 1. januar 2021. Den tredje er Inger Støjbergs udmeldelse af Venstre den 4. februar 2021.

Vi ser flere interessante ting i figuren. For det første ser vi at opbakningen til Venstre løbende falder i løbet af 2020, men ikke fra én måling til den næste. Det er med andre ord et gradvist fald der er sket i løbet af 2020, hvor Socialdemokratiet konsoliderede deres magt og position, herunder også i meningsmålingerne.

For det andet ser vi et direkte og stort fald i meningsmålingerne efter Lars Løkke Rasmussen meldte sig ud ved årsskiftet. Før dette lå Venstre på over 15% i langt de fleste meningsmålinger. Siden da har Venstre ligget på under 15% i samtlige meningsmålinger. Dette er en klar diskontinuitet i opbakningen til Venstre i meningsmålingerne.

For det tredje ser vi, at der ikke var et lignende fald i meningsmålingerne efter Inger Støjberg forlod Venstre. Meningsmåingerne giver med andre ord Venstre samme (lave) opbakning i meningsmålingerne før og efter hendes exit.

Formålet med dette indlæg er udelukkende at beskrive, hvordan Venstre har klaret sig, og dermed ikke hvordan partiet vil klare sig i de kommende målinger og måneder. Det vil tiden – og sandsynligvis et senere indlæg – belyse.

Polls and the 2020 Presidential Election

In 2016, opinion polls – and in particular poll-based prediction models – suffered a major hit with the inability to predict the election of Donald J. Trump as the president of the United States. If you want a quick reminder, take a look at this forecast from the 2016 Presidential Election:

The 2020 Presidential Election polling was not great, but not a disaster. This is the simple point I want to emphasise in this blog post. I saw a lot of takes in my feed in the wake of the election calling the election everything from a “total and unmitigated disaster for the US polling industry” to the death of “quantitative political science“. I know, you gotta do what you gotta do to earn the sweet retweets, but I find such interpretations hyperbolic.

I will not provide all the answers (if any at all) to what happened with the polls in the 2020 election. My aim is much more humble: provide some reflections and thoughts on what might have happened with the polls. Specifically, I will provide links to the material I have stumbled upon so far that provide some of the most nuanced views on how well the polls performed.

When you hear people calling the election an “unmitigated disaster” for the polling industry, it is good to take a step back and remember that other elections have experienced significant polling failures in the past. It takes a lot for opinion polls to be an unmitigated disaster. Or as W. Joseph Campbell describes it in the great book Lost in a Gallup: Polling Failure in U.S. Presidential Elections: “In a way, polling failure in presidential elections is not especially surprising. Indeed, it is almost extraordinary that election polls do not flop more often than they do, given the many and intangible ways that error can creep into surveys. And these variables may be difficult or impossible to measure or quantify.”

Accordingly, it is not the norm that opinion polls enable an exact and reliable prediction of who will be the next president. If anything, when only looking at the most recent elections, our myopic view might bias our understanding of how accurate opinion polls have been in a historical perspective.

It is interesting to see what W. Joseph Campbell wrote in Lost in Gallup, prior to the election, on what to expect in 2020: “Expect surprise, especially in light of the Covid-19 coronavirus pandemic that deepened the uncertainties of the election year. And whatever happens, whatever polling controversy arises, it may not be a rerun of 2016. Voters in 2020 are well advised to regard election polls and poll-based prediction models with skepticism, to treat them as if they might be wrong and not ignore the cliché that polling can be more art than science. Downplaying polls, but not altogether ignoring them, seems useful guidance, given that polls are not always in error. But when they fail, they can fail in surprising ways.”

Taking the actual outcome of the election into account, this is a good description of what we should expect. We should expect surprise in the polls but not ignore them. They turned out to be quite useful in order to understand what would have happened, but they also did show some surprises. Generals always fight the last war and pollsters always fight the last polling failure. I believe this is the key lesson for the next election: do not ignore them but be open to the possibility that there might be surprises.

What frustrated me a lot in the wake of the 2020 election was the frame that the opinion polls got it wrong. There is a simply lack of nuance in this view that is needed if we want to actually understand how well the polls performed. Take, for example, this post by Tim Harford titled “Why the polls got it wrong”. There is no evaluation of how precise the opinion polls were, only the conclusion that polls got it wrong. Admittedly, Tim Harford acknowledges that at “this early stage one can only guess at what went wrong”, but it is still disappointing to see such unnuanced opinions. Ironically, the article provides less evidence on “why the polls got it wrong” than opinion polls provided evidence on who would become the next president.

The discrepancy between what the opinion polls show and what the media reports is interesting. Our public memory of the 2016 election is that opinion polls got it wrong and nobody, especially the media, saw it coming. There was a polling failure but we tend to ignore all information available to us during the 2016 campaign that warned us about the fact that polls might be wrong. An article by Nate Silver in 2016, titled “Trump Is Just A Normal Polling Error Behind Clinton”, stated that: “Clinton’s lead is small enough that it wouldn’t take more than a normal amount of polling error to wipe the lead out and leave Trump the winner of the national popular vote.” And we got a fair amount of polling error although Trump was not the winner of the national popular vote.

More importantly, in 2016, opinion polls did not all proclaim that Hillary Clinton would be the next president of the United States. In fact, that it not the job of any single opinion poll. If the job was simply to estimate the popular vote, that could be a job for a single poll. The bias was not in the individual polls but rather the aggregation methods (see Wright and Wright 2018 for more on this point). What went wrong was that state-level polling underestimated Trump in battleground states, in particular the Rust Belt states Michigan, Pennsylvania and Wisconsin (one reason for this was that polls dit not appropriately adjust for nonresponse, cf. Gelman and Azari 2017). I will not rule out we face similar issues with the 2020 election.

Despite the problems in 2016, the 2018 midterm elections went a lot better for the polls and Nate Silver concluded that the polls are all right. There was a sense that the problems we faced in 2016 were not corrected (for more information on what changed between 2016 and 2020, see this article). However, we might have overestimated how much we could conclude based on the performance in 2018.

That being said, I do not see the polls as being completely off in 2020. Sure, there were certain issues, but I find the narrative of a universal failure of polls in 2020 inaccurate and unfair. I think a key reason this narrative took off is that people started evaluating the quality of the polls on election night and did not wait for all votes to be counted. The chronology of how the results were called in the different states might have played a role here. James Poniewozik made a great point about this: “There’s a Black Lodge backwards-talk version of the election where the same results happen but PA counts its votes first and Miami-Dade comes in last, and people say, ‘Closer than I thought, but pretty much on target.'” It is not only about what the numbers in the polls show, but also how we interpret them – and in what order.

This is not to say that opinion polls could not do better, but part of the problem is how we consume polls. Generally, based on the lesson from 2016, the coverage was one where most stories about opinion polls came with caveats and reminders that it could be close. A good example is the article ‘I’m Here To Remind You That Trump Can Still Win‘. I did also notice an increased certainty among some pundits, pointing out that Biden’s lead was bigger compared to what the polls showed in 2016, there were fewer undecided voters than in 2016, we had improved state polls, many people have already voted etc. However, in the wake of the election, I saw a lot of people bashing the polls, prediction-models and the coverage of polls, but overall I found this coverage sober and much better than in 2016.

It is also important to keep in mind that when we are looking at presidential elections and in particular the composition of the Electoral Collece, a few percentage points of the vote from the Democrats to the Republicans (and vice versa) might have significant implications for who will win. For that reason, we should be cautious when evaluating the overall result and, when trying to predict the result, maybe not be 95% certain that a certain candidate will win.

The conclusion reached by Nate Silver is that “while polling accuracy was mediocre in 2020, it also wasn’t any sort of historical outlier.” (see also this post by Nate Silver with additional data). In other words, it was not a disaster, but there was also nothing to celebrate.

What went wrong? What is most likely the case is that several polls overestimated Democrats. However, we still do not know yet, but Matt Singh outlines four categories of explanations for what might have gone wrong: 1) sample bias, 2) differential turnout, 3) misreporting and 4) late swing (see also this post by Pew Research Center on some of the potential issues and solutions).

The four explanations are all valid but I find the third one most unlikely, i.e. that people should simply have lied when asked about their vote choice, also called the “shy Trump voters”. There is no evidence that people lie about voting for Trump, and I doubt we will see any convincing evidence for this in relation to the 2020 election.

Out of the four categories, I find it most likely that the polls had difficulties reaching certain voters. The polls seem to have underestimated a shift towards non-college and Hispanic voters in specific states. In addition, it might be difficult to measure who wants to answer polls now, especially if Trump supporters are more likely to distrust polls (and the media) in general (David Shor made a similar point here and here and here). These issues can be very difficult to address with traditional weighting methods. However, again, when we look at the polling error in specific battleground states in a historical context, the results do not point towards a historical disaster.

I am less convinced of the usefulness of election-forecasting models aggregating all available information. The issue is that we reduce all the complexities and all of the different polls to a single prediction. Maybe the coverage would be much better if simply focusing on the state-level polls in the battleground states and in particular the variation in these polls. The Economists model did a good job with making all their material publicly available (something that FiveThirtyEight did not do) and the researchers were explicit about the limitations (see, for example, here and here). That being said, I believe that the probability of 95% for a Biden win provided by The Economist team was a scientific failure (something that can most likely be explained by common sense, our experience as consumers of forecasts, statistical analysis, statistical design and sociology of science). There were some differences between the FiveThirtyEight model and The Economists model (see this thread), and I believe the communication of the numbers and uncertainties was done much better by FiveThirtyEight (see also this thread on a lot of the reflections on how to report the numbers).

We really don’t know yet what went wrong with a lot of the polls, but we know that it was not a total and unmitigated disaster. American Association for Public Opinion Research released their evaluation of the polling errors in relation to the 2016 election some time after the election, and it will be interesting to see what the detailed evaluation of the 2020 election will show. However, I do not expect any smoking guns. Instead, what I expect is a combination of some of the different categories mentioned above.

Last, the most recent research suggests that non-probability sampling performed better than probability polls in the 2020 election. This provides some optimism for the future. While probability polls will be more difficult to conduct in the future, advances in survey sampling and the conduction of non-probability polls will provide more valid estimates on who will win.

While I like criticising polls as much as the next guy, I am not convinced we should conclude that the polls experienced a total and unmitigated disaster. What I hope we will see in the next election is less focus on poll-based prediction models and more focus on high-quality state-level polling in key states.