How not to measure conspiracy beliefs #2

This is a brief update to a previous post on how to measure conspiracy beliefs. My point in the previous post was that a study published in Psychological Medicine used weird measures to capture conspiracy beliefs.

In a letter to the editor, Sally McManus, Joanna D’Ardenne and Simon Wessely note that the response options provided in the paper are problematic: “When framing response options in attitudinal research, a balance of agree and disagree response options is standard practice, e.g. strongly and slightly disagree options, one in the middle, and two in agreement. Some respondents avoid the ‘extreme’ responses either end of a scale. But here, there was just one option for ‘do not agree’, and four for agreement (agree a little, agree moderately, agree a lot, agree completely).”

The authors of the study replied to the letter and, in brief, they double down on their conclusions: “Just because the results are surprising to some – but certainly not to many others – does not make them inaccurate. We need further work on the topic and there is clearly enough from the survey estimates to warrant that.”

Interestingly, we now have further work on the topic. In a new study, Agreeing to disagree: Reports of the popularity of Covid-19 conspiracy theories are greatly exaggerated, Robbie M. Sutton and Karen M. Douglas (both from the University of Kent) show that the measures in the study mentioned above (and in my previous post) are indeed problematic.

The figure below shows the key result, i.e. the sum agreement with specific conspiracy beliefs using the different scales.

The shaded areas are the ‘agree’ options in the scale (with more agree options provided in the original study). What we can see is that the sum agreement is substantially greater when using the problematic scale. For a conspiracy belief such as ‘Coronavirus is a bioweapon developed by China to destroy the West’, the ‘Strongly disagree-Strongly agree’ scale results in a sum agreement of 8.8%, whereas the scale used by the authors of the original study resulted in a sum agreement of 31.9%.

In sum, this is an interesting case of how (not) to measure conspiracy beliefs and how researchers from the University of Oxford themselves can contribute to the spread of such conspiracy beliefs. Or, as Robbie M. Sutton and Karen M. Douglas conclude: “As happens often (Lee, Sutton, & Hartley, 2016), the striking descriptive statistics of Freeman et al.’s (2020a) study were highlighted in a press release that stripped them of nuance and caveats, and led to some sensational and misleading media reporting that may have complicated the very problems that we all, as researchers, are trying to help solve.”

Observationer relateret til COVID-19

Her følger et par personlige observationer, primært skrevet til mig selv, så jeg på et tidspunkt i fremtiden kan minde mig selv om, hvad der blandt andet optog min opmærksomhed i (hvad der føles som) en historisk tid.

Politik. Jeg kan se det ironiske i, at politikerne for år tilbage talte om 2020-planer. Reformer blev præsenteret og, når man ser tilbage på det med den aktuelle situation in mente, er det svært at tage den politiske (langtids)planlægning seriøst. Dermed ikke sagt, at langtidsplanlægning ikke giver mening, men at man om noget skal være bevidst om, at vi opererer med en del usikkerhed og potentielle eksogene faktorer, der skal tages højde for.

Data. Det har som datainteresseret været en fornøjelse at følge, hvor meget ny data, der er kommet ud på kort tid. En masse af disse data har dog været af en yderst ringe kvalitet. Mest interessant har det været ift. data på, hvordan de politiske reaktioner har været på pandemien. Jeg talte med en journalist forleden omkring udfordringer forbundet med at måle komparative forskelle i, hvordan forskellige lande har reageret på krisen, og min pointe var her, at dette er langt mere besværligt, end vi umiddelbart går og tror. Hvis du ønsker et godt overblik over de forskellige datasæt relateret til dette, kan jeg kraftigt anbefale A tracker of trackers: COVID-19 policy responses and data.

Alder. Jo yngre du er, desto større en periode (og vigtigere del) af livet vil blive påvirket af pandemien. Hvis du er 14, er et år ~7% af dit liv. Hvis du er 31 er det ~3% af dit liv. Endnu vigtigere er det, at der er oplevelser for unge, der er svære at genskabe eller opleve senere i livet. Når man er 31 er det relativt begrænset, hvad der skal ske netop lige nu, som ikke blot kan udskydes et år. På samme måde føler jeg for de ældre mennesker, der i højere grad skal frygte konsekvenserne af at blive smittet. Derfor har jeg også en følelse af, at jeg har en perfekt alder til at opleve dette. Jeg er ung nok til ikke at skulle frygte for mit liv, og gammel nok til ikke at være alt for bekymret, hvis der går et år uden de store begivenheder. Dagene flyver afsted, livet går, men mange af mine dage i denne tid ligner til forveksling min tid før coronaen.

Personlighed. Det er min opfattelse at folk har reageret vidt forskelligt på udsigten til at skulle leve et mindre (fysisk) socialt liv, arbejde hjemmefra m.v. Jeg har en relativt klar idé om, hvilke Big Five-kombinationer, der er godt klædt på til en pandemi. Alle personlighedstræk kan være relevante at kigge på, men især høj Conscientiousness og lav Extraversion er en fantastisk kombination i disse måneder. Jeg har læst et par forskellige studier ifm. pandemien (konkret har jeg fundet et tocifret antal studier) og der er – såvidt jeg kan se – intet der taler for, at personlighedstræk forklarer store forskelle i livstilfredshed ifm. pandemien (relativt til før pandemien).

Helbred. Der har også været mere fokus på fysisk helbred. Jeg har uden tvivl fået motioneret mindre over de seneste måneder og er nok i dårligere form nu end for seks måneder siden. Omvendt har jeg indrettet min kost derefter. Jeg drak i forvejen så godt som ingen alkohol, spiser så godt som intet sukker og er gået over til koffeinfri kaffe. En pandemi handler på mange måder om at indrette tilværelsen på en sådan måde, at det ikke at gøre bestemte ting, er en succes. Dette er jeg i overvejende grad lykkedes med.

Kultur. Kultur er om noget det der binder os sammen. Især i disse tider. Jeg får set mange flere film og serier end normalt – og fået spillet betydeligt mere Pandemic. Jeg fik set Contagion for første gang og der er ingen tvivl omkring, at den var en bedre oplevelse end hvis jeg havde set den for eksempelvis et år siden. Ligeledes er jeg også overbevist om, at flere har haft en positiv oplevelse med Tiger King, end tilfældet ville have været, hvis den var blevet lanceret i en “normal” tid. Caspar Erics Dagbog fra dage med COVID-19 er et andet eksempel på noget, der viser hvordan kulturen sætter ord på det, der kollektivt opleves individuelt. Hvad angår TV-serier har jeg blandt andet fået set en del (mini)serier, der bygger på bøger (kan i flæng nævne Sharp Objects, I Know This Much Is True, Normal People, Little Fires Everywhere og I’ll Be Gone in the Dark).

Bøger. Det er en god tid at få kastet sig over forskellige typer af litteratur. I forhold til klassikere har jeg fået læst både faglitteratur (eksempelvis Bowling Alone: The Collapse and Revival of American Community og Essence of Decision: Explaining the Cuban Missile Crisis) og skønlitteratur (er så småt igennem Krig og fred, der har været på min læseliste i flere år). I forhold til nyere litteratur er det primært faglitteratur, den har stået på – eksempelvis bøger som Slowdown: The End of the Great Acceleration—and Why It’s Good for the Planet, the Economy, and Our Lives og Dark Data: Why What You Don’t Know Matters.

Figurer. Siden COVID-19 for alvor slog igennem, har der været talrige forskellige figurer, der skulle hjælpe os med at forstå pandemien (især “Flatten the Curve“-figurer og figurer med komparative trends, der viser hvor slemt det står til i forskellige lande). Et af de spørgsmål jeg er interesseret i, er hvordan borgerne rent faktisk reagerer og forstår disse figurer. Vi ved fra forskningen, at folk har svært ved at forstå hældninger i figurer (se evt. dette studie fra 1984). Et studie fra Canada har vist, at det ikke ændrer opbakningen til forskellige tiltag ifm. COVID-19, om man anvender en logaritmisk eller lineær skala til præsentationen af data. Anden forskning (eksempelvis her, her og her) har vist, at folks opfattelser kan påvirkes af, hvordan COVID-19 data præsenteres, hvor lineære figurer er nemmere at forstå ift. de absolutte tal. Noget af denne forskning misforstår dog hvad der er hele pointen i at anvende en logaritmisk skala (at vi ofte er mere interesseret i at formidle ændringsraten end de absolutte tal).

Geografi. Det har været interessant at følge hvor vidt forskellige tilgange der har været til COVID-19 i England og Danmark. Jeg har kort fortalt levet efter danske anbefalinger og arbejdet hjemmefra i flere måneder nu. London er en (relativt) død by og har dermed fungeret som en fantastisk cykelby. Man kan også kun håbe, at dette vil fremme færre biler (denne artikel i Nature Sustainability giver et fint indblik i, hvordan det går med at forbyde biler). Paradoksalt nok har jeg fået set mere af London i den seneste tid, end jeg ville have gjort, hvis alt var som normalt. Man kan kun håbe, at dette vil blive en anledning til, at London fremadrettet bliver mere københavnsk. Ligeledes er der også noget interessant ved at se storbyer være mennesketomme (eksempelvis København).

Møder. De fleste møder kunne let erstattes af en mail eller to, forlyder det. Jeg har i løbet af seneste månder haft møder på Teams, Zoom, BlueJeans og Skype, og jeg synes at Zoom har det bedste produkt. Jeg er dog ikke nødvendigvis overbevist om, at disse onlinemøder er bedre end fysiske møder (eller dårligere end mails). Fysiske møder er alt andet lige bedre, såfremt de tjener et formål og forløber effektivt. Der har været rig anledning til at reflektere over og diskutere skiftet fra fysiske møder til Zoom-møder, og i den forbindelse (gen)læste jeg denne artikel, der giver gode råd med på vejen om hvorfor, hvordan og hvornår møder bør afholdes.

Tid. Tid er subjektivt.

Historie. I forbindelse med det amerikanske præsidentvalg i 2016 havde jeg en følelse af, at have oplevet noget historisk. Eller som jeg beskrev det dengang, var det følelsen af at falde i søvn i én virkelighed og vågne op i en anden. Til trods for at der nok var tale om en historisk begivenhed, var det begrænset, hvor meget det påvirkede min dagligdag og mine daglige interaktioner. På den måde vil denne tid nok stå betydeligt skarpere i min hukommelse.

Ulighed. Jeg har allerede skrevet om dette i et tidligere indlæg, men det er og bliver det vigtigste emne i forbindelse med pandemien. Jeg er stadig bekymret for de borgere, der ikke er lige så privilegeret med muligheden for at arbejde hjemme (eller overhovedet have et arbejde). Til trods for at der er fordele forbundet med COVID-19, tror jeg desværre, at ulemperne ift. ulighed på en lang række parametre vil overstige alle potentielle fordele. Når man kigger på universitetsverdenen er der også noget tragisk over, at ph.d.-studerende lider mens kan man se millionbevillinger blive delt ud til etablerede forskere, der udelukkende kommer middelmådig forskning ud af (tænk hvor mange spændende postdoc-projekter, der kunne finansieres for de penge!).

De fire gamle partier i meningsmålingerne

Socialdemokratiet oplevede i forbindelse med COVID-19 en stor fremgang i meningsmålingerne. Men hvor stor er denne fremgang i et historisk perspektiv?

Jeg har kigget nærmere på 773 meningsmålinger fra Gallup fra 1957 til 2020 for de fire gamle partier. De fire gamle partier omfatter som bekendt Socialdemokraterne, Venstre, Det Konservative Folkeparti og Radikale Venstre. Den første meningsmåling er fra 28. juni 1957 og den seneste meningsmåling er fra 11. juni 2020.

Til at gøre dette har jeg anvendt et R-script lavet af Simon Straubinger til at visualisere læsbarheden af statsministertaler over tid.

Her er alle meningsmålingerne:

Der er i min optik mindst fire interessante observationer.

For det første er det nemt at se, at Socialdemokraterne har oplevet en stor fremgang i meningsmålingerne over kort tid ifm. COVID-19. Socialdemokratiet står historisk stabilt selvom de stadig ligger pænt fra de 40% af stemmerne, de engang lå stabilt til at få. Kenneth Thue Nielsen har desuden lavet en god analyse for Altinget, hvor han sammenligner Socialdemokratiets opbakning med hvordan andre partier har klaret tidligere kriser.

For det andet bekræfter figuren, at meningsmålingerne generelt er stabile på kort sigt. Vi skal normalt ikke forvente de store forandringer i meningsmålingerne over kort tid, så du kan uden de store vanskeligheder undlade at følge med i hvad meningsmålingerne viser fra uge til uge. Det går dog op og ned i politik, og det er især interessant at se hvordan Socialdemokratiet i løbet af de seneste ti år er gået fra at have over 30% af stemmerne til at ligge på under 20% – til at være tilbage på over 30%.

For det tredje ser vi historisk mere variation hos Venstre, hvor det går betydeligt mere op og ned over tid, alt efter hvilken periode vi er i. Venstre ligger givetvis lavt sammenlignet med Socialdemokratiet og deres opbakning over de seneste år, men der er intet der tyder på en historisk krise for partiet.

For det fjerde har Konservative og Radikale Venstre oplevet stor opbakning på vidt forskellige tidspunkter (især relateret til, hvornår de var statsministerparti), men de har over de seneste fem år fulgt hinanden godt. Det er desuden altid en fornøjelse at se de Konservatives meningsmålinger og huske på, at da Bendt Bendtsen blev kaldt ‘Mr. 10 procent’, var det ment som en kritik.

Vi har for vane blot at kigge på de seneste meningsmålinger, når vi ønsker at se hvordan partierne klarer sig, men det kan være sundt at tage de historiske briller på i ny og næ. Jeg planlægger således at skrive et opfølgningsindlæg om 63 år.

Data visualization: a reading list

Here is a collection of books and peer-reviewed articles on data visualization. There is a lot of good material on the philosophy, principles and practices of data visualization.

I plan to update the list with additional material in the future (see the current version as a draft). Do reach out if you have any recommendations.

Introduction

Graphs in Statistical Analysis (Anscombe 1973)
An Economist’s Guide to Visualizing Data (Schwabish 2014)
Data Visualization in Sociology (Healy and Moody 2014)
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (Weissgerber et al. 2015)
Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods (Cleveland and McGill 1984)
Graphic Display of Data (Wilkinson 2012)
Visualizing Data in Political Science (Traunmüller 2020)
Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks (Schwabish 2021)

History

Historical Development of the Graphical Representation of Statistical Data (Funkhouser 1937)
Quantitative Graphics in Statistics: A Brief History (Beniger and Robyn 1978)

Tips and recommendations

Ten Simple Rules for Better Figures (Rougier et al. 2014)
Designing Graphs for Decision-Makers (Zacks and Franconeri 2020)
Designing Effective Graphs (Frees and Miller 1998)
Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics (Donahue 2011)
Designing Better Graphs by Including Distributional Information and Integrating Words, Numbers, and Images (Lane and Sándor 2009)

Analysis and decision making

Statistical inference for exploratory data analysis and model diagnostics (Buja et al. 2009)
Statistics and Decisions: The Importance of Communication and the Power of Graphical Presentation (Mahon 1977)
The Eight Steps of Data Analysis: A Graphical Framework to Promote Sound Statistical Analysis (Fife 2020)

Uncertainty

Researchers Misunderstand Confidence Intervals and Standard Error Bars (Belia et al. 2005)
Error bars in experimental biology (Cumming et al. 2007)
Confidence Intervals and the Within-the-Bar Bias (Pentoney and Berger 2016)
Depicting Error (Wainer 1996)
When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems (Kay et al. 2016)
Decisions With Uncertainty: The Glass Half Full (Joslyn and LeClerc 2013)
Uncertainty Visualization (Padilla et al. 2020)
A Probabilistic Grammar of Graphics (Pu and Kay 2020)

Tables

Let’s Practice What We Preach: Turning Tables into Graphs (Gelman et al. 2002)
Why Tables Are Really Much Better Than Graphs (Gelman 2011)
Graphs or Tables (Ehrenberg 1978)
Using Graphs Instead of Tables in Political Science (Kastellec and Leoni 2007)
Ten Guidelines for Better Tables (Schwabish 2020)

Deciding on a chart

Graph and chart aesthetics for experts and laymen in design: The role of familiarity and perceived ease of use (Quispel et al. 2016)

Chart types

Boxplots

40 years of boxplots (Wickham and Stryjewski 2011)

Pie charts

No Humble Pie: The Origins and Usage of a Statistical Chart (Spence 2005)

Infographics

Infovis and Statistical Graphics: Different Goals, Different Looks (Gelman and Unwin 2013)
InfoVis Is So Much More: A Comment on Gelman and Unwin and an Invitation to Consider the Opportunities (Kosara 2013)
InfoVis and Statistical Graphics: Comment (Murrell 2013)
Graphical Criticism: Some Historical Notes (Wickham 2013)
Tradeoffs in Information Graphics (Gelman and Unwin 2013)

Maps

Visualizing uncertainty in areal data with bivariate choropleth maps, map pixelation and glyph rotation (Lucchesi and Wikle 2017)

Scatterplot

The Many Faces of a Scatterplot (Cleveland and McGill 1984)
The early origins and development of the scatterplot (Friendly and Denis 2005)

Dot plots

Dot Plots: A Useful Alternative to Bar Charts (Robbins 2006)

3D charts

The Pseudo Third Dimension (Haemer 1951)

Teaching pedagogy

Correlational Analysis and Interpretation: Graphs Prevent Gaffes (Peden 2001)
Numbers, Pictures, and Politics: Teaching Research Methods Through Data Visualizations (Rom 2015)
Data Analysis and Data Visualization as Active Learning in Political Science (Henshaw and Meinke 2018)

Software

Excel

Effective Data Visualization: The Right Chart for the Right Data (Evergreen 2016)

R

Data Visualization (Healy 2018)
Data Visualization with R (Kabacoff 2018)
ggplot2: Elegant Graphics for Data Analysis (Wickham 2009)
Fundamentals of Data Visualization (Wilke 2019)
R Graphics Cookbook (Chang 2020)

Stata

A Visual Guide to Stata Graphics (Mitchell 2012)


Changelog
– 2021-03-01: Add ‘Better Data Visualizations’
– 2020-08-03: Add ‘Ten Guidelines for Better Tables’
– 2020-07-14: Add ‘Designing Graphs for Decision-Makers’ and ‘A Probabilistic Grammar of Graphics’ (ht: Simon Straubinger)

A response to Andrew Gelman

In a new blog post, Andrew Gelman writes that the findings in an article of ours are best explained by forking paths. I encourage you to read the blog post and, if you still care about the topic, continue and read this post as well.

This is going to be a (relatively) long post. In brief, I will show that the criticism is misleading. Specifically, it is easy to find the effect we report in our paper (without “statistical rule-following that’s out of control”) and that Andrew Gelman is either very unlucky or, what I find more likely, very selective in what he reports. I have no confidence that Andrew Gelman engaged with our material with an open mind but, on the contrary, I believe he invested a non-trivial amount of time building up a (false) narrative about the validity and reliability of our findings.

That being said, Andrew Gelman was polite in reaching out and he gave me the possibility to comment on his criticism. Beyond a few clarifications, I decided not to provide most of the comments below in a private conversation. Again, based upon his language and analysis of our data, I am convinced that he has no interest in engaging in a constructive discussion about the validity of our findings. For that reason, I find it better to keep everything public and transparent.

Our contribution

In our paper, we show that winning political office have significant implications for the longevity of candidates for US gubernatorial office. Here is the abstract:

Does political office cause worse or better longevity prospects? Two perspectives in the literature offer contradicting answers. First, increased income, social status, and political connections obtained through holding office can increase longevity. Second, increased stress and working hours associated with holding office can have detrimental effects on longevity. To provide causal evidence, we exploit a regression discontinuity design with unique data on the longevity of candidates for US gubernatorial office. The results show that politicians winning a close election live 5–10 years longer than candidates who lose.

And here is the table with the main result:

The paper is published in Political Science Research and Methods. You can find the replication material for the article here.

Is our finding replicated?

Before we get into the details of the analysis and whatnot, I am happy to confirm that our findings are similar to those in a recent study published by the economists Mark Borgschulte and Jacob Vogler in Journal of Economic Behavior & Organization. For the sample most similar to ours, they find a local average treatment effect of 6.26 years:

It is great to see that other people are working on the topic and reaching similar conclusions. Overall, I believe that the effect we report in our paper is not only reproducible but also replicated by another set of researchers. I did inform Andrew Gelman about this study but I can understand his reasons for not linking to this study as well in his blog post.

How difficult is it to reproduce our findings?

In brief, Andrew Gelman is not convinced by our study. That’s okay. I am rarely convinced when I see a new study (it’s easy to think about potential limitations and statistical issues). However, what we see here are several characterisations of the results in particular and the statistical approach in general, such as “silly”, a “scientific failure”, “statistical rule-following that’s out of control”, “fragile” and “fatally flawed”.

No scientific study is perfect. Data is messy and to err is human. Accordingly, it is important and healthy for science that we closely and thoroughly inspect the quality of each others work (especially when we consider how many errors that can easily slip through peer-review). Not only do I appreciate that smart colleagues devote their sparse time to look at my work, I also encourage people to check out my work and reach out if something is not working out (or write a blog post or publish a paper or make a tweet). For that reason, I make all my material publicly available (most often on Harvard Dataverse and GitHub). That being said, in this case, I am simply not convinced that our study is fatally flawed.

Andrew Gelman argue that we report an estimate that is very difficult to reproduce (and, at the end of the day, unrepresentative of a true effect): “I tried a few other things but I couldn’t figure out how to get that huge and statistically significant coefficient in the 5-10 year range. Until . . . I ran their regression discontinuity model”.

Seriously? Is it really that difficult to obtain a significant effect in line with what we report in the paper? Based upon Andrew Gelman’s post, I can understand if that’s the impression people might have. Maybe Andrew Gelman should consider reading Gelman and Hill (2007) and follow the first suggestion on how to run an RDD analysis. From page 214:

Without any complicated procedures, what happens if we get the data and follow the procedure as described in the introductory textbook? Well, let us look at the data and run the suggested regression.

# Load tidyverse (to make life easy)
library("tidyverse")

# Load the data
df_rdd <- data.table::fread("longevity.csv")

# Make the data ready for the analysis
df_m <- df_rdd %>% 
  filter(year >= 1945, living_day_imp_post > 0) %>% 
  mutate(won = ifelse(margin_pct_1 >= 0, 1, 0),
         margin = margin_pct_1)

# Run regression
df_m %>% 
  filter(abs(margin) < 5) %>% 
  lm(living_day_imp_post ~ won + margin, data = .) %>% 
  broom::tidy()
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
                          
1 (Intercept)    8429.      676.     12.5  4.63e-28
2 won            3930.     1193.      3.29 1.13e- 3
3 margin         -656.      219.     -3.00 2.98e- 3

The effect of winning office is ~10 years according to this model. I am not saying that this is the best model to estimate (it’s not a model we report in the paper). However, that’s it. Nothing more. We don’t need rocket science (or a series of covariates and weird analytical choices) to reproduce the main result. You might hate the result, not believe it, believe it is nothing but noise etc., but at least show the dignity to acknowledge that it is there.

How can it be that difficult to get this estimate? Gelman and Hill (2007) is more than 10 years old (I can still very much recommend it though), but I would suggest that you at least try to follow your own guidelines before you try to convince your readers that you could not reproduce a set of results. That is, if you don’t want to make a fool of yourself.

There is a certain irony to all of this, especially when the reason we didn’t pursue certain analytical choices was that it “would take additional work” (or, that’s at least what Andrew Gelman speculates). I know that I can’t expect too much of Andrew’s time; he is on a tight schedule with a new blog post every day (with an incentive – or at least motivation – to point out flaws in research, especially in studies using regression discontinuity designs), but… why not show how easy it is to get the main effect, as much as you don’t like it or the design that gave birth to it, without “statistical rule-following that’s out of control”? Why pretend that you can only find the effect when you follow our regression discontinuity model?

(Also, do note the number of cases in the example in the textbook, i.e. n = 68. Not a single word about power here. There is something funny about how the textbook example is good enough when you believe the result but that the sample size is turning this into a “fatally flawed” study when you don’t believe in the result. I know, the book is old and things have changed over the last few years (“New Statistics“, all for the better). However, I have seen hundreds of RDD studies with less power than us that are nowhere as reliable. And don’t get me started on IV regression studies with weak instruments.)

The conclusion that our results are so difficult to reproduce is misleading. Or, in other words, it is convenient for his blog post that he didn’t bother to run the first suggestion for an RDD analysis suggested by Gelman and Hill (2007).

So what did he do? Interestingly, in order to show how difficult it is to reproduce our results, you need to take a few analytical steps. I don’t want to speculate too much but, for the lack of better words, we can say that Andrew Gelman fell prey to the garden of forking paths.

Specifically, Andrew Gelman is not trying to limit the amount of decisions he have to make in order to find a non-significant effect (or, I think he was, but that didn’t work out). On the contrary, he is very much interested in getting “the most natural predictors” (I know, I laughed too!) right from the get-go. What’s the difference between falling prey to the garden of forking paths and having common sense insights into ‘the most natural predictors’? You tell me.

Let us look at what exactly Andrew Gelman is saying that he did in order to get to his statistically non-significant effect:

I created the decades_since_1950 variable as I had the idea that longevity might be increasing, and I put it in decades rather than years to get a more interpretable coefficient. I restricted the data to 1945-2012 and to candidates who were no longer alive at this time because that’s what was done in the paper, and I considered election margins of less than 10 percentage points because that’s what they showed in their graph, and also this did seem like a reasonable boundary for close elections that could’ve gone either way (so that we could consider it as a randomly assigned treatment).

Here is the problem. When I do all of this, I get an effect! Hmm. Hmm. Hmm. We better dig into the code reported in the blog post. Here is the code that he used to get at the first reported regression:

df_rdd <- data.table::fread("longevity.csv")
death_date <- sapply(df_rdd[,"death_date_imp"], as.character) 
living <- df_rdd[,"living"] == "yes"
death_date[living] <- "2020-01-01"
election_year <- as.vector(unlist(df_rdd[,"year"]))
election_date <- paste(election_year, "-11-05", sep="")
more_days <- as.vector(as.Date(death_date) - as.Date(election_date))
more_years <- more_days/365.24
age <- as.vector(unlist(df_rdd[,"living_day_imp_pre"]))/365.24
n <- nrow(df_rdd)
name <- paste(unlist(df_rdd[,"cand_last"]), unlist(df_rdd[,"cand_first"]), unlist(df_rdd[,"cand_middle"]))
first_race <- c(TRUE, name[2:n] != name[1:(n-1)])
margin <- as.vector(unlist(df_rdd[,"margin_pct_1"]))
won <- ifelse(margin > 0, 1, 0)
lifetime <- age + more_years
decades_since_1950 <- (election_year - 1950)/10
data <- data.frame(margin, won, election_year, age, more_years, living, lifetime, decades_since_1950)
subset <- first_race & election_year >= 1945 & election_year <= 2012 & abs(margin) < 10 & !living
library("arm")
fit_1a <- lm(more_years ~ won + age + decades_since_1950 + margin, data=data, subset=subset) 
display(fit_1a)
lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset)
                   coef.est coef.se
(Intercept)        78.60     4.05  
won                 2.39     2.44  
age                -0.98     0.08  
decades_since_1950 -0.21     0.51  
margin             -0.11     0.22  
---
n = 311, k = 5
residual sd = 10.73, R-Squared = 0.35

(No, the code is not a historical document on how people wrote R code in the 90s – nor a paid ad for tidyverse.)

Here the effect of winning office is only 2.39 years (all his models show estimates between 1 and 3 years). The first thing I notice here is that the sample size is substantially different from mine, so it must be something with the subsetting. Ah, I get it! He also restricted the sample with the first_race variable. Let us try to subset according to the actual procedure outlined in the paragraph above and estimate the model again.

subset_reported <- election_year >= 1945 & election_year <= 2012 & abs(margin) < 10 & !living
fit_1a_reported <- lm(more_years ~ won + age + decades_since_1950 + margin, data=data, subset=subset_reported) 
display(fit_1a_reported)
lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset_reported)
                   coef.est coef.se
(Intercept)        74.65     3.18  
won                 3.15     1.89  
age                -0.91     0.06  
decades_since_1950 -0.03     0.41  
margin             -0.19     0.17  
---
n = 499, k = 5
residual sd = 10.93, R-Squared = 0.33

That makes more sense. Somebody might even want to call it statistically significant (I will, for the sake of argument, not do this here). My theory is that Andrew Gelman initially did as he wrote in the blog post but decided that it would not be good for his criticism to actually find an effect and, accordingly, took another path in the garden. In other words, the effect is not good to introduce in the first model in a blog post about a paper with forking paths and “statistical rule-following that’s out of control”. However, I can say that what Andrew Gelman is doing here is the simple act of ‘p-hacking in reverse’ (type-2 professor-level p-hacking instead of the well-known newb type-1 p-hacking).

Here is another funny thing: Later in the blog post, Andrew Gelman follows the actual procedure as described in the blog post. Now, however, it is described as an explicit choice to include “duplicate cases” — and just to laugh at the “silly” results: “Just for laffs, I re-ran the analysis including the duplicate cases”.

That’s a fucked-up sense of humour. Different strokes for different folks, I guess. Andrew Gelman played around with the data till he got the insignificant finding he wanted and then he decides to attribute effects consistent with those in the paper to ‘just for laughs’ or by ‘including data’ (that was not excluded in the first place). What is the difference between selecting “the most natural predictors” and including variables just for laughs? Garden of forking paths, I guess.

In any case, congratulations! You could select a set of covariates that returns a sample size of 311 and the standard errors you wanted when you decided to go into the garden.

Using the non-significant effect, Gelman then continues to introduce a set of follow-up regressions to show that it is not possible for him to get anywhere near a significant effect, implying that a significant effect can only be obtained by forking paths: “What about excluding decades_since_1950? […] Nahhh, it doesn’t do much. We could exclude age also: […] Now the estimate’s even smaller and noisier! We should’ve kept age in the model in any case. We could up the power by including more elections: […] Now we have almost 500 cases, but we’re still not seeing that large and statistically significant effect.”

How (un)lucky can you be? Or, how little scientific integrity can you show when you engage with the material?

Here’s a thought experiment: If we had used the “most natural predictors” in the paper and found an effect (which we still did afterall), would Andrew Gelman then have agreed with us that those were the most natural predictors? What about if the simple model presented above (with no covariates) would have returned no effect, would Andrew Gelman still have found “the most natural predictors” to be relevant as the most natural predictors? Of course not.

As you will see in his blog post, he describes that he limits the sample significantly, but this is only to keep things simple: “To keep things simple, I just kept the first election in the dataset for each candidate”. I suggest he replace “keep things simple” with “make sure I have a small sample size and a non-significant effect and a chance to keep this blog post interesting”. Sure, there can be valid reasons to exclude these observations (or at least reflect upon how to best model the data at hand), but if you are trying to tell us that our study is useless, please provide better reasons for discarding a significant number of cases than to “keep things simple”.

Again, I am not convinced by the argument that he was unable to reproduce an effect similar to that reported in the paper.

What we did in the paper was to not use a simple OLS regression to estimate the effect, but the rdrobust package for robust nonparametric regression discontinuity estimates. Here is the main result reported in the paper:

rdrobust::rdrobust(y = df_m$living_day_imp_post, 
                   x = df_m$margin) %>% 
  summary()
Call: rdrobust

Number of Obs.                 1092
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 516         576
Eff. Number of Obs.            236         243
Order est. (p)                   1           1
Order bias  (q)                  2           2
BW est. (h)                  9.541       9.541
BW bias (b)                 19.017      19.017
rho (h/b)                    0.502       0.502
Unique Obs.                    516         555

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional  2749.283   873.601     3.147     0.002  [1037.057 , 4461.509]  
        Robust         -         -     3.188     0.001  [1197.646 , 5020.823]  
=============================================================================

This is the effect. Again, nothing more. Andrew Gelman also imply that including age as a covariate in this approach is needed so here is the model with age as a predictor (conveniently not reported in his blog post):

rdrobust::rdrobust(y = df_m$living_day_imp_post, 
                   x = df_m$margin, 
                   covs = df_m$living_day_imp_pre) %>% 
  summary()
Call: rdrobust

Number of Obs.                 1092
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 516         576
Eff. Number of Obs.            255         265
Order est. (p)                   1           1
Order bias  (q)                  2           2
BW est. (h)                 10.412      10.412
BW bias (b)                 20.323      20.323
rho (h/b)                    0.512       0.512
Unique Obs.                    516         555

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional  1948.131   701.527     2.777     0.005   [573.164 , 3323.098]  
        Robust         -         -     2.838     0.005   [689.788 , 3770.580]  
=============================================================================

This is indeed a more precise estimate.

We do present a lot of results in the paper and the appendix. You have a lot of potential choices when you do an RDD analysis. Bandwidth selection, covariates, local-polynomial order etc., and that’s definitely something that can give a lot of possibilities in terms of forking paths.

We report the most parsimonious results in the paper in order to not give us the freedom to pursue several (forking) paths (including providing our case for what’s the most natural predictors). We are transparent about the models we estimated in our paper, also when the result is not statistically significant. Specifically, not all estimates reported in our paper are significant, and hopefully our approach shows that it is indeed possible to get non-significant effects (though still interesting effect sizes). For the choice of bandwidth, for example, Andrew Gelman writes that the “analysis is very sensitive to the bandwidth”. We do show that it is possible to obtain insignificant effects when you estimate the models with a specific bandwidth. Here is the first figure in the appendix:

We find the largest effect estimates with a bandwidth of ~5 using no covariates (Model 1). For the covariates, I am sure, as Andrew Gelman most likely can confirm, it is possible to reduce this effect further with some forking paths.

There is, as Andrew Gelman also points out, no smoking gun. What he does instead is to provide a lot of scribbles about the sociology of science, common sense and fatally flawed statistics. The narrative mode is ‘stream of consciousness’ with a lot of negative words. The purpose is to show that there are so many issues with our study, that it is, again, fatally flawed.

From what I can see, it is mostly a series of misunderstandings or, at best, comments that are completely irrelevant. I guess his aim is to show that there are so many issues that the sum of these proves his bigger point that the study is silly.

The first illustrative example is this point: “Then I looked above, and I saw the effective number of observations was only 236 or 243, not even as many as the 311 I had earlier!”

I suggest that you consult the documentation for rdrobust:

It’s the number of observations on each side of the threshold. The total number of observations is 479. And just to see if we can agree on one thing here (however, having seen how our analysis is being treated in the blog post, I remain skeptical):

479 > 311
[1] TRUE

Also, even if it was “only” around 240, it is still a sample size almost four times greater than the textbook example provided in Gelman and Hill (2007). I am not using this as an argument but as a recommendation to update the material if you want to conclude that what everybody else is doing is fatally flawed.

Andrew Gelman also hints at issues with the data. Specifically, it is a problem that we have additional data that we do not use, such as politicians that died before the next election. I am unable to see how it should be a limitation that we provide all the data we have, including cases that we do not use in our analysis. Also, he says that it is not possible to get the dataset in the correct format which is very much incorrect (it’s actually easy to download .csv files from the Harvard Dataverse).

Enough about this ‘circumstantial evidence’. The overall conclusion of the blog post is the following paragraph: “If you do a study of an effect that is small and highly variable (which this one is: to the extent that winning or losing can have large effects on your lifespan, the effect will surely vary a lot from person to person), you’ve set yourself up for scientific failure: you’re working with noise.”

I believe there is more signal than noise in this data. I definitely do not see our work as a scientific failure. The effects are all positive and not highly variable. Yes, they tend to be closer to 5 years than to 10 years, but that’s not necessarily noise. Again, this is in line with the finding for governors running in elections post-1908 reported by other researchers, where the effect size is ~6 years.

Next, these are not small effects. Even if you manage to press them all the way down to a few years, they are still substantial.

I do agree a lot with the point on heterogeneity, i.e. that the effects will vary a lot from person to person. We do explore that to some extent in the paper but we try not to make too strong inferences about any of these subsample analysis. However, I will be happy to see future research deal with this exact point.

An effect size of up to 10 years is a large effect and, yes, it would have been interesting as well if the effect was 2 years. If the true effect of winning office on longevity is, say, 0.5 years, would we have sufficient statistical power to detect such an effect and conclude that it was statistically different from 0? I don’t think so and I believe that Andrew Gelman got some good reflections on Type M and Type S errors in relation to effect sizes that are also relevant to our study.

Is it against common sense that politicians that wins office live significantly different lives and that these differences might add up to a substantial difference in longevity prospects? Maybe. I don’t think so. If I did, I would not have worked on a paper testing this dynamic. However, as always, it is important to remain skeptical towards the findings of our study (even when it’s in line with the findings in another study).

I will not speculate about whether it is against common sense or not. As always, everything is obvious once you know the answer. What I can say is that I would never let the size of an effect be decisive for whether I would like to publish the result or not (for that reason, I have also published several null findings). Also, when we are dealing with effect sizes, most people are unimpressed by small effect sizes (and a Cohen’s d of .9 is considered a small-to-moderate effect by lay people). Accordingly, I am not sure how strong of an approach common sense is (in and by itself) when we are to evaluate effect sizes.

That being said, my skepticism always increases as the effect size increases. And I would be surprised (and worried) if nobody decided to take a look at our data. Again, I am thankful that Andrew Gelman spend a significant amount of his time on this. Interestingly, Jon Mellon, co-director of the British Election Study, was also critical towards the effect size. He did look at the data. However, he did not reach the same conclusion as Andrew Gelman:

I am not including this here to say that Jon Mellon is on my side as this is not about being on one side or the other. However, I am saying that I find it interesting if other people look at the data as well without reaching the obvious conclusion that it is impossible to produce our models without making weird analytical choices. Also, I am open to the possibility that Jon Mellon – as anybody else – will update what they think about the findings in our paper in the light of these blog posts.

The role of visual inference in RDDs

The first thing Andrew Gelman is presenting in his blog is a visual summary of our data (a scatter plot). He uses this to say that there is nothing going on around the threshold (the discontinuity) and all we can see is evidence for noise (and even garden of forking paths!). I am not impressed by this reasoning but I still find it relevant to comment on. Specifically, we know that a visual presentation of a discontinuity can be strong evidence for an effect, but it’s not sufficient to say that there is no effect because we cannot eyeball a difference.

To understand this, consider this figure made by the economist C. Kirabo Jackson on how RD plots are not useful to assess the existence of an effect.

Specifically, the figure shows a statistically significant effect but it is clear that we cannot observe this effect. I do still believe it is relevant to look at the raw data to get a sense of what we are looking at, but I am not sure why our evidence is less compelling just by looking at the raw data.

To further illustrate this, I took a look at one of my favourite RDD studies, namely ‘What Happens When Extremists Win Primaries?‘ by Andrew B. Hall (published in American Political Science Review). From the abstract: “When an extremist — as measured by primary-election campaign receipt patterns — wins a “coin-flip” election over a more moderate candidate, the party’s general-election vote share decreases on average by approximately 9–13 percentage points, and the probability that the party wins the seat decreases by 35–54 percentage points.”

Here are figures of the raw data similar to those Gelman presented in his blog post using our raw data:

What is clear here is that the raw data does not allow us to assess whether there is a strong effect (i.e. that when an extremist wins an election, the vote share decreases on average by approximately 9–13 percentage points). I am not picking this study to criticize the design (or say that “look, everybody else is doing this!”). Again, I pick this study as it is one of my favourite RDD studies that also shows large effects, or as Andrew B. Hall writes in the paper: “These are large effects.”

In sum, I am not convinced that the raw data says anything meaningful about whether we are left with noise-mining.

Concluding remarks

Overall, the purpose of this post is not to say that I am correct and Andrew Gelman is wrong. I go where the data brings me (even if it is not “common sense”). If it turns out there is no large effect of winning office on longevity, that’s fine with me. I’m not a politician.

Do download the data, read the paper, play around with the data, update the data (alas, in the long run we will have no missing data), try different models, find out how much work you need to put into this in order to make the estimates statistically non-significant etc. Again, you can make a lot of decisions when conducting this type of analysis, including but not limited to bandwidth choices, outliers, second order polynomial, alternative cutoffs and various restricted samples (we provide tests on all of these in the Appendix, but of course this is not enough — and we have provided the replication material for that exact reason).

I am not saying that any of the above is proof that Andrew Gelman deliberately only presented models that suited his “common sense” belief about the fact that our study must be a failure (“we knew ahead of time not to trust this result because of the unrealistically large effect sizes”). However, I can say that I, in the future, will be much more critical towards his way of presenting his analysis of other people’s work on his blog (and maybe even in his published work).

In general, I am sure I agree with Andrew Gelman on more topics than I disagree. As he wrote in a blog post the other day (in relation to another topic): “Build strong models, make big assumptions, issue strong statements, and then when (inevitably) you’re proven wrong, when you’re embarrassed in front of the world, own your errors and figure out what went wrong in your reasoning. Science is self-correcting—but only if we self-correct.” That’s spot on.

We definitely agree on the importance of being able to look into other peoples analysis and results and interpretations. Nullius in verba and whatnot. Furthermore, the last thing I want to do is to look ungrateful when people are reproducing my work (I know from my prior work how pathetic scientists can be when you show that it’s difficult to reproduce their work).

That being said, I was wondering whether I should dignify Andrew Gelman’s criticism with a response. He did little to engage with the material (and it shows). Here is my view: Andrew Gelman is an academic Woody Allen. Some of his work is very good, but his blog post on our study is closer to A Rainy Day in New York than, say, Annie Hall.

Overall, I see a contribution in our paper. As always, a single study is of little value in and by itself, but I do see a contribution. For that reason, I don’t agree that our paper is a “scientific failure”, but I can see how such a categorisation is needed as the criticism is less effective in this case without these exaggerations.

To Andrew Gelman, any effect size he’s not convinced by is best explained by ‘forking paths’ (if all you have is a hammer, everything looks like a nail). Even if it requires a few detours in the garden to get to that point. I believe we agree that it’s possible to fool yourself with forking paths without ever realizing it. The disagreement here is primarily about who is not realizing it.