A response to Andrew Gelman

In a new blog post, Andrew Gelman writes that the findings in an article of ours are best explained by forking paths. I encourage you to read the blog post and, if you still care about the topic, continue and read this post as well.

This is going to be a (relatively) long post. In brief, I will show that the criticism is misleading. Specifically, it is easy to find the effect we report in our paper (without “statistical rule-following that’s out of control”) and that Andrew Gelman is either very unlucky or, what I find more likely, very selective in what he reports. I have no confidence that Andrew Gelman engaged with our material with an open mind but, on the contrary, I believe he invested a non-trivial amount of time building up a (false) narrative about the validity and reliability of our findings.

That being said, Andrew Gelman was polite in reaching out and he gave me the possibility to comment on his criticism. Beyond a few clarifications, I decided not to provide most of the comments below in a private conversation. Again, based upon his language and analysis of our data, I am convinced that he has no interest in engaging in a constructive discussion about the validity of our findings. For that reason, I find it better to keep everything public and transparent.

Our contribution

In our paper, we show that winning political office have significant implications for the longevity of candidates for US gubernatorial office. Here is the abstract:

Does political office cause worse or better longevity prospects? Two perspectives in the literature offer contradicting answers. First, increased income, social status, and political connections obtained through holding office can increase longevity. Second, increased stress and working hours associated with holding office can have detrimental effects on longevity. To provide causal evidence, we exploit a regression discontinuity design with unique data on the longevity of candidates for US gubernatorial office. The results show that politicians winning a close election live 5–10 years longer than candidates who lose.

And here is the table with the main result:

The paper is published in Political Science Research and Methods. You can find the replication material for the article here.

Is our finding replicated?

Before we get into the details of the analysis and whatnot, I am happy to confirm that our findings are similar to those in a recent study published by the economists Mark Borgschulte and Jacob Vogler in Journal of Economic Behavior & Organization. For the sample most similar to ours, they find a local average treatment effect of 6.26 years:

It is great to see that other people are working on the topic and reaching similar conclusions. Overall, I believe that the effect we report in our paper is not only reproducible but also replicated by another set of researchers. I did inform Andrew Gelman about this study but I can understand his reasons for not linking to this study as well in his blog post.

How difficult is it to reproduce our findings?

In brief, Andrew Gelman is not convinced by our study. That’s okay. I am rarely convinced when I see a new study (it’s easy to think about potential limitations and statistical issues). However, what we see here are several characterisations of the results in particular and the statistical approach in general, such as “silly”, a “scientific failure”, “statistical rule-following that’s out of control”, “fragile” and “fatally flawed”.

No scientific study is perfect. Data is messy and to err is human. Accordingly, it is important and healthy for science that we closely and thoroughly inspect the quality of each others work (especially when we consider how many errors that can easily slip through peer-review). Not only do I appreciate that smart colleagues devote their sparse time to look at my work, I also encourage people to check out my work and reach out if something is not working out (or write a blog post or publish a paper or make a tweet). For that reason, I make all my material publicly available (most often on Harvard Dataverse and GitHub). That being said, in this case, I am simply not convinced that our study is fatally flawed.

Andrew Gelman argue that we report an estimate that is very difficult to reproduce (and, at the end of the day, unrepresentative of a true effect): “I tried a few other things but I couldn’t figure out how to get that huge and statistically significant coefficient in the 5-10 year range. Until . . . I ran their regression discontinuity model”.

Seriously? Is it really that difficult to obtain a significant effect in line with what we report in the paper? Based upon Andrew Gelman’s post, I can understand if that’s the impression people might have. Maybe Andrew Gelman should consider reading Gelman and Hill (2007) and follow the first suggestion on how to run an RDD analysis. From page 214:

Without any complicated procedures, what happens if we get the data and follow the procedure as described in the introductory textbook? Well, let us look at the data and run the suggested regression.

# Load tidyverse (to make life easy)
library("tidyverse")

# Load the data
df_rdd <- data.table::fread("longevity.csv")

# Make the data ready for the analysis
df_m <- df_rdd %>% 
  filter(year >= 1945, living_day_imp_post > 0) %>% 
  mutate(won = ifelse(margin_pct_1 >= 0, 1, 0),
         margin = margin_pct_1)

# Run regression
df_m %>% 
  filter(abs(margin) < 5) %>% 
  lm(living_day_imp_post ~ won + margin, data = .) %>% 
  broom::tidy()
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
                          
1 (Intercept)    8429.      676.     12.5  4.63e-28
2 won            3930.     1193.      3.29 1.13e- 3
3 margin         -656.      219.     -3.00 2.98e- 3

The effect of winning office is ~10 years according to this model. I am not saying that this is the best model to estimate (it’s not a model we report in the paper). However, that’s it. Nothing more. We don’t need rocket science (or a series of covariates and weird analytical choices) to reproduce the main result. You might hate the result, not believe it, believe it is nothing but noise etc., but at least show the dignity to acknowledge that it is there.

How can it be that difficult to get this estimate? Gelman and Hill (2007) is more than 10 years old (I can still very much recommend it though), but I would suggest that you at least try to follow your own guidelines before you try to convince your readers that you could not reproduce a set of results. That is, if you don’t want to make a fool of yourself.

There is a certain irony to all of this, especially when the reason we didn’t pursue certain analytical choices was that it “would take additional work” (or, that’s at least what Andrew Gelman speculates). I know that I can’t expect too much of Andrew’s time; he is on a tight schedule with a new blog post every day (with an incentive – or at least motivation – to point out flaws in research, especially in studies using regression discontinuity designs), but… why not show how easy it is to get the main effect, as much as you don’t like it or the design that gave birth to it, without “statistical rule-following that’s out of control”? Why pretend that you can only find the effect when you follow our regression discontinuity model?

(Also, do note the number of cases in the example in the textbook, i.e. n = 68. Not a single word about power here. There is something funny about how the textbook example is good enough when you believe the result but that the sample size is turning this into a “fatally flawed” study when you don’t believe in the result. I know, the book is old and things have changed over the last few years (“New Statistics“, all for the better). However, I have seen hundreds of RDD studies with less power than us that are nowhere as reliable. And don’t get me started on IV regression studies with weak instruments.)

The conclusion that our results are so difficult to reproduce is misleading. Or, in other words, it is convenient for his blog post that he didn’t bother to run the first suggestion for an RDD analysis suggested by Gelman and Hill (2007).

So what did he do? Interestingly, in order to show how difficult it is to reproduce our results, you need to take a few analytical steps. I don’t want to speculate too much but, for the lack of better words, we can say that Andrew Gelman fell prey to the garden of forking paths.

Specifically, Andrew Gelman is not trying to limit the amount of decisions he have to make in order to find a non-significant effect (or, I think he was, but that didn’t work out). On the contrary, he is very much interested in getting “the most natural predictors” (I know, I laughed too!) right from the get-go. What’s the difference between falling prey to the garden of forking paths and having common sense insights into ‘the most natural predictors’? You tell me.

Let us look at what exactly Andrew Gelman is saying that he did in order to get to his statistically non-significant effect:

I created the decades_since_1950 variable as I had the idea that longevity might be increasing, and I put it in decades rather than years to get a more interpretable coefficient. I restricted the data to 1945-2012 and to candidates who were no longer alive at this time because that’s what was done in the paper, and I considered election margins of less than 10 percentage points because that’s what they showed in their graph, and also this did seem like a reasonable boundary for close elections that could’ve gone either way (so that we could consider it as a randomly assigned treatment).

Here is the problem. When I do all of this, I get an effect! Hmm. Hmm. Hmm. We better dig into the code reported in the blog post. Here is the code that he used to get at the first reported regression:

df_rdd <- data.table::fread("longevity.csv")
death_date <- sapply(df_rdd[,"death_date_imp"], as.character) 
living <- df_rdd[,"living"] == "yes"
death_date[living] <- "2020-01-01"
election_year <- as.vector(unlist(df_rdd[,"year"]))
election_date <- paste(election_year, "-11-05", sep="")
more_days <- as.vector(as.Date(death_date) - as.Date(election_date))
more_years <- more_days/365.24
age <- as.vector(unlist(df_rdd[,"living_day_imp_pre"]))/365.24
n <- nrow(df_rdd)
name <- paste(unlist(df_rdd[,"cand_last"]), unlist(df_rdd[,"cand_first"]), unlist(df_rdd[,"cand_middle"]))
first_race <- c(TRUE, name[2:n] != name[1:(n-1)])
margin <- as.vector(unlist(df_rdd[,"margin_pct_1"]))
won <- ifelse(margin > 0, 1, 0)
lifetime <- age + more_years
decades_since_1950 <- (election_year - 1950)/10
data <- data.frame(margin, won, election_year, age, more_years, living, lifetime, decades_since_1950)
subset <- first_race & election_year >= 1945 & election_year <= 2012 & abs(margin) < 10 & !living
library("arm")
fit_1a <- lm(more_years ~ won + age + decades_since_1950 + margin, data=data, subset=subset) 
display(fit_1a)
lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset)
                   coef.est coef.se
(Intercept)        78.60     4.05  
won                 2.39     2.44  
age                -0.98     0.08  
decades_since_1950 -0.21     0.51  
margin             -0.11     0.22  
---
n = 311, k = 5
residual sd = 10.73, R-Squared = 0.35

(No, the code is not a historical document on how people wrote R code in the 90s – nor a paid ad for tidyverse.)

Here the effect of winning office is only 2.39 years (all his models show estimates between 1 and 3 years). The first thing I notice here is that the sample size is substantially different from mine, so it must be something with the subsetting. Ah, I get it! He also restricted the sample with the first_race variable. Let us try to subset according to the actual procedure outlined in the paragraph above and estimate the model again.

subset_reported <- election_year >= 1945 & election_year <= 2012 & abs(margin) < 10 & !living
fit_1a_reported <- lm(more_years ~ won + age + decades_since_1950 + margin, data=data, subset=subset_reported) 
display(fit_1a_reported)
lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset_reported)
                   coef.est coef.se
(Intercept)        74.65     3.18  
won                 3.15     1.89  
age                -0.91     0.06  
decades_since_1950 -0.03     0.41  
margin             -0.19     0.17  
---
n = 499, k = 5
residual sd = 10.93, R-Squared = 0.33

That makes more sense. Somebody might even want to call it statistically significant (I will, for the sake of argument, not do this here). My theory is that Andrew Gelman initially did as he wrote in the blog post but decided that it would not be good for his criticism to actually find an effect and, accordingly, took another path in the garden. In other words, the effect is not good to introduce in the first model in a blog post about a paper with forking paths and “statistical rule-following that’s out of control”. However, I can say that what Andrew Gelman is doing here is the simple act of ‘p-hacking in reverse’ (type-2 professor-level p-hacking instead of the well-known newb type-1 p-hacking).

Here is another funny thing: Later in the blog post, Andrew Gelman follows the actual procedure as described in the blog post. Now, however, it is described as an explicit choice to include “duplicate cases” — and just to laugh at the “silly” results: “Just for laffs, I re-ran the analysis including the duplicate cases”.

That’s a fucked-up sense of humour. Different strokes for different folks, I guess. Andrew Gelman played around with the data till he got the insignificant finding he wanted and then he decides to attribute effects consistent with those in the paper to ‘just for laughs’ or by ‘including data’ (that was not excluded in the first place). What is the difference between selecting “the most natural predictors” and including variables just for laughs? Garden of forking paths, I guess.

In any case, congratulations! You could select a set of covariates that returns a sample size of 311 and the standard errors you wanted when you decided to go into the garden.

Using the non-significant effect, Gelman then continues to introduce a set of follow-up regressions to show that it is not possible for him to get anywhere near a significant effect, implying that a significant effect can only be obtained by forking paths: “What about excluding decades_since_1950? […] Nahhh, it doesn’t do much. We could exclude age also: […] Now the estimate’s even smaller and noisier! We should’ve kept age in the model in any case. We could up the power by including more elections: […] Now we have almost 500 cases, but we’re still not seeing that large and statistically significant effect.”

How (un)lucky can you be? Or, how little scientific integrity can you show when you engage with the material?

Here’s a thought experiment: If we had used the “most natural predictors” in the paper and found an effect (which we still did afterall), would Andrew Gelman then have agreed with us that those were the most natural predictors? What about if the simple model presented above (with no covariates) would have returned no effect, would Andrew Gelman still have found “the most natural predictors” to be relevant as the most natural predictors? Of course not.

As you will see in his blog post, he describes that he limits the sample significantly, but this is only to keep things simple: “To keep things simple, I just kept the first election in the dataset for each candidate”. I suggest he replace “keep things simple” with “make sure I have a small sample size and a non-significant effect and a chance to keep this blog post interesting”. Sure, there can be valid reasons to exclude these observations (or at least reflect upon how to best model the data at hand), but if you are trying to tell us that our study is useless, please provide better reasons for discarding a significant number of cases than to “keep things simple”.

Again, I am not convinced by the argument that he was unable to reproduce an effect similar to that reported in the paper.

What we did in the paper was to not use a simple OLS regression to estimate the effect, but the rdrobust package for robust nonparametric regression discontinuity estimates. Here is the main result reported in the paper:

rdrobust::rdrobust(y = df_m$living_day_imp_post, 
                   x = df_m$margin) %>% 
  summary()
Call: rdrobust

Number of Obs.                 1092
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 516         576
Eff. Number of Obs.            236         243
Order est. (p)                   1           1
Order bias  (q)                  2           2
BW est. (h)                  9.541       9.541
BW bias (b)                 19.017      19.017
rho (h/b)                    0.502       0.502
Unique Obs.                    516         555

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional  2749.283   873.601     3.147     0.002  [1037.057 , 4461.509]  
        Robust         -         -     3.188     0.001  [1197.646 , 5020.823]  
=============================================================================

This is the effect. Again, nothing more. Andrew Gelman also imply that including age as a covariate in this approach is needed so here is the model with age as a predictor (conveniently not reported in his blog post):

rdrobust::rdrobust(y = df_m$living_day_imp_post, 
                   x = df_m$margin, 
                   covs = df_m$living_day_imp_pre) %>% 
  summary()
Call: rdrobust

Number of Obs.                 1092
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 516         576
Eff. Number of Obs.            255         265
Order est. (p)                   1           1
Order bias  (q)                  2           2
BW est. (h)                 10.412      10.412
BW bias (b)                 20.323      20.323
rho (h/b)                    0.512       0.512
Unique Obs.                    516         555

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional  1948.131   701.527     2.777     0.005   [573.164 , 3323.098]  
        Robust         -         -     2.838     0.005   [689.788 , 3770.580]  
=============================================================================

This is indeed a more precise estimate.

We do present a lot of results in the paper and the appendix. You have a lot of potential choices when you do an RDD analysis. Bandwidth selection, covariates, local-polynomial order etc., and that’s definitely something that can give a lot of possibilities in terms of forking paths.

We report the most parsimonious results in the paper in order to not give us the freedom to pursue several (forking) paths (including providing our case for what’s the most natural predictors). We are transparent about the models we estimated in our paper, also when the result is not statistically significant. Specifically, not all estimates reported in our paper are significant, and hopefully our approach shows that it is indeed possible to get non-significant effects (though still interesting effect sizes). For the choice of bandwidth, for example, Andrew Gelman writes that the “analysis is very sensitive to the bandwidth”. We do show that it is possible to obtain insignificant effects when you estimate the models with a specific bandwidth. Here is the first figure in the appendix:

We find the largest effect estimates with a bandwidth of ~5 using no covariates (Model 1). For the covariates, I am sure, as Andrew Gelman most likely can confirm, it is possible to reduce this effect further with some forking paths.

There is, as Andrew Gelman also points out, no smoking gun. What he does instead is to provide a lot of scribbles about the sociology of science, common sense and fatally flawed statistics. The narrative mode is ‘stream of consciousness’ with a lot of negative words. The purpose is to show that there are so many issues with our study, that it is, again, fatally flawed.

From what I can see, it is mostly a series of misunderstandings or, at best, comments that are completely irrelevant. I guess his aim is to show that there are so many issues that the sum of these proves his bigger point that the study is silly.

The first illustrative example is this point: “Then I looked above, and I saw the effective number of observations was only 236 or 243, not even as many as the 311 I had earlier!”

I suggest that you consult the documentation for rdrobust:

It’s the number of observations on each side of the threshold. The total number of observations is 479. And just to see if we can agree on one thing here (however, having seen how our analysis is being treated in the blog post, I remain skeptical):

479 > 311
[1] TRUE

Also, even if it was “only” around 240, it is still a sample size almost four times greater than the textbook example provided in Gelman and Hill (2007). I am not using this as an argument but as a recommendation to update the material if you want to conclude that what everybody else is doing is fatally flawed.

Andrew Gelman also hints at issues with the data. Specifically, it is a problem that we have additional data that we do not use, such as politicians that died before the next election. I am unable to see how it should be a limitation that we provide all the data we have, including cases that we do not use in our analysis. Also, he says that it is not possible to get the dataset in the correct format which is very much incorrect (it’s actually easy to download .csv files from the Harvard Dataverse).

Enough about this ‘circumstantial evidence’. The overall conclusion of the blog post is the following paragraph: “If you do a study of an effect that is small and highly variable (which this one is: to the extent that winning or losing can have large effects on your lifespan, the effect will surely vary a lot from person to person), you’ve set yourself up for scientific failure: you’re working with noise.”

I believe there is more signal than noise in this data. I definitely do not see our work as a scientific failure. The effects are all positive and not highly variable. Yes, they tend to be closer to 5 years than to 10 years, but that’s not necessarily noise. Again, this is in line with the finding for governors running in elections post-1908 reported by other researchers, where the effect size is ~6 years.

Next, these are not small effects. Even if you manage to press them all the way down to a few years, they are still substantial.

I do agree a lot with the point on heterogeneity, i.e. that the effects will vary a lot from person to person. We do explore that to some extent in the paper but we try not to make too strong inferences about any of these subsample analysis. However, I will be happy to see future research deal with this exact point.

An effect size of up to 10 years is a large effect and, yes, it would have been interesting as well if the effect was 2 years. If the true effect of winning office on longevity is, say, 0.5 years, would we have sufficient statistical power to detect such an effect and conclude that it was statistically different from 0? I don’t think so and I believe that Andrew Gelman got some good reflections on Type M and Type S errors in relation to effect sizes that are also relevant to our study.

Is it against common sense that politicians that wins office live significantly different lives and that these differences might add up to a substantial difference in longevity prospects? Maybe. I don’t think so. If I did, I would not have worked on a paper testing this dynamic. However, as always, it is important to remain skeptical towards the findings of our study (even when it’s in line with the findings in another study).

I will not speculate about whether it is against common sense or not. As always, everything is obvious once you know the answer. What I can say is that I would never let the size of an effect be decisive for whether I would like to publish the result or not (for that reason, I have also published several null findings). Also, when we are dealing with effect sizes, most people are unimpressed by small effect sizes (and a Cohen’s d of .9 is considered a small-to-moderate effect by lay people). Accordingly, I am not sure how strong of an approach common sense is (in and by itself) when we are to evaluate effect sizes.

That being said, my skepticism always increases as the effect size increases. And I would be surprised (and worried) if nobody decided to take a look at our data. Again, I am thankful that Andrew Gelman spend a significant amount of his time on this. Interestingly, Jon Mellon, co-director of the British Election Study, was also critical towards the effect size. He did look at the data. However, he did not reach the same conclusion as Andrew Gelman:

I am not including this here to say that Jon Mellon is on my side as this is not about being on one side or the other. However, I am saying that I find it interesting if other people look at the data as well without reaching the obvious conclusion that it is impossible to produce our models without making weird analytical choices. Also, I am open to the possibility that Jon Mellon – as anybody else – will update what they think about the findings in our paper in the light of these blog posts.

The role of visual inference in RDDs

The first thing Andrew Gelman is presenting in his blog is a visual summary of our data (a scatter plot). He uses this to say that there is nothing going on around the threshold (the discontinuity) and all we can see is evidence for noise (and even garden of forking paths!). I am not impressed by this reasoning but I still find it relevant to comment on. Specifically, we know that a visual presentation of a discontinuity can be strong evidence for an effect, but it’s not sufficient to say that there is no effect because we cannot eyeball a difference.

To understand this, consider this figure made by the economist C. Kirabo Jackson on how RD plots are not useful to assess the existence of an effect.

Specifically, the figure shows a statistically significant effect but it is clear that we cannot observe this effect. I do still believe it is relevant to look at the raw data to get a sense of what we are looking at, but I am not sure why our evidence is less compelling just by looking at the raw data.

To further illustrate this, I took a look at one of my favourite RDD studies, namely ‘What Happens When Extremists Win Primaries?‘ by Andrew B. Hall (published in American Political Science Review). From the abstract: “When an extremist — as measured by primary-election campaign receipt patterns — wins a “coin-flip” election over a more moderate candidate, the party’s general-election vote share decreases on average by approximately 9–13 percentage points, and the probability that the party wins the seat decreases by 35–54 percentage points.”

Here are figures of the raw data similar to those Gelman presented in his blog post using our raw data:

What is clear here is that the raw data does not allow us to assess whether there is a strong effect (i.e. that when an extremist wins an election, the vote share decreases on average by approximately 9–13 percentage points). I am not picking this study to criticize the design (or say that “look, everybody else is doing this!”). Again, I pick this study as it is one of my favourite RDD studies that also shows large effects, or as Andrew B. Hall writes in the paper: “These are large effects.”

In sum, I am not convinced that the raw data says anything meaningful about whether we are left with noise-mining.

Concluding remarks

Overall, the purpose of this post is not to say that I am correct and Andrew Gelman is wrong. I go where the data brings me (even if it is not “common sense”). If it turns out there is no large effect of winning office on longevity, that’s fine with me. I’m not a politician.

Do download the data, read the paper, play around with the data, update the data (alas, in the long run we will have no missing data), try different models, find out how much work you need to put into this in order to make the estimates statistically non-significant etc. Again, you can make a lot of decisions when conducting this type of analysis, including but not limited to bandwidth choices, outliers, second order polynomial, alternative cutoffs and various restricted samples (we provide tests on all of these in the Appendix, but of course this is not enough — and we have provided the replication material for that exact reason).

I am not saying that any of the above is proof that Andrew Gelman deliberately only presented models that suited his “common sense” belief about the fact that our study must be a failure (“we knew ahead of time not to trust this result because of the unrealistically large effect sizes”). However, I can say that I, in the future, will be much more critical towards his way of presenting his analysis of other people’s work on his blog (and maybe even in his published work).

In general, I am sure I agree with Andrew Gelman on more topics than I disagree. As he wrote in a blog post the other day (in relation to another topic): “Build strong models, make big assumptions, issue strong statements, and then when (inevitably) you’re proven wrong, when you’re embarrassed in front of the world, own your errors and figure out what went wrong in your reasoning. Science is self-correcting—but only if we self-correct.” That’s spot on.

We definitely agree on the importance of being able to look into other peoples analysis and results and interpretations. Nullius in verba and whatnot. Furthermore, the last thing I want to do is to look ungrateful when people are reproducing my work (I know from my prior work how pathetic scientists can be when you show that it’s difficult to reproduce their work).

That being said, I was wondering whether I should dignify Andrew Gelman’s criticism with a response. He did little to engage with the material (and it shows). Here is my view: Andrew Gelman is an academic Woody Allen. Some of his work is very good, but his blog post on our study is closer to A Rainy Day in New York than, say, Annie Hall.

Overall, I see a contribution in our paper. As always, a single study is of little value in and by itself, but I do see a contribution. For that reason, I don’t agree that our paper is a “scientific failure”, but I can see how such a categorisation is needed as the criticism is less effective in this case without these exaggerations.

To Andrew Gelman, any effect size he’s not convinced by is best explained by ‘forking paths’ (if all you have is a hammer, everything looks like a nail). Even if it requires a few detours in the garden to get to that point. I believe we agree that it’s possible to fool yourself with forking paths without ever realizing it. The disagreement here is primarily about who is not realizing it.