In 2017, I pushed the replication material for my article, ‘Welfare Retrenchments and Government Support’, to a GitHub repository. I had been working on the article for years and the code was not necessarily up to date. It worked perfectly, gave the exact estimates and was relatively easy to read. Accordingly, everything was good, life was simple and I felt confident that I would never have to look at the code again.
This turned out not to be the case. I recently got a mail from a student who was unable to get the exact estimates as reported in Table 1 in the paper, even when following my script and using the data I made publicly available. I went through the code and I noticed that I could not reproduce the exact estimates with my current R setup. Sure, the results were substantially identical but not the exact same – and the N was also different.
I looked into the issue and I could see that changes were made to the defaults of set.seed()
in R 3.6.0. As I ran the original analyses in R 3.3.1, and I am now using R 4.1.0, this could explain why the matching procedure I rely on is not returning the exact matches. For that reason, I decided to make some updates to the replication material so there now is a dataset with the matched data. The script is doing the same as before, but it is not relying on the matched data obtained with the setup in R 3.3.1. This should make it a lot easier to get the exact same estimates as provided throughout the paper.
To increase the changes of long-term reproducibility, I should consider using packrat or a Docker container (I primarily use Docker for my Shiny dashboards). However, as the analyses are mostly a few OLS regressions, I believe this would be overkill and would not necessarily make it easier for most people to easily download the data and script and play around with the results. And I don’t mind making extra updates in the future if needed in order to reproduce the results with different setups.
Interestingly, I did all of these analyses before I doubled down on tidyverse and for that reason I decided to make a series of additional updates to the material, including:
- More spaces to make the code easier to read. For example, instead of
x=week, y=su
it is nowx = week, y = su
. - The use of underscores (snake cases) instead of dots. For example, the object
ess.matched
is nowess_matched
. - A significant reduction in the use of dollar signs (primarily by the use of
mutate()
). - The use of
pivot_longer()
instead ofgather()
. - No double mention of the variable
edulevel
in the variable selection. - Removing the deprecated
type.dots
argument fromrdplot()
. - The use of
seq(0.01, 0.25, 0.01)
instead of having 0.01, 0.02, 0.03, 0.04, etc. all the way to 0.25! - The use of
map_df()
instead of a for loop.
And a series of other minor changes that makes the code easier to read and use in 2021. I have made the updated material available in the GitHub repository. There is a revised R-script for the analysis, a dataset with the matched observations and a file with the session info on the current setup I used to reproduce the results.
I have started using the new native pipe operator in R (|>
) instead of the tidyverse pipe (%>%
), but I decided not to change this in the current version to make sure that the script is also working well using the version of R I used to conduct the analysis years ago. In other words, the 2021 script should work using both R 3.3.1 and R 4.1.0.
I also thought about using the essurvey
package to get the data from the European Social Survey (we have an example on how to do that in the Quantitative Politics with R book), but I find it safer to only work with local copies of the data and not rely on this package being available in the future.
In a parallel universe a more productive version of myself would spend time and energy on more fruitful endeavors than updating the material for an article published years ago. However, I can highly recommend going through old material and see whether and if it still works. Some of the issues you might encounter will help you a lot in ensuring that the replication material you create for future projects are also more likely to stand the test of time.