Updating the replication material for “Welfare Retrenchments and Government Support”

In 2017, I pushed the replication material for my article, ‘Welfare Retrenchments and Government Support’, to a GitHub repository. I had been working on the article for years and the code was not necessarily up to date. It worked perfectly, gave the exact estimates and was relatively easy to read. Accordingly, everything was good, life was simple and I felt confident that I would never have to look at the code again.

This turned out not to be the case. I recently got a mail from a student who was unable to get the exact estimates as reported in Table 1 in the paper, even when following my script and using the data I made publicly available. I went through the code and I noticed that I could not reproduce the exact estimates with my current R setup. Sure, the results were substantially identical but not the exact same – and the N was also different.

I looked into the issue and I could see that changes were made to the defaults of set.seed() in R 3.6.0. As I ran the original analyses in R 3.3.1, and I am now using R 4.1.0, this could explain why the matching procedure I rely on is not returning the exact matches. For that reason, I decided to make some updates to the replication material so there now is a dataset with the matched data. The script is doing the same as before, but it is not relying on the matched data obtained with the setup in R 3.3.1. This should make it a lot easier to get the exact same estimates as provided throughout the paper.

To increase the changes of long-term reproducibility, I should consider using packrat or a Docker container (I primarily use Docker for my Shiny dashboards). However, as the analyses are mostly a few OLS regressions, I believe this would be overkill and would not necessarily make it easier for most people to easily download the data and script and play around with the results. And I don’t mind making extra updates in the future if needed in order to reproduce the results with different setups.

Interestingly, I did all of these analyses before I doubled down on tidyverse and for that reason I decided to make a series of additional updates to the material, including:

  • More spaces to make the code easier to read. For example, instead of x=week, y=su it is now x = week, y = su.
  • The use of underscores (snake cases) instead of dots. For example, the object ess.matched is now ess_matched.
  • A significant reduction in the use of dollar signs (primarily by the use of mutate()).
  • The use of pivot_longer() instead of gather().
  • No double mention of the variable edulevel in the variable selection.
  • Removing the deprecated type.dots argument from rdplot().
  • The use of seq(0.01, 0.25, 0.01) instead of having 0.01, 0.02, 0.03, 0.04, etc. all the way to 0.25!
  • The use of map_df() instead of a for loop.

And a series of other minor changes that makes the code easier to read and use in 2021. I have made the updated material available in the GitHub repository. There is a revised R-script for the analysis, a dataset with the matched observations and a file with the session info on the current setup I used to reproduce the results.

I have started using the new native pipe operator in R (|>) instead of the tidyverse pipe (%>%), but I decided not to change this in the current version to make sure that the script is also working well using the version of R I used to conduct the analysis years ago. In other words, the 2021 script should work using both R 3.3.1 and R 4.1.0.

I also thought about using the essurvey package to get the data from the European Social Survey (we have an example on how to do that in the Quantitative Politics with R book), but I find it safer to only work with local copies of the data and not rely on this package being available in the future.

In a parallel universe a more productive version of myself would spend time and energy on more fruitful endeavors than updating the material for an article published years ago. However, I can highly recommend going through old material and see whether and if it still works. Some of the issues you might encounter will help you a lot in ensuring that the replication material you create for future projects are also more likely to stand the test of time.

Udregn mandater til Folketinget med R

I mange meningsmålinger rapporteres partiernes opbakning ikke udelukkende med andelen af stemmer i procent, men også som mandattal. D’Hondts metode bruges som bekendt fordelingen af kredsmandater ved Folketingsvalg, der sammen med tillægsmandater sikrer en ligelig fordeling mellem stemmer og mandater ved valget.

Hvis man gerne vil estimere hvor mange mandater de respektive partier står til at få, kan jeg varmt anbefale seatdist pakken til R. Den er udviklet af Juraj Medzihorsky og kan findes her. Når du har installeret pakken kan du nemt hente den ind i R og bruge giveseats() funktionen til at udregne mandater:

giveseats(c(33, 6, 10, 7, 8, 1, 3, 1, 5, 17, 8, 1), 
          ns = 175, 
          thresh = 0.02,
          method = "dh")

Det første vi giver funktionen er en vektor med opbakningen til partierne i procent (jeg har her undladt decimaler blot for at gøre det nemmere at læse). 33 er eksempelvis opbakningen til Socialdemokratiet i procent. ns angiver hvor mange mandater, vi skal fordele (number of seats, i dette tilfælde 175), thres angiver spærregrænsen (2% i dette tilfælde) og method er vores fordelingsmetode (dh for D’Hondt).

Her kan vi se at Socialdemokratiet vil få omkring 61 mandater ved næste folketingsvalg. Dette er selvfølgelig et estimat, da vi 1) har usikkerhed i meningsmålingen og 2) alle mandater ikke fordeles så simpelt ved valget. Vi tager ligeledes ikke de fire nordatlantiske mandater med i betragtning. Ikke desto mindre er det relativt nemt at få et estimat på, hvor stor opbakningen er til partierne i mandater. Pakken giver desuden en lang række muligheder for at undersøge partiernes mandattal, hvis man tog andre mandatfordelingsmetoder i brug.

Data visualization: a reading list

Here is a collection of books and peer-reviewed articles on data visualization. There is a lot of good material on the philosophy, principles and practices of data visualization.

I plan to update the list with additional material in the future (see the current version as a draft). Do reach out if you have any recommendations.

Introduction

Graphs in Statistical Analysis (Anscombe 1973)
An Economist’s Guide to Visualizing Data (Schwabish 2014)
Data Visualization in Sociology (Healy and Moody 2014)
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (Weissgerber et al. 2015)
Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods (Cleveland and McGill 1984)
Graphic Display of Data (Wilkinson 2012)
Visualizing Data in Political Science (Traunmüller 2020)
Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks (Schwabish 2021)

History

Historical Development of the Graphical Representation of Statistical Data (Funkhouser 1937)
Quantitative Graphics in Statistics: A Brief History (Beniger and Robyn 1978)

Tips and recommendations

Ten Simple Rules for Better Figures (Rougier et al. 2014)
Designing Graphs for Decision-Makers (Zacks and Franconeri 2020)
Designing Effective Graphs (Frees and Miller 1998)
Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics (Donahue 2011)
Designing Better Graphs by Including Distributional Information and Integrating Words, Numbers, and Images (Lane and Sándor 2009)

Analysis and decision making

Statistical inference for exploratory data analysis and model diagnostics (Buja et al. 2009)
Statistics and Decisions: The Importance of Communication and the Power of Graphical Presentation (Mahon 1977)
The Eight Steps of Data Analysis: A Graphical Framework to Promote Sound Statistical Analysis (Fife 2020)

Uncertainty

Researchers Misunderstand Confidence Intervals and Standard Error Bars (Belia et al. 2005)
Error bars in experimental biology (Cumming et al. 2007)
Confidence Intervals and the Within-the-Bar Bias (Pentoney and Berger 2016)
Depicting Error (Wainer 1996)
When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems (Kay et al. 2016)
Decisions With Uncertainty: The Glass Half Full (Joslyn and LeClerc 2013)
Uncertainty Visualization (Padilla et al. 2020)
A Probabilistic Grammar of Graphics (Pu and Kay 2020)

Tables

Let’s Practice What We Preach: Turning Tables into Graphs (Gelman et al. 2002)
Why Tables Are Really Much Better Than Graphs (Gelman 2011)
Graphs or Tables (Ehrenberg 1978)
Using Graphs Instead of Tables in Political Science (Kastellec and Leoni 2007)
Ten Guidelines for Better Tables (Schwabish 2020)

Deciding on a chart

Graph and chart aesthetics for experts and laymen in design: The role of familiarity and perceived ease of use (Quispel et al. 2016)

Chart types

Boxplots

40 years of boxplots (Wickham and Stryjewski 2011)

Pie charts

No Humble Pie: The Origins and Usage of a Statistical Chart (Spence 2005)

Infographics

Infovis and Statistical Graphics: Different Goals, Different Looks (Gelman and Unwin 2013)
InfoVis Is So Much More: A Comment on Gelman and Unwin and an Invitation to Consider the Opportunities (Kosara 2013)
InfoVis and Statistical Graphics: Comment (Murrell 2013)
Graphical Criticism: Some Historical Notes (Wickham 2013)
Tradeoffs in Information Graphics (Gelman and Unwin 2013)

Maps

Visualizing uncertainty in areal data with bivariate choropleth maps, map pixelation and glyph rotation (Lucchesi and Wikle 2017)

Scatterplot

The Many Faces of a Scatterplot (Cleveland and McGill 1984)
The early origins and development of the scatterplot (Friendly and Denis 2005)

Dot plots

Dot Plots: A Useful Alternative to Bar Charts (Robbins 2006)

3D charts

The Pseudo Third Dimension (Haemer 1951)

Teaching pedagogy

Correlational Analysis and Interpretation: Graphs Prevent Gaffes (Peden 2001)
Numbers, Pictures, and Politics: Teaching Research Methods Through Data Visualizations (Rom 2015)
Data Analysis and Data Visualization as Active Learning in Political Science (Henshaw and Meinke 2018)

Software

Excel

Effective Data Visualization: The Right Chart for the Right Data (Evergreen 2016)

R

Data Visualization (Healy 2018)
Data Visualization with R (Kabacoff 2018)
ggplot2: Elegant Graphics for Data Analysis (Wickham 2009)
Fundamentals of Data Visualization (Wilke 2019)
R Graphics Cookbook (Chang 2020)

Stata

A Visual Guide to Stata Graphics (Mitchell 2012)


Changelog
– 2021-03-01: Add ‘Better Data Visualizations’
– 2020-08-03: Add ‘Ten Guidelines for Better Tables’
– 2020-07-14: Add ‘Designing Graphs for Decision-Makers’ and ‘A Probabilistic Grammar of Graphics’ (ht: Simon Straubinger)

Potpourri: Statistics #63 (COVID-19)

Why outbreaks like coronavirus spread exponentially, and how to “flatten the curve”
Top 15 R resources on Novel COVID-19 Coronavirus
Collection of analyses, packages, visualisations of COVID19 data in R
Coronavirus (Covid-19) Data in the United States
How to Flatten the Curve, a Social Distancing Simulation and Tutorial
Forecasting COVID-19
Flatten the COVID-19 curve
Tidying the Johns Hopkins Covid-19 data
Our World in Data: Coronavirus Source Data
A COVID Small Multiple
– Some R-scripts available on GitHub: andrewheiss/flatten_the_curve.R acoppock/covid, troelst/covid.Rmd

New resource: awesome-ggplot2

I use ggplot2 every day. It is a great R package and the best tool available to make beautiful data visualisations. The logic of grammar of graphics makes it easy to learn as well as making it possible for you to gradually improve your plots.

Luckily, there are a lot of resources available for the package. I have created a repository with a list of packages, tutorials and other useful resources.

It is available here. It was also featured in R Weekly.

Potpourri: Statistics #58

Mastering R presentations
The Little Handbook of Statistical Practice
Create regular expressions easily
Data Integrity Tests for R
Quantitative Economics with Python
Doing Meta-Analysis in R: A Hands-On Guide
Appreciating R: The Ease of Testing Linear Model Assumptions
Just Quickly: The unexpected use of functions as arguments
Glue magic Part I
Pivoting tidily
Introducing fable
Advancing Text Mining with R and quanteda
Map coloring: the color scale styles available in the tmap package
Practical ggplot2

A Guide to Getting International Statistics into R

In political science, some of the data we use is from international databases such as the World Bank, ILOSTAT, OECD, WHO and Eurostat. One possibility to access data from these sources is to manually download data from their webpages. This is, however, often time-consuming and not an efficient way to obtain data.

Luckily, there are easier ways to access international statistics. In this post, I will show you have to get data from the World Bank, ILOSTAT, OECD, WHO, Eurostat and Our World in Data into R. The R packages available to access data are called WDI, Rilostat, OECD, WHO, eurostat and owidR, respectively. In brief, the packages make it easy for you to get the most recent data on a series of indicators into R.

To begin, while technically not required to obtain the data, load tidyverse (for the data management tools). Next, load the the five packages mentioned above. Make sure to install the packages first (they are all, with the exception of owidR, available on CRAN).

# load relevant packages
## data management etc.
library("tidyverse")
## the five packages to access data
library("WDI")
library("Rilostat")
library("OECD")
library("WHO")
library("eurostat")
library("owidR")

The packages have some similarities. Specifically, there are two steps you need to go through. First, you will have to find the data you would like to use. Second, you will need to download the data. In the table below I outline the relevant functions for each step in the five packages.

Package Find data Download data
WDI WDIsearch() WDI()
Rilostat get_ilostat_toc() get_ilostat()
OECD search_dataset() get_dataset()
WHO get_codes() get_data()
eurostat get_eurostat_toc() get_eurostat()
owidR owid_search() owid()

You might not be sure what exact source to use. Instead, you will know what type of data you are looking for, e.g. data on unemployment. Accordingly, I find it useful to save the string of relevance (in this example unemployment) and search through the individual sources. Below, I search for unemployment in each data source and examine the output in the View window.

# finding data
## search string
searchText <- "unemployment"

## World Bank
searchText %>%
  WDIsearch() %>%
  View()

## ILOSTAT
ilostat_list <- get_ilostat_toc()
ilostat_list %>%
  filter(str_detect(tolower(indicator.label), tolower(searchText))) %>%
  View()

## OECD
oecd_list <- get_datasets()
search_dataset(searchText, data = oecd_list) %>% 
  View()

## WHO
who_list <- get_codes()
who_list %>%
  filter(str_detect(tolower(display), tolower(searchText))) %>%
  View()

## Eurostat
eurostat_list <- get_eurostat_toc()
eurostat_list %>%
  filter(str_detect(tolower(title), tolower(searchText))) %>%
  View()

## Our World in Data
searchText %>%
  owid_search() %>%
  View()

In the View() window you will get a list of the variables containing the search string in the label. Next to each of the labels you will see what the unique indicator id is for the variable. This is the information we will use to download the data.

In the World Bank and ILOSTAT, the indicator variable is called indicator. In the Our World in Data, the indicator variable is called chart_id. In OECD the indicator variable is called id and in WHO it is called label. For the unemployment rate in the World Bank data, for example, we can see that the indicator is SL.UEM.TOTL.ZS.

Using the code below, I download data from the various datasets. You can change the specific indicators to whatever data you would like to download.

# get data
## World Bank
data_worldbank <- WDI(indicator = "SL.UEM.TOTL.ZS")

## ILOSTAT
data_ilostat <- get_ilostat(id = "UNE_DYAP_NOC_RT_A")

## OECD
data_oecd <- get_dataset(dataset = "AVD_DUR")

## WHO
data_who <- get_data("tfr")

## Eurostat
data_eurostat <- get_eurostat("ei_lmhr_m")

## Our World in Data
data_owid <- owid("unemployment-rate")

We now have the data in our six objects (data_*). Usually, you want to restructure the data or link it to other datasets. This is where the functions in tidyverse come in handy.

Using the data we got from the World Bank, we can show the unemployment rate in the Scandinavian countries:

# create figure
data_worldbank %>% 
  drop_na(SL.UEM.TOTL.ZS) %>%
  filter(country %in% c("Denmark", "Norway", "Sweden")) %>%
  ggplot(aes(x = year, y = SL.UEM.TOTL.ZS, colour = country)) +
  geom_line(size = 1) +
  theme_minimal() + 
  labs(title = "Unemployment (% of total labor force), Scandinavia",
       colour = NULL,
       y = NULL,
       x = NULL) +
  theme(legend.position = "bottom")

This is basically all you need to get data into R. Some of the packages have extra features that I recommend that you check out (e.g. the ability to download data on multiple indicators at once with the WDI() function).

Noteworthy, there are other packages that will help you get international statistics into R. The BIS package, for example, makes it possible to get data from the Bank for International Settlements into R. In this specific example, however, there are only few variables available and no need for a search string (for an example on getting data from BIS, I have updated the code on GitHub). Another example is the imfr package that enables you to get data from the International Monetary Fund (you can read more about this package here).

Last, there are a few principles that I recommend that you follow. First, only download the data you need. For some of the functions, you can specify the period and countries you want data from. This will ensure that you do not download the full data (e.g. WDI(country = c("DK", "NO", "SE"), indicator = "SL.UEM.TOTL.ZS")).

Second, only download the data once and save it in a local file. Instead of having one script where you both download and manipulate the data, consider having a script where you download and save the data and another script where you work with the data. There is simply no need to download the data again and again, especially if you run all of your code several times a day.

Changelog
– 2021-09-16: Add Our World in Data (the owidR package) to the guide.
– 2020-05-13: Add note on the imfr package.
– 2019-10-08: Add note on other packages added, including an example with the BIS package.
– 2019-10-05: Add Eurostat (the eurostat package) to the guide.

Teaching material: Quantitative Politics with R

If you are interested in learning R, I can recommend this resource: Quantitative Politics with R. It is a guide in development (together with Zoltán Fazekas).

In the current version, you will find an introduction to the basics of R (e.g. how to import and manipulate data), how to collect political data (primary and secondary data), how to visualise data and a brief introduction to OLS regression.

The material follows – for the most part – tools within the tidyverse (such as the dplyr and ggplot2 packages). In future versions you will find additional techniques to analyse data, more on scoped verbs, functional programming tools etc. Any suggestions, feedback and comments are more than welcome.

(This is old news for people following me on Twitter.)

Potpourri: Statistics #51

– 2018 in Graphics: Bloomberg, FiveThirtyEight, Reuters, Nathan Yau
Survey Raking: An Illustration
textrecipes 0.0.1
Topics in Econometrics: Advances in Causality and Foundations of Machine Learning
Learning Statistics with R
EDUC 263: Introduction to Data Management Using R
Practical R for Mass Communication and Journalism: How Do I? …
Text classification with tidy data principles
Easily generate information-rich, publication-quality tables from R
gganimate: Getting Started
Text as Data
A biased tour of the uncertainty visualization zoo

Potpourri: Statistics #50

Generating data to explore the myriad causal effects that can be estimated in observational data analysis
A Practical Guide to Mixed Models in R
Ask the Question, Visualize the Answer
Statistical Rethinking with brms, ggplot2, and the tidyverse
Twitter, political ideology & the 115th US Senate
You Can’t Test Instrument Validity
Introduction to Econometrics with R
Hands-On Programming with R
Tidytext Tutorials
What’s the best way to learn the programming language R? (Preferably, for free)