Erik Gahner Larsen

A Guide to Getting International Statistics into R

In political science, some of the data we use is from international databases such as the World Bank, ILOSTAT, OECD, WHO and Eurostat. One possibility to access data from these sources is to manually download data from their webpages. This is, however, often time-consuming and not an efficient way to obtain data.

Luckily, there are easier ways to access international statistics. In this post, I will show you have to get data from the World Bank, ILOSTAT, OECD, WHO and Eurostat into R. The R packages available to access data are called WDI, Rilostat, OECD, WHO and eurostat, respectively. In brief, the packages make it easy for you to get the most recent data on a series of indicators into R.

To begin, while technically not required to obtain the data, load tidyverse (for the data management tools). Next, load the the five packages mentioned above. Make sure to install the packages first (they are all available on CRAN).

# load relevant packages
## data management etc.
## the five packages to access data

The packages have some similarities. Specifically, there are two steps you need to go through. First, you will have to find the data you would like to use. Second, you will need to download the data. In the table below I outline the relevant functions for each step in the five packages.

Package Find data Download data
WDI WDIsearch() WDI()
Rilostat get_ilostat_toc() get_ilostat()
OECD search_dataset() get_dataset()
WHO get_codes() get_data()
eurostat get_eurostat_toc() get_eurostat()

You might not be sure what exact source to use. Instead, you will know what type of data you are looking for, e.g. data on unemployment. Accordingly, I find it useful to save the string of relevance (in this example unemployment) and search through the individual sources. Below, I search for unemployment in each data source and examine the output in the View window.

# finding data
## search string
searchText <- "unemployment"

## World Bank
searchText %>%
  WDIsearch() %>%

ilostat_list <- get_ilostat_toc()
ilostat_list %>%
  filter(str_detect(tolower(indicator.label), tolower(searchText))) %>%

oecd_list <- get_datasets()
search_dataset(searchText, data = oecd_list) %>% 

## WHO
who_list <- get_codes()
who_list %>%
  filter(str_detect(tolower(display), tolower(searchText))) %>%

## Eurostat
eurostat_list <- get_eurostat_toc()
eurostat_list %>%
  filter(str_detect(tolower(title), tolower(searchText))) %>%

In the View window you will get a list of the variables containing the search string in the label. Next to each of the labels you will see what the unique indicator id is for the variable. This is the information we will use to download the data.

In the World Bank and ILOSTAT, the indicator variable is called indicator. In OECD the indicator variable is called id and in WHO it is called label. For the unemployment rate in the World Bank data, for example, we can see that the indicator is SL.UEM.TOTL.ZS.

Using the code below, I download data from the various datasets. You can change the specific indicators to whatever data you would like to download.

# get data
## World Bank
data_worldbank <- WDI(indicator = "SL.UEM.TOTL.ZS")

data_ilostat <- get_ilostat(id = "UNE_DYAP_NOC_RT_A")

data_oecd <- get_dataset(dataset = "AVD_DUR")

## WHO
data_who <- get_data("tfr")

## Eurostat
data_eurostat <- get_eurostat("ei_lmhr_m")

We now have the data in our five objects (data_*). Usually, you want to restructure the data or link it to other datasets. This is where the functions in tidyverse come in handy.

Using the data we got from the World Bank, we can show the unemployment rate in the Scandinavian countries:

# create figure
data_worldbank %>% 
  drop_na(SL.UEM.TOTL.ZS) %>%
  filter(country %in% c("Denmark", "Norway", "Sweden")) %>%
  ggplot(aes(x = year, y = SL.UEM.TOTL.ZS, colour = country)) +
  geom_line(size = 1) +
  theme_minimal() + 
  labs(title = "Unemployment (% of total labor force), Scandinavia",
       colour = NULL,
       y = NULL,
       x = NULL) +
  theme(legend.position = "bottom")

This is basically all you need to get data into R. Some of the packages have extra features that I recommend that you check out (e.g. the ability to download data on multiple indicators at once with the WDI() function).

Noteworthy, there are other packages that will help you get international statistics into R. The BIS package, for example, makes it possible to get data from the Bank for International Settlements into R. In this specific example, however, there are only few variables available and no need for a search string (for an example on getting data from BIS, I have updated the code on GitHub).

Last, there are a few principles that I recommend that you follow. First, only download the data you need. For some of the functions, you can specify the period and countries you want data from. This will ensure that you do not download the full data (e.g. WDI(country = c("DK", "NO", "SE"), indicator = "SL.UEM.TOTL.ZS")).

Second, only download the data once and save it in a local file. Instead of having one script where you both download and manipulate the data, consider having a script where you download and save the data and another script where you work with the data. There is simply no need to download the data again and again, especially if you run all of your code several times a day.

– 2019-10-08: Note on other packages added, including an example with the BIS package.
– 2019-10-05: Eurostat (the eurostat package) added to the guide.

Seven ways to find data

Data might be the new oil (there are arguments for and against this). While there definitely is a lot of data out there for you to drill, it can be difficult to find the exact data you need.

In this post I will outline seven different strategies to 1) keep yourself updated on new data sources and 2) find older datasets. I do not recommend that you necessarily go with all of them (there is a significant overlap between what you will find using the different strategies), and I have ranked the strategies according to my own personal preferences.

1. Newsletters
One of the best ways to keep yourself updated on new datasets is by getting the updates directly to your mailbox. Here, I can highly recommend the weekly newsletter Data Is Plural by Jeremy Singer-Vine.

While there are other newsletters out there, my impression is that if you subscribe to Data Is Plural, you should be covered. In addition, you can take a look at the structured archive of datasets covered in the newsletter (841 datasets at the time of writing). If you do not already subscribe to the newsletter, do yourself a favour and sign up.

2. GitHub repositories
Another good way to find data is to explore GitHub repositories. A lot of repositories host data (e.g. media outlets like FiveThirtyEight), and by exploring popular repositories, you will often find interesting data.

However, there are repositories that also list datasets you might find interesting. Awesome Public Datasets, for example, is a list of open datasets from a wide range of fields (GIS, neuroscience, sports, climate etc.). I curate the PolData repository where you can find a list of political datasets (elections, international relations, parties, policies etc.).

3. Twitter
Twitter is as always a good way to keep yourself in the loop. While there are specific users on Twitter that tweet about new and old datasets (such as GetTheData and Pew Research Methods), the most useful strategy here is to follow researchers.

Researchers care about sharing useful resources such as datasets. To illustrate, I found this amazing resource on free and open psychological datasets on Twitter.

4. Harvard Dataverse
The Harvard Dataverse is another great place to find datasets. The search function is working well and there is publicly available data related to various topics (especially for political scientists).

Noteworthy, I use this service to get a sense of forthcoming articles (as the data usually is stored online prior to the articles hitting your RSS or/and Twitter feed). For example, journals such as American Journal of Political Science and Journal of Politics have their own dataverse where they archive datasets well in advance of the actual publication.

Psychologists might prefer OSF instead of the Harvard Dataverse. However, I find OSF cumbersome to use and a mess when you want to explore potential datasets.

5. Facebook groups
Facebook is usually not my cup of tea (let us be honest: it is shit). That being said, there are some good groups for academics to explore. One of these is Political Science Data where people are good at sharing links to new resources. Furthermore, this is also a good place to ask for data suggestions. My impression is that there are similar Facebook groups available for other scientific domains as well.

6. Reddit
If you already use Reddit, The subreddit r/datasets is worth looking into. The quality of the submissions is not always great but you will often find some interesting datasets from various fields.

Another subreddit to check out is r/dataisbeautiful, where people share data visualizations (mostly original content). While sharing data is not the main objective of the subreddit, you will most likely find a lot of interesting data there.

7. Google Dataset Search
Last, we have Google Dataset Search. I like the idea of having a Google for datasets. And this is literally a Google for datasets. That being said, I have not used this service a lot and whenever I use it to find data, I am not convinced that this is the best strategy to use. Accordingly, I recommend following the six resources introduced above before using this service.

Potpourri: Statistics #57

Word limits in political science journals

Different political science journals have different article formats with different word/page limits. Consequently, whenever you want to submit an article to a journal, the first thing to look up is the exact word limit.

In order to get a sense of the different article formats and word limits in political science journals, I have created an overview. The overview shows word limits for long articles, short articles and review essays/articles.

The overview currently consists of 65 journals and I will most likely add more journals (and more features) in the future. Do reach out on Twitter or drop me a mail if you got any feedback or if there is a specific journal of relevance to political scientists that I should add to the overview.

Last, the overview is sorted by impact factor (obtained with the excellent scholar package in R).

25 interesting facts

1. Associations with cancer risk or benefits have been claimed for most food ingredients (Schoenfeld and Ioannidis 2013)

2. People in non-English speaking countries with subtitled TV are better at English than people in countries with dubbed television (Micola et al. 2019)

3. Walking speed is a function of city size in that pedestrians move more quickly in big cities than in small towns (Walmsley and Lewis 1989)

4. Littered cigarette filters reduce growth and alter short-term primary productivity of terrestrial plants (Green et al. 2019)

5. In soccer penalty kicks, goalkeepers almost always jump right or left (the optimal strategy is to stay in the goal’s center) (Bar-Eli et al. 2007)

6. Credit card payments increase unhealthy food purchases (Thomas et al. 2011)

7. Autocracies systematically build more new skyscrapers than democracies (Gjerløw and Knutsen 2019)

8. Bacteria persist more efficiently on laminated restaurant menus as compared to paper menus (Sirsat et al. 2013)

9. People view their own perceptions and beliefs as objective reflections of reality but others’ as distorted by bias (Pronin 2008)

10. The price of champagne falls before New Year’s Eve due to the entry of a large share of new consumers (Bayot and Caminade 2014)

11. In prison, inmates cooperate in Prisoner’s Dilemma (Khadjavi and Lange 2013)

12. The World Cup in soccer increases state aggression (Bertoli 2017)

13. Vaccines are not associated with autism (Taylor et al. 2014)

14. Open office spaces make workers rely more on email while decreasing face-to-face interaction (Bernstein and Turban 2018)

15. GDP data can be systematically manipulated for political ends (Wallace 2016)

16. In used-car transactions, there is left-digit bias in the processing of odometer values, i.e. people focus on the number’s leftmost digits, with implications for the sale price (Lacetera et al. 2012)

17. Thanksgiving dinners attended by residents from opposing political party precincts are 30 to 50 minutes shorter than same-party dinners (Chen and Rohla 2018) (although see this eLetter)

18. Bronze medalists tend to be happier than silver medalists (Medvec et al. 1995)

19. Warming oceans are killing coral reefs (Hughes et al. 2018)

20. National trust levels are negatively associated with the length of countries’ constitutions (Bjørnskov and Voigt 2014)

21. In an experiment, increased sexual frequency did not lead to increased happiness (Loewenstein et al. 2015)

22. WTO membership is likely to have no causal effect on domestic corruption overall (if anything, it is likely to increase corrupt practices, particularly among firms that are government owned) (Choudhury 2019)

23. Walking is good for creative thinking (Oppezzo and Schwartz 2014)

24. Media outlets are more likely to report opinion polls that show larger changes (Larsen and Fazekas 2019; Searles et al. 2016)

25. Listening to Mozart does not improve your spatial-reasoning performance (Steele et al. 1999; McKelvie and Low 2002; Črnčec et al. 2006)

Teaching material: Quantitative Politics with R

If you are interested in learning R, I can recommend this resource: Quantitative Politics with R. It is a guide in development (together with Zoltán Fazekas).

In the current version, you will find an introduction to the basics of R (e.g. how to import and manipulate data), how to collect political data (primary and secondary data), how to visualise data and a brief introduction to OLS regression.

The material follows – for the most part – tools within the tidyverse (such as the dplyr and ggplot2 packages). In future versions you will find additional techniques to analyse data, more on scoped verbs, functional programming tools etc. Any suggestions, feedback and comments are more than welcome.

(This is old news for people following me on Twitter.)

Alas, it’s not rocket science

Boris Johnson writes in The Telegraph that since we could get to the Moon, we should be able to get out of the EU: “They went to the Moon 50 years ago. Surely today we can solve the logistical issues of the Irish border”.

I sympathise with the sentiment in the argument. A lot of smart people – including a lot of social scientists – are working on solving complicated and difficult social issues, and it should be possible to solve the issue at hand. And why not expect this when scientists can solve complicated issues in the natural sciences? After all, it’s not rocket science.

However, there are very good reasons to why we cannot simply solve social issues that – intuitively – should be easy to solve. In brief, social science is much more complicated than a lot of the issues that we deal with in the natural sciences. We simply believe that we understand complex social phenomena when the truth is that we are not good at understanding or/and predicting such phenomena. Accordingly, when social scientists say that social science is not rocket science, we envy the simplicity of rocket science.

Duncan J. Watts describes this clearly in his book, Everything Is Obvious: How Common Sense Fails Us: “Well, I’m no rocket scientist, and I have immense respect for the people who can land a machine the size of a small car on another planet. But the sad fact is that we’re actually much better at planning the flight path of an interplanetary rocket than we are at managing the economy, merging two corporations, or even predicting how many copies of a book will sell.”

The problem is that Boris Johnson is assuming that the ontological parsimony in the natural sciences easily applies to the social sciences. There are specific reasons to which this perspective only works for the natural sciences and not the social sciences. Seva Gunitsky (2019), for example, describes why ontological parsimony works in the natural sciences: “The scientific version of ontological parsimony, most often associated with theoretical physics and mathematics (but sometimes imported into social science), argues that reality itself is governed by parsimonious physical laws. The fundamental physical nature of matter itself, at least at the subatomic level, possesses a symmetry that abets and even demands parsimonious explanations. Parsimonious theories that take advantage of this symmetry are appealing not just because they are elegant, but because they are more likely to be true.”

However, we cannot draw on the simplicity of the natural sciences to infer the potential to identify and suggest solutions to social issues. In the social sciences, we do not have the luxury of studying parsimonious physical laws. On the contrary, the social world is much more complicated. Acknowledging this is important if we are to actually understand and solve social issues – including the logistical issues of the Irish border.

New article in European Political Science Review: Bailout or bust?

Robert Klemmensen, Michael Baggesen Klitgaard and I have a new article in the May issue of the European Political Science Review. The article is titled ‘Bailout or bust? Government evaluations in the wake of a bailout‘. Here is the abstract:

Governments are often punished for negative events such as economic downturns and financial shocks. However, governments can address such shocks with salient policy responses that might mitigate public punishment. We use three high-quality nationally representative surveys collected around a key event in the history of the Dutch economy, namely the outbreak of the financial crisis in 2008, to examine how voters responded to a salient government bailout. The results illustrate that governments can get substantial credit for pursuing a bailout in the midst of a financial crisis. Future research should take salient policy responses into account to fully understand the public response to the outbreak of financial and economic crises.

You can find the article here. Replication material is available at GitHub and the Harvard Dataverse.

Potpourri: Statistics #56