How (not) to study suicide terrorism

Today is the 20 year anniversary for 9/11. That made me look into one of the most salient methodological discussions on how to study suicide terrorism within political science.

Suicide terrorism is a difficult topic to study. Why? Because we cannot learn about the causes (or correlates) of suicide terrorism from only studying cases of terrorism. Pape (2003) studies 188 suicide attacks in the period 1980-2001. He concludes that there is a strategic logic to these attacks, namely that they pay off for the organisations and groups pursuing such attacks.

Ashworth et al. (2008) use simple statistics such as conditional probabilities to show that there are problems with the paper in question, namely that the original paper “samples on the dependent variable.” I especially liked this formulation in the conclusion: “It is important to note that our critique of Pape’s (2003) analysis does not make the well-known point that association does not imply causation. Rather, because Pape collects only instances of suicide terrorism, his data do not even let him calculate the needed associations.”

Pape (2008) provides a reply to the critique raised by Ashworth and colleagues. He first brings a long excerpt from his book not taking the critique of Ashworth et al. into account. Then, he writes: “One might still wonder whether the article is flawed by sample bias because it considered systematically only actual instances of suicide terrorism. The answer is no, for two reasons. First, the article did not sample suicide terrorism, but collected the universe of suicide terrorist attacks worldwide from 1980 through 2001. […] There is no such thing as sample bias in collecting a universe. Second, although it is true that the universe systematically studied did not include suicide terrorist campaigns that did not happen, and that this limits the claims that my article could make, this does not mean that my analysis could not support any claims or that it could not support the claims I actually made.”

Importantly, just because you might have the universe of suicide terrorist attacks, you should still treat it as a sample (especially if you want to make policy recommendations about future cases we have not seen yet). In other words, this is a weird way of defending your flawed analysis. In an unpublished rejoinder, Ashworth (2008) provide some additional arguments to why the response to the criticism is flawed. Also, Horowitz (2010) shows that when you increase the universe of cases, Pape’s findings do not hold.

The debate is more than ten years old but reminiscent of similar contemporary debates on data and causality. Accordingly, I find it to be a good read for people interested in research design, data and inference — and it’s a good case to discuss what can (not) be learned from ‘selecting on the dependent variable’. Last, and most importantly, if you want to understand this amazing tweet, it is good to be familiar with the debate.

Potpourri: Statistics #63 (COVID-19)

Why outbreaks like coronavirus spread exponentially, and how to “flatten the curve”
Top 15 R resources on Novel COVID-19 Coronavirus
Collection of analyses, packages, visualisations of COVID19 data in R
Coronavirus (Covid-19) Data in the United States
How to Flatten the Curve, a Social Distancing Simulation and Tutorial
Forecasting COVID-19
Flatten the COVID-19 curve
Tidying the Johns Hopkins Covid-19 data
Our World in Data: Coronavirus Source Data
A COVID Small Multiple
– Some R-scripts available on GitHub: andrewheiss/flatten_the_curve.R acoppock/covid, troelst/covid.Rmd

Potpourri: Statistics #58

Mastering R presentations
The Little Handbook of Statistical Practice
Create regular expressions easily
Data Integrity Tests for R
Quantitative Economics with Python
Doing Meta-Analysis in R: A Hands-On Guide
Appreciating R: The Ease of Testing Linear Model Assumptions
Just Quickly: The unexpected use of functions as arguments
Glue magic Part I
Pivoting tidily
Introducing fable
Advancing Text Mining with R and quanteda
Map coloring: the color scale styles available in the tmap package
Practical ggplot2

A Guide to Getting International Statistics into R

In political science, some of the data we use is from international databases such as the World Bank, ILOSTAT, OECD, WHO and Eurostat. One possibility to access data from these sources is to manually download data from their webpages. This is, however, often time-consuming and not an efficient way to obtain data.

Luckily, there are easier ways to access international statistics. In this post, I will show you have to get data from the World Bank, ILOSTAT, OECD, WHO, Eurostat and Our World in Data into R. The R packages available to access data are called WDI, Rilostat, OECD, WHO, eurostat and owidR, respectively. In brief, the packages make it easy for you to get the most recent data on a series of indicators into R.

To begin, while technically not required to obtain the data, load tidyverse (for the data management tools). Next, load the the five packages mentioned above. Make sure to install the packages first (they are all, with the exception of owidR, available on CRAN).

# load relevant packages
## data management etc.
library("tidyverse")
## the five packages to access data
library("WDI")
library("Rilostat")
library("OECD")
library("WHO")
library("eurostat")
library("owidR")

The packages have some similarities. Specifically, there are two steps you need to go through. First, you will have to find the data you would like to use. Second, you will need to download the data. In the table below I outline the relevant functions for each step in the five packages.

Package Find data Download data
WDI WDIsearch() WDI()
Rilostat get_ilostat_toc() get_ilostat()
OECD search_dataset() get_dataset()
WHO get_codes() get_data()
eurostat get_eurostat_toc() get_eurostat()
owidR owid_search() owid()

You might not be sure what exact source to use. Instead, you will know what type of data you are looking for, e.g. data on unemployment. Accordingly, I find it useful to save the string of relevance (in this example unemployment) and search through the individual sources. Below, I search for unemployment in each data source and examine the output in the View window.

# finding data
## search string
searchText <- "unemployment"

## World Bank
searchText %>%
  WDIsearch() %>%
  View()

## ILOSTAT
ilostat_list <- get_ilostat_toc()
ilostat_list %>%
  filter(str_detect(tolower(indicator.label), tolower(searchText))) %>%
  View()

## OECD
oecd_list <- get_datasets()
search_dataset(searchText, data = oecd_list) %>% 
  View()

## WHO
who_list <- get_codes()
who_list %>%
  filter(str_detect(tolower(display), tolower(searchText))) %>%
  View()

## Eurostat
eurostat_list <- get_eurostat_toc()
eurostat_list %>%
  filter(str_detect(tolower(title), tolower(searchText))) %>%
  View()

## Our World in Data
searchText %>%
  owid_search() %>%
  View()

In the View() window you will get a list of the variables containing the search string in the label. Next to each of the labels you will see what the unique indicator id is for the variable. This is the information we will use to download the data.

In the World Bank and ILOSTAT, the indicator variable is called indicator. In the Our World in Data, the indicator variable is called chart_id. In OECD the indicator variable is called id and in WHO it is called label. For the unemployment rate in the World Bank data, for example, we can see that the indicator is SL.UEM.TOTL.ZS.

Using the code below, I download data from the various datasets. You can change the specific indicators to whatever data you would like to download.

# get data
## World Bank
data_worldbank <- WDI(indicator = "SL.UEM.TOTL.ZS")

## ILOSTAT
data_ilostat <- get_ilostat(id = "UNE_DYAP_NOC_RT_A")

## OECD
data_oecd <- get_dataset(dataset = "AVD_DUR")

## WHO
data_who <- get_data("tfr")

## Eurostat
data_eurostat <- get_eurostat("ei_lmhr_m")

## Our World in Data
data_owid <- owid("unemployment-rate")

We now have the data in our six objects (data_*). Usually, you want to restructure the data or link it to other datasets. This is where the functions in tidyverse come in handy.

Using the data we got from the World Bank, we can show the unemployment rate in the Scandinavian countries:

# create figure
data_worldbank %>% 
  drop_na(SL.UEM.TOTL.ZS) %>%
  filter(country %in% c("Denmark", "Norway", "Sweden")) %>%
  ggplot(aes(x = year, y = SL.UEM.TOTL.ZS, colour = country)) +
  geom_line(size = 1) +
  theme_minimal() + 
  labs(title = "Unemployment (% of total labor force), Scandinavia",
       colour = NULL,
       y = NULL,
       x = NULL) +
  theme(legend.position = "bottom")

This is basically all you need to get data into R. Some of the packages have extra features that I recommend that you check out (e.g. the ability to download data on multiple indicators at once with the WDI() function).

Noteworthy, there are other packages that will help you get international statistics into R. The BIS package, for example, makes it possible to get data from the Bank for International Settlements into R. In this specific example, however, there are only few variables available and no need for a search string (for an example on getting data from BIS, I have updated the code on GitHub). Another example is the imfr package that enables you to get data from the International Monetary Fund (you can read more about this package here).

Last, there are a few principles that I recommend that you follow. First, only download the data you need. For some of the functions, you can specify the period and countries you want data from. This will ensure that you do not download the full data (e.g. WDI(country = c("DK", "NO", "SE"), indicator = "SL.UEM.TOTL.ZS")).

Second, only download the data once and save it in a local file. Instead of having one script where you both download and manipulate the data, consider having a script where you download and save the data and another script where you work with the data. There is simply no need to download the data again and again, especially if you run all of your code several times a day.

Changelog
– 2021-09-16: Add Our World in Data (the owidR package) to the guide.
– 2020-05-13: Add note on the imfr package.
– 2019-10-08: Add note on other packages added, including an example with the BIS package.
– 2019-10-05: Add Eurostat (the eurostat package) to the guide.

Potpourri: Statistics #57

Keep It Together: Using the tidyverse for machine learning
Learn to purrr
Mastering Shiny
A Comprehensive List of Handy R Packages
The challenges of using machine learning to identify gender in images
How is polling done around the world?
How to Get Better at Embracing Unknowns
Drawing maps in R
Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics
Visualizing Locke and Mill: a tidytext analysis
Tutorial: Cleaning UK Office for National Statistics data in R
– Transitioning into the tidyverse: part 1, part 2
Your Friendly Guide to Colors in Data Visualisation
Optimising your R code – a guided example
Learning data visualization
Reference Collection to push back against “Common Statistical Myths”
mutate_all(), select_if(), summarise_at()… what’s the deal with scoped verbs?!
Tools for Exploring and Comparing Data Frames
Tom’s Cookbook for Better Viz
Themes to Improve Your ggplot Figures
Lesser Known R Features
What Statistics Can and Can’t Tell Us About Ourselves
A Graphical Introduction to tidyr’s pivot_*()
n() cool #dplyr things
Bayesian Linear Mixed Models: Random Intercepts, Slopes, and Missing Data
Prepping data for #rstats #tidyverse and a priori planning
NYT-style urban heat island maps

Potpourri: Statistics #54

A data.table and dplyr tour
Mistakes, we’ve drawn a few
Twenty rules for good graphics
gganimate: The grammar of animation
Visualising Intersecting Sets Of Twitter Followers
Docker and Packrat
Explore your Researcher Degrees of Freedom
Teaching material: Data analytics and visualization
10 things R can do that might surprise you
Scraping Data from the Web with rvest
Common statistical tests are linear models (or: how to teach stats)
8 Useful R Packages for Data Science You Aren’t Using (But Should!)
Easy multi-panel plots in R using facet_wrap() and facet_grid() from ggplot2
Winners of the 1st Shiny Contest
Rachael’s R Tutorials
Web Scraping for Broad City Charts
Implementing the super learner with tidymodels
Three things to know beyond base R

Potpourri: Statistics #52

Here’s why 2019 is a great year to start with R: A story of 10 year old R code then and now
How the BBC Visual and Data Journalism team works with graphics in R
Special Topics in Data Science: Responsible Data Science
Causal Data Science
From Psychologist to Data Scientist
Causal Graphs Seminar
R Coding Style Guide
Explaining the 2016 Democratic Primary with Machine Learning
A guide to making your data analysis more reproducible
Exploring the multiplication table with R
hcandersenr: An R Package for H.C. Andersens fairy tales
Solving the model representation problem with broom
Basic Stata Syntax Workshop
Bayesian Logistic Regression using brms, Part 1
Half a dozen frequentist and Bayesian ways to measure the difference in means in two groups
Understanding propensity score weighting
Causal Inference Book
15 new ideas and new tools for R gathered from the RStudio Conference 2019
Keeping up to date with R news
tidylog

Potpourri: Statistics #51

– 2018 in Graphics: Bloomberg, FiveThirtyEight, Reuters, Nathan Yau
Survey Raking: An Illustration
textrecipes 0.0.1
Topics in Econometrics: Advances in Causality and Foundations of Machine Learning
Learning Statistics with R
EDUC 263: Introduction to Data Management Using R
Practical R for Mass Communication and Journalism: How Do I? …
Text classification with tidy data principles
Easily generate information-rich, publication-quality tables from R
gganimate: Getting Started
Text as Data
A biased tour of the uncertainty visualization zoo

Potpourri: Statistics #46

ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus
Xenographics: Weird but (sometimes) useful charts
An R package for sensitivity analysis (konfound)
How to Choose and Design the Perfect Chart
Digging deeper: online resources for intermediate to advanced R users
Animated Directional Chord Diagrams
Making a twitter dashboard with R
Why not to use two axes, and what to use instead
Teaching difference-in-differences
factoextra: Extract and Visualize the Results of Multivariate Data Analyses
What to consider when choosing colors for data visualization
Clickbait-Corrected p-Values