## Causality models: Campbell, Rubin and Pearl

In political science, the predominant way to discuss causality is in relation to experiments and counterfactuals (within the potential outcomes framework). However, we also use concepts such as internal and external validity and sometimes we use arrows to show how different concepts are connected. When I was introduced to causality, it was on a PowerPoint slide with the symbol X, a rightwards arrow, and the symbol Y, together with a few bullet points on the specific criteria that should be met before we can say that a relationship is causal (inspired by John Gerring’s criterial approach; see, e.g., Gerring 2005).

Importantly, there are multiple models we can consider when we want to discuss causality. In brief, there are three popular causality models today: 1) the Campbell model (focusing on threats to validity), 2) the Rubin model (focusing on potential outcomes), and 3) the Pearl model (focusing on directed acyclic graphs). The names of the models are based on the names of the researchers who have been instrumental in the development of these models (Donald Campbell, Donald Rubin and Judea Pearl). I believe a good understanding of these three models is a prerequisite to be able to discuss causal inference within quantitative social science.

Luckily, we have good introductions to the three frameworks that compare the main similarities and differences. The special issue introduced by Maxwell (2010) focuses on two of the frameworks, namely the frameworks related to Campbell and Rubin. What is great about the special issue is that it focuses on important differences between the two frameworks but also how the two frameworks are complementary. That being said, it does not pay a lot of attention to the Pearl’s framework. Shadish (2010) and West and Thoemmes (2010) provide comparisons of the work by Campbell and Rubin on causal inference. Rubin (2010) and Imbens (2010) further provide some additional reflections on the causal models from their own perspectives.

The best primer to understand the three frameworks is the book chapter by Shadish and Sullivan (2012). They make it clear that all three models to causality acknowledge the importance of manipulable causes and brings an experimental terminology into observational research. In addition, they highlight the importance of assumptions (as causal inference without assumptions is impossible). Unfortunately, they do not summarise the key similarities and differences between the models in a table. For that reason, I decided to create the table below to provide a brief overview of the three models. Keep in mind that the table provides a simplified comparison and there are important nuances that you will only fully understand by consulting the relevant literature.

 Campbell Rubin Pearl Core Validity typology and the associated threats to validity Precise conceptualization of causal inference Directed acyclic graphs (DAGs) Goal Create a generalized causal theory Define an effect clearly and precisely State the conditions under which a given DAG can support a causal inference Fields of development Psychology Statistics, program evaluation Artificial intelligence, machine learning Examples of main concepts Internal validity, external validity, statistical conclusion validity, construct validity Potential outcomes, causal effect, stable-unit-treatment-value assumption Node, edge, collider, d-seperation, back-door criterion, do(x) operator Definition of effect Difference between counterfactuals Difference between potential outcomes The space of probability distributions on Y using the do(x) operator Causal generalisation Meta-analysis, construct and external validity Response surface analysis, meditational modeling Specified within the DAG Assumption for valid inference in observational research Ruled out all threats to validity Strong ignorability Correct DAG Examples of application Quasi-experiments Missing data imputation, propensity scores Mediational paths Conceptual and philosophical scope Wide-ranging Narrow, formal statistical model Narrow, formal statistical model Emphasis Descriptive causation Descriptive causation Explanatory causation Preference for randomized experiments Yes Yes No Focus on effect or mechanism Effect Effect Mechanism Limitation General lack of quantification, no formal statistical model (lacks analytic sophistication) Limited focus on features of research designs with observational data Vulnerability to misspecification

The Campbell model focuses on validity, i.e., the quality of the conclusions you can make based on your research. The four types of validity to consider here are: 1) (statistical) conclusion validity, internal validity, construct validity, and external validity. Most important for the causal model is the internal validity. That is, the extent to which the research design identities a causal relationship. External validity refers to teh extent to which we can generalise the causal relationship to other populations/contexts. I believe one of the key advantages here is the comprehensive list of potential threats to validity listed in this work. Some of these potential threats are more relevant for specific designs or results, and being familiar with these potential threats will make you a much more critical (and thereby better) researcher. The best comprehensive introduction to the Campbell model is Shadish et al. (2002).

The Rubin model focuses on potential outcomes and how units have potential outcomes in different conditions (most often with and without a binary treatment). For example, Y(1) is an array of potential outcomes under treatment 1 and Y(0) is an array of potential outcomes under treatment 0. This is especially useful when considering an experiment and how randomisation can realise one potential outcome for a unit that can, in combination with other units, be used to calculate the average treatment effect (as we cannot estimate individual-level causal effects). To solve the fundamental problem of causal inference (that we can only observe one unit in one world) we would need a time machine, and in the absence of such science fiction tools, we are left with the importance of the assignment mechanism for causal inference (to estimate effects such as ATE, LATE, PATE, ATT, ATC, and ITT). One of the key advantages of this model is to understand how potential outcomes are turned into one realised outcome and the assumptions we rely on. For example, the Stable Unit Treatment Value Assumption (SUTVA) implies that potential outcomes for one unit are unaffected by the treatment of another unit. This emphasises the importance of minimising the interference between units. The best comprehensive introduction to the Rubin model is Imbens and Rubin (2015).

The Pearl model provides causal identification through directed acylic graphs (DAGs), i.e., how conditioning on a variable along a path blocks the path, and how specific effects need to be restricted in order to make causal inferences. When using with this model of causality, you are often worken with multiple paths and not a simple setup where you only have two groups, one outcome and a single treatment. DAGs can also be understood as non-parametric structural equation models, and are particular useful when working with conditional probabilities and Bayes networks/graphical models.

One of the main advantages of the Pearl model is that it forces you to think much more carefully about your causal model, including what not to control for. For that reason, the model is much better geared to causal inference in complicated settings than, say, the Rubin model.

However, there are also some noteworthy limitations. Interactions and effect heterogeneity are implied in the model, and it can be difficult to convey such ideas (whereas it is easier to consider conditional average treatment effects in the Rubin model). While DAGs are helpful to understand complex causal models, it is often less helpful when we have to consider the parametric assumptions we need to estimate causal effects in practice.

The best introduction to the Pearl model is, surprisingly, not the work by Pearl himself (although I did enjoy The Book of Why). As a political scientist (or a social scientist more generally), I find introductions such as Morgan and Winship (2014), Elwert (2013), Elwert and Winship (2014), Dablander (2020), and Rohrer (2018) much more accessible.

(For Danish readers, you can also check out my lecture slides from 2016 on the Rubin model, the Campbell model and the Pearl model. I also made a different version of the table presented above in Danish that you can find here.)

In political science, researchers have mostly relied on the work by Rubin and Campbell, and less so on the work by Pearl. However, recently we have seen some good work that relies on the insights provided by DAGs. Great examples include the work on racially biased policing in the U.S. (see Knox et al. 2020) and the the work on estimating controlled direct effects (Acharya et al. 2016).

Imbens (2020) provides a good and critical discussion of DAGs in relation to the Rubin model (in favour of the potential outcomes over DAGs as the preferred model to causality within the social sciences). Matthay and Glymour (2020) show how the threats to internal, external, construct and statistical conclusion validity can be presented as DAGs. Lundberg et al. (2021) show how both potential outcomes and DAGs can be used to outline the identification assumptions linking a theoretical estimand to an empirical estimand. This is amazing work and everybody with an interest in strong causal inference connecting statistical evidence to theory should read it.

My opiniated take is that the three models work well together but not necessarily at the same time when thinking about theories, research designs and data. Specifically, I prefer Pearl → Rubin → Campbell. First, use Pearl to outline the causal model (with a particular focus on what not to include). Second use Rubin to focus on the causal estimand of interest, consider different estimators and assumptions (SITA/SUTVA). Third, use Campbell to discuss threats to vality, measurement error, etc.

In sum, the three models are all good to be familiar with if you do quantitative (and even qualitative) social science.

## How (not) to study suicide terrorism

Today is the 20 year anniversary for 9/11. That made me look into one of the most salient methodological discussions on how to study suicide terrorism within political science.

Suicide terrorism is a difficult topic to study. Why? Because we cannot learn about the causes (or correlates) of suicide terrorism from only studying cases of terrorism. Pape (2003) studies 188 suicide attacks in the period 1980-2001. He concludes that there is a strategic logic to these attacks, namely that they pay off for the organisations and groups pursuing such attacks.

Ashworth et al. (2008) use simple statistics such as conditional probabilities to show that there are problems with the paper in question, namely that the original paper “samples on the dependent variable.” I especially liked this formulation in the conclusion: “It is important to note that our critique of Pape’s (2003) analysis does not make the well-known point that association does not imply causation. Rather, because Pape collects only instances of suicide terrorism, his data do not even let him calculate the needed associations.”

Pape (2008) provides a reply to the critique raised by Ashworth and colleagues. He first brings a long excerpt from his book not taking the critique of Ashworth et al. into account. Then, he writes: “One might still wonder whether the article is flawed by sample bias because it considered systematically only actual instances of suicide terrorism. The answer is no, for two reasons. First, the article did not sample suicide terrorism, but collected the universe of suicide terrorist attacks worldwide from 1980 through 2001. […] There is no such thing as sample bias in collecting a universe. Second, although it is true that the universe systematically studied did not include suicide terrorist campaigns that did not happen, and that this limits the claims that my article could make, this does not mean that my analysis could not support any claims or that it could not support the claims I actually made.”

Importantly, just because you might have the universe of suicide terrorist attacks, you should still treat it as a sample (especially if you want to make policy recommendations about future cases we have not seen yet). In other words, this is a weird way of defending your flawed analysis. In an unpublished rejoinder, Ashworth (2008) provide some additional arguments to why the response to the criticism is flawed. Also, Horowitz (2010) shows that when you increase the universe of cases, Pape’s findings do not hold.

The debate is more than ten years old but reminiscent of similar contemporary debates on data and causality. Accordingly, I find it to be a good read for people interested in research design, data and inference — and it’s a good case to discuss what can (not) be learned from ‘selecting on the dependent variable’. Last, and most importantly, if you want to understand this amazing tweet, it is good to be familiar with the debate.

## Happy Danes

I finally got to read The Little Book of Hygge: The Danish Way to Live Well, written by Meik Wiking. It’s actually a fine book and if you are moving to Denmark, I would definitely recommend picking it up (it’s an easy read). There is a lot of things that are good to know about Danish culture before you get first-hand experience with it.

There were a few statements in the book on how happy Danes are according to the European Social Survey. The most important statement is the one that “Danes are the happiest people in Europe according to the European Social Survey”. I have worked a lot with European Social Survey data over the years (including in a few academic publications) and it would be easy for me to look at the data and examine how much happier Danes are than other people in Europe (at least when looking at self-reported data).

The question we can use to measure how happy people are in the European Social Survey is ‘Taking all things together, how happy would you say you are?’, which is measured on an 11-point scale from 0 to 10 with 0 being ‘Extremely unhappy’ and 10 being ‘Extremely happy’. Using this data, I can confirm that Danes are indeed the happiest people in Europe. On average, respondents in Denmark are more likely to report that they are happy.

For each country and round of the European Social Survey, I calculated a (weighted) average of how happy the people in the survey is within each country and year of the survey. This gives us 223 estimates (i.e., one for each sample). Noteworthy, not all countries are included in all rounds of the European Social Survey. The figure below shows the distribution of estimates with the estimates from Denmark in blue (to show how they are in the top of the scale).

The one estimate that is better than most of the Danish estimates is Iceland in 2004. Of the 223 estimates, 7 out of the 8 estimates from Denmark are in the top 10. Five countries, namely Iceland, Denmark, Switzerland, Norway, and Finland, fill up the top 30 estimates.

In the figure below I take these five countries and Sweden to show how they rank over the years (with the country at the top being number one). Here you can see how Denmark has been number 1 in all years but 2004 (where Iceland had the greatest average level of happiness) and 2016 (where there is no data from Denmark).

We see a few things here in addition to the fact that Denmark is in top. First, Swedes are consistently less happy, on average, than people in the other five countries. Second, there is quite a lot of variation in the ranking of Switzerland over time (from number 5 in 2012 to number 1 in 2016).

In sum, Danes are indeed – at least when asked in the European Social Survey – quite happy.

## How to improve your figures #8: Make probabilities tangible

Statistics is about learning from data in the context of uncertainty. Often we communicate uncertainty in the form of probabilities. How should we best communicate such probabilities in our figures? The key point in this post is that we should not only present probabilities in the form of probabilities and the like. Instead, we need to work hard on making our numbers tangible.

Why is it not sufficient to simply present estimates on probabilities? Probabilities are difficult because we easily interpret such probabilities differently. When people hear that a candidate is 80% likely to win an election, some people will see that as a much more likely outcome than others. In other words, there are uncertainties in how people perceive uncertainties. We have known for decades that people assign very different probabilities to different probability terms (see e.g. Wallsten et al. 1986; 1988), and the meaning of terms such as “nearly certain” will for one person be close to 100% and for another person be closer to 50%.

To make matters worse, risks and probabiblities can be expressed in different ways. Consider the example of the study that showed how the show “13 Reasons Why” was associated with a 28.9% increase in suicide rates. This was a much more interesting study because they focused on the rate-change in relative terms instead of saying an increase from 0.35 in 100,000 to 0.45 in 100,000 (see also this article on why you should not believe the study in question). To illustrate such differences, @justsaysrisks reports the absolute and relative risk from different articles communicating research findings.

In David Spiegelhalter’s great book, The Art of Statistics: Learning From Data, he looks at how the risk of getting bowel cancer increases by 18% for a group of people who eat 50g of processed meat a day. In Table 1.2 in the book, Spiegelhalter shows how a difference between two groups in one percentage point can be turned into a relative risk of 18%:

 Method Non-bacon eaters Daily bacon eaters Event rate 6% 7% Expected frequency 6 out of 100 7 out of 100 1 in 16 1 in 14 Odds 6/94 7/93
 Comparative measures Absolute risk difference 1%, or 1 out of 100 Relative risk 1.18, or an 18% increase ‘Number Needed to Treat’ 100 Odds ratio (7/93) / (6/94) = 1.18

As you can see, event rates of 6% and 7% in the two groups with an absolute risk difference of 1% can be turned into a relative risk of 18% (with an odds ratio of 1.18). Spiegelhalter’s book provides other good examples and I can highly recommend it.

Accordingly, probabilities are tricky and we need to be careful in how we communicate them. We have seen a lot of discussions on how best to communicate electoral forecasts (if the probability that a candidate will win more than 50% of the votes is 85%, how confident will people be that the candidate will win?). One great suggestion offered by Spiegelhalter in his book is to not think about percentages per se, but rather make probabilities tangible by showing the outcomes for, say, 100 people (or 100 elections, if you are working on a forecast).

To do this, we use unit charts to show counts of a variable. Here, we can use a 10×10 grid where each cell represents one percentage point. A specific version of the unit chart is an isotype chart, where we use icons or images instead of simple shapes.

There is evidence that such visualisations work better than simply presenting the information numerically. Galesic et al. (2009) show, in the context of medical risks, how icon arrays increase the accuracy of the understanding of risks (see also Fagerlin et al. 2005 on how pictographs can reduce the undue influence of anecdotal reasoning).

When we hear that the probability that the Democrats will win an election is 75%, we think about the outcome of one election and how that is significantly more likely to happen. However, when we use an isotype chart where we show 100 outcomes, 75 of them being won by the Democrats, we make the 25 out of 100 Republican outcomes more salient.

There are different R packages you can use to make such visualisations, e.g. waffle and ggwaffle. In the figure below, I used the waffle package to demonstrate how the Democrats got a probability of 75% of winning a (hypothetical) election.

There are many different ways to communicate probabilities. However, try to avoid simply presenting the numerical probabilities in your figures, or the odds ratios, and consider how you can make the probabilities more tangible and easier for the reader to process.

## The plural of anecdote

There are two very different quotes. The first is “the plural of anecdote is not data”. The second is “the plural of anecdote is data”. They are pretty much opposites. You can find more information on the two quotes here.

I often see the former quote being used to dismiss anecdotal evidence and ask for non-anecdotal data. We all agree that N>1 is better than N=1, all else equal, but I believe the latter quote is better, i.e. that the plural of anecdote is data. The focus on data as the aggregation of individual observations makes you think a lot more critically about what is part of the data – and what is not part of the data, what types of measurement error you are most likely dealing with, etc.

Similarly, it is easier to think about cherry picking and other selection biases when we consider anecdote the singular of data. A single data point is of little relevance in and by itself, but the mere aggregation of such data points is not sufficient – or even necessary – to say that we are no longer looking at anecdotes.

## Advice on data and code

I have been reading a few papers on how to structure data and code. In the post, I provide a list of the papers I have found together with the main advice/rules offered in the respective papers (do consult the individual papers for examples and explanations).

Noteworthy, there is an overlap in the advice the papers give. I do not agree with everything (and as you can see, they are written for people working with tabular data and definitely not people working with more modern workflows with several gigabytes of data stored in NoSQL databases). However, overall, I can highly recommend that you take most of these recommendations on board.

Last, another good resource is the supplementary material to A Practical Guide for Transparency in Psychological Science. This paper deals with the details of folder structure, data documentation, analytical reproducibility, etc.

Here we go:

#### Nagler (1995): Coding Style and Good Computing Practice

1. Maintain a lab book from the beginning of a project to the end.
2. Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
3. Correct errors in code where they occur, and rerun the code.
4. Separate tasks related to data manipulation vs. data analysis into separate files.
5. Design each program to perform only one task.
6. Do not try to be as clever as possible when coding. Try to writecode that is as simple as possible.
7. Set up each section of a program to perform only one task.
8. Use a consistent style regarding lower- and upper-case letters.
9. Use variable names that have substantive meaning.
10. Use variable names that indicate direction where possible.
11. Use appropriate white space in your programs, and do so in a consistent fashion to make the programs easy to read.
12. Include comments before each block of code describing the purpose of the code.
13. Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
14. Rewrite any code that is not clear.
15. Verify that missing data are handled correctly on any recode or creation of a new variable.
16. After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
17. When possible, automate things and avoid placing hard-wired values (those computed “by hand”) in code.

#### Broman and Woo (2018): Data Organization in Spreadsheets

1. Be consistent
– Use consistent codes for categorical variables.
– Use a consistent fixed code for any missing values.
– Use consistent variable names.
– Use consistent subject identifiers.
– Use a consistent data layout in multiple files.
– Use consistent file names.
– Use a consistent format for all dates.
– Use consistent phrases in your notes.
– Be careful about extra spaces within cells.
2. Choose good names for things
– Don’t use spaces, either in variable names or file names.
– Avoid special characters, except for underscores and hyphens.
– Keep names short, but meaningful.
– Never include “final” in a file name
3. Write dates as YYYY-MM-DD
4. Fill in all cells and use some common code for missing data.
5. Put just one thing in a cell
6. Don’t use more than one row for the variable names
7. Create a data dictionary
8. No calculations in the raw data files
9. Don’t use font color or highlighting as data
10. Make backups
11. Use data validation to avoid errors
12. Save the data in plain text files

#### Balaban et al. (2021): Ten simple rules for quick and dirty scientific programming

1. Think before you code
3. Look for opportunities for code reuse
5. Avoid premature optimization
6. Use automated unit testing for critical components
7. Refactor frequently
8. Write self-documenting code for programmers and a readme file for users
10. Go explore and be rigorous when you publish

#### Wilson et al. (2017): Good enough practices in scientific computing

1. Data management
– Save the raw data.
– Ensure that raw data are backed up in more than one location.
– Create the data you wish to see in the world.
– Create analysis-friendly data.
– Record all the steps used to process data.
– Anticipate the need to use multiple tables, and use a uniquei dentifier for every record.
– Submit data to a reputable DOI-issuing repository so that others can access and cite it.
2. Software
– Place a brief explanatory comment at the start of every program.
– Decompose programs into functions.
– Be ruthless about eliminating duplication.
– Always search for well-maintained software libraries that do what you need.
– Test libraries before relying on them.
– Give functions and variables meaningful names.
– Make dependencies and requirementsexplicit.
– Do not comment and uncomment sections of code to control a program’s behavior.
– Provide a simple example or test dataset.
– Submit code to a reputable DOI-issuing repository.
3. Collaboration
– Create an overview of your project.
– Create a shared “to-do” list for theproject.
– Decide on communication strategies.
– Make the project citable.
4. Project organization
– Put each project in its own directory, which is named after the project.
– Put text documents associated with the project in the doc directory.
– Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
– Put project source code in the src directory.
– Put external scripts or compiled programs in the bin directory.
– Name all files to reflect their contentor function.
5. Keeping track of changes
– Backup (almost) everything created by a human being as soon as it is created.
– Keep changes small.
– Share changes frequently.
– Create, maintain, and use a checklist for saving and sharing changes to the project.
– Store each project in a folder that is mirrored off the researcher’s working machine.
– Add a file called CHANGELOG.txt to the project’s docs subfolder.
– Copy the entire project whenever a significant change has been made.
– Use a version control system.
6. Manuscripts
– Write manuscripts using online tools with rich formatting, change tracking, and reference management.
– Write the manuscript in a plain text format that permits version control.

#### Gentzkow and Shapiro (2014): Code and Data for the Social Sciences

1. Automation
– Automate everything that can be automated.
– Write a single script that executes all code from beginning to end.
2. Version Control
– Store code and data under version control.
– Run the whole directory before checking it back in.
3. Directories
– Separate directories by function.
– Separate files into inputs and outputs.
– Make directories portable.
4. Keys
– Store cleaned data in tables with unique, non-missing keys.
– Keep data normalized as far into your code pipeline as you can.
5. Abstraction
– Abstract to eliminate redundancy.
– Abstract to improve clarity.
– Otherwise, don’t abstract.
6. Documentation
– Don’t write documentation you will not maintain.
– Code should be self-documenting.
7. Management
– E-mail is not a task management system.
8. Code style
– Keep it short and purposeful.
– Use descriptive names.
– Pay special attention to coding algebra.
– Make logical switches intuitive.
– Be consistent.
– Check for errors.
– Write tests.
– Profile slow code relentlessly.
– Separate slow code from fast code.

## Replace equations with code

Here is a suggestion: In empirical research, academics should move equations from the methods section to the appendix and, if anything, show the few lines of code used to estimate the model(s) in the software being used (ideally with citations to the software and statistical packages). Preferably, it should be possible to understand the estimation strategy without having to read any equations.

Of course, I am talking about the type of work that is not primarily interested in developing a new estimator or a formal theory that can be applied to a few case studies (or shed light on the limitations of empirical models). I am not against the use of equations or abstractions of any kind to communicate clearly and without ambiguity. I am, however, skeptical towards how empirical research often include equations for the sake of … including equations.

I have a theory that academics, and in particular political scientists, put more equations in their research to show off their skills rather than to help the reader understand what is going on. In most cases, equations are not needed and are often there only to impress reviewers and peers, which of course are the same people (hence, peer-review). The use of equations are excluding readers rather than including readers.

I am confident that most researchers spend more time in their favourite statistical IDE than they do writing and reading equations. For that reason, I also believe that most researchers will find it easier to read actual code instead of equations. Take this example on the equation and code for a binomial regression model (estimated with glmer()) from Twitter:

Personally, I find it much easier to understand what is going on when I look at the R code instead of the extracted equation. Not only that, I also find it easier to think of potential alternatives to the regression model, e.g., that I can easily change the functional form and see how such changes will affect the results. This is something I rarely consider when I only look at equations.

The example above is from R, and not all researchers use or understand R. However, I am quite certain that everybody that understands the equation above will also be able to understand the few lines of code. And when people use Stata, it is often even easier to read the code (even if you are not an avid Stata user). SPSS syntax is much more difficult to read but that says more about why you should not use SPSS in the first place.

I am not against the use of equations in research papers. However, I do believe empirical research would be much better off by showing and citing code instead of equations. Accordingly, please replace equations with code.