Potpourri: Statistics #78

Investigation of Data Irregularities in Doing Business 2018 and Doing Business 2020
Dyadic Clustering in International Relations
Forecasting: Principles and Practice
Data Disasters
A Quick How-to on Labelling Bar Graphs in ggplot2
Data visualisation using R, for researchers who don’t use R
Easy access to high-resolution daily climate data for Europe
Put R Models in Production
Machine learning, explained
Three ways to visualize binary survey data
In defense of simple charts
Modern Statistics with R
How to avoid machine learning pitfalls: a guide for academic researchers
Tune xgboost models with early stopping to predict shelter animal status
Machine-learning on dirty data in Python: a tutorial
I saw your RCT and I have some worries! FAQs
Up and running with officedown
Use racing methods to tune xgboost models and predict home runs
The 5-minute learn: Create pretty and geographically accurate transport maps in R
R’s Internal Data Formats: .Rda, .RData, .rds
Improve Your Code – Best Practices for Durable Code
An educator’s perspective of the tidyverse
Estimating regression coefficients using a Neural Network (from scratch)
Let users choose which plot you want to show
A look into ANOVA. The long way.
3 alternatives to a discrete color scale legend in ggplot2
Downloading the Census Household Pulse Survey in R
The Stata Guide
The Four Pipes of magrittr
Introducing {facetious} – alternate facets for ggplot2
Alternatives to Simple Color Legends in ggplot2
Top 3 Coding Best Practices from the Shiny Contest
Visualizing ordinal variables
Making Shiny apps mobile friendly
Climate circles
Elegant and informative maps with tmap
Exploring R² and regression variance with Euler/Venn diagrams
Exploring Pamela Jakiela’s simple TWFE diagnostics with R
The marginaleffects package for R
A lightweight data validation ecosystem with R, GitHub, and Slack
Create spatial square/hexagon grids and count points inside in R with sf
A daily updated JSON dataset of all the Open House London venues, events, and metadata
Animating Network Evolutions with gganimate
Beyond Bar and Box Plots
Causal Inference in R Workshop
Odds != Probability
How to visualize polls and results of the German election with Datawrapper
Irreproducibility in Machine Learning
A collection of themes for RStudio
Shiny, Tableau, and PowerBI: Better Business Intelligence
Automate PowerPoint Production Using R
Estimating graph dimension with cross-validated eigenvalues
Understanding text size and resolution in ggplot2
Introduction to linear mixed models

Previous posts: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77

Causality models: Campbell, Rubin and Pearl

In political science, the predominant way to discuss causality is in relation to experiments and counterfactuals (within the potential outcomes framework). However, we also use concepts such as internal and external validity and sometimes we use arrows to show how different concepts are connected. When I was introduced to causality, it was on a PowerPoint slide with the symbol X, a rightwards arrow, and the symbol Y, together with a few bullet points on the specific criteria that should be met before we can say that a relationship is causal (inspired by John Gerring’s criterial approach; see, e.g., Gerring 2005).

Importantly, there are multiple models we can consider when we want to discuss causality. In brief, there are three popular causality models today: 1) the Campbell model (focusing on threats to validity), 2) the Rubin model (focusing on potential outcomes), and 3) the Pearl model (focusing on directed acyclic graphs). The names of the models are based on the names of the researchers who have been instrumental in the development of these models (Donald Campbell, Donald Rubin and Judea Pearl). I believe a good understanding of these three models is a prerequisite to be able to discuss causal inference within quantitative social science.

Luckily, we have good introductions to the three frameworks that compare the main similarities and differences. The special issue introduced by Maxwell (2010) focuses on two of the frameworks, namely the frameworks related to Campbell and Rubin. What is great about the special issue is that it focuses on important differences between the two frameworks but also how the two frameworks are complementary. That being said, it does not pay a lot of attention to the Pearl’s framework. Shadish (2010) and West and Thoemmes (2010) provide comparisons of the work by Campbell and Rubin on causal inference. Rubin (2010) and Imbens (2010) further provide some additional reflections on the causal models from their own perspectives.

The best primer to understand the three frameworks is the book chapter by Shadish and Sullivan (2012). They make it clear that all three models to causality acknowledge the importance of manipulable causes and brings an experimental terminology into observational research. In addition, they highlight the importance of assumptions (as causal inference without assumptions is impossible). Unfortunately, they do not summarise the key similarities and differences between the models in a table. For that reason, I decided to create the table below to provide a brief overview of the three models. Keep in mind that the table provides a simplified comparison and there are important nuances that you will only fully understand by consulting the relevant literature.

Campbell Rubin Pearl
Core Validity typology and the associated threats to validity Precise conceptualization of causal inference Directed acyclic graphs (DAGs)
Goal Create a generalized causal theory Define an effect clearly and precisely State the conditions under which a given DAG can support a causal inference
Fields of development Psychology Statistics, program evaluation Artificial intelligence, machine learning
Examples of main concepts Internal validity, external validity, statistical conclusion validity, construct validity Potential outcomes, causal effect, stable-unit-treatment-value assumption Node, edge, collider, d-seperation, back-door criterion, do(x) operator
Definition of effect Difference between counterfactuals Difference between potential outcomes The space of probability distributions on Y using the do(x) operator
Causal generalisation Meta-analysis, construct and external validity Response surface analysis, meditational modeling Specified within the DAG
Assumption for valid inference in observational research Ruled out all threats to validity Strong ignorability Correct DAG
Examples of application Quasi-experiments Missing data imputation, propensity scores Mediational paths
Conceptual and philosophical scope Wide-ranging Narrow, formal statistical model Narrow, formal statistical model
Emphasis Descriptive causation Descriptive causation Explanatory causation
Preference for randomized experiments Yes Yes No
Focus on effect or mechanism Effect Effect Mechanism
Limitation General lack of quantification, no formal statistical model (lacks analytic sophistication) Limited focus on features of research designs with observational data Vulnerability to misspecification

The Campbell model focuses on validity, i.e., the quality of the conclusions you can make based on your research. The four types of validity to consider here are: 1) (statistical) conclusion validity, internal validity, construct validity, and external validity. Most important for the causal model is the internal validity. That is, the extent to which the research design identities a causal relationship. External validity refers to teh extent to which we can generalise the causal relationship to other populations/contexts. I believe one of the key advantages here is the comprehensive list of potential threats to validity listed in this work. Some of these potential threats are more relevant for specific designs or results, and being familiar with these potential threats will make you a much more critical (and thereby better) researcher. The best comprehensive introduction to the Campbell model is Shadish et al. (2002).

The Rubin model focuses on potential outcomes and how units have potential outcomes in different conditions (most often with and without a binary treatment). For example, Y(1) is an array of potential outcomes under treatment 1 and Y(0) is an array of potential outcomes under treatment 0. This is especially useful when considering an experiment and how randomisation can realise one potential outcome for a unit that can, in combination with other units, be used to calculate the average treatment effect (as we cannot estimate individual-level causal effects). To solve the fundamental problem of causal inference (that we can only observe one unit in one world) we would need a time machine, and in the absence of such science fiction tools, we are left with the importance of the assignment mechanism for causal inference (to estimate effects such as ATE, LATE, PATE, ATT, ATC, and ITT). One of the key advantages of this model is to understand how potential outcomes are turned into one realised outcome and the assumptions we rely on. For example, the Stable Unit Treatment Value Assumption (SUTVA) implies that potential outcomes for one unit are unaffected by the treatment of another unit. This emphasises the importance of minimising the interference between units. The best comprehensive introduction to the Rubin model is Imbens and Rubin (2015).

The Pearl model provides causal identification through directed acylic graphs (DAGs), i.e., how conditioning on a variable along a path blocks the path, and how specific effects need to be restricted in order to make causal inferences. When using with this model of causality, you are often worken with multiple paths and not a simple setup where you only have two groups, one outcome and a single treatment. DAGs can also be understood as non-parametric structural equation models, and are particular useful when working with conditional probabilities and Bayes networks/graphical models.

One of the main advantages of the Pearl model is that it forces you to think much more carefully about your causal model, including what not to control for. For that reason, the model is much better geared to causal inference in complicated settings than, say, the Rubin model.

However, there are also some noteworthy limitations. Interactions and effect heterogeneity are implied in the model, and it can be difficult to convey such ideas (whereas it is easier to consider conditional average treatment effects in the Rubin model). While DAGs are helpful to understand complex causal models, it is often less helpful when we have to consider the parametric assumptions we need to estimate causal effects in practice.

The best introduction to the Pearl model is, surprisingly, not the work by Pearl himself (although I did enjoy The Book of Why). As a political scientist (or a social scientist more generally), I find introductions such as Morgan and Winship (2014), Elwert (2013), Elwert and Winship (2014), Dablander (2020), and Rohrer (2018) much more accessible.

(For Danish readers, you can also check out my lecture slides from 2016 on the Rubin model, the Campbell model and the Pearl model. I also made a different version of the table presented above in Danish that you can find here.)

In political science, researchers have mostly relied on the work by Rubin and Campbell, and less so on the work by Pearl. However, recently we have seen some good work that relies on the insights provided by DAGs. Great examples include the work on racially biased policing in the U.S. (see Knox et al. 2020) and the the work on estimating controlled direct effects (Acharya et al. 2016).

Imbens (2020) provides a good and critical discussion of DAGs in relation to the Rubin model (in favour of the potential outcomes over DAGs as the preferred model to causality within the social sciences). Matthay and Glymour (2020) show how the threats to internal, external, construct and statistical conclusion validity can be presented as DAGs. Lundberg et al. (2021) show how both potential outcomes and DAGs can be used to outline the identification assumptions linking a theoretical estimand to an empirical estimand. This is amazing work and everybody with an interest in strong causal inference connecting statistical evidence to theory should read it.

My opiniated take is that the three models work well together but not necessarily at the same time when thinking about theories, research designs and data. Specifically, I prefer Pearl → Rubin → Campbell. First, use Pearl to outline the causal model (with a particular focus on what not to include). Second use Rubin to focus on the causal estimand of interest, consider different estimators and assumptions (SITA/SUTVA). Third, use Campbell to discuss threats to vality, measurement error, etc.

In sum, the three models are all good to be familiar with if you do quantitative (and even qualitative) social science.

How (not) to study suicide terrorism

Today is the 20 year anniversary for 9/11. That made me look into one of the most salient methodological discussions on how to study suicide terrorism within political science.

Suicide terrorism is a difficult topic to study. Why? Because we cannot learn about the causes (or correlates) of suicide terrorism from only studying cases of terrorism. Pape (2003) studies 188 suicide attacks in the period 1980-2001. He concludes that there is a strategic logic to these attacks, namely that they pay off for the organisations and groups pursuing such attacks.

Ashworth et al. (2008) use simple statistics such as conditional probabilities to show that there are problems with the paper in question, namely that the original paper “samples on the dependent variable.” I especially liked this formulation in the conclusion: “It is important to note that our critique of Pape’s (2003) analysis does not make the well-known point that association does not imply causation. Rather, because Pape collects only instances of suicide terrorism, his data do not even let him calculate the needed associations.”

Pape (2008) provides a reply to the critique raised by Ashworth and colleagues. He first brings a long excerpt from his book not taking the critique of Ashworth et al. into account. Then, he writes: “One might still wonder whether the article is flawed by sample bias because it considered systematically only actual instances of suicide terrorism. The answer is no, for two reasons. First, the article did not sample suicide terrorism, but collected the universe of suicide terrorist attacks worldwide from 1980 through 2001. […] There is no such thing as sample bias in collecting a universe. Second, although it is true that the universe systematically studied did not include suicide terrorist campaigns that did not happen, and that this limits the claims that my article could make, this does not mean that my analysis could not support any claims or that it could not support the claims I actually made.”

Importantly, just because you might have the universe of suicide terrorist attacks, you should still treat it as a sample (especially if you want to make policy recommendations about future cases we have not seen yet). In other words, this is a weird way of defending your flawed analysis. In an unpublished rejoinder, Ashworth (2008) provide some additional arguments to why the response to the criticism is flawed. Also, Horowitz (2010) shows that when you increase the universe of cases, Pape’s findings do not hold.

The debate is more than ten years old but reminiscent of similar contemporary debates on data and causality. Accordingly, I find it to be a good read for people interested in research design, data and inference — and it’s a good case to discuss what can (not) be learned from ‘selecting on the dependent variable’. Last, and most importantly, if you want to understand this amazing tweet, it is good to be familiar with the debate.

Happy Danes

I finally got to read The Little Book of Hygge: The Danish Way to Live Well, written by Meik Wiking. It’s actually a fine book and if you are moving to Denmark, I would definitely recommend picking it up (it’s an easy read). There is a lot of things that are good to know about Danish culture before you get first-hand experience with it.

There were a few statements in the book on how happy Danes are according to the European Social Survey. The most important statement is the one that “Danes are the happiest people in Europe according to the European Social Survey”. I have worked a lot with European Social Survey data over the years (including in a few academic publications) and it would be easy for me to look at the data and examine how much happier Danes are than other people in Europe (at least when looking at self-reported data).

The question we can use to measure how happy people are in the European Social Survey is ‘Taking all things together, how happy would you say you are?’, which is measured on an 11-point scale from 0 to 10 with 0 being ‘Extremely unhappy’ and 10 being ‘Extremely happy’. Using this data, I can confirm that Danes are indeed the happiest people in Europe. On average, respondents in Denmark are more likely to report that they are happy.

For each country and round of the European Social Survey, I calculated a (weighted) average of how happy the people in the survey is within each country and year of the survey. This gives us 223 estimates (i.e., one for each sample). Noteworthy, not all countries are included in all rounds of the European Social Survey. The figure below shows the distribution of estimates with the estimates from Denmark in blue (to show how they are in the top of the scale).

The one estimate that is better than most of the Danish estimates is Iceland in 2004. Of the 223 estimates, 7 out of the 8 estimates from Denmark are in the top 10. Five countries, namely Iceland, Denmark, Switzerland, Norway, and Finland, fill up the top 30 estimates.

In the figure below I take these five countries and Sweden to show how they rank over the years (with the country at the top being number one). Here you can see how Denmark has been number 1 in all years but 2004 (where Iceland had the greatest average level of happiness) and 2016 (where there is no data from Denmark).

We see a few things here in addition to the fact that Denmark is in top. First, Swedes are consistently less happy, on average, than people in the other five countries. Second, there is quite a lot of variation in the ranking of Switzerland over time (from number 5 in 2012 to number 1 in 2016).

In sum, Danes are indeed – at least when asked in the European Social Survey – quite happy.

Potpourri: Statistics #77 (Excel)

333 Excel Shortcuts for Windows and Mac
Excel VBA Introduction
101 Excel Functions you should know
How to Create a Dot Plot in Excel
How to Create a Fan Chart in Excel
How to Create a Non-Ribbon Sankey Diagram in Excel
How to Create a Horizontal Bar Graph With Endpoints In Excel
How to Create a Dumbbell Chart in Excel
How to Create a Lollipop Chart in Excel
How To Create a Waffle Fourfold Chart in Excel Using Conditional Formatting
How to Create a Bivariate Area Chart in Excel
How to Create a Range Bar Graph in Excel
How to Create a Fourfold Chart in Excel
How to Create a Bar Chart With Color Ranges in Excel
How to Create a Grid Map In Excel
How to Create a Unit Chart in Excel
How to Create a Scatterplot with Dynamic Reference Lines in Excel
How to Create a Barcode Plot in Excel
How to Create a Strip Plot in Excel
How to Create a Heatmap In Excel
How to Create a Grid Map With Circles In Excel
How to Create a Grid Map With Sparklines in Excel
How to Create a Density Scatterplot In Excel
How to Create a Bar Chart With Labels Above Bar in Excel
How to Create a Scatterplot Matrix In Excel
Tufte in Excel – The Bar Chart
Tufte in Excel – The Box Plot
Tufte in Excel – The Slopegraph
Tufte in Excel – The Dot-Dash-Plot
Tufte in Excel – Sparklines

Previous posts: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76

How to improve your figures #8: Make probabilities tangible

Statistics is about learning from data in the context of uncertainty. Often we communicate uncertainty in the form of probabilities. How should we best communicate such probabilities in our figures? The key point in this post is that we should not only present probabilities in the form of probabilities and the like. Instead, we need to work hard on making our numbers tangible.

Why is it not sufficient to simply present estimates on probabilities? Probabilities are difficult because we easily interpret such probabilities differently. When people hear that a candidate is 80% likely to win an election, some people will see that as a much more likely outcome than others. In other words, there are uncertainties in how people perceive uncertainties. We have known for decades that people assign very different probabilities to different probability terms (see e.g. Wallsten et al. 1986; 1988), and the meaning of terms such as “nearly certain” will for one person be close to 100% and for another person be closer to 50%.

To make matters worse, risks and probabiblities can be expressed in different ways. Consider the example of the study that showed how the show “13 Reasons Why” was associated with a 28.9% increase in suicide rates. This was a much more interesting study because they focused on the rate-change in relative terms instead of saying an increase from 0.35 in 100,000 to 0.45 in 100,000 (see also this article on why you should not believe the study in question). To illustrate such differences, @justsaysrisks reports the absolute and relative risk from different articles communicating research findings.

In David Spiegelhalter’s great book, The Art of Statistics: Learning From Data, he looks at how the risk of getting bowel cancer increases by 18% for a group of people who eat 50g of processed meat a day. In Table 1.2 in the book, Spiegelhalter shows how a difference between two groups in one percentage point can be turned into a relative risk of 18%:

Method Non-bacon eaters Daily bacon eaters
Event rate 6% 7%
Expected frequency 6 out of 100 7 out of 100
1 in 16 1 in 14
Odds 6/94 7/93
Comparative measures
Absolute risk difference 1%, or 1 out of 100
Relative risk 1.18, or an 18% increase
‘Number Needed to Treat’ 100
Odds ratio (7/93) / (6/94) = 1.18

As you can see, event rates of 6% and 7% in the two groups with an absolute risk difference of 1% can be turned into a relative risk of 18% (with an odds ratio of 1.18). Spiegelhalter’s book provides other good examples and I can highly recommend it.

Accordingly, probabilities are tricky and we need to be careful in how we communicate them. We have seen a lot of discussions on how best to communicate electoral forecasts (if the probability that a candidate will win more than 50% of the votes is 85%, how confident will people be that the candidate will win?). One great suggestion offered by Spiegelhalter in his book is to not think about percentages per se, but rather make probabilities tangible by showing the outcomes for, say, 100 people (or 100 elections, if you are working on a forecast).

To do this, we use unit charts to show counts of a variable. Here, we can use a 10×10 grid where each cell represents one percentage point. A specific version of the unit chart is an isotype chart, where we use icons or images instead of simple shapes.

There is evidence that such visualisations work better than simply presenting the information numerically. Galesic et al. (2009) show, in the context of medical risks, how icon arrays increase the accuracy of the understanding of risks (see also Fagerlin et al. 2005 on how pictographs can reduce the undue influence of anecdotal reasoning).

When we hear that the probability that the Democrats will win an election is 75%, we think about the outcome of one election and how that is significantly more likely to happen. However, when we use an isotype chart where we show 100 outcomes, 75 of them being won by the Democrats, we make the 25 out of 100 Republican outcomes more salient.

There are different R packages you can use to make such visualisations, e.g. waffle and ggwaffle. In the figure below, I used the waffle package to demonstrate how the Democrats got a probability of 75% of winning a (hypothetical) election.

There are many different ways to communicate probabilities. However, try to avoid simply presenting the numerical probabilities in your figures, or the odds ratios, and consider how you can make the probabilities more tangible and easier for the reader to process.

The plural of anecdote

There are two very different quotes. The first is “the plural of anecdote is not data”. The second is “the plural of anecdote is data”. They are pretty much opposites. You can find more information on the two quotes here.

I often see the former quote being used to dismiss anecdotal evidence and ask for non-anecdotal data. We all agree that N>1 is better than N=1, all else equal, but I believe the latter quote is better, i.e. that the plural of anecdote is data. The focus on data as the aggregation of individual observations makes you think a lot more critically about what is part of the data – and what is not part of the data, what types of measurement error you are most likely dealing with, etc.

Similarly, it is easier to think about cherry picking and other selection biases when we consider anecdote the singular of data. A single data point is of little relevance in and by itself, but the mere aggregation of such data points is not sufficient – or even necessary – to say that we are no longer looking at anecdotes.

Advice on data and code

I have been reading a few papers on how to structure data and code. In the post, I provide a list of the papers I have found together with the main advice/rules offered in the respective papers (do consult the individual papers for examples and explanations).

Noteworthy, there is an overlap in the advice the papers give. I do not agree with everything (and as you can see, they are written for people working with tabular data and definitely not people working with more modern workflows with several gigabytes of data stored in NoSQL databases). However, overall, I can highly recommend that you take most of these recommendations on board.

Last, another good resource is the supplementary material to A Practical Guide for Transparency in Psychological Science. This paper deals with the details of folder structure, data documentation, analytical reproducibility, etc.

Here we go:

Nagler (1995): Coding Style and Good Computing Practice

1. Maintain a lab book from the beginning of a project to the end.
2. Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
3. Correct errors in code where they occur, and rerun the code.
4. Separate tasks related to data manipulation vs. data analysis into separate files.
5. Design each program to perform only one task.
6. Do not try to be as clever as possible when coding. Try to writecode that is as simple as possible.
7. Set up each section of a program to perform only one task.
8. Use a consistent style regarding lower- and upper-case letters.
9. Use variable names that have substantive meaning.
10. Use variable names that indicate direction where possible.
11. Use appropriate white space in your programs, and do so in a consistent fashion to make the programs easy to read.
12. Include comments before each block of code describing the purpose of the code.
13. Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
14. Rewrite any code that is not clear.
15. Verify that missing data are handled correctly on any recode or creation of a new variable.
16. After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
17. When possible, automate things and avoid placing hard-wired values (those computed “by hand”) in code.

Broman and Woo (2018): Data Organization in Spreadsheets

1. Be consistent
– Use consistent codes for categorical variables.
– Use a consistent fixed code for any missing values.
– Use consistent variable names.
– Use consistent subject identifiers.
– Use a consistent data layout in multiple files.
– Use consistent file names.
– Use a consistent format for all dates.
– Use consistent phrases in your notes.
– Be careful about extra spaces within cells.
2. Choose good names for things
– Don’t use spaces, either in variable names or file names.
– Avoid special characters, except for underscores and hyphens.
– Keep names short, but meaningful.
– Never include “final” in a file name
3. Write dates as YYYY-MM-DD
4. Fill in all cells and use some common code for missing data.
5. Put just one thing in a cell
6. Don’t use more than one row for the variable names
7. Create a data dictionary
8. No calculations in the raw data files
9. Don’t use font color or highlighting as data
10. Make backups
11. Use data validation to avoid errors
12. Save the data in plain text files

Balaban et al. (2021): Ten simple rules for quick and dirty scientific programming

1. Think before you code
2. Start with prototypes and expand them in short development cycles
3. Look for opportunities for code reuse
4. Modularize your code
5. Avoid premature optimization
6. Use automated unit testing for critical components
7. Refactor frequently
8. Write self-documenting code for programmers and a readme file for users
9. Grow your libraries and tools organically from your research
10. Go explore and be rigorous when you publish

Wilson et al. (2017): Good enough practices in scientific computing

1. Data management
– Save the raw data.
– Ensure that raw data are backed up in more than one location.
– Create the data you wish to see in the world.
– Create analysis-friendly data.
– Record all the steps used to process data.
– Anticipate the need to use multiple tables, and use a uniquei dentifier for every record.
– Submit data to a reputable DOI-issuing repository so that others can access and cite it.
2. Software
– Place a brief explanatory comment at the start of every program.
– Decompose programs into functions.
– Be ruthless about eliminating duplication.
– Always search for well-maintained software libraries that do what you need.
– Test libraries before relying on them.
– Give functions and variables meaningful names.
– Make dependencies and requirementsexplicit.
– Do not comment and uncomment sections of code to control a program’s behavior.
– Provide a simple example or test dataset.
– Submit code to a reputable DOI-issuing repository.
3. Collaboration
– Create an overview of your project.
– Create a shared “to-do” list for theproject.
– Decide on communication strategies.
– Make the license explicit.
– Make the project citable.
4. Project organization
– Put each project in its own directory, which is named after the project.
– Put text documents associated with the project in the doc directory.
– Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
– Put project source code in the src directory.
– Put external scripts or compiled programs in the bin directory.
– Name all files to reflect their contentor function.
5. Keeping track of changes
– Backup (almost) everything created by a human being as soon as it is created.
– Keep changes small.
– Share changes frequently.
– Create, maintain, and use a checklist for saving and sharing changes to the project.
– Store each project in a folder that is mirrored off the researcher’s working machine.
– Add a file called CHANGELOG.txt to the project’s docs subfolder.
– Copy the entire project whenever a significant change has been made.
– Use a version control system.
6. Manuscripts
– Write manuscripts using online tools with rich formatting, change tracking, and reference management.
– Write the manuscript in a plain text format that permits version control.

Gentzkow and Shapiro (2014): Code and Data for the Social Sciences

1. Automation
– Automate everything that can be automated.
– Write a single script that executes all code from beginning to end.
2. Version Control
– Store code and data under version control.
– Run the whole directory before checking it back in.
3. Directories
– Separate directories by function.
– Separate files into inputs and outputs.
– Make directories portable.
4. Keys
– Store cleaned data in tables with unique, non-missing keys.
– Keep data normalized as far into your code pipeline as you can.
5. Abstraction
– Abstract to eliminate redundancy.
– Abstract to improve clarity.
– Otherwise, don’t abstract.
6. Documentation
– Don’t write documentation you will not maintain.
– Code should be self-documenting.
7. Management
– Manage tasks with a task management system.
– E-mail is not a task management system.
8. Code style
– Keep it short and purposeful.
– Make your functions shy.
– Order your functions for linear reading.
– Use descriptive names.
– Pay special attention to coding algebra.
– Make logical switches intuitive.
– Be consistent.
– Check for errors.
– Write tests.
– Profile slow code relentlessly.
– Separate slow code from fast code.

Potpourri: Statistics #76

Introduction to Deep Learning — 170 Video Lectures from Adaptive Linear Neurons to Zero-shot Classification with Transformers
The Identification Zoo: Meanings of Identification in Econometrics
Why you sometimes need to break the rules in data viz
A Concrete Introduction to Probability (using Python)
R packages that make ggplot2 more beautiful (Vol. I)
R packages that make ggplot2 more powerful (Vol. II)
Etimating multilevel models for change in R
Static and dynamic network visualization with R
Open Source RStudio/Shiny on AWS Fargate
Functional PCA with R
When Graphs Are a Matter of Life and Death
Python Projects with Source Code
The Stata workflow guide
15 Tips to Customize lines in ggplot2 with element_line()
7 Tips to customize rectangle elements in ggplot2 element_rect()
8 tips to use element_blank() in ggplot2 theme
Introduction to Machine Learning Interviews Book
Introduction to Python for Social Science
21 Must-Read Data Visualization Books, According to Experts
Introduction to Modern Statistics
The Difference Between Random Factors and Random Effects
Creating a figure of map layers in R
Reasons to Use Tidymodels
Professional, Polished, Presentable: Making great slides with xaringan
Polished summary tables in R with gtsummary
Top 10 Ideas in Statistics That Have Powered the AI Revolution
The New Native Pipe Operator in R
RMarkdown Tips and Tricks
Iterative visualizations with ggplot2: no more copy-pasting
Scaling Models in Poltical Science
Setting up and debugging custom fonts
A Pirate’s Favorite Programming Language
Tired: PCA + kmeans, Wired: UMAP + GMM
Three simple ideas for better election poll graphics
Exploratory Functional PCA with Sparse Data
Efficient simulations in R
The Beginner’s Guide to the Modern Data Stack
How to become a better R code detective?
A short introduction to grammar of graphics (via ggplot2)
Workflows for querying databases via R
A Handbook for Teaching and Learning with R and RStudio
Writing reproducible manuscripts in R
ggpairs in R- A Brief Introduction to ggpairs

Previous posts: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75

Replace equations with code

Here is a suggestion: In empirical research, academics should move equations from the methods section to the appendix and, if anything, show the few lines of code used to estimate the model(s) in the software being used (ideally with citations to the software and statistical packages). Preferably, it should be possible to understand the estimation strategy without having to read any equations.

Of course, I am talking about the type of work that is not primarily interested in developing a new estimator or a formal theory that can be applied to a few case studies (or shed light on the limitations of empirical models). I am not against the use of equations or abstractions of any kind to communicate clearly and without ambiguity. I am, however, skeptical towards how empirical research often include equations for the sake of … including equations.

I have a theory that academics, and in particular political scientists, put more equations in their research to show off their skills rather than to help the reader understand what is going on. In most cases, equations are not needed and are often there only to impress reviewers and peers, which of course are the same people (hence, peer-review). The use of equations are excluding readers rather than including readers.

I am confident that most researchers spend more time in their favourite statistical IDE than they do writing and reading equations. For that reason, I also believe that most researchers will find it easier to read actual code instead of equations. Take this example on the equation and code for a binomial regression model (estimated with glmer()) from Twitter:

Personally, I find it much easier to understand what is going on when I look at the R code instead of the extracted equation. Not only that, I also find it easier to think of potential alternatives to the regression model, e.g., that I can easily change the functional form and see how such changes will affect the results. This is something I rarely consider when I only look at equations.

The example above is from R, and not all researchers use or understand R. However, I am quite certain that everybody that understands the equation above will also be able to understand the few lines of code. And when people use Stata, it is often even easier to read the code (even if you are not an avid Stata user). SPSS syntax is much more difficult to read but that says more about why you should not use SPSS in the first place.

I am not against the use of equations in research papers. However, I do believe empirical research would be much better off by showing and citing code instead of equations. Accordingly, please replace equations with code.