## How to improve your figures #8: Make probabilities tangible

Statistics is about learning from data in the context of uncertainty. Often we communicate uncertainty in the form of probabilities. How should we best communicate such probabilities in our figures? The key point in this post is that we should not only present probabilities in the form of probabilities and the like. Instead, we need to work hard on making our numbers tangible.

Why is it not sufficient to simply present estimates on probabilities? Probabilities are difficult because we easily interpret such probabilities differently. When people hear that a candidate is 80% likely to win an election, some people will see that as a much more likely outcome than others. In other words, there are uncertainties in how people perceive uncertainties. We have known for decades that people assign very different probabilities to different probability terms (see e.g. Wallsten et al. 1986; 1988), and the meaning of terms such as “nearly certain” will for one person be close to 100% and for another person be closer to 50%.

To make matters worse, risks and probabiblities can be expressed in different ways. Consider the example of the study that showed how the show “13 Reasons Why” was associated with a 28.9% increase in suicide rates. This was a much more interesting study because they focused on the rate-change in relative terms instead of saying an increase from 0.35 in 100,000 to 0.45 in 100,000 (see also this article on why you should not believe the study in question). To illustrate such differences, @justsaysrisks reports the absolute and relative risk from different articles communicating research findings.

In David Spiegelhalter’s great book, The Art of Statistics: Learning From Data, he looks at how the risk of getting bowel cancer increases by 18% for a group of people who eat 50g of processed meat a day. In Table 1.2 in the book, Spiegelhalter shows how a difference between two groups in one percentage point can be turned into a relative risk of 18%:

 Method Non-bacon eaters Daily bacon eaters Event rate 6% 7% Expected frequency 6 out of 100 7 out of 100 1 in 16 1 in 14 Odds 6/94 7/93
 Comparative measures Absolute risk difference 1%, or 1 out of 100 Relative risk 1.18, or an 18% increase ‘Number Needed to Treat’ 100 Odds ratio (7/93) / (6/94) = 1.18

As you can see, event rates of 6% and 7% in the two groups with an absolute risk difference of 1% can be turned into a relative risk of 18% (with an odds ratio of 1.18). Spiegelhalter’s book provides other good examples and I can highly recommend it.

Accordingly, probabilities are tricky and we need to be careful in how we communicate them. We have seen a lot of discussions on how best to communicate electoral forecasts (if the probability that a candidate will win more than 50% of the votes is 85%, how confident will people be that the candidate will win?). One great suggestion offered by Spiegelhalter in his book is to not think about percentages per se, but rather make probabilities tangible by showing the outcomes for, say, 100 people (or 100 elections, if you are working on a forecast).

To do this, we use unit charts to show counts of a variable. Here, we can use a 10×10 grid where each cell represents one percentage point. A specific version of the unit chart is an isotype chart, where we use icons or images instead of simple shapes.

There is evidence that such visualisations work better than simply presenting the information numerically. Galesic et al. (2009) show, in the context of medical risks, how icon arrays increase the accuracy of the understanding of risks (see also Fagerlin et al. 2005 on how pictographs can reduce the undue influence of anecdotal reasoning).

When we hear that the probability that the Democrats will win an election is 75%, we think about the outcome of one election and how that is significantly more likely to happen. However, when we use an isotype chart where we show 100 outcomes, 75 of them being won by the Democrats, we make the 25 out of 100 Republican outcomes more salient.

There are different R packages you can use to make such visualisations, e.g. waffle and ggwaffle. In the figure below, I used the waffle package to demonstrate how the Democrats got a probability of 75% of winning a (hypothetical) election.

There are many different ways to communicate probabilities. However, try to avoid simply presenting the numerical probabilities in your figures, or the odds ratios, and consider how you can make the probabilities more tangible and easier for the reader to process.

## The plural of anecdote

There are two very different quotes. The first is “the plural of anecdote is not data”. The second is “the plural of anecdote is data”. They are pretty much opposites. You can find more information on the two quotes here.

I often see the former quote being used to dismiss anecdotal evidence and ask for non-anecdotal data. We all agree that N>1 is better than N=1, all else equal, but I believe the latter quote is better, i.e. that the plural of anecdote is data. The focus on data as the aggregation of individual observations makes you think a lot more critically about what is part of the data – and what is not part of the data, what types of measurement error you are most likely dealing with, etc.

Similarly, it is easier to think about cherry picking and other selection biases when we consider anecdote the singular of data. A single data point is of little relevance in and by itself, but the mere aggregation of such data points is not sufficient – or even necessary – to say that we are no longer looking at anecdotes.

## Advice on data and code

I have been reading a few papers on how to structure data and code. In the post, I provide a list of the papers I have found together with the main advice/rules offered in the respective papers (do consult the individual papers for examples and explanations).

Noteworthy, there is an overlap in the advice the papers give. I do not agree with everything (and as you can see, they are written for people working with tabular data and definitely not people working with more modern workflows with several gigabytes of data stored in NoSQL databases). However, overall, I can highly recommend that you take most of these recommendations on board.

Last, another good resource is the supplementary material to A Practical Guide for Transparency in Psychological Science. This paper deals with the details of folder structure, data documentation, analytical reproducibility, etc.

Here we go:

#### Nagler (1995): Coding Style and Good Computing Practice

1. Maintain a lab book from the beginning of a project to the end.
2. Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
3. Correct errors in code where they occur, and rerun the code.
4. Separate tasks related to data manipulation vs. data analysis into separate files.
5. Design each program to perform only one task.
6. Do not try to be as clever as possible when coding. Try to writecode that is as simple as possible.
7. Set up each section of a program to perform only one task.
8. Use a consistent style regarding lower- and upper-case letters.
9. Use variable names that have substantive meaning.
10. Use variable names that indicate direction where possible.
11. Use appropriate white space in your programs, and do so in a consistent fashion to make the programs easy to read.
12. Include comments before each block of code describing the purpose of the code.
13. Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
14. Rewrite any code that is not clear.
15. Verify that missing data are handled correctly on any recode or creation of a new variable.
16. After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
17. When possible, automate things and avoid placing hard-wired values (those computed “by hand”) in code.

#### Broman and Woo (2018): Data Organization in Spreadsheets

1. Be consistent
– Use consistent codes for categorical variables.
– Use a consistent fixed code for any missing values.
– Use consistent variable names.
– Use consistent subject identifiers.
– Use a consistent data layout in multiple files.
– Use consistent file names.
– Use a consistent format for all dates.
– Use consistent phrases in your notes.
– Be careful about extra spaces within cells.
2. Choose good names for things
– Don’t use spaces, either in variable names or file names.
– Avoid special characters, except for underscores and hyphens.
– Keep names short, but meaningful.
– Never include “final” in a file name
3. Write dates as YYYY-MM-DD
4. Fill in all cells and use some common code for missing data.
5. Put just one thing in a cell
6. Don’t use more than one row for the variable names
7. Create a data dictionary
8. No calculations in the raw data files
9. Don’t use font color or highlighting as data
10. Make backups
11. Use data validation to avoid errors
12. Save the data in plain text files

#### Balaban et al. (2021): Ten simple rules for quick and dirty scientific programming

1. Think before you code
3. Look for opportunities for code reuse
5. Avoid premature optimization
6. Use automated unit testing for critical components
7. Refactor frequently
8. Write self-documenting code for programmers and a readme file for users
10. Go explore and be rigorous when you publish

#### Wilson et al. (2017): Good enough practices in scientific computing

1. Data management
– Save the raw data.
– Ensure that raw data are backed up in more than one location.
– Create the data you wish to see in the world.
– Create analysis-friendly data.
– Record all the steps used to process data.
– Anticipate the need to use multiple tables, and use a uniquei dentifier for every record.
– Submit data to a reputable DOI-issuing repository so that others can access and cite it.
2. Software
– Place a brief explanatory comment at the start of every program.
– Decompose programs into functions.
– Be ruthless about eliminating duplication.
– Always search for well-maintained software libraries that do what you need.
– Test libraries before relying on them.
– Give functions and variables meaningful names.
– Make dependencies and requirementsexplicit.
– Do not comment and uncomment sections of code to control a program’s behavior.
– Provide a simple example or test dataset.
– Submit code to a reputable DOI-issuing repository.
3. Collaboration
– Create an overview of your project.
– Create a shared “to-do” list for theproject.
– Decide on communication strategies.
– Make the project citable.
4. Project organization
– Put each project in its own directory, which is named after the project.
– Put text documents associated with the project in the doc directory.
– Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
– Put project source code in the src directory.
– Put external scripts or compiled programs in the bin directory.
– Name all files to reflect their contentor function.
5. Keeping track of changes
– Backup (almost) everything created by a human being as soon as it is created.
– Keep changes small.
– Share changes frequently.
– Create, maintain, and use a checklist for saving and sharing changes to the project.
– Store each project in a folder that is mirrored off the researcher’s working machine.
– Add a file called CHANGELOG.txt to the project’s docs subfolder.
– Copy the entire project whenever a significant change has been made.
– Use a version control system.
6. Manuscripts
– Write manuscripts using online tools with rich formatting, change tracking, and reference management.
– Write the manuscript in a plain text format that permits version control.

#### Gentzkow and Shapiro (2014): Code and Data for the Social Sciences

1. Automation
– Automate everything that can be automated.
– Write a single script that executes all code from beginning to end.
2. Version Control
– Store code and data under version control.
– Run the whole directory before checking it back in.
3. Directories
– Separate directories by function.
– Separate files into inputs and outputs.
– Make directories portable.
4. Keys
– Store cleaned data in tables with unique, non-missing keys.
– Keep data normalized as far into your code pipeline as you can.
5. Abstraction
– Abstract to eliminate redundancy.
– Abstract to improve clarity.
– Otherwise, don’t abstract.
6. Documentation
– Don’t write documentation you will not maintain.
– Code should be self-documenting.
7. Management
– E-mail is not a task management system.
8. Code style
– Keep it short and purposeful.
– Use descriptive names.
– Pay special attention to coding algebra.
– Make logical switches intuitive.
– Be consistent.
– Check for errors.
– Write tests.
– Profile slow code relentlessly.
– Separate slow code from fast code.

## Replace equations with code

Here is a suggestion: In empirical research, academics should move equations from the methods section to the appendix and, if anything, show the few lines of code used to estimate the model(s) in the software being used (ideally with citations to the software and statistical packages). Preferably, it should be possible to understand the estimation strategy without having to read any equations.

Of course, I am talking about the type of work that is not primarily interested in developing a new estimator or a formal theory that can be applied to a few case studies (or shed light on the limitations of empirical models). I am not against the use of equations or abstractions of any kind to communicate clearly and without ambiguity. I am, however, skeptical towards how empirical research often include equations for the sake of … including equations.

I have a theory that academics, and in particular political scientists, put more equations in their research to show off their skills rather than to help the reader understand what is going on. In most cases, equations are not needed and are often there only to impress reviewers and peers, which of course are the same people (hence, peer-review). The use of equations are excluding readers rather than including readers.

I am confident that most researchers spend more time in their favourite statistical IDE than they do writing and reading equations. For that reason, I also believe that most researchers will find it easier to read actual code instead of equations. Take this example on the equation and code for a binomial regression model (estimated with glmer()) from Twitter:

Personally, I find it much easier to understand what is going on when I look at the R code instead of the extracted equation. Not only that, I also find it easier to think of potential alternatives to the regression model, e.g., that I can easily change the functional form and see how such changes will affect the results. This is something I rarely consider when I only look at equations.

The example above is from R, and not all researchers use or understand R. However, I am quite certain that everybody that understands the equation above will also be able to understand the few lines of code. And when people use Stata, it is often even easier to read the code (even if you are not an avid Stata user). SPSS syntax is much more difficult to read but that says more about why you should not use SPSS in the first place.

I am not against the use of equations in research papers. However, I do believe empirical research would be much better off by showing and citing code instead of equations. Accordingly, please replace equations with code.

## New article in Personality and Individual Differences: Personality in a pandemic

In the July issue of Personality and Individual Differences, you will find an article I have co-authored with Steven G. Ludeke, Joseph A. Vitriol and Miriam Gensowski. In the paper, titled Personality in a pandemic: Social norms moderate associations between personality and social distancing behaviors, we demonstrate when Big Five personality traits are more likely to predict social distancing behaviors.

Here is the abstract:

To limit the transmission of the coronavirus disease 2019 (COVID-19), it is important to understand the sources of social behavior for members of the general public. However, there is limited research on how basic psychological dispositions interact with social contexts to shape behaviors that help mitigate contagion risk, such as social distancing. Using a sample of 89,305 individuals from 39 countries, we show that Big Five personality traits and the social context jointly shape citizens’ social distancing during the pandemic. Specifically, we observed that the association between personality traits and social distancing behaviors were attenuated as the perceived societal consensus for social distancing increased. This held even after controlling for objective features of the environment such as the level of government restrictions in place, demonstrating the importance of subjective perceptions of local norms.

You can find the article here. The replication material is available on Harvard Dataverse and GitHub.

## How to improve your figures #7: Don’t use a third dimension

Most static figures show information in two dimensions (with a horisontal dimension and a vertical dimension). This works really well on the screen as well as on paper. However, once in a while you also see figures presenting figures with a third dimension (3D). It is not necessarily a problem adding a third dimension if you are actually using it to visualise additional data in a meaningful manner. If you have three continuous variables constituting a space, it can make sense to show how observations are positioned in this space. Take for example this figure generated from the interflex package in R:

However, unless you have a clear plan for how, and specifically why, you want to present your data in three dimensions, my recommendation is to stick to two dimensions. I will, as always, present a few examples from the peer-reviewed scientific literature. These are not necessarily bad figures, just figures where I believe not having the third dimension would significantly improve the data visualisation.

First, take a look at a bar chart from this article on the willingness to trade off civil liberties for security before and after the 7/7 London terrorist attack:

The third dimension here is not adding anything of value to the presentation of the data. On the contrary, it is difficult to see whether item 3 is greater than or equal to 3 on the average willingness to trade-off. As Schwabish (2014) describes in relation to the use of 3D charts: “the third dimension does not plot data values, but it does add clutter to the chart and, worse, it can distort the information.” This is definitely the case here.

Second, let us look at a figure from this article showing correlations between voting preferences and ten distinct values:

Again, the third dimension is not adding anything of value to the visualisation. It is only making it more difficult to compare the values. The good thing about the figure is that it shows the exact values. However, it would be much better to present the information in a simple bar chart like this one:

In the figure above, I show the values without the use of any third dimension. Notice how it is easier to identify the specific values (e.g. “Uni”) in the bar chart compared to the 3D chart, when we do not need to move our eyes in a three-dimensional space.

The figure shows the number of articles referring to domestic nuclear energy in four different countries from March 12, 2011 to April 10, 2011. However, it is very difficult to compare the numbers in the different countries, and if there are more articles in the UK and/or France than in Switzerland, we will not be able to see how many articles there are in the latter country. This is not a good visualisation. Instead, it would have been better with a simple line graph in two dimensions.

Again, there can be good reasons to visualise information in three dimensions, but unless you have very good reasons to do so, my recommendation is to keep it simple in 2D. In other words, in most cases, 2D > 3D.

## How to improve your figures #6: Don’t use bar graphs to mislead

In a previous post, I argued that the y-axis can be misleading under certain conditions. One of these conditions is when using a bar graph with a non-zero starting point. In this post I will show that bar graphs can be misleading even when the y-axis is not misleading.

In brief, bar graphs do not only convey certain estimates or data summaries but also an idea about how the data is distributed. The point here is that our perception of data is shaped by the bar graph, and in particular that we are inclined to believe that the data is placed within the bar. For that reason, it is often better to replace the bar graph with an alternative such as a box plot. Here is a visual summary of one of the key points:

There is a name for the bias: the within-the-bar bias. Newman and Scholl (2012) showed that this bias is present: “(a) for graphs with and without error bars, (b) for bars that originated from both lower and upper axes, (c) for test points with equally extreme numeric labels, (d) both from memory (when the bar was no longer visible) and in online perception (while the bar was visible during the judgment), (e) both within and between subjects, and (f) in populations including college students, adults from the broader community, and online samples.” In other words, the bias is the norm rather than the exception in how we process bar charts.

Godau et al. (2016) found that people are more likely to underestimate the mean when data is presented in bar graphs. Interestingly, they did not find any evidence that the height of the bars affected the underestimation. There is even some disagreement about whether bar charts should include zero (e.g., Witt 2019). Most recently, however, Yang et al. (2021) have demonstrated how truncating a bar graph persistently (even when presented with an explicit warning) misleads readers.

This is an important issue to focus on. Weissgerber et al. (2015) looked at papers in top physiology journals and found that 85.6% of the papers used at least one bar graph. I have no reason to believe that these numbers should differ significantly from other fields using quantitative data. For that reason, we need to focus on the limitations of bar graphs and potential improvements.

A limitation with the bar graph is that different distributions of the data can give you the same bar graph. Consider this illustration from Weissgerber et al. (2015) on how different distributions of the data (with different issues such as outliers and unequal n) can give you the same bar graph:

Accordingly, bar graphs will often not provide sufficient information on what the data actually looks like and can even give you a biased perception of what the data looks like (partially explained by the within-the-bar bias). The solution is to show more of the data in your visualisations.

Ho et al. (2019) provide one illustrative example on how to do this when you want to examine the difference between two groups. Here is their evolution of two-group data graphics (from panel a, the traditional bar graph, to panel e, an estimation graphic showing the mean difference with 95% confidence intervals as well):

From panel a to panel b, you can see how we address some of the within-the-bar bias, and further show how the data points are actually distributed when looking at panel c. This is just one example of how we can improve the bar graph to show more of the data, and often the right choice of visualisation will depend upon what message you will need to convey and how much data you will have to show.

That being said, there are some general recommendations that will make it more likely that you create a good visualisation. Specifically, Weissgerber et al. (2019) provide seven recommendations where I find four of them relevant in this context (read the paper for the full list as well as the rationale for each):

1. Replace bar graphs with figures that show the data distribution
2. Consider adding dots to box plots
3. Use symmetric jittering in dot plots to make all data points visible
4. Use semi-transparency or show gradients to make overlapping points visible in scatter plots and flow-cytometry figures

Bar graphs are great, and definitely better than pie charts, but do consider how you can improve them in order to show what your data actually looks like beyond the bar.

## Visualizing climate change with stripes

Climate change is abstract. We do not personally experience climate change in our day-to-day activities (although cimate change is detectable from any single day of weather at global scale, cf. Sippel 2020), and if we are to understand climate change, data – and in particular data visualisation – is crucial. I have recently been reading some literature on the relevance of visualisations and uncertainty in relation to climate change. There has, for example, been some work on the role of visual imagery on public attitudes towards climate change (e.g., Bolsen et al. 2019 and van der Linden et al. 2014) and how uncertainty may make people more likely to accept climate scientists’ predictions (see Howe et al. 2019).

Scientific evidence and data is not enough and we need to consider the best possible ways to visualise climate change. One of the most popular visualisations is the iconic #ShowYourStripes figure that shows global average annual temperatures from 1850 to 2019:

I believe it is a good visualisation but I have a few objections. First and foremost, I like numbers and I do not like how simplified the presentation is. What exactly are the numbers we are looking at here? Should I be concerned? If the dark blue is -0.01°C and the dark red is 0.01°C, is the story here one of change or stability? What is the average temperature in the sample and how much variation is there? Call me old-fashioned, but I don’t think a data visualisation is working if you are simply saying that something is increasing over time.

Interestingly, you can also download the figure with labels, but this provides no information on what values the colours are showing – only the time dimension:

The lack of a meaningful legend is an issue here. It would not make the visualisation more complicated but only help better understand the changes.

Second, I am not convinced that the tool is actually good if you want to show your stripes (and that is what we are being told to do afterall). How useful is the visualisation when we go beyond the global averages? To illustrate my concern, here is the visualisation I got for Denmark:

Sure, there is a story to tell, but I begin to miss certain details. Again, what are the values I am looking at? How much variation is there? And most importantly, how much uncertainty is there over time?

Third, I do not like the extreme colour scale used to illustrate the changes from 1850 (or 1901) to 2019. We know that the temperatures are going to increase in the future and the visualisation can give the false impression that we are already at a peak. I know this is not the message that the figure wants to convey, but people might look at the figure and conclude that we have seen the worst of what is to come.

It is not a bad visualisation. However, it is definitely not the best. You can check out the Vis for Future project from 2019 and find a lot of other great visualisations (the ‘Warming Stripes’ was one of the winners). I can also recommend a lot of the work by Neil Kaye, e.g. this and this. A recent example of a good visualisation is this visualisation from the National Oceanic and Atmospheric Administration on annual temperatures compared to the 20th-century average in the United States (notice how the legend is making it easier to see what we are actually looking at):

Climate change is abstract, but good visualisations with labels can help us better understand the global changes in our climate.