How to improve your figures #8: Make probabilities tangible

Statistics is about learning from data in the context of uncertainty. Often we communicate uncertainty in the form of probabilities. How should we best communicate such probabilities in our figures? The key point in this post is that we should not only present probabilities in the form of probabilities and the like. Instead, we need to work hard on making our numbers tangible.

Why is it not sufficient to simply present estimates on probabilities? Probabilities are difficult because we easily interpret such probabilities differently. When people hear that a candidate is 80% likely to win an election, some people will see that as a much more likely outcome than others. In other words, there are uncertainties in how people perceive uncertainties. We have known for decades that people assign very different probabilities to different probability terms (see e.g. Wallsten et al. 1986; 1988), and the meaning of terms such as “nearly certain” will for one person be close to 100% and for another person be closer to 50%.

To make matters worse, risks and probabiblities can be expressed in different ways. Consider the example of the study that showed how the show “13 Reasons Why” was associated with a 28.9% increase in suicide rates. This was a much more interesting study because they focused on the rate-change in relative terms instead of saying an increase from 0.35 in 100,000 to 0.45 in 100,000 (see also this article on why you should not believe the study in question). To illustrate such differences, @justsaysrisks reports the absolute and relative risk from different articles communicating research findings.

In David Spiegelhalter’s great book, The Art of Statistics: Learning From Data, he looks at how the risk of getting bowel cancer increases by 18% for a group of people who eat 50g of processed meat a day. In Table 1.2 in the book, Spiegelhalter shows how a difference between two groups in one percentage point can be turned into a relative risk of 18%:

 Method Non-bacon eaters Daily bacon eaters Event rate 6% 7% Expected frequency 6 out of 100 7 out of 100 1 in 16 1 in 14 Odds 6/94 7/93
 Comparative measures Absolute risk difference 1%, or 1 out of 100 Relative risk 1.18, or an 18% increase ‘Number Needed to Treat’ 100 Odds ratio (7/93) / (6/94) = 1.18

As you can see, event rates of 6% and 7% in the two groups with an absolute risk difference of 1% can be turned into a relative risk of 18% (with an odds ratio of 1.18). Spiegelhalter’s book provides other good examples and I can highly recommend it.

Accordingly, probabilities are tricky and we need to be careful in how we communicate them. We have seen a lot of discussions on how best to communicate electoral forecasts (if the probability that a candidate will win more than 50% of the votes is 85%, how confident will people be that the candidate will win?). One great suggestion offered by Spiegelhalter in his book is to not think about percentages per se, but rather make probabilities tangible by showing the outcomes for, say, 100 people (or 100 elections, if you are working on a forecast).

To do this, we use unit charts to show counts of a variable. Here, we can use a 10×10 grid where each cell represents one percentage point. A specific version of the unit chart is an isotype chart, where we use icons or images instead of simple shapes.

There is evidence that such visualisations work better than simply presenting the information numerically. Galesic et al. (2009) show, in the context of medical risks, how icon arrays increase the accuracy of the understanding of risks (see also Fagerlin et al. 2005 on how pictographs can reduce the undue influence of anecdotal reasoning).

When we hear that the probability that the Democrats will win an election is 75%, we think about the outcome of one election and how that is significantly more likely to happen. However, when we use an isotype chart where we show 100 outcomes, 75 of them being won by the Democrats, we make the 25 out of 100 Republican outcomes more salient.

There are different R packages you can use to make such visualisations, e.g. waffle and ggwaffle. In the figure below, I used the waffle package to demonstrate how the Democrats got a probability of 75% of winning a (hypothetical) election.

There are many different ways to communicate probabilities. However, try to avoid simply presenting the numerical probabilities in your figures, or the odds ratios, and consider how you can make the probabilities more tangible and easier for the reader to process.

How to improve your figures #7: Don’t use a third dimension

Most static figures show information in two dimensions (with a horisontal dimension and a vertical dimension). This works really well on the screen as well as on paper. However, once in a while you also see figures presenting figures with a third dimension (3D). It is not necessarily a problem adding a third dimension if you are actually using it to visualise additional data in a meaningful manner. If you have three continuous variables constituting a space, it can make sense to show how observations are positioned in this space. Take for example this figure generated from the interflex package in R:

However, unless you have a clear plan for how, and specifically why, you want to present your data in three dimensions, my recommendation is to stick to two dimensions. I will, as always, present a few examples from the peer-reviewed scientific literature. These are not necessarily bad figures, just figures where I believe not having the third dimension would significantly improve the data visualisation.

First, take a look at a bar chart from this article on the willingness to trade off civil liberties for security before and after the 7/7 London terrorist attack:

The third dimension here is not adding anything of value to the presentation of the data. On the contrary, it is difficult to see whether item 3 is greater than or equal to 3 on the average willingness to trade-off. As Schwabish (2014) describes in relation to the use of 3D charts: “the third dimension does not plot data values, but it does add clutter to the chart and, worse, it can distort the information.” This is definitely the case here.

Second, let us look at a figure from this article showing correlations between voting preferences and ten distinct values:

Again, the third dimension is not adding anything of value to the visualisation. It is only making it more difficult to compare the values. The good thing about the figure is that it shows the exact values. However, it would be much better to present the information in a simple bar chart like this one:

In the figure above, I show the values without the use of any third dimension. Notice how it is easier to identify the specific values (e.g. “Uni”) in the bar chart compared to the 3D chart, when we do not need to move our eyes in a three-dimensional space.

The figure shows the number of articles referring to domestic nuclear energy in four different countries from March 12, 2011 to April 10, 2011. However, it is very difficult to compare the numbers in the different countries, and if there are more articles in the UK and/or France than in Switzerland, we will not be able to see how many articles there are in the latter country. This is not a good visualisation. Instead, it would have been better with a simple line graph in two dimensions.

Again, there can be good reasons to visualise information in three dimensions, but unless you have very good reasons to do so, my recommendation is to keep it simple in 2D. In other words, in most cases, 2D > 3D.

How to improve your figures #6: Don’t use bar graphs to mislead

In a previous post, I argued that the y-axis can be misleading under certain conditions. One of these conditions is when using a bar graph with a non-zero starting point. In this post I will show that bar graphs can be misleading even when the y-axis is not misleading.

In brief, bar graphs do not only convey certain estimates or data summaries but also an idea about how the data is distributed. The point here is that our perception of data is shaped by the bar graph, and in particular that we are inclined to believe that the data is placed within the bar. For that reason, it is often better to replace the bar graph with an alternative such as a box plot. Here is a visual summary of one of the key points:

There is a name for the bias: the within-the-bar bias. Newman and Scholl (2012) showed that this bias is present: “(a) for graphs with and without error bars, (b) for bars that originated from both lower and upper axes, (c) for test points with equally extreme numeric labels, (d) both from memory (when the bar was no longer visible) and in online perception (while the bar was visible during the judgment), (e) both within and between subjects, and (f) in populations including college students, adults from the broader community, and online samples.” In other words, the bias is the norm rather than the exception in how we process bar charts.

Godau et al. (2016) found that people are more likely to underestimate the mean when data is presented in bar graphs. Interestingly, they did not find any evidence that the height of the bars affected the underestimation. There is even some disagreement about whether bar charts should include zero (e.g., Witt 2019). Most recently, however, Yang et al. (2021) have demonstrated how truncating a bar graph persistently (even when presented with an explicit warning) misleads readers.

This is an important issue to focus on. Weissgerber et al. (2015) looked at papers in top physiology journals and found that 85.6% of the papers used at least one bar graph. I have no reason to believe that these numbers should differ significantly from other fields using quantitative data. For that reason, we need to focus on the limitations of bar graphs and potential improvements.

A limitation with the bar graph is that different distributions of the data can give you the same bar graph. Consider this illustration from Weissgerber et al. (2015) on how different distributions of the data (with different issues such as outliers and unequal n) can give you the same bar graph:

Accordingly, bar graphs will often not provide sufficient information on what the data actually looks like and can even give you a biased perception of what the data looks like (partially explained by the within-the-bar bias). The solution is to show more of the data in your visualisations.

Ho et al. (2019) provide one illustrative example on how to do this when you want to examine the difference between two groups. Here is their evolution of two-group data graphics (from panel a, the traditional bar graph, to panel e, an estimation graphic showing the mean difference with 95% confidence intervals as well):

From panel a to panel b, you can see how we address some of the within-the-bar bias, and further show how the data points are actually distributed when looking at panel c. This is just one example of how we can improve the bar graph to show more of the data, and often the right choice of visualisation will depend upon what message you will need to convey and how much data you will have to show.

That being said, there are some general recommendations that will make it more likely that you create a good visualisation. Specifically, Weissgerber et al. (2019) provide seven recommendations where I find four of them relevant in this context (read the paper for the full list as well as the rationale for each):

1. Replace bar graphs with figures that show the data distribution
2. Consider adding dots to box plots
3. Use symmetric jittering in dot plots to make all data points visible
4. Use semi-transparency or show gradients to make overlapping points visible in scatter plots and flow-cytometry figures

Bar graphs are great, and definitely better than pie charts, but do consider how you can improve them in order to show what your data actually looks like beyond the bar.

Visualizing climate change with stripes

Climate change is abstract. We do not personally experience climate change in our day-to-day activities (although cimate change is detectable from any single day of weather at global scale, cf. Sippel 2020), and if we are to understand climate change, data – and in particular data visualisation – is crucial. I have recently been reading some literature on the relevance of visualisations and uncertainty in relation to climate change. There has, for example, been some work on the role of visual imagery on public attitudes towards climate change (e.g., Bolsen et al. 2019 and van der Linden et al. 2014) and how uncertainty may make people more likely to accept climate scientists’ predictions (see Howe et al. 2019).

Scientific evidence and data is not enough and we need to consider the best possible ways to visualise climate change. One of the most popular visualisations is the iconic #ShowYourStripes figure that shows global average annual temperatures from 1850 to 2019:

I believe it is a good visualisation but I have a few objections. First and foremost, I like numbers and I do not like how simplified the presentation is. What exactly are the numbers we are looking at here? Should I be concerned? If the dark blue is -0.01°C and the dark red is 0.01°C, is the story here one of change or stability? What is the average temperature in the sample and how much variation is there? Call me old-fashioned, but I don’t think a data visualisation is working if you are simply saying that something is increasing over time.

Interestingly, you can also download the figure with labels, but this provides no information on what values the colours are showing – only the time dimension:

The lack of a meaningful legend is an issue here. It would not make the visualisation more complicated but only help better understand the changes.

Second, I am not convinced that the tool is actually good if you want to show your stripes (and that is what we are being told to do afterall). How useful is the visualisation when we go beyond the global averages? To illustrate my concern, here is the visualisation I got for Denmark:

Sure, there is a story to tell, but I begin to miss certain details. Again, what are the values I am looking at? How much variation is there? And most importantly, how much uncertainty is there over time?

Third, I do not like the extreme colour scale used to illustrate the changes from 1850 (or 1901) to 2019. We know that the temperatures are going to increase in the future and the visualisation can give the false impression that we are already at a peak. I know this is not the message that the figure wants to convey, but people might look at the figure and conclude that we have seen the worst of what is to come.

It is not a bad visualisation. However, it is definitely not the best. You can check out the Vis for Future project from 2019 and find a lot of other great visualisations (the ‘Warming Stripes’ was one of the winners). I can also recommend a lot of the work by Neil Kaye, e.g. this and this. A recent example of a good visualisation is this visualisation from the National Oceanic and Atmospheric Administration on annual temperatures compared to the 20th-century average in the United States (notice how the legend is making it easier to see what we are actually looking at):

Climate change is abstract, but good visualisations with labels can help us better understand the global changes in our climate.

How to improve your figures #5: Don’t use pie charts

The pie chart is more than 200 years old. And I am sure people will use pie charts 200 years from now (cf., the Lindy effect). It is a popular chart type and I am not universally against using pie charts. However, in most cases – especially in academic publications – the pie chart is not the best choice.

Specifically, often we are simply better off not providing a figure at all or turning the pie chart into a bar chart. I will provide an example of both these cases. First, to understand why no chart at all might be better than a pie chart, I will show the full context of a pie chart from this article to illustrate the issue:

As you can see, the pie chart and the figure legend take up more than half of the space on the page (and the text on the page is then used to comment on what the figure is showing). The pie chart is adding nothing of value here that couldn’t be described in a few sentences, including the examples highlighted for each category in the pie chart. Accordingly, even (or especially) when pie charts are quite simple, their shape makes them take up more space than simply adding a few lines of text. Of course, in some settings, such as in PowerPoint presentations or tweets, text can be a problem and a pie chart like the one above might be a useful way to illustrate the information of interest.

Second, when we add more categories to a pie chart, the visualisation will only be even worse. That is, when we want to show more information, the pie chart is not the ideal choice. Consider a figure from this article on the occupations of members of Congress:

There are at least three issues with the pie chart in question. First, and most importantly when we look at pie charts with multiple categories, it is very difficult to compare the relative size of the different categories. Second, pie charts often use different colours for the different categories. However, the colours are not adding any information and it is almost impossible to read the “Business employee” category. Third, notice how “Other/unknown” is 0% but still have a non-zero sized piece of the pie. It might be a rounding issue but it is not looking good.

To address all of the three issues, we can create a simple bar chart showing the same information:

This figure makes it a lot easier to compare the different categories, works well in black and white, and do not need to use any ink to show a value of 0%.

Pie charts are not always bad, but make sure that you consider whether your pie chart is actually better than no pie chart at all or better than an alternative chart (such as a bar chart).

How to improve your figures #4: Show labels

This is a brief follow-up post to my previous post with advice on how you can improve your figures. Can it be worse than showing variable names instead of actual labels on your figures? Yes. You can have no labels at all.

Take a look at this article. It’s great and includes references to a lot of good material. However, for both figures provided in the article, it is not clear what exactly the y-axis is showing. Take the figure with the Fragile States Index as an example. What is a value of 5? What is a value of 6? Ideally, the figure should show that without you having to track down the source material (also, do notice how only countries doing better than the US is included to look the US look worse than it actually is).

I see missing labels now and then. Consider, for example, this article from the Comparative Political Studies. Here is Figure 1 (and the title legend):

What you see is that there is no information on what is shown on the respective axes. The only thing you have is a series of numbers. “Satisfaction With Government, Perception of the Economy, and Clarity of Responsibility”, sure, but what is on the x-axis? What is on the y-axis? Figure 2 in the paper is similar to Figure 1, only with “Trust in Parliament” instead of “Satisfaction With Government”.

This is an extreme example but my recommendation is simple: Make sure that you always have labels that tell the reader what the figure is actually showing. Having no labels is highly problematic, showing variable names are problematic and showing informative labels is great.

How to improve your figures #3: Don’t show variable names

When you plot a figure in your favourite statistical software, you will most likely see the name of the variable(s) you are plotting. If your income variable is called inc, your software will call the axis with income for inc and not income. In most cases variable names are not sufficient and you should, for that reason, not show variable names in your figures.

Good variable names are easy to read and write – and follow specific naming conventions. For example, you cannot (and should not) include spaces in your variable names. That is why we use underscores (_) to separate words in variable names. However, R, SPSS and Stata will happily show such underscores in your figures – and you need to fix that.

I believe this is data visualisation 101 but it is something I see a lot, including in published research. For example, take a look at this figure (Figure 1 from this paper):

As you can see, we have Exitfree, Anti_EU and some GDP* variables. The good thing about this paper is that the variable names are mentioned in the main text as well: “Individuals and parties may have ideological objections to European integration and hence desire a free exit right irrespective of whether their country is peripheral. To control for this, a variable variable ‘Anti_EU’ is constructed based on the variable ‘eu_anti_pro’ in the ParlGov database”. However, I would still recommend that you do not show the actual variable names in the figures but use actual names (with spaces and everything).

Let’s look at another few examples from this paper. Here is the first figure:

The important thing is not what the figure is about, but the labels. You will see labels such as PID_rep_dem and age_real. These are not good labels to have in a figure in a paper. age_real is not mentioned anywhere in the paper (only age as a covariate is mentioned).

Let us take a look at Figure 3 from the same paper:

Here you will see a variable called form2. What was form 1? Is there a form 3? When we rely on variable names instead of clear labels, we introduce ambiguity and makes it difficult for the reader to understand what is going on. Notice also the difference between Figure 1 and Figure 3 for age, i.e. age_real and real_age. Are those variables the same (i.e. a correlation of 1)? And if that is the case, why have two age variables?

Okay, next example. Look at Figure 6 from this paper:

Here we see a variable on the x-axis called yrs_since1920 (years since 1920). It would be better having a label for this axis simply being “Years since 1920”. Or even better: just the year and having the actual years on the axis. Notice also here the 1.sønderjylland_ny label. Sønderjylland is not mentioned in the paper and it is not clear how ny (new in Danish) should be understood here (most likely that it wasn’t the first Sønderjylland variable that was created in the data).

Let’s take another example, specifically Figure 3 from this paper:

Here we see the good old underscores en masse. anti_elite, immigrant_blame, ring_wing_complete_populism, rich_blame and left_wing_complete_populism. There are 29 authors on the article in question. Too many cooks spoil the broth? Nahh, I am sure most of the authors on the manuscript didn’t even bother looking at the figures (also, if you want to have fun, take a critical look at the results provided in the appendix!).

And now I notice that all of the examples I have provided above are from Stata. I promise it is a coincidence. However, let’s take one last example from R just to confirm that it is not only an issue in Stata. Specifically, look at Figure 3 in this paper (or Figure 4, Figure 5 and Figure 6):

The figure show trends in public opinion on economic issues in the United States from 1972 to 2016. There are too many dots in the labels here. guar.jobs.n.income, FS.aid.4.college etc. are not ideal labels in your figure.

In sum, I like most of the papers above (there is a reason I found the examples in the first place). However, it is a major turn-off that the figures do not show actual labels but simply rely on the variable names or weird abbreviations to show crucial information.

Here is a collection of books and peer-reviewed articles on data visualization. There is a lot of good material on the philosophy, principles and practices of data visualization.

I plan to update the list with additional material in the future (see the current version as a draft). Do reach out if you have any recommendations.

Introduction

Graphs in Statistical Analysis (Anscombe 1973)
An Economist’s Guide to Visualizing Data (Schwabish 2014)
Data Visualization in Sociology (Healy and Moody 2014)
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (Weissgerber et al. 2015)
Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods (Cleveland and McGill 1984)
Graphic Display of Data (Wilkinson 2012)
Visualizing Data in Political Science (Traunmüller 2020)
Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks (Schwabish 2021)

History

Historical Development of the Graphical Representation of Statistical Data (Funkhouser 1937)
Quantitative Graphics in Statistics: A Brief History (Beniger and Robyn 1978)

Tips and recommendations

Ten Simple Rules for Better Figures (Rougier et al. 2014)
Designing Graphs for Decision-Makers (Zacks and Franconeri 2020)
Designing Effective Graphs (Frees and Miller 1998)
Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics (Donahue 2011)
Designing Better Graphs by Including Distributional Information and Integrating Words, Numbers, and Images (Lane and Sándor 2009)

Uncertainty

Researchers Misunderstand Confidence Intervals and Standard Error Bars (Belia et al. 2005)
Error bars in experimental biology (Cumming et al. 2007)
Confidence Intervals and the Within-the-Bar Bias (Pentoney and Berger 2016)
Depicting Error (Wainer 1996)
When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems (Kay et al. 2016)
Decisions With Uncertainty: The Glass Half Full (Joslyn and LeClerc 2013)
Uncertainty Visualization (Padilla et al. 2020)
A Probabilistic Grammar of Graphics (Pu and Kay 2020)

Tables

Let’s Practice What We Preach: Turning Tables into Graphs (Gelman et al. 2002)
Why Tables Are Really Much Better Than Graphs (Gelman 2011)
Graphs or Tables (Ehrenberg 1978)
Using Graphs Instead of Tables in Political Science (Kastellec and Leoni 2007)
Ten Guidelines for Better Tables (Schwabish 2020)

Chart types

Boxplots

40 years of boxplots (Wickham and Stryjewski 2011)

Infographics

Infovis and Statistical Graphics: Different Goals, Different Looks (Gelman and Unwin 2013)
InfoVis Is So Much More: A Comment on Gelman and Unwin and an Invitation to Consider the Opportunities (Kosara 2013)
InfoVis and Statistical Graphics: Comment (Murrell 2013)
Graphical Criticism: Some Historical Notes (Wickham 2013)
Tradeoffs in Information Graphics (Gelman and Unwin 2013)

Maps

Visualizing uncertainty in areal data with bivariate choropleth maps, map pixelation and glyph rotation (Lucchesi and Wikle 2017)

Scatterplot

The Many Faces of a Scatterplot (Cleveland and McGill 1984)
The early origins and development of the scatterplot (Friendly and Denis 2005)

Dot plots

Dot Plots: A Useful Alternative to Bar Charts (Robbins 2006)

3D charts

The Pseudo Third Dimension (Haemer 1951)

Software

Excel

Effective Data Visualization: The Right Chart for the Right Data (Evergreen 2016)

R

Data Visualization (Healy 2018)
Data Visualization with R (Kabacoff 2018)
ggplot2: Elegant Graphics for Data Analysis (Wickham 2009)
Fundamentals of Data Visualization (Wilke 2019)
R Graphics Cookbook (Chang 2020)

Stata

A Visual Guide to Stata Graphics (Mitchell 2012)

Changelog
– 2021-03-01: Add ‘Better Data Visualizations’
– 2020-08-03: Add ‘Ten Guidelines for Better Tables’
– 2020-07-14: Add ‘Designing Graphs for Decision-Makers’ and ‘A Probabilistic Grammar of Graphics’ (ht: Simon Straubinger)

How to improve your figures #1: Don’t use the y-axis to mislead

There are good reasons to think carefully about the y-axis when you design figures, including considerations on whether to start your y-axis at zero or not. In this post, I provide a simple piece of advice: when presenting bar charts on a linear scale, start at 0. Not 0.38. Not 0.31. Not 0.04. 0.

The figure below, from Hanel et al. (2018), depicts the same data in three panels. It shows how the same data can be presented in different ways with implications for how we perceive differences between groups.

In the first panel, we see the distributions of the two groups. In the second panel, we see that the y-axis starts at 4.6. In that figure, it looks like the value for Poland (the red bar) is three times greater than the value for the UK (the blue bar). In the third panel, relative to the second panel, we see a much better presentation of the two groups with a y-axis starting at zero.

Despite the fact that bar charts with arbitrary and non-zero starting y-axes are problematic, I see it again and again in scientific publications. Take for example this new article in the American Political Science Review, where the bar charts use the y-axis to mislead. Specifically, they leave the impression of a greater difference between the two groups than is supported by the data:

For another example, take this new article in Political Communication where the bar chart conveniently starts at 4.00% to give the impression of a large difference between the groups. (On a sidenote, I can’t believe how unlucky the authors were. The only statistical finding in the article is the finding that wasn’t preregistered.)

Alas, journals and books are filled with examples of bar charts that use the y-axis to mislead. The general issue is that these figures do not comply to the principle of proportional ink: “The sizes of shaded areas in a visualization need to be proportional to the data values they represent.”

This is not to say that y-axes should always start at zero. On the contrary, there are many cases where figures should definitely not start at zero (see this article from Quartz and this video from Vox for more information). However, when creating a bar plot, the best way to improve your figure is to comply to the principle of proportional ink. Start at 0.