How to improve your figures #3: Don’t show variable names

When you plot a figure in your favourite statistical software, you will most likely see the name of the variable(s) you are plotting. If your income variable is called inc, your software will call the axis with income for inc and not income. In most cases variable names are not sufficient and you should, for that reason, not show variable names in your figures.

Good variable names are easy to read and write – and follow specific naming conventions. For example, you cannot (and should not) include spaces in your variable names. That is why we use underscores (_) to separate words in variable names. However, R, SPSS and Stata will happily show such underscores in your figures – and you need to fix that.

I believe this is data visualisation 101 but it is something I see a lot, including in published research. For example, take a look at this figure (Figure 1 from this paper):

As you can see, we have Exitfree, Anti_EU and some GDP* variables. The good thing about this paper is that the variable names are mentioned in the main text as well: “Individuals and parties may have ideological objections to European integration and hence desire a free exit right irrespective of whether their country is peripheral. To control for this, a variable variable ‘Anti_EU’ is constructed based on the variable ‘eu_anti_pro’ in the ParlGov database”. However, I would still recommend that you do not show the actual variable names in the figures but use actual names (with spaces and everything).

Let’s look at another few examples from this paper. Here is the first figure:

The important thing is not what the figure is about, but the labels. You will see labels such as PID_rep_dem and age_real. These are not good labels to have in a figure in a paper. age_real is not mentioned anywhere in the paper (only age as a covariate is mentioned).

Let us take a look at Figure 3 from the same paper:

Here you will see a variable called form2. What was form 1? Is there a form 3? When we rely on variable names instead of clear labels, we introduce ambiguity and makes it difficult for the reader to understand what is going on. Notice also the difference between Figure 1 and Figure 3 for age, i.e. age_real and real_age. Are those variables the same (i.e. a correlation of 1)? And if that is the case, why have two age variables?

Okay, next example. Look at Figure 6 from this paper:

Here we see a variable on the x-axis called yrs_since1920 (years since 1920). It would be better having a label for this axis simply being “Years since 1920”. Or even better: just the year and having the actual years on the axis. Notice also here the 1.sønderjylland_ny label. Sønderjylland is not mentioned in the paper and it is not clear how ny (new in Danish) should be understood here (most likely that it wasn’t the first Sønderjylland variable that was created in the data).

Let’s take another example, specifically Figure 3 from this paper:

Here we see the good old underscores en masse. anti_elite, immigrant_blame, ring_wing_complete_populism, rich_blame and left_wing_complete_populism. There are 29 authors on the article in question. Too many cooks spoil the broth? Nahh, I am sure most of the authors on the manuscript didn’t even bother looking at the figures (also, if you want to have fun, take a critical look at the results provided in the appendix!).

And now I notice that all of the examples I have provided above are from Stata. I promise it is a coincidence. However, let’s take one last example from R just to confirm that it is not only an issue in Stata. Specifically, look at Figure 3 in this paper (or Figure 4, Figure 5 and Figure 6):

The figure show trends in public opinion on economic issues in the United States from 1972 to 2016. There are too many dots in the labels here. guar.jobs.n.income, FS.aid.4.college etc. are not ideal labels in your figure.

In sum, I like most of the papers above (there is a reason I found the examples in the first place). However, it is a major turn-off that the figures do not show actual labels but simply rely on the variable names or weird abbreviations to show crucial information.

Data visualization: a reading list

Here is a collection of books and peer-reviewed articles on data visualization. There is a lot of good material on the philosophy, principles and practices of data visualization.

I plan to update the list with additional material in the future (see the current version as a draft). Do reach out if you have any recommendations.

Introduction

Graphs in Statistical Analysis (Anscombe 1973)
An Economist’s Guide to Visualizing Data (Schwabish 2014)
Data Visualization in Sociology (Healy and Moody 2014)
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (Weissgerber et al. 2015)
Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods (Cleveland and McGill 1984)
Graphic Display of Data (Wilkinson 2012)
Visualizing Data in Political Science (Traunmüller 2020)
Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks (Schwabish 2021)

History

Historical Development of the Graphical Representation of Statistical Data (Funkhouser 1937)
Quantitative Graphics in Statistics: A Brief History (Beniger and Robyn 1978)

Tips and recommendations

Ten Simple Rules for Better Figures (Rougier et al. 2014)
Designing Graphs for Decision-Makers (Zacks and Franconeri 2020)
Designing Effective Graphs (Frees and Miller 1998)
Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics (Donahue 2011)
Designing Better Graphs by Including Distributional Information and Integrating Words, Numbers, and Images (Lane and Sándor 2009)

Analysis and decision making

Statistical inference for exploratory data analysis and model diagnostics (Buja et al. 2009)
Statistics and Decisions: The Importance of Communication and the Power of Graphical Presentation (Mahon 1977)
The Eight Steps of Data Analysis: A Graphical Framework to Promote Sound Statistical Analysis (Fife 2020)

Uncertainty

Researchers Misunderstand Confidence Intervals and Standard Error Bars (Belia et al. 2005)
Error bars in experimental biology (Cumming et al. 2007)
Confidence Intervals and the Within-the-Bar Bias (Pentoney and Berger 2016)
Depicting Error (Wainer 1996)
When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems (Kay et al. 2016)
Decisions With Uncertainty: The Glass Half Full (Joslyn and LeClerc 2013)
Uncertainty Visualization (Padilla et al. 2020)
A Probabilistic Grammar of Graphics (Pu and Kay 2020)

Tables

Let’s Practice What We Preach: Turning Tables into Graphs (Gelman et al. 2002)
Why Tables Are Really Much Better Than Graphs (Gelman 2011)
Graphs or Tables (Ehrenberg 1978)
Using Graphs Instead of Tables in Political Science (Kastellec and Leoni 2007)
Ten Guidelines for Better Tables (Schwabish 2020)

Deciding on a chart

Graph and chart aesthetics for experts and laymen in design: The role of familiarity and perceived ease of use (Quispel et al. 2016)

Chart types

Boxplots

40 years of boxplots (Wickham and Stryjewski 2011)

Pie charts

No Humble Pie: The Origins and Usage of a Statistical Chart (Spence 2005)

Infographics

Infovis and Statistical Graphics: Different Goals, Different Looks (Gelman and Unwin 2013)
InfoVis Is So Much More: A Comment on Gelman and Unwin and an Invitation to Consider the Opportunities (Kosara 2013)
InfoVis and Statistical Graphics: Comment (Murrell 2013)
Graphical Criticism: Some Historical Notes (Wickham 2013)
Tradeoffs in Information Graphics (Gelman and Unwin 2013)

Maps

Visualizing uncertainty in areal data with bivariate choropleth maps, map pixelation and glyph rotation (Lucchesi and Wikle 2017)

Scatterplot

The Many Faces of a Scatterplot (Cleveland and McGill 1984)
The early origins and development of the scatterplot (Friendly and Denis 2005)

Dot plots

Dot Plots: A Useful Alternative to Bar Charts (Robbins 2006)

3D charts

The Pseudo Third Dimension (Haemer 1951)

Teaching pedagogy

Correlational Analysis and Interpretation: Graphs Prevent Gaffes (Peden 2001)
Numbers, Pictures, and Politics: Teaching Research Methods Through Data Visualizations (Rom 2015)
Data Analysis and Data Visualization as Active Learning in Political Science (Henshaw and Meinke 2018)

Software

Excel

Effective Data Visualization: The Right Chart for the Right Data (Evergreen 2016)

R

Data Visualization (Healy 2018)
Data Visualization with R (Kabacoff 2018)
ggplot2: Elegant Graphics for Data Analysis (Wickham 2009)
Fundamentals of Data Visualization (Wilke 2019)
R Graphics Cookbook (Chang 2020)

Stata

A Visual Guide to Stata Graphics (Mitchell 2012)


Changelog
– 2021-03-01: Add ‘Better Data Visualizations’
– 2020-08-03: Add ‘Ten Guidelines for Better Tables’
– 2020-07-14: Add ‘Designing Graphs for Decision-Makers’ and ‘A Probabilistic Grammar of Graphics’ (ht: Simon Straubinger)

Potpourri: Statistics #52

Here’s why 2019 is a great year to start with R: A story of 10 year old R code then and now
How the BBC Visual and Data Journalism team works with graphics in R
Special Topics in Data Science: Responsible Data Science
Causal Data Science
From Psychologist to Data Scientist
Causal Graphs Seminar
R Coding Style Guide
Explaining the 2016 Democratic Primary with Machine Learning
A guide to making your data analysis more reproducible
Exploring the multiplication table with R
hcandersenr: An R Package for H.C. Andersens fairy tales
Solving the model representation problem with broom
Basic Stata Syntax Workshop
Bayesian Logistic Regression using brms, Part 1
Half a dozen frequentist and Bayesian ways to measure the difference in means in two groups
Understanding propensity score weighting
Causal Inference Book
15 new ideas and new tools for R gathered from the RStudio Conference 2019
Keeping up to date with R news
tidylog

Nyt bogkapitel: Logistisk regression med binært udfald

Har skrevet et kapitel om binær logistisk regressionsanalyse til den nye bog, Videregående kvantitative metoder. Bogen er redigeret af M. Azhar Hussain og Jørgen Trankjær Lauridsen, der begge har gjort et fantastisk stykke arbejde.

Kapitel 3, Logistisk regression med binært udfald, giver en introduktion til, hvordan man gennemfører logistiske regressionsanalyser. Helt konkret beskrives tre procedurer, der bør gennemføres i forbindelse med en god logistisk regression: 1) estimering af en model, 2) udregning af forudsagte sandsynligheder og 3) visualisering af resultaterne.

Hele bogen er bygget op med et praktisk fokus på, hvordan tingene gøres i Stata. Do-filen til mit kapitel kan findes her.

Potpourri: Statistik #35

How to better communicate election forecasts — in one simple chart
What data patterns can lie behind a correlation coefficient?
Electoral Vote Prediction Map in R
Plotly R cheat sheet
Stata Figure Schemes Latest version + inclusion in Stata’s SSC archive
The hard road to reproducibility
Equivalence, non-inferiority and superiority testing
Writing Good R Code and Writing Well
December ’16 RStudio Tips and Tricks
A non-comprehensive list of awesome things other people did in 2016
The 10 Best Data Visualization Articles of 2016 (and Why They Were Awesome)

Potpourri: Statistik #31

A FiveThirtyEight-inspired theme for ggplot2
Comparing ggplot2 and R Base Graphics
List of useful RStudio addins made by useRs
Creating a LOESS animation with gganimate
interplot: Plot the Effects of Variables in Interaction Terms
How Much Should We Trust Estimates from Multiplicative Interaction Models? Simple Tools to Improve Empirical Practice
Stata Figure Schemes (+ update)
An unhealthy obsession with p-values is ruining science
Statistikere udsender sjælden form for advarsel
A reading list for the Replicability Crisis
Rock ‘n Poll
Jagten på den perfekte stikprøve
– Demystifying Box-and-whisker plots: Part 1, Part 2
Polipredict.dk: markedets forventninger til politiske begivenheder
Rbitrary Standards

Potpourri: Statistik #17

How Do We Know if a Program Made a Difference? A Guide to Statistical Methods for Program Impact Evaluation
Common Misconceptions about Data Analysis and Statistics
Taking a Chance in the Classroom: Five Concrete Reasons Your Students Should Be Learning to Analyze Data in the Reproducible Paradigm
What p-hacking really looks like: A comment on Masicampo & LaLande (2012)
All English soccer results 1888-2014
Qualitative Comparative Analysis in Critical Perspective
Data wrangling, exploration, and analysis with R (link updated, 2019-09-27)
How to Make More Published Research True
Reproducible Research: What, Why, and How?
Sailing between the Scylla of hyping of sexy research and the Charybdis of reflexive skepticism

Potpourri: Statistik #16

Folketingets åbne data
How The FiveThirtyEight Senate Forecast Model Works
ggthemr
Google Correlations: New approaches to collecting data for statistical network analysis
Digitalisering af historisk statistik
List of resources for the Stata commands ‘margins’ and ‘marginsplot’
Covering election night with R
How to share data with a statistician
Replication and reputation: Whose career matters?
Research replication in social science: reflections from Nathaniel Beck