Jytte fra Marketing er desværre tilbage

Her den anden dag fik jeg læst Jytte vender tilbage. Bogen er skrevet af Morten Münster og er en opfølger til bogen Jytte fra Marketing er desværre gået for i dag (som jeg anmeldte for Kommunikationsforum). Hvis du allerede er bekendt med grundlæggende begreber inden for statistik (eksempelvis fødselsdagsparadokset), adfærdsøkonomi (eksempelvis tabsaversion), og økonomi mere generelt (eksempelvis alternativomkostninger), er der ikke meget nyt at komme efter i denne nye bog.

Jeg havde oprindeligt ikke tænkt mig at skrive et indlæg om bogen, men ét eksempel i bogen gjorde, at end ikke en syv nationers hær ville kunne holde mine fingre fra tastaturet. Her er indledningen til kapitel 4 i bogen:

“Der findes et vidunderligt koncept inden for adfærdsforskningen, der på engelsk kaldes for 9-enders. 9-enders er mennesker, der har rund fødselsdag til næste år. De er altså fx 29, 39 eller 49 år. Vi kan kalde dem for 9’ere. Begrebet blev navngivet af de to psykologer Adam Alter og Hal Hershfield, som i et studie tilbage i 2015 undersøgte 9’ere og deres adfærd. De havde en hypotese om, at 9’ere opførte sig anderledes end os andre. For at undersøge denne hypotese fandt de forskellige større datakilder fra fx tilmeldinger til maratonløb, selvmordsrater og utroskab.”

Det var i 2014 (og ikke i 2015), at studiet blev publiceret, men lad det ligge. We got bigger fish to fry. Eksemplet var mig allerede bekendt, da jeg har brugt et utal af timer på at reproducere såvel som replikere det pågældende studie. Der er absolut intet empirisk belæg for en såkaldt 9-ender effekt. I 2015 publicerede jeg en artikel i tidsskriftet Frontiers in Psychology, hvor jeg viste, at deres data ganske enkelt ikke stemte overens med deres konklusioner. Ligeledes har de tre forskere, Simon Kühne, Thorsten Schneider og David Richter, vist, at studiet ikke lader sig replikere. Du kan derfor ånde lettet op, hvis du er 29 eller 39 eller en anden alder, der ender på 9. Det er ikke mere sandsynligt, at du vil begå selvmord af den grund.

Ovenstående er desværre et godt eksempel på, hvad der sker, når folk uden nogen som helst forskningsbaggrund kaster om sig med begreber, teorier og empiriske fund, uden at have de fornødne kvalifikationer til at kunne vurdere, om der nu også er tale om et “vidunderligt koncept” – eller i det mindste evner at foretage en simpel litteratursøgning, der nemt kunne bekræfte, at der er problemer med studiet.

Dette fik mig til at have paraderne oppe gennem resten af min læsning af bogen, og det piner mig at måtte konkludere, at der desværre er mange fejl og mangler i bogen. Alt fra fejl i titler til fremstillingen af (pseudo)videnskabelige fund i litteraturen, og andet, der intet har med (adfærds)forskning at gøre, som oser lidt for meget af at være “smart i en fart”-litteratur. Ironisk nok virker det til, at bogen primært er skrevet til mellemledere der, når de hører om Dunning–Kruger, the illusion of explanatory depth og overconfidence, kan føle sig klogere på adfærdsforskning.

Der er ét eksempel, der fortjener et par ekstra ord med på vejen som illustrerer mit problem med bogen til perfektion. Mere specifikt det eksempel, der handler om, hvordan vores (for)navn påvirker store livsbeslutninger:

Selv noget så irrelevant som vores navne påvirker vores valg i livet. Især forskeren Brett Pelham har stået bag forskning, der statistisk viser, at mennesker træffer store beslutninger på baggrund af deres navn. I sit studie Why Susie Sells Seashells by the Seashore: Implicit Egotism and Major Life Decisions viser han, at hvis du er amerikaner og hedder George, så er der større sandsynlighed for, at du bor i Georgia, og hvis du hedder Louise, er der større odds for, at du bor i Louisiana.

Dette er desværre og forventeligt for godt til at være sandt. Har man blot et minimalt kendskab til adfærdsforskningen vil man vide, at dette er et af de fund, der ikke er robust. Psykologen Uri Simonsohn har skrevet flere studier herom. I et studie viser han således, hvordan disse fund kan forklares med “a combination of cohort, geographic and ethnic confounds, and reverse causality.” I et andet studie konkluderer han, at “controlling for reverse causality entirely eliminated the name-similarity effect.” Med andre ord tyder det på, at vi “forveksler korrelation med kausalitet” (husk på denne formulering da du vil se den igen meget snart).

Hvorfor tager jeg fat i dette blandt flere eksempler i bogen? Der er trods alt flere eksempler på studier, der ikke lader sig replikere, og man kan vel ikke anfægte, at en forfatter ikke nødvendigvis kender til forskningen på området. Når jeg nævner det her, er det fordi ovenstående udpluk bliver efterfulgt af nedenstående lodret forkerte påstand:

Det er ikke lykkedes nogen at tilbagevise disse kulørte studier, selvom det jo kunne være, at vi begår den klassiske fejl og forveksler korrelation med kausalitet. Kunne det fx være, at forældre, der bor i Louisiana, synes, det er sødt at kalde deres datter for Louise? Måske. Men studier af vores velkendte navns irrationelle kraft er årelange og viser sammenhænge i besynderlige kontekster. Eksempelvis er der en statistisk overrepræsentation af folk, der hedder Dennis og Denise, som bliver ’dentists’, og det gælder også Lawrence og Larry, som bliver ’lawyers’. Om det er grunden til, at Lars Hansen er læge, må stå hen i det uvisse, men navne har vist sig at have en indflydelse på, hvilke produkter vi foretrækker, hvilke politikere vi donerer penge til, hvilke ægtefæller vi ender med, og altså også hvilke jobs vi går efter.

Det er ikke lykkedes nogen at tilbagevise disse kulørte studier? Jeg ved ikke hvilken sten, man skal have sovet under, for i 2021 – som en autoritet inden for adfærdsforskningen – at kunne fremføre sådan en påstand. Det irriterer mig, at man ikke har lavet blot den mindste litteratursøgning for at belyse, om det rent faktisk er “lykkedes nogen at tilbagevise disse kulørte studier”. Ser man på noterne til bogen, kan man også se, at forfatteren henviser til studiet med et link til en fil på Andrew Gelmans blog: “12. Du kan læse mere om Pelhams studie med navne her: http://www.stat.columbia.edu/~gelman/stuff_for_blog/susie.pdf”. Uden at sige for meget kan jeg godt røbe, at Andrew Gelman ikke har en positiv holdning til det pågældende studie.

Jeg fremhæver dette eksempel da det ikke blot handler om, at man ukritisk formidler, hvad et studie har vist. Det vi har med at gøre her er mere problematisk end ukritisk formidling, da der er tale om “kritisk” fejlagtig formidling, hvor læseren får et indtryk af, at forfatteren rent faktisk har forholdt sig kritisk til studierne og kommet frem til, at der ikke er nogen kritik. Jeg har fuld forståelse for, hvis læserne af bogen vil være overbeviste om, at førnævnte effekter er robuste og stærke, selvom efterfølgende forskning har vist, at det på ingen måde er tilfældet.

Og nu vi er ved noterne i bogen melder et andet problem sig. Der er talrige påstande, der ikke underbygges med referencer (jeg ville eksempelvis gerne se nærmere på, hvor stærk evidens der er for at påstå, at “Det viser sig, at størstedelen af de indsatte i fængslerne mener, de er mere retskafne end den gennemsnitlige borger.”). En stor del af bogen virker til at være skrevet med udgangspunkt i anekdotiske henvisninger til studier og eksempler, forfatteren ikke har kendskab til via kildematerialet, men ofte blot har hørt om i en podcast eller to.

Jeg finder desuden humoren og tonen i bogen arrogant og friskfyragtig. Det er selvfølgelig en smagssag og skal ikke forstås som en kritik (hvoraf alt andet nævnt ovenfor – hvis nogen skulle være i tvivl – så sandelig skal forstås som en kritik). Det samlede produkt får desværre bogen til at fremstå som adfærdsforskningens svar på en Big Mac. Indbydende udstråling og letfordøjeligt – men mest af alt tomme kalorier, når man kigger produktet efter i sømmene. Dette er ærgerligt, da Morten Münster uomtvisteligt er en populær formidler af adfærdsforskning.

Forhåndenværende bog er desværre et godt eksempel på, at det ingenlunde er nemt at skrive en bog, der er både engagerende og fagligt velfunderet. Adfærdsforskning er populært og der bliver solgt mange bøger inden for denne genre i disse år, men jeg kan godt være bekymret for, om populariteten sker på bekostning af en faglighed, der bør udgøre fundamentet for denne forskning og formidlingen af samme.

The plural of anecdote

There are two very different quotes. The first is “the plural of anecdote is not data”. The second is “the plural of anecdote is data”. They are pretty much opposites. You can find more information on the two quotes here.

I often see the former quote being used to dismiss anecdotal evidence and ask for non-anecdotal data. We all agree that N>1 is better than N=1, all else equal, but I believe the latter quote is better, i.e. that the plural of anecdote is data. The focus on data as the aggregation of individual observations makes you think a lot more critically about what is part of the data – and what is not part of the data, what types of measurement error you are most likely dealing with, etc.

Similarly, it is easier to think about cherry picking and other selection biases when we consider anecdote the singular of data. A single data point is of little relevance in and by itself, but the mere aggregation of such data points is not sufficient – or even necessary – to say that we are no longer looking at anecdotes.

How to improve your figures #8: Make probabilities tangible

Statistics is about learning from data in the context of uncertainty. Often we communicate uncertainty in the form of probabilities. How should we best communicate such probabilities in our figures? The key point in this post is that we should not only present probabilities in the form of probabilities and the like. Instead, we need to work hard on making our numbers tangible.

Why is it not sufficient to simply present estimates on probabilities? Probabilities are difficult because we easily interpret such probabilities differently. When people hear that a candidate is 80% likely to win an election, some people will see that as a much more likely outcome than others. In other words, there are uncertainties in how people perceive uncertainties. We have known for decades that people assign very different probabilities to different probability terms (see e.g. Wallsten et al. 1986; 1988), and the meaning of terms such as “nearly certain” will for one person be close to 100% and for another person be closer to 50%.

To make matters worse, risks and probabiblities can be expressed in different ways. Consider the example of the study that showed how the show “13 Reasons Why” was associated with a 28.9% increase in suicide rates. This was a much more interesting study because they focused on the rate-change in relative terms instead of saying an increase from 0.35 in 100,000 to 0.45 in 100,000 (see also this article on why you should not believe the study in question). To illustrate such differences, @justsaysrisks reports the absolute and relative risk from different articles communicating research findings.

In David Spiegelhalter’s great book, The Art of Statistics: Learning From Data, he looks at how the risk of getting bowel cancer increases by 18% for a group of people who eat 50g of processed meat a day. In Table 1.2 in the book, Spiegelhalter shows how a difference between two groups in one percentage point can be turned into a relative risk of 18%:

Method Non-bacon eaters Daily bacon eaters
Event rate 6% 7%
Expected frequency 6 out of 100 7 out of 100
1 in 16 1 in 14
Odds 6/94 7/93
Comparative measures
Absolute risk difference 1%, or 1 out of 100
Relative risk 1.18, or an 18% increase
‘Number Needed to Treat’ 100
Odds ratio (7/93) / (6/94) = 1.18

As you can see, event rates of 6% and 7% in the two groups with an absolute risk difference of 1% can be turned into a relative risk of 18% (with an odds ratio of 1.18). Spiegelhalter’s book provides other good examples and I can highly recommend it.

Accordingly, probabilities are tricky and we need to be careful in how we communicate them. We have seen a lot of discussions on how best to communicate electoral forecasts (if the probability that a candidate will win more than 50% of the votes is 85%, how confident will people be that the candidate will win?). One great suggestion offered by Spiegelhalter in his book is to not think about percentages per se, but rather make probabilities tangible by showing the outcomes for, say, 100 people (or 100 elections, if you are working on a forecast).

To do this, we use unit charts to show counts of a variable. Here, we can use a 10×10 grid where each cell represents one percentage point. A specific version of the unit chart is an isotype chart, where we use icons or images instead of simple shapes.

There is evidence that such visualisations work better than simply presenting the information numerically. Galesic et al. (2009) show, in the context of medical risks, how icon arrays increase the accuracy of the understanding of risks (see also Fagerlin et al. 2005 on how pictographs can reduce the undue influence of anecdotal reasoning).

When we hear that the probability that the Democrats will win an election is 75%, we think about the outcome of one election and how that is significantly more likely to happen. However, when we use an isotype chart where we show 100 outcomes, 75 of them being won by the Democrats, we make the 25 out of 100 Republican outcomes more salient.

There are different R packages you can use to make such visualisations, e.g. waffle and ggwaffle. In the figure below, I used the waffle package to demonstrate how the Democrats got a probability of 75% of winning a (hypothetical) election.

There are many different ways to communicate probabilities. However, try to avoid simply presenting the numerical probabilities in your figures, or the odds ratios, and consider how you can make the probabilities more tangible and easier for the reader to process.

Advice on data and code

I have been reading a few papers on how to structure data and code. In the post, I provide a list of the papers I have found together with the main advice/rules offered in the respective papers (do consult the individual papers for examples and explanations).

Noteworthy, there is an overlap in the advice the papers give. I do not agree with everything (and as you can see, they are written for people working with tabular data and definitely not people working with more modern workflows with several gigabytes of data stored in NoSQL databases). However, overall, I can highly recommend that you take most of these recommendations on board.

Last, another good resource is the supplementary material to A Practical Guide for Transparency in Psychological Science. This paper deals with the details of folder structure, data documentation, analytical reproducibility, etc.

Here we go:

Nagler (1995): Coding Style and Good Computing Practice

1. Maintain a lab book from the beginning of a project to the end.
2. Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
3. Correct errors in code where they occur, and rerun the code.
4. Separate tasks related to data manipulation vs. data analysis into separate files.
5. Design each program to perform only one task.
6. Do not try to be as clever as possible when coding. Try to writecode that is as simple as possible.
7. Set up each section of a program to perform only one task.
8. Use a consistent style regarding lower- and upper-case letters.
9. Use variable names that have substantive meaning.
10. Use variable names that indicate direction where possible.
11. Use appropriate white space in your programs, and do so in a consistent fashion to make the programs easy to read.
12. Include comments before each block of code describing the purpose of the code.
13. Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
14. Rewrite any code that is not clear.
15. Verify that missing data are handled correctly on any recode or creation of a new variable.
16. After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
17. When possible, automate things and avoid placing hard-wired values (those computed “by hand”) in code.

Broman and Woo (2018): Data Organization in Spreadsheets

1. Be consistent
– Use consistent codes for categorical variables.
– Use a consistent fixed code for any missing values.
– Use consistent variable names.
– Use consistent subject identifiers.
– Use a consistent data layout in multiple files.
– Use consistent file names.
– Use a consistent format for all dates.
– Use consistent phrases in your notes.
– Be careful about extra spaces within cells.
2. Choose good names for things
– Don’t use spaces, either in variable names or file names.
– Avoid special characters, except for underscores and hyphens.
– Keep names short, but meaningful.
– Never include “final” in a file name
3. Write dates as YYYY-MM-DD
4. Fill in all cells and use some common code for missing data.
5. Put just one thing in a cell
6. Don’t use more than one row for the variable names
7. Create a data dictionary
8. No calculations in the raw data files
9. Don’t use font color or highlighting as data
10. Make backups
11. Use data validation to avoid errors
12. Save the data in plain text files

Balaban et al. (2021): Ten simple rules for quick and dirty scientific programming

1. Think before you code
2. Start with prototypes and expand them in short development cycles
3. Look for opportunities for code reuse
4. Modularize your code
5. Avoid premature optimization
6. Use automated unit testing for critical components
7. Refactor frequently
8. Write self-documenting code for programmers and a readme file for users
9. Grow your libraries and tools organically from your research
10. Go explore and be rigorous when you publish

Wilson et al. (2017): Good enough practices in scientific computing

1. Data management
– Save the raw data.
– Ensure that raw data are backed up in more than one location.
– Create the data you wish to see in the world.
– Create analysis-friendly data.
– Record all the steps used to process data.
– Anticipate the need to use multiple tables, and use a uniquei dentifier for every record.
– Submit data to a reputable DOI-issuing repository so that others can access and cite it.
2. Software
– Place a brief explanatory comment at the start of every program.
– Decompose programs into functions.
– Be ruthless about eliminating duplication.
– Always search for well-maintained software libraries that do what you need.
– Test libraries before relying on them.
– Give functions and variables meaningful names.
– Make dependencies and requirementsexplicit.
– Do not comment and uncomment sections of code to control a program’s behavior.
– Provide a simple example or test dataset.
– Submit code to a reputable DOI-issuing repository.
3. Collaboration
– Create an overview of your project.
– Create a shared “to-do” list for theproject.
– Decide on communication strategies.
– Make the license explicit.
– Make the project citable.
4. Project organization
– Put each project in its own directory, which is named after the project.
– Put text documents associated with the project in the doc directory.
– Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
– Put project source code in the src directory.
– Put external scripts or compiled programs in the bin directory.
– Name all files to reflect their contentor function.
5. Keeping track of changes
– Backup (almost) everything created by a human being as soon as it is created.
– Keep changes small.
– Share changes frequently.
– Create, maintain, and use a checklist for saving and sharing changes to the project.
– Store each project in a folder that is mirrored off the researcher’s working machine.
– Add a file called CHANGELOG.txt to the project’s docs subfolder.
– Copy the entire project whenever a significant change has been made.
– Use a version control system.
6. Manuscripts
– Write manuscripts using online tools with rich formatting, change tracking, and reference management.
– Write the manuscript in a plain text format that permits version control.

Gentzkow and Shapiro (2014): Code and Data for the Social Sciences

1. Automation
– Automate everything that can be automated.
– Write a single script that executes all code from beginning to end.
2. Version Control
– Store code and data under version control.
– Run the whole directory before checking it back in.
3. Directories
– Separate directories by function.
– Separate files into inputs and outputs.
– Make directories portable.
4. Keys
– Store cleaned data in tables with unique, non-missing keys.
– Keep data normalized as far into your code pipeline as you can.
5. Abstraction
– Abstract to eliminate redundancy.
– Abstract to improve clarity.
– Otherwise, don’t abstract.
6. Documentation
– Don’t write documentation you will not maintain.
– Code should be self-documenting.
7. Management
– Manage tasks with a task management system.
– E-mail is not a task management system.
8. Code style
– Keep it short and purposeful.
– Make your functions shy.
– Order your functions for linear reading.
– Use descriptive names.
– Pay special attention to coding algebra.
– Make logical switches intuitive.
– Be consistent.
– Check for errors.
– Write tests.
– Profile slow code relentlessly.
– Separate slow code from fast code.

Potpourri: Statistics #76

Introduction to Deep Learning — 170 Video Lectures from Adaptive Linear Neurons to Zero-shot Classification with Transformers
The Identification Zoo: Meanings of Identification in Econometrics
Why you sometimes need to break the rules in data viz
A Concrete Introduction to Probability (using Python)
R packages that make ggplot2 more beautiful (Vol. I)
R packages that make ggplot2 more powerful (Vol. II)
Etimating multilevel models for change in R
Static and dynamic network visualization with R
Open Source RStudio/Shiny on AWS Fargate
Functional PCA with R
When Graphs Are a Matter of Life and Death
Python Projects with Source Code
The Stata workflow guide
15 Tips to Customize lines in ggplot2 with element_line()
7 Tips to customize rectangle elements in ggplot2 element_rect()
8 tips to use element_blank() in ggplot2 theme
Introduction to Machine Learning Interviews Book
Introduction to Python for Social Science
21 Must-Read Data Visualization Books, According to Experts
Introduction to Modern Statistics
The Difference Between Random Factors and Random Effects
Creating a figure of map layers in R
Reasons to Use Tidymodels
Professional, Polished, Presentable: Making great slides with xaringan
Polished summary tables in R with gtsummary
Top 10 Ideas in Statistics That Have Powered the AI Revolution
The New Native Pipe Operator in R
RMarkdown Tips and Tricks
Iterative visualizations with ggplot2: no more copy-pasting
Scaling Models in Poltical Science
Setting up and debugging custom fonts
A Pirate’s Favorite Programming Language
Tired: PCA + kmeans, Wired: UMAP + GMM
Three simple ideas for better election poll graphics
Exploratory Functional PCA with Sparse Data
Efficient simulations in R
The Beginner’s Guide to the Modern Data Stack
How to become a better R code detective?
A short introduction to grammar of graphics (via ggplot2)
Workflows for querying databases via R
A Handbook for Teaching and Learning with R and RStudio
Writing reproducible manuscripts in R
ggpairs in R- A Brief Introduction to ggpairs


Previous posts: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75

Assorted links #5

121. A Supercut of Supercuts: Aesthetics, Histories, Databases
122. UbuWeb: Film & Video
123. The Behavioral Economics Guide 2021
124. Climate Solutions 101
125. On Noise (the book)
126. PSY 1 | Introduction to Psychological Science | Lectures
127. Why Are Gamers So Much Better Than Scientists at Catching Fraud?
128. A Visual Guide to The Big Lebowski
129. Louvre site des collections
130. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
131. Notes from a Moab Trailer
132. Boris Johnson Knows Exactly What He’s Doing
133. What does global warming spell for you… and your loved ones?
134. Day of Rage: An In-Depth Look at How a Mob Stormed the Capitol
135. We Are Living in a Climate Emergency, and We’re Going to Say So
136. Obama’s Words
137. The Problem With Bo Burnham’s Inside
138. Minard’s Famous “Napoleon’s March” Chart – What It Shows, What It Doesn’t
139. Story of Sosumi & the Mac Startup Sound
140. The 9/11 Pager Leaks
141. The 1904 Olympic Marathon May Have Been the Strangest Ever
142. The Library for The Study of What We Are
143. Creative Destruction: The Structural Consequences of Scientific Curation
144. Bear plus snowflake equals polar bear
145. Why you should put salaries on your job ads
146. My current HTML boilerplate
147. John McAfee Fled to Belize, But He Couldn’t Escape Himself
148. Myanmar Burning
149. 100 Things to Know
150. Life tips


Previous posts: #1 #2 #3 #4

New article in Party Politics: Party activism in the populist radical right

In the new issue of Party Politics, you will find an article I have written together with Paul Whiteley, Matthew Goodwin and Harold Clarke. The article deals with the predictors of party activism within the populist radical right.

Here is the abstract:

Recent decades have seen an upsurge of interest in populist radical right (PRR) parties. Yet despite a large body of research on PRR voters, there are few studies of the internal life of these parties. In particular, there is a dearth of research about why people are active in them. This article uses data from a unique large-scale survey of United Kingdom Independence Party (UKIP) members to investigate if drivers of voting support for these parties are also important for explaining party activism. Analyses show that traditional models of party activism are important for understanding engagement in UKIP, but macro-level forces captured in an expanded relative deprivation model also stimulate participation in the party. That said macro-level forces are not the dominant driver of activism.

You can find the paper here. Kai Arzheimer provides some great thoughts on the paper here.

Replace equations with code

Here is a suggestion: In empirical research, academics should move equations from the methods section to the appendix and, if anything, show the few lines of code used to estimate the model(s) in the software being used (ideally with citations to the software and statistical packages). Preferably, it should be possible to understand the estimation strategy without having to read any equations.

Of course, I am talking about the type of work that is not primarily interested in developing a new estimator or a formal theory that can be applied to a few case studies (or shed light on the limitations of empirical models). I am not against the use of equations or abstractions of any kind to communicate clearly and without ambiguity. I am, however, skeptical towards how empirical research often include equations for the sake of … including equations.

I have a theory that academics, and in particular political scientists, put more equations in their research to show off their skills rather than to help the reader understand what is going on. In most cases, equations are not needed and are often there only to impress reviewers and peers, which of course are the same people (hence, peer-review). The use of equations are excluding readers rather than including readers.

I am confident that most researchers spend more time in their favourite statistical IDE than they do writing and reading equations. For that reason, I also believe that most researchers will find it easier to read actual code instead of equations. Take this example on the equation and code for a binomial regression model (estimated with glmer()) from Twitter:

Personally, I find it much easier to understand what is going on when I look at the R code instead of the extracted equation. Not only that, I also find it easier to think of potential alternatives to the regression model, e.g., that I can easily change the functional form and see how such changes will affect the results. This is something I rarely consider when I only look at equations.

The example above is from R, and not all researchers use or understand R. However, I am quite certain that everybody that understands the equation above will also be able to understand the few lines of code. And when people use Stata, it is often even easier to read the code (even if you are not an avid Stata user). SPSS syntax is much more difficult to read but that says more about why you should not use SPSS in the first place.

I am not against the use of equations in research papers. However, I do believe empirical research would be much better off by showing and citing code instead of equations. Accordingly, please replace equations with code.

PPEPE added to Party Facts

Our data on populist parties in European Parliament elections (PPEPE), from our article in the Electoral Studies, is now linked to the Party Facts. You can find it here.

Party Facts, for people not already familiar with this amazing resource, links several datasets on political parties to make it easier for researchers to work with such data. At the time of writing, Party Facts cover 5765 core parties from 224 countries.

This means that it is now possible to link our data to several other high-quality political datasets such as ParlGov, V-Party, CHES, etc. You can find an example in R showing how you can easily link our PPEPE data to ParlGov, the Manifesto Project and CHES here.

New article in Personality and Individual Differences: Personality in a pandemic

In the July issue of Personality and Individual Differences, you will find an article I have co-authored with Steven G. Ludeke, Joseph A. Vitriol and Miriam Gensowski. In the paper, titled Personality in a pandemic: Social norms moderate associations between personality and social distancing behaviors, we demonstrate when Big Five personality traits are more likely to predict social distancing behaviors.

Here is the abstract:

To limit the transmission of the coronavirus disease 2019 (COVID-19), it is important to understand the sources of social behavior for members of the general public. However, there is limited research on how basic psychological dispositions interact with social contexts to shape behaviors that help mitigate contagion risk, such as social distancing. Using a sample of 89,305 individuals from 39 countries, we show that Big Five personality traits and the social context jointly shape citizens’ social distancing during the pandemic. Specifically, we observed that the association between personality traits and social distancing behaviors were attenuated as the perceived societal consensus for social distancing increased. This held even after controlling for objective features of the environment such as the level of government restrictions in place, demonstrating the importance of subjective perceptions of local norms.

You can find the article here. The replication material is available on Harvard Dataverse and GitHub.