The plural of anecdote

There are two very different quotes. The first is “the plural of anecdote is not data”. The second is “the plural of anecdote is data”. They are pretty much opposites. You can find more information on the two quotes here.

I often see the former quote being used to dismiss anecdotal evidence and ask for non-anecdotal data. We all agree that N>1 is better than N=1, all else equal, but I believe the latter quote is better, i.e. that the plural of anecdote is data. The focus on data as the aggregation of individual observations makes you think a lot more critically about what is part of the data – and what is not part of the data, what types of measurement error you are most likely dealing with, etc.

Similarly, it is easier to think about cherry picking and other selection biases when we consider anecdote the singular of data. A single data point is of little relevance in and by itself, but the mere aggregation of such data points is not sufficient – or even necessary – to say that we are no longer looking at anecdotes.

Advice on data and code

I have been reading a few papers on how to structure data and code. In the post, I provide a list of the papers I have found together with the main advice/rules offered in the respective papers (do consult the individual papers for examples and explanations).

Noteworthy, there is an overlap in the advice the papers give. I do not agree with everything (and as you can see, they are written for people working with tabular data and definitely not people working with more modern workflows with several gigabytes of data stored in NoSQL databases). However, overall, I can highly recommend that you take most of these recommendations on board.

Last, another good resource is the supplementary material to A Practical Guide for Transparency in Psychological Science. This paper deals with the details of folder structure, data documentation, analytical reproducibility, etc.

Here we go:

Nagler (1995): Coding Style and Good Computing Practice

1. Maintain a lab book from the beginning of a project to the end.
2. Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
3. Correct errors in code where they occur, and rerun the code.
4. Separate tasks related to data manipulation vs. data analysis into separate files.
5. Design each program to perform only one task.
6. Do not try to be as clever as possible when coding. Try to writecode that is as simple as possible.
7. Set up each section of a program to perform only one task.
8. Use a consistent style regarding lower- and upper-case letters.
9. Use variable names that have substantive meaning.
10. Use variable names that indicate direction where possible.
11. Use appropriate white space in your programs, and do so in a consistent fashion to make the programs easy to read.
12. Include comments before each block of code describing the purpose of the code.
13. Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
14. Rewrite any code that is not clear.
15. Verify that missing data are handled correctly on any recode or creation of a new variable.
16. After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
17. When possible, automate things and avoid placing hard-wired values (those computed “by hand”) in code.

Broman and Woo (2018): Data Organization in Spreadsheets

1. Be consistent
– Use consistent codes for categorical variables.
– Use a consistent fixed code for any missing values.
– Use consistent variable names.
– Use consistent subject identifiers.
– Use a consistent data layout in multiple files.
– Use consistent file names.
– Use a consistent format for all dates.
– Use consistent phrases in your notes.
– Be careful about extra spaces within cells.
2. Choose good names for things
– Don’t use spaces, either in variable names or file names.
– Avoid special characters, except for underscores and hyphens.
– Keep names short, but meaningful.
– Never include “final” in a file name
3. Write dates as YYYY-MM-DD
4. Fill in all cells and use some common code for missing data.
5. Put just one thing in a cell
6. Don’t use more than one row for the variable names
7. Create a data dictionary
8. No calculations in the raw data files
9. Don’t use font color or highlighting as data
10. Make backups
11. Use data validation to avoid errors
12. Save the data in plain text files

Balaban et al. (2021): Ten simple rules for quick and dirty scientific programming

1. Think before you code
2. Start with prototypes and expand them in short development cycles
3. Look for opportunities for code reuse
4. Modularize your code
5. Avoid premature optimization
6. Use automated unit testing for critical components
7. Refactor frequently
8. Write self-documenting code for programmers and a readme file for users
9. Grow your libraries and tools organically from your research
10. Go explore and be rigorous when you publish

Wilson et al. (2017): Good enough practices in scientific computing

1. Data management
– Save the raw data.
– Ensure that raw data are backed up in more than one location.
– Create the data you wish to see in the world.
– Create analysis-friendly data.
– Record all the steps used to process data.
– Anticipate the need to use multiple tables, and use a uniquei dentifier for every record.
– Submit data to a reputable DOI-issuing repository so that others can access and cite it.
2. Software
– Place a brief explanatory comment at the start of every program.
– Decompose programs into functions.
– Be ruthless about eliminating duplication.
– Always search for well-maintained software libraries that do what you need.
– Test libraries before relying on them.
– Give functions and variables meaningful names.
– Make dependencies and requirementsexplicit.
– Do not comment and uncomment sections of code to control a program’s behavior.
– Provide a simple example or test dataset.
– Submit code to a reputable DOI-issuing repository.
3. Collaboration
– Create an overview of your project.
– Create a shared “to-do” list for theproject.
– Decide on communication strategies.
– Make the license explicit.
– Make the project citable.
4. Project organization
– Put each project in its own directory, which is named after the project.
– Put text documents associated with the project in the doc directory.
– Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
– Put project source code in the src directory.
– Put external scripts or compiled programs in the bin directory.
– Name all files to reflect their contentor function.
5. Keeping track of changes
– Backup (almost) everything created by a human being as soon as it is created.
– Keep changes small.
– Share changes frequently.
– Create, maintain, and use a checklist for saving and sharing changes to the project.
– Store each project in a folder that is mirrored off the researcher’s working machine.
– Add a file called CHANGELOG.txt to the project’s docs subfolder.
– Copy the entire project whenever a significant change has been made.
– Use a version control system.
6. Manuscripts
– Write manuscripts using online tools with rich formatting, change tracking, and reference management.
– Write the manuscript in a plain text format that permits version control.

Gentzkow and Shapiro (2014): Code and Data for the Social Sciences

1. Automation
– Automate everything that can be automated.
– Write a single script that executes all code from beginning to end.
2. Version Control
– Store code and data under version control.
– Run the whole directory before checking it back in.
3. Directories
– Separate directories by function.
– Separate files into inputs and outputs.
– Make directories portable.
4. Keys
– Store cleaned data in tables with unique, non-missing keys.
– Keep data normalized as far into your code pipeline as you can.
5. Abstraction
– Abstract to eliminate redundancy.
– Abstract to improve clarity.
– Otherwise, don’t abstract.
6. Documentation
– Don’t write documentation you will not maintain.
– Code should be self-documenting.
7. Management
– Manage tasks with a task management system.
– E-mail is not a task management system.
8. Code style
– Keep it short and purposeful.
– Make your functions shy.
– Order your functions for linear reading.
– Use descriptive names.
– Pay special attention to coding algebra.
– Make logical switches intuitive.
– Be consistent.
– Check for errors.
– Write tests.
– Profile slow code relentlessly.
– Separate slow code from fast code.

PPEPE added to Party Facts

Our data on populist parties in European Parliament elections (PPEPE), from our article in the Electoral Studies, is now linked to the Party Facts. You can find it here.

Party Facts, for people not already familiar with this amazing resource, links several datasets on political parties to make it easier for researchers to work with such data. At the time of writing, Party Facts cover 5765 core parties from 224 countries.

This means that it is now possible to link our data to several other high-quality political datasets such as ParlGov, V-Party, CHES, etc. You can find an example in R showing how you can easily link our PPEPE data to ParlGov, the Manifesto Project and CHES here.

New article in Electoral Studies: Populist parties in European Parliament elections

In the June issue of Electoral Studies, you will find an article I’ve written together with Mattia Zulianello. In the article, we introduce a comparative dataset on left, right and valence populist parties in European Parliament elections from 1979 to 2019. Here is the abstract:

Despite the increasing interest in populism, there is a lack of comparative and longterm evidence on the electoral performance of populist parties. We address this gap by using a novel dataset covering 92 populist parties in the European Parliament elections from 1979 to 2019. Specifically, we provide aggregate data on the electoral performance of all populist parties as well as the three ideational varieties of populism, i.e. right-wing, left-wing and valence populist parties. We show that there is significant variation both across countries as well as between the ideational varieties of populism. Most notably, while the success of left-wing and valence populists is concentrated in specific areas, right-wing populist parties have consolidated as key players in the vast majority of EU countries.

You can find the article here. I also recommend that you check out this great Twitter thread. You can find the data on GitHub and the Harvard Dataverse.

Potpourri: Statistics #64

Tidymodels: tidy machine learning in R
The Seven Key Things You Need To Know About dplyr 1.0.0
Introduction to Data Science
When Is Anonymous Not Really Anonymous?
Empirical Papers for Teaching Causal Inference
Why log ratios are useful for tracking COVID-19
Effect Sizes and Power for Interactions in ANOVA Designs
Why I’m not making COVID19 visualizations, and why you (probably) shouldn’t either
Word Rank Slope Charts
Displaying time series with R
New parsnip-adjacent packages
Exploring tidymodels With Hockey Data
Conducting and Visualizing Specification Curve Analyses
How John Burn-Murdoch’s Influential Dataviz Helped The World Understand Coronavirus
Spatial Aggregation
Bayes’ theorem in three panels
The Evolution of the American Census
How to standardize group colors in data visualizations in R
Calibrating time zones: an early bird or a night owl?
Program Evaluation

Har statistikdatabasen Numbeo pålidelige data?

I løbet af de seneste år har medier som DR, TV 2, Berlingske, B.T. m.v. formidlet historier, der bygger på data fra statistikdatabasen Numbeo.

Faktatjekmediet TjekDet har kigget nærmere på, om disse data er pålidelige. I den forbindelse giver jeg blandt andet følgende kommentar med på vejen:

Det kan være, at folk, der er mere bekymret for kriminalitet, er mere tilbøjelige til at bidrage med data i en by og ikke i en anden by. Selv om de to byer måtte have den samme mængde kriminalitet, ville det føre til en forskel i deres kriminalitetsindeks.

Med andre ord er jeg også skeptisk i forhold til, hvor pålidelige de respektive data er. Læs hele artiklen her.

Hvor mange skoler er lukket?

Altinget kunne for et år siden rapportere, at 603 skoler er lukket på 10 år. Der er dog noget, der tyder på, at dette tal er overdrevet og reelt set er betydeligt lavere.

Johan Ries Møller, specialestuderende ved Syddansk Universitet, har netop offentliggjort nye data for udvalgsstyrede kommuner, der viser, at blot 148 folkeskoler blev besluttet lukket i perioden 2009-2013 og 54 i perioden 2013-2017. Disse tal bakkes op af forskellige måder at estimere antallet af skolelukninger.

Hvorfor kom Altinget frem til et andet svar, der indikerede, at langt flere skoler var lukket? De tog udgangspunkt i svar til Folketinget, der opererer med en bestemt teknisk definition af, hvornår vi har at gøre med en skolelukning:

Dette tal er noget overdrevent, da mange af de skoler, der optræder på listen blot har fået fællesledelse eller har oplevet en reduktion i antal klassetrin. Grunden til, at de figurerer i de officielle opgørelser, er, at Institutionsregisteret baserer sig på de såkaldte institutionsnumre. Der er ikke fast praksis for, hvorvidt underafdelinger skal nedlægge deres institutionsnummer når de underlægges fælles ledelse.

Der er tale om et fornemt eksempel på at gå data efter i sømmene og forstå, hvordan de helt præcist er indsamlet og forstås. Johan Ries Møller er desuden i gang med at færdiggøre sit speciale omkring skolelukningernes betydning for vælgerens adfærd, der også har nogle yderst interessante resultater.

Seinfeld

Seinfeld (1989–1998) er en fremragende sitcom. Sammen med serier som Frasier (1993-2004) og Friends (1994-2004) står den som en af de stærkeste komedieserier fra 90’erne, der stadig kan ses i 2017. For nyligt valgte jeg at gense seriens 173 afsnit, der er fordelt over 9 sæsoner, og jeg kan kun anbefale, at man får den (gen)set.

I forbindelse med at jeg så serien, fik jeg også læst Seinfeldia: How a Show About Nothing Changed Everything, der leverer en minutiøs gennemgang af historien bag serien. Den var lidt for detaljerig til min smag, men for de virkelige fans af serien, kan den også varmt anbefales.

For hvert afsnit jeg så af serien, valgte jeg også – som med det meste af det jeg ser – at vurdere det på IMDb. På IMDb, der er en forkortelse for Internet Movie Database, vurderede jeg således hvert afsnit af Seinfeld på en skala fra 1 til 10.

Det var heldigvis ikke kun mig, der valgte at gense Seinfeld og vurdere hvert afsnit af serien på IMDb. Det samme gjorde min gode ven Knud. Dette giver et datasæt, hvor vi kan sætte lidt tal på, hvordan vi hver især har det med serien. Figur 1 viser fordelingen af vores respektive vurderinger af alle afsnittene af Seinfeld.

Figur 1: Fordeling af vurderinger, afsnit af Seinfeld

Figuren viser med al tydelighed, at Knud gennemsnitligt er mere glad for de enkelte afsnit af Seinfeld end jeg er. Knuds gennemsnitlige vurdering er således 7,46, hvor min er 6,67. Dette skyldes især, at jeg har været ekstra hård ved de afsnit, som jeg synes er subpar.

Den laveste vurdering jeg giver et afsnit er 2, hvor det for Knud er 3. Overordnet giver jeg 39 afsnit vurderingen 2, 3, 4 eller 5, hvor dette kun er tilfældet for Knud ved to afsnit. Dette afspejler sig også i standardafvigelsen for vores vurderinger, hvor denne er 1,07 for Knud og 1,88 for mig.

Der gemmer sig dog mere relevant information end blot fordelingen af vurderinger. Det viser sig, at jeg bliver mere skeptisk over tid. Figur 2 viser vores vurderinger over tid, samt de gennemsnitlige vurderinger for hvert afsnit på IMDb fra alle brugerne på IMDb, der har vurderet de respektive afsnit.

Figur 2: Vurderinger af afsnit over tid

Her ser vi for det første, at Knud følger IMDbs vurderinger meget fint, om end med et lidt lavere gennemsnit. Jeg er klart mere positiv i mine vurderinger af de første sæsoner (altså de tidligere 90’ere). Det var især sæson 8 og 9, jeg havde det svært med. Præmissen i mange af afsnittene var simpelthen for ringe, og til trods for at det ikke var elendige afsnit, var det langt under den standard, som der blev etableret i løbet af de tidligere sæsoner af serien.

Ovenstående viser også, at det er begrænset, hvor mange afsnit vi giver en topkarakter, altså 10 ud af 10. Konkret er der blot otte afsnit af Seinfeld, der har fået 10 af enten Knud eller mig. Tabel 1 viser hvilke afsnit, der er tale om. Første kolonne angiver hhv. sæson og afsnit, anden kolonne titlen på afsnittet, tredje kolonne min vurdering, fjerde kolonne Knuds vurdering og så IMDbs gennemsnitsvurdering i femte kolonne.

Tabel 1: Afsnit som minimum en af os har givet vurderingen 10

Afsnit Vurdering: Erik Vurdering: Knud Vurdering: IMDb
3.16 The Fix Up 10 9 8,5
3.17 The Boyfriend: Part 1 10 9 9,0
3.18 The Boyfriend: Part 2 10 9 8,9
4.11 The Contest 10 9 9,6
4.17 The Outing 10 9 9,4
5.20 The Hamptons 10 10 9,1
6.12 The Label Maker 10 9 8,7
7.6 The Soup Nazi 9 10 9,6

Her ser vi for det første, at jeg har givet flere afsnit 10 ud af 10. Knud har givet to afsnit 10 ud af 10, hvor det ene af dem er et, jeg også har givet 10 (det fantastiske afsnit The Hamptons). For det andet kan vi se, at vi ikke er uenige omkring disse afsnit. For de otte afsnit hvor en har givet 10, giver den anden som minimum 9. The Soup Nazi gav Knud 10 ud af 10, og det ligger med en gennemsnitsvurdering på 9,6 på IMDb, og det er da også et rigtig godt afsnit, men det er ikke lige så godt som seriens bedste afsnit (som eksempelvis The Fix Up og The Contest).

Som sagt indledningsvist er der tale om en fremragende sitcom. Jo længere tid der går, desto større er sandsynligheden for, at man møder folk, der ikke har set den. Det er en skam, for den holder stadig den dag i dag, og er betydeligt bedre end de sitcoms, der kører i TV i disse år.

Data fra SPSS til Stata og R

Der var engang hvor folk brugte SPSS. I dag bruger man enten Stata og/eller R, men det hænder desværre, at man finder et datasæt, der kun vil åbnes i SPSS (.sav-filer). Der findes dog måder hvorpå man kan få et datasæt fra SPSS til Stata såvel som R. I dette indlæg beskriver jeg et par af disse måder.

Hvis man har SPSS installeret, kan man åbne datasættet og gemme det i et format, der kan åbnes i Stata såvel som R. Proceduren her er simpel: Åben datasættet i SPSS, gå op under “File”, vælg “Save As” og gem filen som Comma delimited (*.csv). Denne fil kan importeres til Stata vha. følgende kommando (hvor datasættet selvfølgelig er gemt med navnet data):

insheet using "data.csv", delimiter(";")

I R kan man anvende kommandoen read.csv og gemme datasættet i en data frame. I dette tilfælde vil kommandoen for samme datasæt som i ovenstående eksempel være:

data = read.csv("data.csv", sep = ";")

Husk i begge tilfælde at angive et korrekt working directory forinden, så den kan finde datasættet. I SPSS er det muligt at gemme datasæt i Stata-format (.dta), hvor man, såfremt man gør det, kan åbne datasættet på normal vis i Stata efterfølgende. Jeg har dog flere gange oplevet at SPSS er gået ned når jeg har gjort det, hvorfor det ikke er noget jeg personligt kan eller vil anbefale.

Har man ikke SPSS installeret, findes der pakker der kan være en behjælpelig. I R kan man anvende foreign-pakken. Først henter vi og indlæser pakken, for derefter at importere datasættet til en data frame med read.spss:

install.packages("foreign")
library(foreign)
data = read.spss("data.sav", to.data.frame=TRUE)

Hvis man vil importere SPSS-filer direkte til Stata, kan man anvende usespss-pakken. Denne virker dog udelukkende på 32-bit versionen, hvorfor det er et par år siden jeg sidst anvendte den. Der kan læses mere om den teknik her.