## Er der sexchikane på Christiansborg?

I 2018 deltog ansatte i Folketinget i en arbejdspladsvurdering (APV). I Politiken kritiserer jeg sammen med andre eksperter det metodiske grundlag for denne undersøgelse. Konkret udtaler jeg:

Det virker til at være et utroligt dårligt formuleret spørgsmål, som jeg generelt ville være meget, meget påpasselig med at bruge til at sige, at der ikke skulle være et problem. Eller det omvendt for den sags skyld. Det er på ingen måde noget, man ville se i en professionel spørgeskemaundersøgelse.

Læs hele artiklen her (bag en betalingsmur).

## Political science syllabi

Over the years I have saved various syllabi in a local folder. I decided to do some digital housekeeping the other day with the aim of getting rid of the folder. Instead, I found links to the syllabi online and deleted the folder (including the ones that I couldn’t find).

This is a overview I made for myself but I bring it here in case it might be helpful to others. I provide the name of the teacher in parenthesis.

### Experiments

Experimentation and Causal Inference (Thomas J. Leeper)
GOVT 83.21 / QSS 30.03: Experiments in Politics (Brendan Nyhan)
Experimental Methods in Political Science (Bethany Albertson and Mike Findley)
Experimental Design and Social Behavior (Rick Wilson)
Gov 2008 Experimental Political Science (Ryan Enos and Dustin Tingley)
G4068: Experimentation in the Social Sciences (Costas Panagopoulos)

### Political psychology

Government 2749: Political Psychology and International Relations (J. D. Kertzer)
GOVT 30:Political Misinformation and Conspiracy Theories (Brendan Nyhan)
PSCI4221: Political Psychology (Pavel Bacovsky)
Political Psychology (Michael F. Meffert)

### Parties and institutions

PSCI 3031: Political Parties & Interest Groups (Nancy Billica)

### Political communication and media

Political Communication, Media and Public Policy (Rasmus K. Nielsen)
Political Communication and Media Effects (Michael F. Meffert)

### European Union

PSCI 4302: Politics of the European Union (Joe Jupille)

### Politics

Electoral Politics (Miguel R. Rueda)

### Public opinion and political behaviour

Public Opinion & Political Communication (Daniel Flynn)
Voters, Public Opinion and Participation (Tom O’Grady)
Public opinion & political behavior (Jennifer Wolak)

### Text as data

Psych/CSCI 626: Text as Data (Morteza Dehghani)
Quantitative Text Analysis (Stefan Müller)
Automated Text Analysis in Political Science (Martijn Schoonvelde)

### Surveys

Survey Design and Analysis (Carey E. Stapleton)
Course on Design and Analysis of Sample Surveys (Andrew Gelman)
Using Surveys for Research and Evaluation (Thomas J. Leeper)

### Research design

PLSC508b: Dissertation Workshop: Research Design & Causal Inference (Chris Blattman)
Sociology 750 – Research Design and Practice in Sociology (Jeremy Freese)
Political Analysis (John Gerring)
Political Analysis: A Primer (John Gerring)
Political Science 522 Research Design and Analysis in Quantitative Research (James H. Kuklinskiand and Jake Bowers)
PAD 6707 Logics of Inquiry (Rick Feiock)
POL580 Methods of Political Inquiry (William Mishler)
Causal Analysis in Data Science (Tom O’Grady)

### Statistics

Statistical Inference: Linear Models, Descriptive, Causal and Statistical Inference (Jake Bowers)
POS 5747 Special Topics in Advanced Quantitative Analysis: Causality, Matching, and Multilevel Models (Jason Barabas)
PUBL0050 Advanced Quantitative Methods (Jack Blumenau)
17.835: Machine Learning and Data Science in Politics (In Song Kim)
Machine Learning for Social Scientists (Rochelle Terman)
POL-GA 1251 Quantitative Political Analysis II (Cyrus Samii)
POS 5746 Advanced Quantitative Analysis in Political Science (Jason Barabas)
POS 3930 Advanced Research Methods in Political Science (Jason Barabas)
Dynamic Analysis (Time Series Modeling in Politics) (Jan Box-Steffensmeier, John R. Freeman & Jon Pevehouse)
Political Science 688: Applied Bayesian and Robust Statistical Methods in Political Research (Jake Bowers)
GOV 2003: Topics in Quantitative Methodology (Kosuke Imai & Santiago Olivella)
POL 245: Visualizing Data (Kosuke Imai)
POL345/SOC305: Introduction to Quantitative Social Science (Margaret Frye & Kosuke Ima)
POL 451: Statistical Methods in Political Science (Kosuke Imai)
Stat186/Gov2002: Causal Inference (Kosuke Imai)
Topics in Applied Econometrics (J. Angrist & W. Newey)
Introduction to Machine Learning (Justin Grimmer)
POLS 509: The Linear Model (Justin Esarey)
17.800: Quantitative Research Methods I (Teppei Yamamoto)
17.802: Quantitative Research Methods II (Teppei Yamamoto)
17.804: Quantitative Research Methods III (Teppei Yamamoto)
17.806: Quantitative Research Methods IV (In Song Kim)
Political Science 208: Political Science Methods (Justin Esarey)
Political Science 590: Matching for Adjustment and Causal Inference
GOVT 10: Quantitative Political Analysis
Political Science 2580: Introduction to Quantitative Research Methods (Paul Testa and Marie Schenk)
Time Series Analysis (Jamie Monogan)

### Mathematics

Mathematics for Political and Social Research (i.e., Extended Math Camp) (David A. Siege)
Mathematical Tools for Political Scientists (Miguel R. Rueda)

## Problems with the Big Five assessment in the World Values Survey #2

In 2017, I published a study in Personality and Individual Differences with Steven G. Ludeke. Our motivation for conducting the study was that other studies uncritically used the Big Five data in the World Values Survey without evaluating the reliability of the data.

In brief, and to recap, the data was unable to capture inter-individual variation in Big Five personality traits and should be used with caution. Specifically, we showed that the distribution of item-item correlations for the Big Five personality traits were unsatisfactory:

The main reason we decided to write up the short paper was that we could see that different researchers published studies using this data. Accordingly, we hoped that people would read our study and not use the data (and thereby not cite our study).

Of course, some of the citations to our study point to the challenges with measuring Big Five traits. For example, in a study by Laajaj et al. (2019), that I have commented on in the New Scientist, the authors simply point out what we find:

Related, Ludeke and Larson [sic.] (29) flag concerns with the use of the BFI-10 (30), a short 10-item Big Five instrument used in the World Values Survey, showing low correlations between items meant to measure the same PT.

We are happy to see researchers pay attention to our study and share the concern. However, what I have noticed – and what I am concerned we will see more of in the future (and the reason I write this post) – is that some researchers continue to use the data even when they are aware of the limitations/problems.

The most recent example is this study. Here is a brief description from the abstract on what they do: “Using the most recent wave of the World Values Survey, this study investigates the impact of personality on individual protest participation in 20 countries using the multilevel modelling.”

Importantly, the authors do nothing to take the problems with the data into account. Furthermore, they are aware of the problems with the data. As they write in a footnote:

A recent study done by Ludeke and Larsen (2017) points out the problems with the Big Five assessment in the WVS. However, they are not able to come up with any solution to the data challenges posed to the WVS. Given data availability, the WVS is the only choice to conduct cross-national comparative research on personality.

I disagree very much with the reasoning in the footnote.

First, the fact that we do not come up with a solution is a major red flag. People have looked into what can explain the low reliability and, similar to us, have been unable to find a solution (see, for example, this post by Rene Bekkers). Since we do not know what causes the problem, we do not have a solution that can make the data useful. Accordingly, I am not sure I understand the meaning of “however” in the argument, implying that there is a problem but it’s not really a problem. Until somebody can offer a solution (if such a solution exists), I highly recommend that you do not use the Big Five measures in the World Values Survey.

Second, the World Values Survey is not “the only choice to conduct cross-national comparative research on personality.” For example, the 2010 wave of the Latin American Public Opinion Project includes data on Big Five traits in a comparative setting (we also use this data in our study in the European Journal of Personality). There are also several studies using Big Five traits in a comparative perspective (e.g. Curtis and Nielsen 2018, Gravelle et al. 2020 and Vecchione et al. 2011). In other words, the World Values Survey is clearly not the “only choice to conduct cross-national comparative research on personality”. My recommendation is that if you have no choice but to use the Big Five data in the World Values Survey, limit the data to the Netherlands and Germany (where the reliability measures are satisfactory).

My general recommendation: Stop using the Big Five measures in the World Values Survey – even if it’s a good opportunity to cite our critique (seriously, I couldn’t care less about the citations).

## Er Socialdemokratiet gået tilbage i meningsmålingerne?

Da coronavirussen brød ud, oplevede Socialdemokratiet en voldsom fremgang i meningsmålingerne. Forleden kunne en Megafon-måling vise, at Socialdemokratiet “fastholder [en] kæmpe vælgeropbakning”. Det tyder altså på, at der stadig er stor opbakning til Socialdemokratiet i meningsmålingerne.

I et nyt vægtet snit hos Altinget formidles det dog, at de “seneste måneders medvind for Socialdemokratiet [er] begyndt at stilne af”. Konkret rapporteres det, at partiets opbakning nu er på 31,5 pct.

Er opbakningen så begyndt at stilne af? 31,5 pct. lyder meget plausibelt – og min seneste prognose giver Socialdemokratiet 32 pct. af stemmerne. Jeg vil dog argumentere for, at vi stadig skal se et større fald, før vi kan konkludere, at der er evidens for en ændring i opbakningen til Socialdemokratiet.

For at forstå dette kan vi kigge på de seneste målinger for Socialdemokratiet:

Her kan vi se, at der var en stor stigning i opbakningen til Socialdemokratiet i løbet af marts. Siden da har opbakningen ligget stabilt på omkring 33 pct., med nogle meningsmålinger højere og andre målinger lavere. Vi har fået to meningsmålinger i denne uge, hvor Socialdemokratiet ligger på hhv. 33,4 pct. (Voxmeter) og 33,2 pct. (Epinion). Tager vi den statistiske usikkerhed i betragtning, er det svært at se et mønster, der viser nogen forandring, siden Socialdemokratiets opbakning steg.

Det er sandsynligt at opbakningen ikke vil forblive så høj, som den er nu, men der er i min optik intet der taler for, at der er sket noget nyt i meningsmålingerne, når vi kigger på Socialdemokratiet. Min vurdering er, at der skal mere til, før at Socialdemokratiet falder i målingerne, end hvad vi har set nu, men dette udelukker selvfølgelig ikke, at partiets opbakning kan blive mindre (eller større) i fremtiden.

## Reproduce before you replicate

There is an important distinction between reproduction and replication in scientific research. Reproduction is when you use the same data from a study to (re)produce the findings in that particular study. Replication is when you use different data to examine whether you will get the same results (i.e. cross-validation).

I often think about this distinction when I encounter research I would like to replicate. For example, when the sample size is small or the statistical models rely on arbitrary choices, I see an urgent need to replicate a study to see whether the findings generalise beyond the specific study. However, it is expensive to replicate a study. What I end up doing instead is, whenever possible, to access the data, reproduce the findings and see how robust these findings are to (simple) alternative model specifications.

In a lot of cases I find that it would be a waste of resources to even try to replicate the study of interest. Accordingly, there is a temporal order of importance when we engage with the validation of research, namely: reproduce before you replicate. If the reproduction of a study shows that the evidence presented is weak, that should factor into our considerations of whether it is worth the time and money to pursue a replication.

I was thinking about this when I saw a replication study published in the Journal of Politics (a better-than-average journal within political science). In brief, the study is unable to replicate a study published in Science. What caught my interest was this recommendation made by one of the authors on Twitter: “Show maximum caution with psychophys until these issues have been solved. Resources and careers can disappear into this black hole. This null finding costed us about $80,000, 4 years of postdoc salary and more headaches than I can count.” Why did I find this point interesting and important? Because I read the original study in Science years ago, found it interesting and decided to look into the study. And decided to not pursue any research taking this specific study serious. Headache count: 0. In brief, I quickly decided that it would be a waste of resources to expect anything else than a null finding. (Not that there is anything wrong with replicating a null finding.) It took me a few minutes and an open source statistical software which – for the record – was significantly less costly than$80,000.

The original study in Science, published in 2008, is titled ‘Political Attitudes Vary with Physiological Traits‘. There are a lot of red flags that we should pay attention to, such as the data (e.g. a small sample size, total N = 46) and the statistical models (e.g. several covariates).

There is a statistically significant result. However, unsurprisingly, the results are not robust to even simple model specifications (one might say especially to simple model specifications). Interestingly, one of the key findings in the paper is only significant when we include education as a (misspecified) covariate in our model.

I will not go into too much detail here with what the paper is about. Do read the paper if you are bored though. The only thing you need to understand here is that we are interested in the variable ‘Mean amplitude’. Model 1 in the table below replicates the results in Table 3 in the paper. As we can see, the coefficient and the standard error confirm that there is indeed a statistically significant effect (p < .05). In Model 2, we run a simple bivariate OLS regression (i.e. without all the covariates) and see that the finding is no longer anywhere near statistical significance. In Model 3, we add all covariates except for education and see that there is still no significant effect. It’s only when we add education to the model that we find a significant effect.

One might say that this is not an issue. Maybe it is even important to take differences in education into account? Maybe. However, even if this is the case, we should make sure that we don’t treat an ordinal variable as a continuous variable in our models. Education is measured with different categories and we should take this into account in the models. I do that in Model 4 and 5 below, and we see here that there is simply no effect of ‘Mean amplitude’.

(1) (2) (3) (4) (5)
Predictors Estimates Estimates Estimates Estimates Estimates
Mean amplitude 1.67
(0.75)
0.94
(0.87)
1.22
(0.84)
0.97
(0.82)
1.11
(0.75)
Female -2.72
(1.46)
-3.05
(1.65)
-3.85
(1.46)
Age 0.19
(0.10)
0.15
(0.12)
0.16
(0.10)
Income -0.32
(0.50)
-0.61
(0.56)
-0.31
(0.48)
Education -1.76
(0.50)
(3.50)
3.03
(3.29)
Edu: some college -4.02
(3.14)
-4.15
(2.94)
(2.63)
-0.16
(2.50)
Edu: college degree plus -6.88
(2.25)
-5.86
(2.11)
N 46 46 46 46 46
R2 / R2 adjusted 0.375 / 0.297 0.026 / 0.004 0.178 / 0.098 0.325 / 0.241 0.480 / 0.368

It is possible that there is a true effect and the study was simply limited by the small sample size. However, what I am saying here is that I was not surprised to see a failed replication of the study in question and I will be extra critical towards any similar studies and how they use covariates, especially in (political science) top journals. Noteworthy, the replication study deals with several other topics and you should read that study in any case.

Importantly, this is not one unique episode and I have seen several papers where I would be surprised if a replication would find results consistent with the original study. Let’s take one extra example.

I was reading this working paper the other day, No Additional Evidence that Proximity to the July 4th Holiday Affects Affective Polarization, that is unable to replicate a finding in this study, Americans, Not Partisans: Can Priming American National Identity Reduce Affective Polarization?. Specifically, the finding in the original study (that the replication study is unable to replicate) is that proximity to the 4th of July reduces affective polarization.

Again, I was not surprised to see this failed replication as I reproduced the finding in the original study when it was published in the Journal of Politics (a decent journal within political science). Here is a summary of what the original study argues: “Subjects who are interviewed around July 4th should have a slightly more positive impression of the other party and its leaders, all else equal.” And here is the key finding: “those interviewed in the 14-day window around July 4th rate the opposing party’s nominee 1.9 degrees warmer than those interviewed at otherwise similar periods in early June or August.”

I am able to easily get that estimate but with simple robustness checks I was also able to see that a replication study would most likely not find a significant effect. In brief, I decided to reproduce the original study and assess how robust the findings were. One way to assess unconfoundedness that I looked into was to use a lagged outcome as the outcome. Here, we should definitely not find a treatment effect (a treatment should not be able to affect the past — an aspirin today should not give you a headache yesterday).

Importantly, the variables I needed to conduct such a robustness test was available in the replication material. This is because the original study argues that: “I can also guard against potential unobserved heterogeneity by using the feeling thermometer ratings from wave 2 (January–March 2008) as a control variable.” That’s a good point but I am not convinced by that analysis.

So, we have two waves of data in this study. Wave 2 (our placebo wave with no July 4th, interviews conducted January 1, 2008, through March 31, 2008; n = 17,747) and Wave 3 (our treatment wave with July 4th, interviews conducted April 2, 2008, through August 29, 2008; n = 20,052).

The problem is simple: When we look at the outcome measured before the treatment (i.e. look at Wave 2), we find a significant treatment effect similar to that reported as the key finding in the paper. This is illustrated in the left panel and the centre panel in the figure below (with varying window sizes around July 4th). In addition, when we control for the outcome measured in Wave 2, the main results disappear (see the right panel in the figure).

Again, I was not surprised to see that researchers – when trying to replicate this finding – was unable to do so. Or, in other words, I find the replication consistent with what the data actually shows when you conduct relatively simple robustness tests.

I am not picking these examples because the research is bad. Actually, my motivation is the opposite. I found these studies important and relevant for our understanding of contemporary political behaviour and I would most likely not have engaged with the data if I belived the research was bad. Also, I am not saying that any of the authors replicating these studies did not reproduce the main findings before conducting the respective replications.

Also, as anybody who has engaged with statistics know, it is always possible to “break a result”. (If you conduct enough robustness tests, these tests will at the end of the day return some false-positives just by pure chance.) I am not aware of any finding within political science that cannot “break” at some point. And just looking for an insignificant result and writing a blog post about that would be p-hacking as well. However, the reason I find the two examples above interesting is that we are pursuing very simple tests that I would not even call robustness tests but simply tests of the main effect. In other words, the question is not whether it is possible to find an insignificant result but when.

Again, I do believe that the replication studies mentioned above are worth the work and they both contribute significantly to their respective literatures. However, I do agree with the argument that not all studies are worth replicating (for more on this, see this paper by Peder Isager and colleagues). So, my recommendation: Before you replicate anything, make sure you reproduce and fully understand the robustness of the finding you want to replicate.