Potpourri: Statistics #72 (Monty Hall problem)

Monty Hall Simulations
Making the Monty Hall problem weirder but obvious
The Intuitive Monty Hall Problem
The psychology of the Monty Hall problem: Discovering psychological mechanisms for solving a tenacious brain teaser
The Collider Principle in Causal Reasoning: Why the Monty Hall Dilemma Is So Hard
Rationality, the Bayesian standpoint, and the Monty-Hall problem
Josh Miller’s alternative, more intuitive, formulation of Monty Hall problem
Monty Hall problem solved in tidyverse


Previous posts: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71

How to improve your figures #4: Show labels

This is a brief follow-up post to my previous post with advice on how you can improve your figures. Can it be worse than showing variable names instead of actual labels on your figures? Yes. You can have no labels at all.

Take a look at this article. It’s great and includes references to a lot of good material. However, for both figures provided in the article, it is not clear what exactly the y-axis is showing. Take the figure with the Fragile States Index as an example. What is a value of 5? What is a value of 6? Ideally, the figure should show that without you having to track down the source material (also, do notice how only countries doing better than the US is included to look the US look worse than it actually is).

I see missing labels now and then. Consider, for example, this article from the Comparative Political Studies. Here is Figure 1 (and the title legend):

What you see is that there is no information on what is shown on the respective axes. The only thing you have is a series of numbers. “Satisfaction With Government, Perception of the Economy, and Clarity of Responsibility”, sure, but what is on the x-axis? What is on the y-axis? Figure 2 in the paper is similar to Figure 1, only with “Trust in Parliament” instead of “Satisfaction With Government”.

This is an extreme example but my recommendation is simple: Make sure that you always have labels that tell the reader what the figure is actually showing. Having no labels is highly problematic, showing variable names are problematic and showing informative labels is great.

Resources with research writing advice

I was going through a few resources with some good advice on writing research papers. Might be of interest to some of you:

How to write a great research paper (Simon Peyton Jones from Microsoft gives seven suggestions for how to improve your research papers)
Ten simple rules for structuring papers (Table 1 in the paper gives a good summary of the ten “rules”)
Writing Empirical Articles: Transparency, Reproducibility, Clarity, and Memorability (some good advice on how to write good science, e.g. increasing the transparency)
Writing a scientific paper, step by painful step (I’m not a fan of some of the suggestions, such as organising p-values, but overall a lot of good advice)
Robert’s Rules:Suggestions for Writing (motivational piece, e.g. “Write fast, in multiple drafts”)
10 Tips on How to Write Less Badly (especially relevant for people starting in grad school)
How to construct a Nature summary paragraph (good example on how to write a summary paragraph)
Publication, Publication (Gary King on how to structure a publishable paper)
Writing Tips for Ph. D. Students (a must read)
Common Expositional Problems in Students’ Papers and Theses (great set of advice — including some detailed advice on details)
Doing a Literature Review (recommended reading if you are doing a literature review)
Managing Your Research Pipeline (practical advice on how to structure multiple papers)
Mathematical Writing (+100 pages subject-specific advice on mathematical writing)
Three Templates for Introductions to Political Science Articles (great templates for introductions)
Writing Guide (Daniel Simons’ recommendations – including a good revision worksheet)
Of Publishable Quality: Ideas for Political Science Seminar Papers (great paper on how to think about ideas for research projects)
Rookie Mistakes: Preemptive Comments on Graduate Student Empirical Research Manuscripts (another great paper on rookie mistakes in empirical papers)
How to Read (and Understand) a Social Science Journal Article (focus is on reading an article but also relevant for writing)

Ten great R functions

Here are ten R functions that have saved me a lot of time over the years.

1. forcats::fct_reorder()

The forcats package has a lot of great functions. The one I use the most is the fct_reorder() function. I have also seen David Robinson using it a lot in his YouTube videos (I recommend his videos in this post).

The function is good to change the order of values in a factor variable, e.g. if you want to make sure there is some structure to the values you present in a bar chart:

2. countrycode::countrycode()

I have lost count of the number of times I have used the countrycode package. If you are doing comparative research and not using the countrycode() function, you are in for a treat.

In a lot of datasets you will not have the full country name (e.g. Denmark), but something like ISO 3166-1 alpha-2 codes (e.g. DK). The countrycode() function can easily return country names based on ISO codes (or vice versa). Here is an example:

countrycode(c("DK", "SE"), 
            origin = "iso2c", 
            destination = "country.name")

This code will return Denmark and Sweden. As you can see, you simply provide the “origin” (i.e. the type of data you have) and the “destination” (i.e. the type of data you would like). I especially find this function useful when I need to merge datasets with different country variables and when I want to present full country names in a visualisation instead of ISO codes.

Last, if you are working on a country-level dataset, make sure that it is easy to match the countries with any of the variables available in the countrycode package.

3. tidyr::separate_rows()

I recently had to work with a dataset where each country had several priorities in relation to the Sustainable Development Goals (SDGs). However, there was only one SDG variable with information on the relevant SDGs for each country. The separate_rows() function is great to turn such data into multiple rows.

df <- tibble(
  country = c(1, 2),
  SDG = c("SDG 5,SDG 17,SDG 3", "SDG 1,SDG 2,SDG 3")
)

df %>% separate_rows(SDG,
                     sep = ",",
                     convert = TRUE)

The sep argument is specifying what separator you would like to use to separate the information (in this case a comma). The code will return a tibble with two variables and six observations.

4. tidyr::crossing()

I often use the crossing() function when I need to create a data frame from scratch. For example, if you need to create a country-year data frame for a few countries from 1965 to 2021, you can create a data frame where each country has a row for each year. Here is an example:

crossing(country = c("Denmark", "Sweden"),
         year = 1965:2021,
         value = NA_real_)

5. stringi::stri_reverse()

I had to scrape a PDF file but the text I got from the document was reversed, e.g. ‘Agriculture’ was ‘erutlucirgA’. There might be different ways to do this in an easy way, but the function stri_reverse() in the stringi package did the trick. Here is a simple example:

x <- "snoitcnuf R taerg neT"

stringi::stri_reverse(x)

And what we get is: “Ten great R functions”.

6. purrr::reduce()

The reduce() function is a great to collapse repetitive piping. There is a good blog post on the function here. To illustrate, when I used to merge several data frames into one large data frame, I used multiple lines of left_join().

reduce(list(df_1, df_2,
            df_3, df_4), 
       left_join, 
       by = c("iso2c", "year"))

The code will left join all data frames on the iso2c and year variable.

7. dplyr::distinct()

If you have multiple rows in a data frame, e.g. multiple countries, but want a unique row for each country, you can use the distinct() function to get distinct rows. In the example below we have four rows but we turn them into a data frame with distinct rows on the variable x.

df <- tibble(
  x = c(1, 1, 2, 2),
  y = c(1, 1, 2, 4)
) 

df %>% dplyr::distinct(x, .keep_all = TRUE)

8. fuzzyjoin::regex_left_join()

The regex_left_join() from the fuzzyjoin is great if you need to merge a data frame based upon a regular expression. I found this useful when I had to join data frames with different country names.

Here is a simple example where we join two data frames where it merges the rows for both “Denmark” and “denmark”.

df_1 <- data.frame(
  country = c("Denmark", "denmark"),
  year = 2020:2021
)

df_2 <- data_frame(regex_country = c("[Dd]enmark"),
                   type = 1:2)

df_1 %>%
  fuzzyjoin::regex_inner_join(df_2, by = c(country = "regex_country"))

9. ggplot2::labs()

I used to look up the theme() function when I had to remove the title of a legend, or use scale_x_continuous() if I had to change the title of the x-axis. Not anymore. The labs() function is an easy way to change the labels in your figure. You can also use it to change the title and subtitle of your figure. Highly recommended.

10. tidyr::drop_na()

When I check some of my old code, I often see lines like this:

df %>% 
  filter(!is.na(var1))

However, there is a much easier way to do this, namely using the drop_na() function.

df %>% 
  drop_na(var1)

This is not only much easier to write than having to rely on two functions, but also a lot easier to read.

Assorted links #3

61. The American Abyss
62. 52 things I learned in 2020
63. The psychology of torture
64. Almost Wikipedia: Eight Early Encyclopedia Projects and the Mechanisms of Collective Action
65. 96 hour No-Sleep Challenge
66. Internet 3.0 and the Beginning of (Tech) History
67. Drink Me: The Kremlin’s Long, Evil History of Poisoning Its Enemies
68. When you browse Instagram and find former Australian Prime Minister Tony Abbott’s passport number
69. A Brief History of Word Games
70. This Is Why Your Holiday Travel Is Awful
71. The Problem With ‘Hey Guys’
72. Pain & Gain
73. We Need To Take CO2 Out Of The Sky
74. A Pretty-Good Mathematical Model of Perfectionism
75. How I work
76. What if it’s a big hoax and we create a better world for nothing?
77. A Beginner’s Garden of Chess Openings
78. Good sleep, good learning, good life
79. 64 Reasons To Celebrate Paul McCartney
80. The Making of “The Godfather”—Sort of a Home Movie
81. Why Hades is Polygon’s game of the year
82. The Big Here and Long Now
83. Pure Skill Minesweeper
84. Anticipatory Procrastination
85. Trading time for money
86. Radical Jokes (archive.org)
87. How To F#€k Up An Airport
88. Things I Learnt in 2020
89. My year in data
90. A list of popular/awesome video games, add-ons, maps, etc. hosted on GitHub


Previous posts: #1 #2

25 interesting facts #6

126. Climate change may have been an important factor in the outbreak of COVID-19 (Beyer et al. 2021)

127. On average, people underestimate how much their conversation partners enjoy their company (Boothby et al. 2018)

128. There are at least 137 design mistakes you can make in your PowerPoint presentations (Kosslyn et al. 2012)

129. Voting rights do not affect the political maturity of adolescents (Bergh 2013)

130. People consume more when the cost is split, resulting in a substantial loss of efficiency (Gneezy et al. 2004)

131. In Congo, increased tax enforcement substantially raised political participation and trust in city government (Weigel 2020)

132. Boredom leads to endorsement of more extreme political orientations (Van Tilburg and Igou 2016)

133. People tend to pursue urgency over importance when faced with choices between tasks (Zhu et al. 2018).

134. Radical right parties benefits more from malicious social media bots than other party families (Silva and Proksch 2020)

135. Lay estimates of genetic influence match heritability estimates from twin studies (Harden 2021)

136. In soccer, because of spectators, referees favour home teams when awarding yellow and red cards (Dawson and Dobson 2010; Pettersson-Lidbom and Priks 2010)

137. Authoritarian regimes that emerge out of violent social revolution have a greater longevity (Lachapelle et al. 2020)

138. Individuals with Dark Triad traits (Machiavellianism, Narcissism, Psychopathy) more frequently signal virtuous victimhood (Qian et al. 2020)

139. The Dunning-Kruger effect is (mostly) a statistical artefact (Gignac and Zajenkowski 2020)

140. The social cost of carbon is estimated to be US$417 per tCO2 (Ricke et al. 2018)

141. Declines in biodiversity have resulted in declines in human quality of life (Brauman et al. 2020)

142. Personal experiences with the weather and extreme weather events matter for climate opinions and perceptions (Choi et al. 2020, Damsbo-Svendsen 2020, Egan & Mulling 2012, Hazlett & Mildenberger 2020, McDonald et al. 2015, Motta 2020, Rudman et al. 2013, Sisco & Weber 2020 and Whitmarsh 2008)

143. Homo sapiens evolved via selection for prosociality (Hare 2017)

144. Households with solar installations are more politically active than their neighbours (Mildenberger et al. 2019)

145. People underestimate greenhouse gas emissions associated with air travel (Wynes et al. 2020)

146. Unemployment has no effect on the Big Five personality traits (Gnambs and Stiglbauer 2019)

147. Human-made mass exceeds all global living biomass on Earth (Elhacham et al. 2020)

148. Pictures by politicians in non-political settings increase audience engagement on Instagram (Peng 2020)

149. Intelligence predicts humour production ability (Greengross and Miller 2011)

150. Parents who have twins in their first parity are less likely to vote (Dahlgaard and Hansen 2020)


Previous posts: #5, #4, #3, #2, #1

Er nutidens studerende for dårlige? #2

For snart ti år siden skrev jeg et indlæg, hvor jeg kritiserede en rundspørge foretaget af Politiken. Denne rundspørge konkluderede, at nutidens studerende var blevet dårligere i løbet af 5-10 år.

Der var en lang række metodiske problemer med denne type rundspørge, som jeg kom ind på allerede for ti år siden. Hvad jeg i dette indlæg vil belyse er et studie publiceret i Science Advances, der kan forklare, hvorfor folk tror studerende er dårligere i dag – også selvom dette ikke nødvendigvis er tilfældet.

Studiet viser, at der er forskellige mekanismer, der kan forklare, hvorfor vi er tilbøjelige til at tro at nutidens studerende er dårligere. For det første har vi en tendens til at se andres begrænsninger når vi er bedre, og i takt med at vi bliver bedre (eksempelvis som undervisere), vil vi i højere grad se studerende som dårligere. For det første har vi en tendens til at projicere vores egne nuværende kvaliteter overpå ældre studerende. Disse to mekanismer kan relativt nemt forklare, hvorfor det virker til, at studerende i dag er dårligere end tidligere. Derfor giver det ganske enkelt ikke mening blot at spørge lærere om, hvorvidt studerende i dag er dårligere end de var for eksempelvis 10 år siden.

Et andet interessant fund i studiet er, at personer med bestemte træk synes at ungdommen i dag klarer sig dårligere på disse træk. Eksempelvis synes mere autoritære personer, at ungdommen i dag viser ældre for lidt respekt, intelligente mennesker synes at ungdommen er mindre intelligent og belæste personer er af den overbevisning at ungdommen nyder at læse mindre i dag.

Selvfølgelig er det muligt, at ungdommen kan være dårligere i dag end for fem år siden (hvordan vil eksempelvis en masse unge, der er ramt hårdt af COVID-19, præstere gennem resten af deres uddannelse?), men det er – som ovennævnte studie viser – ikke noget man kan dokumentere ved blot at spørge undervisere om, hvorvidt nutidens studerende er dårligere.