Ten great R functions

Here are ten R functions that have saved me a lot of time over the years.

1. forcats::fct_reorder()

The forcats package has a lot of great functions. The one I use the most is the fct_reorder() function. I have also seen David Robinson using it a lot in his YouTube videos (I recommend his videos in this post).

The function is good to change the order of values in a factor variable, e.g. if you want to make sure there is some structure to the values you present in a bar chart:

2. countrycode::countrycode()

I have lost count of the number of times I have used the countrycode package. If you are doing comparative research and not using the countrycode() function, you are in for a treat.

In a lot of datasets you will not have the full country name (e.g. Denmark), but something like ISO 3166-1 alpha-2 codes (e.g. DK). The countrycode() function can easily return country names based on ISO codes (or vice versa). Here is an example:

countrycode(c("DK", "SE"), 
            origin = "iso2c", 
            destination = "country.name")

This code will return Denmark and Sweden. As you can see, you simply provide the “origin” (i.e. the type of data you have) and the “destination” (i.e. the type of data you would like). I especially find this function useful when I need to merge datasets with different country variables and when I want to present full country names in a visualisation instead of ISO codes.

Last, if you are working on a country-level dataset, make sure that it is easy to match the countries with any of the variables available in the countrycode package.

3. tidyr::separate_rows()

I recently had to work with a dataset where each country had several priorities in relation to the Sustainable Development Goals (SDGs). However, there was only one SDG variable with information on the relevant SDGs for each country. The separate_rows() function is great to turn such data into multiple rows.

df <- tibble(
  country = c(1, 2),
  SDG = c("SDG 5,SDG 17,SDG 3", "SDG 1,SDG 2,SDG 3")
)

df %>% separate_rows(SDG,
                     sep = ",",
                     convert = TRUE)

The sep argument is specifying what separator you would like to use to separate the information (in this case a comma). The code will return a tibble with two variables and six observations.

4. tidyr::crossing()

I often use the crossing() function when I need to create a data frame from scratch. For example, if you need to create a country-year data frame for a few countries from 1965 to 2021, you can create a data frame where each country has a row for each year. Here is an example:

crossing(country = c("Denmark", "Sweden"),
         year = 1965:2021,
         value = NA_real_)

5. stringi::stri_reverse()

I had to scrape a PDF file but the text I got from the document was reversed, e.g. ‘Agriculture’ was ‘erutlucirgA’. There might be different ways to do this in an easy way, but the function stri_reverse() in the stringi package did the trick. Here is a simple example:

x <- "snoitcnuf R taerg neT"

stringi::stri_reverse(x)

And what we get is: “Ten great R functions”.

6. purrr::reduce()

The reduce() function is a great to collapse repetitive piping. There is a good blog post on the function here. To illustrate, when I used to merge several data frames into one large data frame, I used multiple lines of left_join().

reduce(list(df_1, df_2,
            df_3, df_4), 
       left_join, 
       by = c("iso2c", "year"))

The code will left join all data frames on the iso2c and year variable.

7. dplyr::distinct()

If you have multiple rows in a data frame, e.g. multiple countries, but want a unique row for each country, you can use the distinct() function to get distinct rows. In the example below we have four rows but we turn them into a data frame with distinct rows on the variable x.

df <- tibble(
  x = c(1, 1, 2, 2),
  y = c(1, 1, 2, 4)
) 

df %>% dplyr::distinct(x, .keep_all = TRUE)

8. fuzzyjoin::regex_left_join()

The regex_left_join() from the fuzzyjoin is great if you need to merge a data frame based upon a regular expression. I found this useful when I had to join data frames with different country names.

Here is a simple example where we join two data frames where it merges the rows for both “Denmark” and “denmark”.

df_1 <- data.frame(
  country = c("Denmark", "denmark"),
  year = 2020:2021
)

df_2 <- data_frame(regex_country = c("[Dd]enmark"),
                   type = 1:2)

df_1 %>%
  fuzzyjoin::regex_inner_join(df_2, by = c(country = "regex_country"))

9. ggplot2::labs()

I used to look up the theme() function when I had to remove the title of a legend, or use scale_x_continuous() if I had to change the title of the x-axis. Not anymore. The labs() function is an easy way to change the labels in your figure. You can also use it to change the title and subtitle of your figure. Highly recommended.

10. tidyr::drop_na()

When I check some of my old code, I often see lines like this:

df %>% 
  filter(!is.na(var1))

However, there is a much easier way to do this, namely using the drop_na() function.

df %>% 
  drop_na(var1)

This is not only much easier to write than having to rely on two functions, but also a lot easier to read.

How to improve your figures #2: Don’t show overlapping text labels

I was reading this study on the impact of Weberian bureaucracy on economic growth published in Comparative Political Studies. It’s a great article and I can highly recommend reading it.

I like that the study presents most of the results in figures. In fact, there are more figures than tables in the article. However, a few of the figures present several data points (countries) in scatter plots with labels to all points (country names). Here is one example:

As you can see, several country labels overlap with each other making it difficult to read the country names. The problem is not as severe as it could have been (as the authors have made the height greater than the width, making more space for horisontal text). However, for a lot of the labels it is simply not possible to read the country names.

Importantly, this is not only about aesthetics. When several country labels overlap, it is no longer possible to see whether there are actual data points hidden by the labels.

To improve the figure, my suggestion would be to only show some of the value labels. In the figure below I have tried to only show the country names for the countries that you can actually read in the figure above.

In my view, this is a clear improvement of the original figure.

My R-script to create the figure is here:

library("tidyverse")
library("haven")

bureaucracygrowth <- read_dta("22725104_Replication_data_Bureaucracy_Growth.dta")

bureaucracygrowth %>% 
  mutate(country_name_show = case_when(
    v2stcritrecadmv9 < -1  ~ country_name,
    QoG_expert_q2_a > 6.3 | QoG_expert_q2_a < 2 ~ country_name,
    v2stcritrecadmv9 > 0.7 & QoG_expert_q2_a < 3.5 ~ country_name,
    v2stcritrecadmv9 < 1 & QoG_expert_q2_a > 4.4 ~ country_name,
    TRUE ~ ""
  )) %>% 
  ggplot(aes(v2stcritrecadmv9, QoG_expert_q2_a)) +
  geom_smooth(method = "lm", se = FALSE) +
  ggrepel::geom_text_repel(aes(label = country_name_show)) +
  geom_point() +
  theme_minimal() +
  labs(y = "Meritocratic recruitment (QoG expert-survey), 2014",
       x = "Meritocratic recruitment (V-Dem), 2014")

ggsave("bureaucracygrowth.png", width = 6, height = 6)

Figurer i ‘Ekspertrapport af den 6. maj 2020’

Onsdag gik samtlige danske medier i ‘breaking mode’ på baggrund af en ny rapport fra Statens Serum Institut. Rapporten bærer titlen ‘Ekspertrapport af den 6. maj 2020: Matematisk modellering af COVID-19 smittespredning og sygehusbelastning ved scenarier for anden fase af genåbningen af Danmark’ og kan findes her.

Jeg har ingen nævneværdige holdninger til selve indholdet af rapporten. Hvad jeg i stedet vil forholde mig til her er kvaliteten af formidlingen i rapporten. I dette indlæg giver jeg således mine konkrete anbefalinger til Statens Serum Instituts ekspertgruppe.

Kvalitetsproblem

Den første figur i rapporten viser, at der er problemer med kvaliteten af rapporten:

Det er nemt at se, at det ikke er nemt at se, hvad der foregår. Man skal knibe øjnene godt sammen for at se, hvad der helt præcist står på akserne.

Dette problem finder man flere steder i rapporten, eksempelvis også her, hvor jeg har zoomet ind på en figur, for at gøre problemet tydeligt:

Min første anbefaling er at gemme figurerne som vektorgrafik (gerne som PDF-dokumenter). Dette gør at figurerne er i en ordentlig kvalitet, der ikke gør ondt i øjnene.

Vælg software

En anden anbefaling er at bruge ét stykke software til at lave de figurer, der skal præsenteres. Her er eksempelvis en hæslig figur fra Excel, der nemt kunne have været lavet i R (som mange af figurerne er lavet i):

Bemærk desuden i ovenstående figur, hvor svært det er at sammenligne på tværs af de forskellige grupper. Og når de andre figurer i rapporten anvender farver, hvorfor så gå med sort/hvid her? Det virker til, at flere har bidraget med figurer til rapporten uden nogen koordination.

Det tætteste jeg kommer på at have en holdning til indholdet er, at det sjældent er et godt tegn, når man ser variation i, hvordan figurer er bygget op i en rapport (e.g. R og Excel).

Derfor: Lav alle figurer med det samme software, så de kan reproduceres i det samme workflow.

Brug ét tema

Det er desværre ikke tilstrækkeligt, at man blot bruger det samme software. Man bør også anvende det samme tema, så der er en visuel identitet i en rapport. Tag for eksempel et par af figurerne, der er lavet i R med temaet ggplot2::theme_bw():

Hvorfor bruge dette tema og ikke noget, der er konsistent med hvad der ellers vises i rapporten? Det skal dog tilføjes, at ovenstående figur har langt større problemer end selve temaet (alene værdierne på x-aksen ødelægger figuren og kunne have være forsimplet drastisk – hvorfor eksempelvis nævne 2020 ved hver enkelt dato?).

Når man skal udarbejde en rapport som nævnte, kan det varmt anbefales at man bruger et par minutter på at overveje hvilket tema, man anvender (eksempelvis kunne de have lavet et theme_ssi(), der nemt kunne appliceres til hver figur lavet med ggplot2). Alt andet er simpelthen for uprofessionelt.

Tag ovenstående som et par gratis anbefalinger – og ingenlunde som mit ønske om at være med i ekspertgruppen. Der er allerede 19 medlemmer i gruppen – og hvis der mangler yderligere hjælp (hvilket der åbenlyst gør), burde der nok være mere end én kvalificeret kvinde (der er én kvinde i ekspertgruppen), der kunne hjælpe med at øge kvaliteten på dette stykke ufrivilligt postmoderne kunst.

New resource: awesome-ggplot2

I use ggplot2 every day. It is a great R package and the best tool available to make beautiful data visualisations. The logic of grammar of graphics makes it easy to learn as well as making it possible for you to gradually improve your plots.

Luckily, there are a lot of resources available for the package. I have created a repository with a list of packages, tutorials and other useful resources.

It is available here. It was also featured in R Weekly.