## Ten great R functions #2

This is a follow-up post to my previous post on great R functions. I use some of these functions a lot while a few of the functions have been very helpful at least once.

#### 11. dplyr::coalesce()

I have been working with data where two columns have the relevant data that needed to be in one column. For example, there might be an outcome for the treatment group and an outcome for the control group in a survey experiment, but each observation only has a value on one of these variables.

To create one variable with all of the information, we can use the coalesce() function. This function will find the first non-missing element across several columns and add that to a variable. In the example below we create a new variable (var3) that is merged from two other variables.

df <- tibble(id = 1:4,
var1 = c(1, 2, NA, NA),
var2 = c(NA, NA, 3, 4))

df %>%
mutate(var3 = coalesce(var1, var2))


The new variable will have the values 1, 2, 3 and 4.

#### 12. fs::dir_ls()

If you need a character vector with the files in a folder, preferably complying with a specific regular expression, the dir_ls function in the fs package got your covered. The example below will return all *.csv files in your working directory (you can also specify a specific path if it should not be your working directory).

fs::dir_ls(regexp = "\\.csv\$")


#### 13. janitor::clean_names()

The clean_names() function does exactly what it promises: clean names. When I get Excel datasets to work with, the first row often have names that are not ideal variable names, including spaces and different signs.

In the example below I create an empty dataset with two variables with the horrible names: Annual sales (USD) and Growth rate (%). Then I use the clean_names() function to get clean names from the data frame. Specifically, the function takes the variable names and edit them into snake_case names.

df <- data.frame("Annual sales (USD)" = NA,
"Growth rate (%)" = NA)

janitor::clean_names(df)


The variable names returned from the function are annual_sales_usd and growth_rate. Much better!

#### 14. dplyr::add_count()

Yet another function that does exactly what it promises. add_count() adds a count of the variable of interest, i.e., the number of values with that specific value. The example below will count how many observations have the specific values on the gear variable in mtcars and add that information to a new variable (gear_n).

mtcars %>%


The function is similar to the count() function but it will not group all observations together on the selected variable. Accordingly, you should only use count() if you want to summarise your data without having to use group_by().

#### 15. performance::check_collinearity()

When you estimate a regression model, you often need to check whether certain assumptions hold or not. The performance package got a lot of relevant functions that makes this easy, such as check_collinearity(). This function easily let you examine the potential multicollinearity in your model.

#### 16. dplyr::across()

If you need to apply a function (or functions) across multiple columns, across() is a great function to use. In one of my scripts, I had to create confidence intervals for poll estimates, and I used the function to create new variables with the maximum and minimum estimates.

polls %>%
mutate(across(starts_with("party"),
~ .x + 1.96 * sqrt((.x * (100 - .x)) / n),
.names = "ci_max_{.col}"),
across(starts_with("party"),
~ .x - 1.96 * sqrt((.x * (100 - .x)) / n),
.names = "ci_min_{.col}")
)


As you can see, the function takes all variables that starts with “party”, calculates the lower and upper estimates and saves the information in new variables.

#### 17. RVerbalExpressions::rx()

Writing regular expressions can be difficult and involve a lot of frustration. The rx() function let you easily write code that returns the regular expression you want. You can see several good examples on how to use the function here.

#### 18. lubridate::make_date()

make_date() is a great function that easily creates a date variable when you have the information on year, month and day in three separate variables. For example:

df %>%
mutate(date = make_date(year, month, day))


#### 19. dplyr::pull()

If you want to extract a single column from a data frame, you can use the pull() function. The example below pulls the gear variable from the data frame and then returns the summary of the variable.

mtcars %>%
pull(gear) %>%
summary()


Similarly, if you want to extract an element from a list, you can use the pluck() function.

#### 20. scales::show_col()

This was a function I was not familiar with until I saw Andrew Heiss mentioning it on Twitter. It is an amazing function to explore different colour schemes. Do check it out.

## Replace equations with code

Here is a suggestion: In empirical research, academics should move equations from the methods section to the appendix and, if anything, show the few lines of code used to estimate the model(s) in the software being used (ideally with citations to the software and statistical packages). Preferably, it should be possible to understand the estimation strategy without having to read any equations.

Of course, I am talking about the type of work that is not primarily interested in developing a new estimator or a formal theory that can be applied to a few case studies (or shed light on the limitations of empirical models). I am not against the use of equations or abstractions of any kind to communicate clearly and without ambiguity. I am, however, skeptical towards how empirical research often include equations for the sake of … including equations.

I have a theory that academics, and in particular political scientists, put more equations in their research to show off their skills rather than to help the reader understand what is going on. In most cases, equations are not needed and are often there only to impress reviewers and peers, which of course are the same people (hence, peer-review). The use of equations are excluding readers rather than including readers.

I am confident that most researchers spend more time in their favourite statistical IDE than they do writing and reading equations. For that reason, I also believe that most researchers will find it easier to read actual code instead of equations. Take this example on the equation and code for a binomial regression model (estimated with glmer()) from Twitter:

Personally, I find it much easier to understand what is going on when I look at the R code instead of the extracted equation. Not only that, I also find it easier to think of potential alternatives to the regression model, e.g., that I can easily change the functional form and see how such changes will affect the results. This is something I rarely consider when I only look at equations.

The example above is from R, and not all researchers use or understand R. However, I am quite certain that everybody that understands the equation above will also be able to understand the few lines of code. And when people use Stata, it is often even easier to read the code (even if you are not an avid Stata user). SPSS syntax is much more difficult to read but that says more about why you should not use SPSS in the first place.

I am not against the use of equations in research papers. However, I do believe empirical research would be much better off by showing and citing code instead of equations. Accordingly, please replace equations with code.

## Ten great R functions

Here are ten R functions that have saved me a lot of time over the years.

#### 1. forcats::fct_reorder()

The forcats package has a lot of great functions. The one I use the most is the fct_reorder() function. I have also seen David Robinson using it a lot in his YouTube videos (I recommend his videos in this post).

The function is good to change the order of values in a factor variable, e.g. if you want to make sure there is some structure to the values you present in a bar chart:

#### 2. countrycode::countrycode()

I have lost count of the number of times I have used the countrycode package. If you are doing comparative research and not using the countrycode() function, you are in for a treat.

In a lot of datasets you will not have the full country name (e.g. Denmark), but something like ISO 3166-1 alpha-2 codes (e.g. DK). The countrycode() function can easily return country names based on ISO codes (or vice versa). Here is an example:

countrycode(c("DK", "SE"),
origin = "iso2c",
destination = "country.name")


This code will return Denmark and Sweden. As you can see, you simply provide the “origin” (i.e. the type of data you have) and the “destination” (i.e. the type of data you would like). I especially find this function useful when I need to merge datasets with different country variables and when I want to present full country names in a visualisation instead of ISO codes.

Last, if you are working on a country-level dataset, make sure that it is easy to match the countries with any of the variables available in the countrycode package.

#### 3. tidyr::separate_rows()

I recently had to work with a dataset where each country had several priorities in relation to the Sustainable Development Goals (SDGs). However, there was only one SDG variable with information on the relevant SDGs for each country. The separate_rows() function is great to turn such data into multiple rows.

df <- tibble(
country = c(1, 2),
SDG = c("SDG 5,SDG 17,SDG 3", "SDG 1,SDG 2,SDG 3")
)

df %>% separate_rows(SDG,
sep = ",",
convert = TRUE)


The sep argument is specifying what separator you would like to use to separate the information (in this case a comma). The code will return a tibble with two variables and six observations.

#### 4. tidyr::crossing()

I often use the crossing() function when I need to create a data frame from scratch. For example, if you need to create a country-year data frame for a few countries from 1965 to 2021, you can create a data frame where each country has a row for each year. Here is an example:

crossing(country = c("Denmark", "Sweden"),
year = 1965:2021,
value = NA_real_)


#### 5. stringi::stri_reverse()

I had to scrape a PDF file but the text I got from the document was reversed, e.g. ‘Agriculture’ was ‘erutlucirgA’. There might be different ways to do this in an easy way, but the function stri_reverse() in the stringi package did the trick. Here is a simple example:

x <- "snoitcnuf R taerg neT"

stringi::stri_reverse(x)


And what we get is: “Ten great R functions”.

#### 6. purrr::reduce()

The reduce() function is a great to collapse repetitive piping. There is a good blog post on the function here. To illustrate, when I used to merge several data frames into one large data frame, I used multiple lines of left_join().

reduce(list(df_1, df_2,
df_3, df_4),
left_join,
by = c("iso2c", "year"))


The code will left join all data frames on the iso2c and year variable.

#### 7. dplyr::distinct()

If you have multiple rows in a data frame, e.g. multiple countries, but want a unique row for each country, you can use the distinct() function to get distinct rows. In the example below we have four rows but we turn them into a data frame with distinct rows on the variable x.

df <- tibble(
x = c(1, 1, 2, 2),
y = c(1, 1, 2, 4)
)

df %>% dplyr::distinct(x, .keep_all = TRUE)


#### 8. fuzzyjoin::regex_left_join()

The regex_left_join() from the fuzzyjoin is great if you need to merge a data frame based upon a regular expression. I found this useful when I had to join data frames with different country names.

Here is a simple example where we join two data frames where it merges the rows for both “Denmark” and “denmark”.

df_1 <- data.frame(
country = c("Denmark", "denmark"),
year = 2020:2021
)

df_2 <- data_frame(regex_country = c("[Dd]enmark"),
type = 1:2)

df_1 %>%
fuzzyjoin::regex_inner_join(df_2, by = c(country = "regex_country"))


#### 9. ggplot2::labs()

I used to look up the theme() function when I had to remove the title of a legend, or use scale_x_continuous() if I had to change the title of the x-axis. Not anymore. The labs() function is an easy way to change the labels in your figure. You can also use it to change the title and subtitle of your figure. Highly recommended.

#### 10. tidyr::drop_na()

When I check some of my old code, I often see lines like this:

df %>%
filter(!is.na(var1))


However, there is a much easier way to do this, namely using the drop_na() function.

df %>%
drop_na(var1)


This is not only much easier to write than having to rely on two functions, but also a lot easier to read.

## How to improve your figures #3: Don’t show variable names

When you plot a figure in your favourite statistical software, you will most likely see the name of the variable(s) you are plotting. If your income variable is called inc, your software will call the axis with income for inc and not income. In most cases variable names are not sufficient and you should, for that reason, not show variable names in your figures.

Good variable names are easy to read and write – and follow specific naming conventions. For example, you cannot (and should not) include spaces in your variable names. That is why we use underscores (_) to separate words in variable names. However, R, SPSS and Stata will happily show such underscores in your figures – and you need to fix that.

I believe this is data visualisation 101 but it is something I see a lot, including in published research. For example, take a look at this figure (Figure 1 from this paper):

As you can see, we have Exitfree, Anti_EU and some GDP* variables. The good thing about this paper is that the variable names are mentioned in the main text as well: “Individuals and parties may have ideological objections to European integration and hence desire a free exit right irrespective of whether their country is peripheral. To control for this, a variable variable ‘Anti_EU’ is constructed based on the variable ‘eu_anti_pro’ in the ParlGov database”. However, I would still recommend that you do not show the actual variable names in the figures but use actual names (with spaces and everything).

Let’s look at another few examples from this paper. Here is the first figure:

The important thing is not what the figure is about, but the labels. You will see labels such as PID_rep_dem and age_real. These are not good labels to have in a figure in a paper. age_real is not mentioned anywhere in the paper (only age as a covariate is mentioned).

Let us take a look at Figure 3 from the same paper:

Here you will see a variable called form2. What was form 1? Is there a form 3? When we rely on variable names instead of clear labels, we introduce ambiguity and makes it difficult for the reader to understand what is going on. Notice also the difference between Figure 1 and Figure 3 for age, i.e. age_real and real_age. Are those variables the same (i.e. a correlation of 1)? And if that is the case, why have two age variables?

Okay, next example. Look at Figure 6 from this paper:

Here we see a variable on the x-axis called yrs_since1920 (years since 1920). It would be better having a label for this axis simply being “Years since 1920”. Or even better: just the year and having the actual years on the axis. Notice also here the 1.sønderjylland_ny label. Sønderjylland is not mentioned in the paper and it is not clear how ny (new in Danish) should be understood here (most likely that it wasn’t the first Sønderjylland variable that was created in the data).

Let’s take another example, specifically Figure 3 from this paper:

Here we see the good old underscores en masse. anti_elite, immigrant_blame, ring_wing_complete_populism, rich_blame and left_wing_complete_populism. There are 29 authors on the article in question. Too many cooks spoil the broth? Nahh, I am sure most of the authors on the manuscript didn’t even bother looking at the figures (also, if you want to have fun, take a critical look at the results provided in the appendix!).

And now I notice that all of the examples I have provided above are from Stata. I promise it is a coincidence. However, let’s take one last example from R just to confirm that it is not only an issue in Stata. Specifically, look at Figure 3 in this paper (or Figure 4, Figure 5 and Figure 6):

The figure show trends in public opinion on economic issues in the United States from 1972 to 2016. There are too many dots in the labels here. guar.jobs.n.income, FS.aid.4.college etc. are not ideal labels in your figure.

In sum, I like most of the papers above (there is a reason I found the examples in the first place). However, it is a major turn-off that the figures do not show actual labels but simply rely on the variable names or weird abbreviations to show crucial information.

## Udregn mandater til Folketinget med R

I mange meningsmålinger rapporteres partiernes opbakning ikke udelukkende med andelen af stemmer i procent, men også som mandattal. D’Hondts metode bruges som bekendt fordelingen af kredsmandater ved Folketingsvalg, der sammen med tillægsmandater sikrer en ligelig fordeling mellem stemmer og mandater ved valget.

Hvis man gerne vil estimere hvor mange mandater de respektive partier står til at få, kan jeg varmt anbefale seatdist pakken til R. Den er udviklet af Juraj Medzihorsky og kan findes her. Når du har installeret pakken kan du nemt hente den ind i R og bruge giveseats() funktionen til at udregne mandater:

giveseats(c(33, 6, 10, 7, 8, 1, 3, 1, 5, 17, 8, 1),
ns = 175,
thresh = 0.02,
method = "dh")


Det første vi giver funktionen er en vektor med opbakningen til partierne i procent (jeg har her undladt decimaler blot for at gøre det nemmere at læse). 33 er eksempelvis opbakningen til Socialdemokratiet i procent. ns angiver hvor mange mandater, vi skal fordele (number of seats, i dette tilfælde 175), thres angiver spærregrænsen (2% i dette tilfælde) og method er vores fordelingsmetode (dh for D’Hondt).

Her kan vi se at Socialdemokratiet vil få omkring 61 mandater ved næste folketingsvalg. Dette er selvfølgelig et estimat, da vi 1) har usikkerhed i meningsmålingen og 2) alle mandater ikke fordeles så simpelt ved valget. Vi tager ligeledes ikke de fire nordatlantiske mandater med i betragtning. Ikke desto mindre er det relativt nemt at få et estimat på, hvor stor opbakningen er til partierne i mandater. Pakken giver desuden en lang række muligheder for at undersøge partiernes mandattal, hvis man tog andre mandatfordelingsmetoder i brug.