Ten great R functions #2

This is a follow-up post to my previous post on great R functions. I use some of these functions a lot while a few of the functions have been very helpful at least once.

11. dplyr::coalesce()

I have been working with data where two columns have the relevant data that needed to be in one column. For example, there might be an outcome for the treatment group and an outcome for the control group in a survey experiment, but each observation only has a value on one of these variables.

To create one variable with all of the information, we can use the coalesce() function. This function will find the first non-missing element across several columns and add that to a variable. In the example below we create a new variable (var3) that is merged from two other variables.

df <- tibble(id = 1:4,
             var1 = c(1, 2, NA, NA),
             var2 = c(NA, NA, 3, 4))

df %>% 
  mutate(var3 = coalesce(var1, var2))

The new variable will have the values 1, 2, 3 and 4.

12. fs::dir_ls()

If you need a character vector with the files in a folder, preferably complying with a specific regular expression, the dir_ls function in the fs package got your covered. The example below will return all *.csv files in your working directory (you can also specify a specific path if it should not be your working directory).

fs::dir_ls(regexp = "\\.csv$")

13. janitor::clean_names()

The clean_names() function does exactly what it promises: clean names. When I get Excel datasets to work with, the first row often have names that are not ideal variable names, including spaces and different signs.

In the example below I create an empty dataset with two variables with the horrible names: Annual sales (USD) and Growth rate (%). Then I use the clean_names() function to get clean names from the data frame. Specifically, the function takes the variable names and edit them into snake_case names.

df <- data.frame("Annual sales (USD)" = NA,
                 "Growth rate (%)" = NA)

janitor::clean_names(df)

The variable names returned from the function are annual_sales_usd and growth_rate. Much better!

14. dplyr::add_count()

Yet another function that does exactly what it promises. add_count() adds a count of the variable of interest, i.e., the number of values with that specific value. The example below will count how many observations have the specific values on the gear variable in mtcars and add that information to a new variable (gear_n).

mtcars %>% 
  add_count(gear, name = "gear_n")

The function is similar to the count() function but it will not group all observations together on the selected variable. Accordingly, you should only use count() if you want to summarise your data without having to use group_by().

15. performance::check_collinearity()

When you estimate a regression model, you often need to check whether certain assumptions hold or not. The performance package got a lot of relevant functions that makes this easy, such as check_collinearity(). This function easily let you examine the potential multicollinearity in your model.

You can read more about the function and see examples here.

16. dplyr::across()

If you need to apply a function (or functions) across multiple columns, across() is a great function to use. In one of my scripts, I had to create confidence intervals for poll estimates, and I used the function to create new variables with the maximum and minimum estimates.

polls %>% 
  mutate(across(starts_with("party"), 
                ~ .x + 1.96 * sqrt((.x * (100 - .x)) / n), 
                .names = "ci_max_{.col}"),
         across(starts_with("party"), 
                ~ .x - 1.96 * sqrt((.x * (100 - .x)) / n), 
                .names = "ci_min_{.col}")
  )

As you can see, the function takes all variables that starts with “party”, calculates the lower and upper estimates and saves the information in new variables.

17. RVerbalExpressions::rx()

Writing regular expressions can be difficult and involve a lot of frustration. The rx() function let you easily write code that returns the regular expression you want. You can see several good examples on how to use the function here.

18. lubridate::make_date()

make_date() is a great function that easily creates a date variable when you have the information on year, month and day in three separate variables. For example:

df %>% 
  mutate(date = make_date(year, month, day))

19. dplyr::pull()

If you want to extract a single column from a data frame, you can use the pull() function. The example below pulls the gear variable from the data frame and then returns the summary of the variable.

mtcars %>% 
  pull(gear) %>% 
  summary()

Similarly, if you want to extract an element from a list, you can use the pluck() function.

20. scales::show_col()

This was a function I was not familiar with until I saw Andrew Heiss mentioning it on Twitter. It is an amazing function to explore different colour schemes. Do check it out.