In a recent post on R Weekly, there was a link to a question asked by Spencer Schien on BlueSky. Specifically, he asked for people’s preference when using dplyr::left_join()
. The first option is to pipe the first data frame into left_join()
with the second data frame specified within the function:
df3 <- df1 |>
left_join(df2)
The second option is not to use the pipe at all but rather to specify the two data frames within the function:
df3 <- left_join(df1, df2)
I believe that both approaches are fine, but I also believe that they each serve specific purposes and thus should be used accordingly (and, of course, within these two specific purposes, be used consistently). Most importantly, you will almost never use the functions like shown above without any other information (e.g., please make it explicit which variables you are using to match rows).
If you work within the tidyverse and use a series of other functions from the {dplyr} package, it makes a lot of sense to use the pipe throughout your code. That is, if the reader of your code is in the pipe mindset, it makes it a lot easier to read a left join with the pipe rather than out of the blue no longer using the pipe. I personally also find it easier to read left_join()
like this, most likely because it is reminiscent of SQL (with LEFT JOIN
).
Similarly, even if you do not need to pipe your data to other functions, it is still important to keep in mind that within the tidyverse, you find a lot of options that might still be easier to interpret when using the pipe. For example, when using join_by()
and inequality joins (introduced in {dplyr} 1.1.0), the input can easily be long, and it makes sense to make it as easy as possible to see what data frame is the starting point with the pipe operator.
If you, on the other hand, rely primarily on base R or packages like {data.table}, it makes a lot more sense not to use the pipe operator to be consistent within your workflow. And if you use {data.table}, it is much better to rely on merge.data.table()
than left_join()
. The point is that – in my view – it makes little sense to say one approach is universally better than the other, but that it depends very much upon your workflow. If you use {polars}, you will most likely use a different approach than if you use {tidypolars}.
The interesting part of looking at this and similar discussions is that they often boil down to a proxy discussion of base R vs. tidyverse, similar to how you see proxy discussions within statistics that can be boiled down to Frequentist vs. Bayesian approaches. This is all good and fun, but also a bit pointless without taking the context of the workflow or problem being addressed into account.