Structured Query Language (SQL) is the most widely used database language. You hear people pronounce it as either ‘sequel’ or ‘S-Q-L’. In this post, I will briefly introduce SQL for users of R and provide some guidance on how to wrangle and extract data from SQL databases in R. SQL is not perfect (see this post for some of the shortcomings and this post for some arguments against SQL), but if you want to get a job working with data, having SQL skills on your CV will do you well in life.
In a blog post, Craig Kerstiens describes three specific reasons to learn SQL. First, SQL is a skill you will be able to put to good use in most organisations, roles and contexts where you will be working with databases. Second, SQL has been around for a long time and will most likely continue to be relevant for a very long time. Third, it will give you a comparative advantage relative to people who are not familiar with SQL.
Why should you learn SQL as an R user? In addition to the reasons mentioned above, SQL is not that strange if you are already familiar with packages such as dplyr where you have functions such as filter()
and select()
. Accordingly, if you are familiar with tidyverse in R, learning the basics of SQL will not be that difficult. As Irene Steves concludes in her presentation at rstudio::global 2021, SQL and R is a dynamic duo.
When we are working with SQL, we are working with databases. There are different relational database management systems (i.e., your database backend), including SQLite, MySQL and PostgreSQL. SQLite in particular is among the most popular database engines.
An SQL database is similar to an R environment where you can store multiple tables (as pointed out by Tom Carpenter). Or, a table in a database is similar to a data frame in a list in R. Accordingly, you can store a series of tables in a SQL database. For a good introduction to databases and SQL in R, see this chapter from R for Data Science.
To get started with databases in R, I highly recommend checking out the Best Practices in Working with Databases at Posit. Grant R. McDermott also have a good introduction to databases in R in his Data Science for Economists lectures. For an introduction to working with SQL within tidyverse, I can recommend Sciencificity’s Blog’s part I, part II, and part III. For other noteworthy introductions, I can recommend Irene Steves’ post on using SQL in RStudio and Jacqueline Nolis’ post on how to use SQL with R Markdown.
Last, for Shiny, there is also a lot of good material out there, including Emily Riederer’s tutorial on how to use databases with Shiny, Thomas Roh’s posts on database transactions and database syncing, and Sebastian Kranz’s post on the dbmisc package with Shiny.