In a technology feature in Nature, Jeffrey M. Perkel provides a guide for early-career researchers to help them make “the right choice” in terms of the best programming language to use. In this post, I will try to provide a few brief answers to the question. This post is more of a response to the thoughts in the article and is not intended to be a comprehensive deep dive into the best programming language to use, so I recommend reading the article first. So, which programming language should you use?
Here is my one-word answer: Python.
Here is my three-word answer: Python and R.
Here is my six-word answer: Python and R, no Jupyter Notebooks.
The article pays attention to Jupyter Notebook and similar tools. I have never used Jupyter Notebook and I would not recommend it. While tools such as Jupyter Notebook might be easy to pick up and use, it is much better to work with a robust (interactive) workflow that is not optimised for working with non-code elements. For R, I prefer programming in an R-script rather than an R Markdown file. In general, when picking up a programming language, try to separate the workflow so you are not writing and coding in the same file (except for commenting your code of course).
Why Python instead of R? Or, more specifically, why Python over R? Because if you want to learn programming, Python is simply a better programming language with much more potential (especially outside academia). If all you care about is being able to run a function or two that can estimate a regression model, R is more than fine. You can even use Stata for all I care (some might interpret this as me not caring at all).
This is not to say that Python is better than R in everything. For the sake of argument, it might not even be the best at one particular thing. If you want to fit multilevel models, R will be much better and easier to use than Python. If you want to create a beautiful figure, you are much better off with {ggplot2} in R (although plotnine in Python is fine). But if you want to pick up one programming language, I would still recommend Python. And this comes from someone who is benefitting financially from more people learning R.
There are also a lot of similarities between Python and R, but my guess is that it is easier to learn R if you are trained in Python than it is to learn Python if you are trained in R. There are at least two reasons for this. First, if you can work with polars or pandas in Python, it will feel like a walk in the park to work with tidyverse or data.table (or even base R). Second, if you can set up a Python environment with the relevant dependencies, you will find it very easy to set up a project in R.
In the article, it is recommended to stick to a programming language that your colleagues are using: “For many programming tasks, almost any language will do. But for beginners, it’s good to choose one that a colleague can help with. Furthermore, if everyone in your field is using a particular language, it helps to be using the same one, too.” I do not disagree with this, but I would like to add three relevant caveats. First, you should work with a programming language that is relevant among future colleagues. This might seem like a small detail, but when you compare the popularity of R in academia with the popularity of Python in industry, it is not necessarily a good idea only to consider the popularity of R among your academic peers when picking a language. Second, there is also something to be said about comparative advantages. If all your friends only use R, but you are a master of Python, you will most likely be better off in the long run. Third, in the age of LLMs, while still good to be able to ask your colleagues for help, a good LLM might be more helpful at providing detailed guidance (i.e., it can be easier to learn Python with the help of an LLM than to learn R with the help of a colleague).
The article continues to talk about memory allocation, but this part is a bit weird. First, it ignores the fact that a lot of computing now takes place in the cloud and it is not really an issue for a scripting language to handle big data. Second, the article implies that you work with manual memory allocation in Rust when saying “In compiled languages such as Rust, coders must specifically allocate the memory they want, then ‘free’ it up when it is no longer needed”, which is misleading (there are other compiled languages that would work as better examples here). For most people, I do not really think memory allocation should be among the key priorities when deciding upon a programming language.
The end of the article talks about the help you can find online: “There’s no shortage of online help, whatever language you choose. Useful resources include The Carpentries, the Data Science Learning Community and Stack Overflow.” I am not familiar with The Carpentries or the DSLC, and while Stack Overflow is a decent resource, I would not really consider any of these resources as relevant when deciding upon a programming language. LLMs are very good at a lot of stuff in Python (and ok at some stuff in R), so I would prioritise LLMs over online resources when choosing a programming language.
So which programming language should you use? Again, if you are an early-career researcher, I would opt for Python and R, and if you have a deep familiarity with one of the two, it is not that difficult to pick up the other. However, as always, keep in mind that this is my reading shaped by my background in the social sciences and not working in academia.