Seven ways to find data – Erik Gahner Larsen

Data might be the new oil (there are arguments for and against this). While there definitely is a lot of data out there for you to drill, it can be difficult to find the exact data you need.

In this post I will outline seven different strategies to 1) keep yourself updated on new data sources and 2) find older datasets. I do not recommend that you necessarily go with all of them (there is a significant overlap between what you will find using the different strategies), and I have ranked the strategies according to my own personal preferences.

1. Newsletters
One of the best ways to keep yourself updated on new datasets is by getting the updates directly to your mailbox. Here, I can highly recommend the weekly newsletter Data Is Plural by Jeremy Singer-Vine.

While there are other newsletters out there, my impression is that if you subscribe to Data Is Plural, you should be covered. In addition, you can take a look at the structured archive of datasets covered in the newsletter (841 datasets at the time of writing). If you do not already subscribe to the newsletter, do yourself a favour and sign up.

2. GitHub repositories
Another good way to find data is to explore GitHub repositories. A lot of repositories host data (e.g. media outlets like FiveThirtyEight), and by exploring popular repositories, you will often find interesting data.

However, there are repositories that also list datasets you might find interesting. Awesome Public Datasets, for example, is a list of open datasets from a wide range of fields (GIS, neuroscience, sports, climate etc.). I curate the PolData repository where you can find a list of political datasets (elections, international relations, parties, policies etc.).

3. Twitter
Twitter is as always a good way to keep yourself in the loop. While there are specific users on Twitter that tweet about new and old datasets (such as GetTheData and Pew Research Methods), the most useful strategy here is to follow researchers.

Researchers care about sharing useful resources such as datasets. To illustrate, I found this amazing resource on free and open psychological datasets on Twitter.

4. Harvard Dataverse
The Harvard Dataverse is another great place to find datasets. The search function is working well and there is publicly available data related to various topics (especially for political scientists).

Noteworthy, I use this service to get a sense of forthcoming articles (as the data usually is stored online prior to the articles hitting your RSS or/and Twitter feed). For example, journals such as American Journal of Political Science and Journal of Politics have their own dataverse where they archive datasets well in advance of the actual publication.

Psychologists might prefer OSF instead of the Harvard Dataverse. However, I find OSF cumbersome to use and a mess when you want to explore potential datasets.

5. Facebook groups
Facebook is usually not my cup of tea (let us be honest: it is shit). That being said, there are some good groups for academics to explore. One of these is Political Science Data where people are good at sharing links to new resources. Furthermore, this is also a good place to ask for data suggestions. My impression is that there are similar Facebook groups available for other scientific domains as well.

6. Reddit
If you already use Reddit, The subreddit r/datasets is worth looking into. The quality of the submissions is not always great but you will often find some interesting datasets from various fields.

Another subreddit to check out is r/dataisbeautiful, where people share data visualizations (mostly original content). While sharing data is not the main objective of the subreddit, you will most likely find a lot of interesting data there.

7. Google Dataset Search
Last, we have Google Dataset Search. I like the idea of having a Google for datasets. And this is literally a Google for datasets. That being said, I have not used this service a lot and whenever I use it to find data, I am not convinced that this is the best strategy to use. Accordingly, I recommend following the six resources introduced above before using this service.