Advice on data and code

I have been reading a few papers on how to structure data and code. In the post, I provide a list of the papers I have found together with the main advice/rules offered in the respective papers (do consult the individual papers for examples and explanations).

Noteworthy, there is an overlap in the advice the papers give. I do not agree with everything (and as you can see, they are written for people working with tabular data and definitely not people working with more modern workflows with several gigabytes of data stored in NoSQL databases). However, overall, I can highly recommend that you take most of these recommendations on board.

Last, another good resource is the supplementary material to A Practical Guide for Transparency in Psychological Science. This paper deals with the details of folder structure, data documentation, analytical reproducibility, etc.

Here we go:

Nagler (1995): Coding Style and Good Computing Practice

1. Maintain a lab book from the beginning of a project to the end.
2. Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
3. Correct errors in code where they occur, and rerun the code.
4. Separate tasks related to data manipulation vs. data analysis into separate files.
5. Design each program to perform only one task.
6. Do not try to be as clever as possible when coding. Try to writecode that is as simple as possible.
7. Set up each section of a program to perform only one task.
8. Use a consistent style regarding lower- and upper-case letters.
9. Use variable names that have substantive meaning.
10. Use variable names that indicate direction where possible.
11. Use appropriate white space in your programs, and do so in a consistent fashion to make the programs easy to read.
12. Include comments before each block of code describing the purpose of the code.
13. Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
14. Rewrite any code that is not clear.
15. Verify that missing data are handled correctly on any recode or creation of a new variable.
16. After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
17. When possible, automate things and avoid placing hard-wired values (those computed “by hand”) in code.

Broman and Woo (2018): Data Organization in Spreadsheets

1. Be consistent
– Use consistent codes for categorical variables.
– Use a consistent fixed code for any missing values.
– Use consistent variable names.
– Use consistent subject identifiers.
– Use a consistent data layout in multiple files.
– Use consistent file names.
– Use a consistent format for all dates.
– Use consistent phrases in your notes.
– Be careful about extra spaces within cells.
2. Choose good names for things
– Don’t use spaces, either in variable names or file names.
– Avoid special characters, except for underscores and hyphens.
– Keep names short, but meaningful.
– Never include “final” in a file name
3. Write dates as YYYY-MM-DD
4. Fill in all cells and use some common code for missing data.
5. Put just one thing in a cell
6. Don’t use more than one row for the variable names
7. Create a data dictionary
8. No calculations in the raw data files
9. Don’t use font color or highlighting as data
10. Make backups
11. Use data validation to avoid errors
12. Save the data in plain text files

Balaban et al. (2021): Ten simple rules for quick and dirty scientific programming

1. Think before you code
2. Start with prototypes and expand them in short development cycles
3. Look for opportunities for code reuse
4. Modularize your code
5. Avoid premature optimization
6. Use automated unit testing for critical components
7. Refactor frequently
8. Write self-documenting code for programmers and a readme file for users
9. Grow your libraries and tools organically from your research
10. Go explore and be rigorous when you publish

Wilson et al. (2017): Good enough practices in scientific computing

1. Data management
– Save the raw data.
– Ensure that raw data are backed up in more than one location.
– Create the data you wish to see in the world.
– Create analysis-friendly data.
– Record all the steps used to process data.
– Anticipate the need to use multiple tables, and use a uniquei dentifier for every record.
– Submit data to a reputable DOI-issuing repository so that others can access and cite it.
2. Software
– Place a brief explanatory comment at the start of every program.
– Decompose programs into functions.
– Be ruthless about eliminating duplication.
– Always search for well-maintained software libraries that do what you need.
– Test libraries before relying on them.
– Give functions and variables meaningful names.
– Make dependencies and requirementsexplicit.
– Do not comment and uncomment sections of code to control a program’s behavior.
– Provide a simple example or test dataset.
– Submit code to a reputable DOI-issuing repository.
3. Collaboration
– Create an overview of your project.
– Create a shared “to-do” list for theproject.
– Decide on communication strategies.
– Make the license explicit.
– Make the project citable.
4. Project organization
– Put each project in its own directory, which is named after the project.
– Put text documents associated with the project in the doc directory.
– Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
– Put project source code in the src directory.
– Put external scripts or compiled programs in the bin directory.
– Name all files to reflect their contentor function.
5. Keeping track of changes
– Backup (almost) everything created by a human being as soon as it is created.
– Keep changes small.
– Share changes frequently.
– Create, maintain, and use a checklist for saving and sharing changes to the project.
– Store each project in a folder that is mirrored off the researcher’s working machine.
– Add a file called CHANGELOG.txt to the project’s docs subfolder.
– Copy the entire project whenever a significant change has been made.
– Use a version control system.
6. Manuscripts
– Write manuscripts using online tools with rich formatting, change tracking, and reference management.
– Write the manuscript in a plain text format that permits version control.

Gentzkow and Shapiro (2014): Code and Data for the Social Sciences

1. Automation
– Automate everything that can be automated.
– Write a single script that executes all code from beginning to end.
2. Version Control
– Store code and data under version control.
– Run the whole directory before checking it back in.
3. Directories
– Separate directories by function.
– Separate files into inputs and outputs.
– Make directories portable.
4. Keys
– Store cleaned data in tables with unique, non-missing keys.
– Keep data normalized as far into your code pipeline as you can.
5. Abstraction
– Abstract to eliminate redundancy.
– Abstract to improve clarity.
– Otherwise, don’t abstract.
6. Documentation
– Don’t write documentation you will not maintain.
– Code should be self-documenting.
7. Management
– Manage tasks with a task management system.
– E-mail is not a task management system.
8. Code style
– Keep it short and purposeful.
– Make your functions shy.
– Order your functions for linear reading.
– Use descriptive names.
– Pay special attention to coding algebra.
– Make logical switches intuitive.
– Be consistent.
– Check for errors.
– Write tests.
– Profile slow code relentlessly.
– Separate slow code from fast code.

Replace equations with code

Here is a suggestion: In empirical research, academics should move equations from the methods section to the appendix and, if anything, show the few lines of code used to estimate the model(s) in the software being used (ideally with citations to the software and statistical packages). Preferably, it should be possible to understand the estimation strategy without having to read any equations.

Of course, I am talking about the type of work that is not primarily interested in developing a new estimator or a formal theory that can be applied to a few case studies (or shed light on the limitations of empirical models). I am not against the use of equations or abstractions of any kind to communicate clearly and without ambiguity. I am, however, skeptical towards how empirical research often include equations for the sake of … including equations.

I have a theory that academics, and in particular political scientists, put more equations in their research to show off their skills rather than to help the reader understand what is going on. In most cases, equations are not needed and are often there only to impress reviewers and peers, which of course are the same people (hence, peer-review). The use of equations are excluding readers rather than including readers.

I am confident that most researchers spend more time in their favourite statistical IDE than they do writing and reading equations. For that reason, I also believe that most researchers will find it easier to read actual code instead of equations. Take this example on the equation and code for a binomial regression model (estimated with glmer()) from Twitter:

Personally, I find it much easier to understand what is going on when I look at the R code instead of the extracted equation. Not only that, I also find it easier to think of potential alternatives to the regression model, e.g., that I can easily change the functional form and see how such changes will affect the results. This is something I rarely consider when I only look at equations.

The example above is from R, and not all researchers use or understand R. However, I am quite certain that everybody that understands the equation above will also be able to understand the few lines of code. And when people use Stata, it is often even easier to read the code (even if you are not an avid Stata user). SPSS syntax is much more difficult to read but that says more about why you should not use SPSS in the first place.

I am not against the use of equations in research papers. However, I do believe empirical research would be much better off by showing and citing code instead of equations. Accordingly, please replace equations with code.