Data Sharing and Code Commenting: Best Practices for Graduate Students

Scientific computing has become increasingly important in psychological science research. However, proper data management and techniques related to analytic workflow are rarely directly taught to graduate students. This short piece highlights a few best practices for data sharing and code commenting that can be incorporated as a graduate student to facilitate data reproducibility and replicability.

Why Should I Share My Data and Code?

Based on concerns of poor self-correction in psychological science (Klein et al., 2018; Open Science Collaboration, 2015), much attention has been drawn to the replicability crisis, also known as the “credibility revolution” (Vazaire, 2018). Accordingly, the open science movement has strengthened the scientific community’s expectation of access to key components of research (e.g., protocols, resources, data, and analysis software code) in order to assess, validate, and replicate prior research (Ioannidis, 2012). Data sharing is one of several key practices research organizations and major funders have begun to mandate (Houtkoop et al., 2018). Despite this, data sharing remains quite rare in the psychological sciences, due to two primary reasons:

1) a lack of knowledge on how to get started
2) researchers are unaware of the benefits of data sharing or may not be confident in the quality of their data

Here, we will be focusing on the latter concern, however, for specific guidelines on how to prepare and share your data, see Klein et al. (2018).

Benefits of Data and Code Sharing

When a researcher makes their data and code widely available, they are, in effect, endorsing:

analytic reproducibility (i.e., statistical analyses that can be re-run to detect unintended errors or bias and verify the logic and sequence of data analysis steps; Hardwicke et al., 2018, Wilson et al., 2017)
analytic robustness (i.e., alternative analytic decisions that may be used to verify results)
analytic replication (i.e., replication of the same analytic steps with new data to investigate generalizability; Houtkoop, 2018)

Importance of Code Commenting

Data sharing should be, however, the bare minimum. Above and beyond this, making your data management and analysis code publicly available, requires that the code be readable (i.e., understandable) and reproducible. There are some basic scientific computing practices that ensure research is not just reproducible, but also efficient, transparent, and accessible in the future (Cruwell et al., 2019). As an example, one way to ensure analytic code can be easily understood, is to provide a detailed, commented version of the code. When you have to come back to modify or review the code you wrote weeks, months, or even years ago, will you be able to remember what you did and what that code means? Even more important, will other people be able to understand what you did? Within an open science framework, it’s essential for other people to be able to easily interpret your code for data quality checks and reproducibility. Although incorporating code comments may seem tedious at the time, the long run benefits afforded to your future self, your peers, your co-authors, and other researchers in the field cannot be understated.

How to Comment your Code

Now that we covered the why, let’s talk about the what. Here are some concrete steps to take when addressing your code (adapted from Wilson et al., 2017). First, create a commented-out section, i.e., the “header”, at the top of your code. Here, create an overview of your project to self-reference. List the project title, filename, the co-authors, a description of the purpose (e.g., initialization, data cleaning, analysis) and any dependencies including required input data files, software version, and calendar date.

Then insert a table of contents that describes the sections of the code. An example table of contents can include:

1) Loading in Data and Libraries
2) Descriptive Statistics
3) Preliminary Analysis
4) Main Analysis for Aim 1
5) Main Analysis for Aim 2
5) Sensitivity Analysis
6) Tables

Adjust this based on your project, your aims, and your workflow. It might also be helpful to include a list of all the variable names and a brief description of each variable in the dataset at the top of the code.

Next, it’s important to record all the steps used to process the data. Find out how to make a comment in the code in the software you are using (e.g., SAS, R). Place a brief explanatory comment at the start of a data step or analytic move.

***An example of commented out code in SAS:

/* Load in Library */
libname gradPSYCH 'N/project/APAGS/data';

/* Designate data */
data APAGS; set gradPSYCH.AGAPS; run;

/* See Contents of Data File */
proc contents data=APAGS; run;

If you carefully comment chunks of functionally-related code, i.e., writing out what you are doing and why, other researchers, and your future self, will be able to easily reproduce your data steps.

Final Thoughts

As a scientist, being committed to open science means engaging in responsible data management techniques, embracing transparency, and preparing your data for a reproducible research workflow. Publicly sharing data as well as a well-organized, carefully commented data management and analytic code that enables other researchers to engage in analytic reproducibility, analytic robustness, and analytic replication is a good start. Overall, these practices will also improve your personal research efficiency and external credibility.

References

Crüwell, S., van Doorn, J., Etz, A., Makel, M. C., Moshontz, H., Niebaum, J. C., Orben, A., Parsons, S., & Schulte-Mecklenbeck, M. (2019). Seven easy steps to open science: An annotated reading list. Zeitschrift für Psychologie, 227(4), 237-248. https://doi.org/10.1027/2151-2604/a000387
Hardwicke, T. E., Mathur, M. B., MacDonald, K. E., Nilsonne, G., Banks, G. C., Kidwell, M., … Frank, M. C. (2018, March 19). Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition. https://doi.org/10.1098/rsos.180448
Houtkoop BL, Chambers C, Macleod M, Bishop DVM, Nichols TE, Wagenmakers E-J. (2018). Data Sharing in Psychology: A Survey on Barriers and Preconditions. Advances in Methods and Practices in Psychological Science.;1(1):70-85. doi:10.1177/2515245917751886
Ioannidis, J. P. (2012). Why science is not necessarily self-correcting. Perspectives on Psychological Science, 7(6), 645-654.
Klein, O., Hardwicke, T. E., Aust, F., Breuer, J., Danielsson, H., Mohr, A. H., Ijzerman, H., Nilsonne, G., Vanpaemel, W., & Frank, M. C. (2018). A Practical Guide for Transparency in Psychological Science. Collabra: Psychology, 4(1), 20. https://doi.org/10.1525/collabra.158
Open Science Collaboration. Estimating the reproducibility of psychological science.Science349,aac4716(2015).DOI:10.1126/science.aac4716
Vazire S. (2018). Implications of the Credibility Revolution for Productivity, Creativity, and Progress. Perspectives on psychological science : a journal of the Association for Psychological Science, 13(4), 411–417. https://doi.org/10.1177/1745691617751884
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510

gradPSYCH Blog

APA's Food for Thought for Psychology Students