Data Organization Best Practices

How should I organize my research project, data, and files effectively?

Below are some resources to help you learn about project, data, and files organization: 

Collaborative folders from one of the Cloud vendors which have negotiatied an agreement with Harvard - such as Microsoft One Drive, Google Drive or Dropbox- can help you share your research files with your group before the data can be shared widely in a repository.

Consider also utilizing Electronic Lab Notebooks (not only for labs!) to organize and store all your research inputs and outpus while working on a project.

What are my options for documenting my data analysis workflow?

When performing data analysis, it is essential to stay organized and document all data analysis steps and the contents of the resulting data files.

  • Document data analysis pipelines, including all steps and parameters. Remember to document consistently even as you optimize analysis methods/algorithms to ensure that you do not repeat your efforts later.
  • Document tool versions used during the analysis so that others can reproduce the work.
  • Metadata. The previous two bullet points can be facilitated by using consistent and descriptive metadata entries.
  • Tracking data sources. It is important to keep track of the data provenance for all analysis steps, i.e., the origin of each input file and the destination of each output file. All of this will help you and your collaborators find and understand the data/analysis later. The data analysis software may track this information, or it could be provided in readme files associated with the data. See these best practices for creating readme files. Use of appropriate and consistent file naming conventions can facilitate tracking input and output files along an analysis pipeline. For more information see this guide on file naming conventions.

The following tools can help you document your computational analysis:

  • RMarkdown (used with R programming language)
  • RMarkdown Notebooks: An R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. Very similar to Jupyter Notebooks
  • Jupyter Notebooks (support for most programming language)
  • Workflow/Pipeline Tools (usually used in Biology): See this curated list of pipeline tools.