Data Management Planning

Life science projects are becoming more and more data-intensive, with both data volume and complexity increasing. Therefore, life scientists need to construct a proper data management plan (DMP) before they start a research project. This plan should be updated regularly during the research project. This is also rapidly becoming a condition to obtain research funds.

A research data management plan can be compared with the checklist that pilots use before each flight: it makes sure that you do not forget essential steps and that you base important decisions on well-informed choices.

Nowadays, research data sets are often very large. In addition, the data are so rich that researchers can use it for various types of analyses, which each may involve multiple steps. Special tools are required to get an oversight of the data sets and the analysis pipelines.

At the same time, scientists are increasingly willing to make their research data available for research by others. This is also rapidly becoming a condition to obtain research funds. For instance, X-Ray images of the spine that have been collected for a study of the backbone may be re-used to study the state of the aorta. Such data re-use requires well-documented data that can be located by other scientists. In addition, it often requires extensive documentation of the data collection process, with proper registration of all operations. This all makes it important to carefully manage the processes of data acquisition and analysis (i.e., ‘data management’).

Good research data management requires professional and careful treatment of data throughout all stages of a research project, from study design to long-term preservation and sharing of data.

It is rarely possible to write a complete data management plan before the start of a research project. You should update your DMP during the project to add the latest decisions and procedural changes. This process is sometimes called ‘active data management planning’. At the end of the project, the final data management plan will be part of the documentation of the publishable data sets.

Data stewardship versus data management
There is no consensus on the proper use of the terms ‘data stewardship’ and ‘data management’. People often assume that data management finishes and data stewardship starts when the project ends. Regardless of the terminology, creating FAIR data requires attention from the planning phase of a scientific experiment to the life-long maintenance of the data. DTL uses the terms ‘data management’ and ‘data stewardship’ interchangeably.

Discussions about efficient use of funding for data-intensive research have resulted in the definition of the FAIR principles: scientific data should become Findable, Accessible, Interoperable, and Reusable, for both humans and computers. You can also use the FAIR principles to guide your data management planning.

Data re-use obviously requires that the data can be found and accessed by others. In addition, both your own research and that of others that will re-use your data will benefit if the data can easily be coupled to related data. This means you should make your data interoperable. For example, two data sets that both use a list of diseases should use the same vocabulary for these diseases. Similarly, two data sets that describe events at a specific location should use the same method to describe that location.

The FAIR principles provide excellent handles for data management planning:

To ensure Findability,

  • select a data repository at an early stage and check out its data format and metadata requirements;
  • make sure the data can get a persistent identifier so that it can be cited;
  • maybe select a catalogue to make your data more findable, especially if the repository is more generic in nature.

To ensure Accessibility,

  • guarantee longevity of the data (e.g., by submitting it to a repository that has a certification like the Data Seal of Approval or an ISO certification);
  • check and describe the legal conditions under which the data can be made available (this is generally easier to do before you have collected and interpreted the data);
  • establish an embargo period if necessary;
  • make sure your ICT infrastructure will keep the data available even in case of equipment failure or human error.

To ensure Interoperability,

  • select commonly used data formats;
  • select commonly used vocabularies for data items.

To ensure Reusability,

  • make sure you keep proper provenance information (i.e., details about how and where the data was generated, including machine settings, and details about all processing steps, such as the software tools with their versions and parameters);
  • select the right minimal metadata standard and collect the necessary metadata (many minimal metadata standards are included in ELIXIR’s biosharing.org repository);
  • select a license for the data (preferably an open license) and the associated software tools;
  • make sure the important conclusions of your study will not only be available in a paper in narrated form, but also in a digital file (e.g., a nanopublication).

At the DTL office, we are often asked for examples of good data management plans (DMPs) that can be reused for new projects. However, a DMP is tailored to a specific project and it is impossible to transfer this to another project. Creating a good DMP requires serious thought. The following suggestions may be helpful:

  • The FAIR data principles can provide handles.
  • Many funders and other organisations provide questions that must be answered in the DMP; these can serve as a template for your DMP (e.g., DMP online).
  • Some DMP sections (e.g., those describing the set-up of your organisation’s infrastructure), may be the same for all your projects, sometimes even for all projects at your institute. It is possible that for instance your ICT department can help you with a standardised description of their methods.
  • A data management policy document (i.e., a page describing how the DMP wil be created) can sometimes be transferred from one project to another. A good example of such a policy document was written by Rob van Nieuwpoort of the Netherlands eScience Center; you can download it here.