The FAIR Data Principles explained

These webpages provide an actionable list of the FAIR Data Principles as a simple guide when publishing data. For each principle, we give a basic definition, examples, and links to useful resources. We hope that by working through the list, anyone wishing to maximise the reusability of their data, can prioritise their efforts and make more informed choices regarding a suitable repository. We hope that this list will also focus the growing public discourse around FAIR: what is FAIR exactly, and what is it not.

Findable: Data and metadata are easy to find by both humans and computers. Machine readable metadata is essential for automatic discovery of relevant datasets and services, and for this reason are essential to the FAIRification process.

Accessible: Limitations on the use of data, and protocols for querying or copying data are made explicit for both humans and machines.

Interoperable: The computer can interpret the data, so that they can be automatically combined with other data. There is a historical trend in computer science toward increased interoperation (for example, between different hardware designs, operating systems, programming languages, and communication protocols). Data interoperability can be seen as the ragged edge of this long-term trend. However, data interoperation is a non-trivial problem and the “I” will require the most creative effort in making FAIR Data.

Reusable: Data and metadata are sufficiently well described for both humans and computers, so that they can be replicated or combined in future research.

What does this mean?
Principle F1 is arguably the most important, in that, without ‘globally unique and persistent identifiers’, it will be hard to achieve other elements of FAIR data. Hence, compliance with F1 will already take you a long way toward your goal of publishing FAIR data (see 10 ways identifiers can help with data integration

Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata, and every concept/measurement in your dataset. Identifiers in this case means a link on the internet (for example, a URL that resolves to a web page that defines the concept such as a particular human protein: http://www.uniprot.org/uniprot/P98161). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Not only do identifiers help others to know exactly what you mean, but they also allow your data to be interpreted in a meaningful way by computers that are either searching for, or trying to automatically integrate your data. Identifiers are essential to human-machine  interoperation that plays such a vital role in the vision of Open Science. Finally, identifiers will also help others to properly cite your work when they are reusing your data.

Of course, identifiers are one thing, but their meaning is another (see principles I1-I3). F1 stipulates two conditions for your identifier: (1) it must be globally unique (which means that someone else could not reuse/reassign the same identifier without, in so doing, referring to your data). You can obtain globally unique identifiers from a registry service that uses algorithms guaranteeing newly minted identifiers are unique. (2) it must be persistent (it takes time and money to keep links active on the Web; over time, links tend to get ‘broken’). Registry services guarantee (to some degree) resolvability of that link to into the future.

Examples of globally unique and persistent identifiers

Example services that supply globally unique and persistent identifiers

What does this mean?
In creating FAIR digital resources, metadata can (and should) be generous and extensive. Include descriptive information about the context, quality and condition, or characteristics of the data. Rich metadata allow the computer to automatically accomplish routine and tedious sorting and prioritizing tasks that currently demand a lot of attention from researchers. As such, compliance with F2 is a good way for people to locate your data, and increase re-use and citations. What does “rich metadata” imply?  That you should not presume that you know who will want to use your data, or for what purpose.  So as a rule of thumb, you should never say “this metadata isn’t useful” – be generous, and provide it anyway!

Examples

  • This includes “intrinsic” metadata, for example, the data captured automatically by machines that generate data (such as DICOM information for image files) as well as  “contextual” metadata, for example, the protocol used (using both keywords and links to a formal protocol document), the measurement devices used (again, both keywords and links to manufacturers), the units of the captured data, the species involved (explicitly by taxon id, for example http://www.uniprot.org/taxonomy/9606 ), the genes/proteins/other that are the focus of the study (such as GO Terms), the physical parameter space of observed or simulated astronomical data sets, questions and concepts linked to longitudinal data, calculation of the properties of materials, or any other details about the experiment. See User Controlled-Metadata.

Example frameworks

What does this mean?
This is a simple and obvious principle, but of critical importance to FAIR. The metadata and the dataset they describe are usually separate files. The association between a metadata file and the dataset should be made explicit by mentioning a dataset’s globally unique and persistent identifier in the metadata. As stated in F1, many repositories will generate globally unique and persistent identifiers for deposited datasets that can be used for this purpose.

Example 
The connection should be annotated in a formal manner – for example, using the foaf:primaryTopic predicate in the case of RDF metadata.

Links to Resources
The DTL FAIRifier tool guarantees F3.

What does this mean?
Identifiers and rich metadata descriptions alone will not ensure ‘findability’ on the internet. Perfectly good data resources may go unused simply because no one knows they even exist. If the availability of a digital resource such as a dataset, service or repository is not known, then nobody (and no machine) can discover it. There are many ways in which digital resources can be made discoverable, including indexing. For example, Google sends out spiders that ‘read’ web pages and automatically index them, so they then become findable in the Google search box. This is great for most ordinary searchers, but for scholarly research data we need to be more explicit about indexing. Principles F1-F3 will provide the core elements for fine grained indexing by some current repositories and future services.

Examples

  • The metadata of FAIR Datasets that are published on FAIR Data Points can be used for indexing by the DTL Search Engine.
  • It may be that registries of FAIR datasets emerge over time by repositories or groups that have interest in specialized topical domains.

Links to Resources

What does this mean?
For most users of the internet, we retrieve data by ‘clicking on a link’. This is a high-level interface to a low-level protocol called tcp, that the computer executes to load data in the web browser for the user (HTTP(S) or FTP, which form the backbone of modern internet are built on tcp, and make requesting and providing digital resources substantially easier than other communications protocols.) Principle A1 states that FAIR data retrieval should be mediated without specialised tools or communication methods. So clearly define who can access the actual data, and specify how.

Examples

  • Most data producers will use HTTP(S) or FTP.
  • Barriers to access that should be avoided: protocols that have limited implementations, poor documentation, or components involving manual human intervention. Note however, that for example for highly sensitive data, it may not be possible to provide secure access through a fully mechanized protocol. In such cases, it is perfectly FAIR to provide an email, telephone number, or skype name to a contact person who can discuss access to the data. To be FAIR, however, this contact “protocol” must be clear and explicit in the metadata.
  • FAIR Accessor (see Interoperability and FAIRness through a novel combination of Web technologies) http://linkeddata.systems/Accessors/UniProtAccessor/C8V1L6
<rdf:Description xmlns:ns1=”http://purl.org/dc/elements/1.1/” xmlns:ns2=”http://www.w3.org/ns/dcat#” rdf:about=”http://linkeddata.systems/Accessors/UniProtAccessor//C8V1L6#Distribution98F7E238-0976-11E7-BCC8-A5FB5C07C3DD”>
<ns1:format>application/rdf+xml</ns1:format>
<rdf:type rdf:resource=”http://purl.org/dc/elements/1.1/Dataset”/>
<rdf:type rdf:resource=”http://rdfs.org/ns/void#Dataset”/>
<rdf:type rdf:resource=”http://www.w3.org/ns/dcat#Distribution”/>
<ns2:downloadURL rdf:resource=”http://www.uniprot.org/uniprot/C8V1L6.rdf”/>
</rdf:Description>

What does this mean?
To maximise data reuse, the protocol should be free (no-cost) and open (-sourced) and thus globally implementable to facilitate data retrieval. Anyone with a computer and an internet connection can access at least the metadata. Hence, this criterion will impact your choice of the repository where you will share your data.

Examples

  • HTTP, FTP, SMTP, …
  • Telephone (arguably not universally-implementable, but close enough)
  • A counter-example would be Skype, which is not universally-implementable because it is proprietary
  • Microsoft Exchange Server protocol is also proprietary

Links to Resources

What does this mean?
This is a key, but often misunderstood part of the FAIR Data. The ‘A’ in FAIR does not necessarily mean ‘Open’ or ‘Free’, but rather, gives the exact conditions under which the data are accessible. So even heavily protected and private data can be FAIR. Ideally, accessibility is specified so transparently that the machine can automatically understand the requirements, and then either automatically execute the requirements, or alert the user to the requirements. It often makes sense to request users to create a user account on a repository. This allows to authenticate the owner (or contributor) of each data set, and to potentially set user specific rights. Hence, this criterion will also affect your choice of the repository where you will share your data.

Examples

  • HMAC authentication
  • HTTPS
  • FTPS
  • Telephone

Links to Resources

What does this mean?
As there is a cost to maintaining an online presence of data resources, over time, datasets will tend to degrade or disappear altogether. When this happens, links get broken and users waste time hunting for data that might no longer be there. Storing the metadata however is, in general, much easier and cheaper to do. Principle A2 says that metadata should persist even when the data are no longer sustained. A2 is related to the registration and indexing issues described in F4.

Examples

  • Metadata are valuable in and of themselves, when planning research, especially replication studies. Even if the original data are missing, tracking down people, institutions or publications associated with the original research can be extremely useful.

What does this mean?
Humans should be able to exchange and interpret each other’s data (so preferably do not use dead languages). But this also applies to computers, meaning that data that should be readable for machines without the need for specialized or ad hoc algorithms, translators, or mappings. Interoperability typically means that each computer system has at least knowledge of the other system’s formats in which data is exchanged. For this to happen and to ensure automatic findability and interoperability of datasets, it is critical to use (1) commonly used controlled vocabularies, ontologies, thesauri (having of course, resolvable globally unique and persistent identifiers, see F1) and (2) a good data model (a well-defined framework to describe and structure (meta)data).

Examples

Links to Resources

What does this mean?
The controlled vocabulary used to describe data sets needs to be documented and resolvable using globally unique and persistent identifiers. This documentation needs to be easily findable and accessible by anyone who uses the data set.

Examples

  • Using the FAIR Data Point ensures I2

Links to resources

What does this mean?
A “qualified reference” is a cross-reference that explains its “intent”.  For example X “is regulator of” Y is a much more qualified reference than X “is associated with” Y  or X  “see also” Y.  The goal, therefore, is to create as many meaningful linkages as possible between (meta)data resources to enrich the contextual knowledge about the data, balanced against the time/energy involved in making a good data model. To be more concrete, if one dataset builds on another data set, if additional data sets are needed to complete the data, or if complementary information is stored in a different data set, this needs to be specified. In particular, the scientific links between the datasets need to be described. Furthermore, all datasets need to be properly cited (i.e. including their globally unique and persistent identifiers).

Examples

What does this mean?
Principle R1 is not unrelated to F2. By giving data many ‘labels’, it will be much easier to find and reuse the data. The objective of R1, however, differs from F2 in that here we focus on the ability of a user (machine or human) to decide if the data just found (F2) is actually USEFUL in their particular context. To make this decision, the data publisher should provide not just metadata that allows discovery, but also metadata that richly describes the context under which that data was generated. This may include the experimental protocols, the manufacturer and brand of the machine or sensor that created the data, the species used, the drug regime, etc. Moreover, implicit in R1 is the idea that the data publisher should not attempt to predict who the data consumer is, or what their needs will be. As such, the term “plurality” was used to indicate that the metadata author should be as generous as possible with their metadata, even to the point of providing information that may seem irrelevant.

Some points to take into consideration (non-exhaustive list):

  • Describe the scope of your data: for what purpose was it generated/collected?
  • Mention any particularities or limitations about the data that other users should be aware of.
  • Specify the date of the data set generation/collection, the lab conditions, who prepared the data, the parameter settings, the name and version of the software used.
  • Is it raw or processed data?
  • Ensure that all variable names are explained or self-explanatory (i.e. defined in the research field’s controlled vocabulary).
  • Clearly specify and document the version of the archived and/or reused data.

Links to Resources

What does this mean?
Under ‘I’ we covered elements of technical interoperability. R1.1 is about ‘legal’ interoperability. What usage rights do you give away to your data? These should be clearly described. If there is ambiguity it could severely limit the reuse of your data by organizations that struggle to comply with licensing restrictions. The more automated search involves licensing considerations, the more important clarity of licensing status of your data will be. The conditions under which the data can be used should be clear to machines and humans.

Examples

  • Commonly used licenses like MIT or Creative Commons can be linked to your data. Methods for marking up this metadata are given in the DTL FAIRifier.

Links to Resources

What does this mean?
You should know where the data came from, clear story of origin/history (see R1) but you also need to know who to cite if you reuse the data, and/or how the author wishes to be acknowledged.  Include a description of the workflow that led to your data: Who generated or collected it? How has it been processed? Has it been published before? Does it contain data from someone else that you have potentially transformed or completed? Ideally the workflow is described in a machine-readable format.

Examples

Links to Resources

What does this mean? 
It is easier to reuse data sets if they are similar: same type of data, data organized in a standardized way, well-established and sustainable file formats, documentation (metadata) following a common template and using common vocabulary. If community standards or best practices for data archiving and sharing exist, they should be followed. Many communities have, for example, “Minimal Information” standards (MIAME, MIAPE, etc.)  FAIR data, therefore, should at least meet those standards.  Other community standards may be less formal, but nevertheless, publishing (meta)data in a manner that increases its use(ability) by the community is the primary objective of FAIRness. There might be situations where good practice exists for the type of data to be submitted but the submitter has valid and specified reasons to divert from the standard practice. This needs to be addressed in the metadata. Note that quality issues are not addressed by the FAIR principles. How reliable data is lies in the eye of the beholder and depends on the foreseen application.

Examples

Links to Resources