What is metadata?
Data represents units of information and metadata can be seen, simply put, as data about other data.
For example, in a simple experiment in which a patient genome was sequenced, there are multiple and diverse fields of metadata: from the biological sex of the patient to the sequencing platform used. Another way to understand metadata, following our genomic experiment example, is going through the different layers of (meta)data:
- One of the basic units of information (data) would be the sequence of nucleotides (e.g. GCTGG and GCCGG).
- This DNA sequence can be contextualised within a chromosome by providing its locus (e.g. β-globin gene). This data, the locus, is about other data, the sequence of nucleotides, which would render it metadata.
- We can go one step further, and provide more context, by giving the associated phenotype of each of these sequences: GCTGG corresponds to a healthy individual, while GCCGG corresponds to a person with β-thalassaemia. Once again, this data is contextualising the data we have provided up to this point, which makes it metadata.
With this simple example, what we are doing is adding layers of data that describe the ones below. Had we given the sequences alone, little would be possible from the researchers' point of view. But, now that the data has its accompanying metadata, a researcher could compare both sequences and their phenotypes to infer a possible cause of the disorder. Each layer of information broadens the possible applications of the data.
Now that we know what metadata is, it is time to emphasise the importance of submitting good quality metadata. The quality of the metadata will determine whether the data is usable, discoverable, and interpretable by researchers and clinicians. One example that highlights the importance of good quality metadata is the description of biological sex of a patient: while the terms "female," "girl," or "woman" all represent the same concept, they are not harmonised. These different representations of a concept can create confusion and errors in data analysis. By applying standard definitions and terminologies to biological sex metadata, we can avoid misunderstandings and facilitate the integration of different datasets.
In order to comply with the FAIR principles and, more specifically, to allow for the (meta)data to be interoperable (i.e. used universally and unambiguously), we need to place some standards onto the (meta)data EGA archives. These can be as simple as asserting that the given patient's biological sex is one of the correct terms (e.g. female or male); or as complex as multiple layers and combinations of validation steps with existing records. You can find further details on how EGA manages these standards on our metadata model page.