CINECA synthetic cohort EUROPE UK1 referencing fake samples

Dataset ID Technology Samples
EGAD00001006673 Illumina HiSeq 2000 448

Dataset Description

Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. For any questions please contact or

This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (, and synthetic subject attributes and phenotypic data derived from UKBiobank ( These data were initially derived using the TOFU tool (, which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial ... (Show More)

Data Use Conditions


See further information on Data Use Conditions

Label Code Version Modifier
publication required DUO:0000019 2019-01-07
general research use DUO:0000042 2019-01-07
user specific restriction DUO:0000026 2019-01-07
institution specific restriction DUO:0000028 2019-01-07