Synthetic Data

One of the limitations in genomics research is that human genomics data is not openly available; access must be controlled according to participant consent agreements and data protection regulations such as GDPR. Obtaining authorization to access such data can sometimes take a long time, resulting in delays to important research work. In this context, synthetic genomic and phenotype data can be useful resources for researchers to avoid these delays.

Synthetic data are artificially generated datasets, often created with algorithms, which can be used without the need for authorization to test new products and tools, build technical demonstrators, validate data models, and train AI models. The EGA provides access to synthetic cohort datasets augmented with rich synthetic metadata that overcomes these real data usage restrictions. Whilst synthetic datasets are not included in the general EGA mandate and services, we can consider such submissions and evaluate their acceptance on the basis of their unique use cases not already covered by existing synthetic datasets. All synthetic data studies are open and permission will be granted automatically, however it is still required to request access. Requests are handled by the responsible DAC at Central EGA or the relevant Federated EGA Node.

Study ID	Title	Located in
EGAS00001002472	CINECA synthetic cohort EUROPE UK1 referencing fake samples	Central EGA
EGAS00001005591	Synthetic data - Genome in a Bottle	Central EGA
EGAS00001005042	Test Study for EGA using data from 1000 Genomes Project - Phase 3	Central EGA
EGAS00001005702	Human genomic and phenotypic synthetic data for the study of rare diseases	Central EGA
EGAS50000000190	EOSC4Cancer Synthetic Colorectal Cancer Genomic data	Central EGA
EGAS50000001932	Synthetic longitudinal breast cancer whole-genome sequencing dataset	Central EGA
EGAS50000000086	Synthetic - FEGA Sweden Heilsa synthetic dataset December 2023	Federated EGA Sweden
EGAS50000000678	Synthetic - GDI synthetic data	Federated EGA Spain