Access to public datasets in the EGA

Contact Information

The EGA Helpdesk
helpdesk@ega-archive.org

Request Access

This DAC controls 38 datasets

Dataset ID	Description	Technology	Samples
EGAD00000000028	Aggregate results from a GWAS study on 3352 cases abd 3145 controls		6497
EGAD00000000029	Aggregate results from a case-control study on stroke and ischemic stroke.		19602
EGAD00000000058	Aggregate results from 22 Carbamazepine-induced hypersensitivity syndrome patients and 2691 UK National Blood Service (NBS) control samples		2713
EGAD00000000059	Aggregate results from 43 Carbamazepine-induced hypersensitivity syndrome patients and 1296 1958 British Birth Cohort control samples		1
EGAD00000000115	Summary data from GWAS analysis on 856 cases and 2836 control		3719
EGAD00001001626	RNA-Seq Illumina GAII dataset for the TraIT cell-line use case (added reverse and forward reads).	Illumina Genome Analyzer II	6
EGAD00001002069	Complete genomics data for VCaP and PC346c.		2
EGAD00001002071	qDNAseq shallow sequencing dataset of the cell line use case.		5
EGAD00001002109	TSACP TruSeq Amplicon Panel dataset for the TraIT cell line use case		5
EGAD00001002250	mRNA-Seq, HiSeq 2000 dataset of the Cell-line use case	Illumina HiSeq 2000	1
EGAD00001003338	This is a test dataset derived from public data of the 1000 Genomes Project. Its purpose is not to allow for any inference about cohort data or results, but to aid bioinformaticians in the technical development and testing of tools, as well as data consumers in learning how to access information. This dataset consists of 2508 samples from the 1000 Genomes Project (https://www.nature.com/articles/nature15393). Samples' (e.g. NA18534) data can be accessed through the IGSR portal (e.g. https://www.internationalgenome.org/data-portal/sample/NA18534) or their corresponding folder at the 1000 Genomes' FTP site (e.g. http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHB/NA18534/exome_alignment/). There are several different types of data this dataset encompasses: Variant Calling Format (VCF, or its binary counterparts BCF) files, both joint (e.g. ALL_chr22_20130502_2504Individuals.vcf.gz) and split (HG01775.chrY.vcf.gz); exome sequencing CRAM files (e.g. NA18534.GRCh38DH.exome.cram); whole genome sequencing CRAM/BAM files (e.g. NA19239.cram). Additionally, there are multiple files that were sliced to create shorter files, which allows for a quick download, formated as "{FILE-INFO}__{NUMBER-OF-READS}r__{CHR}.{START-COORDINATE}-{END-COORDINATE}.{FILETYPE}" (e.g. "HG01500.GRCh38DH__90r__3.10000-10500__4.10000-10500.cram"). These files can be downloaded directly through the EGA-download-client PyEGA3 (https://github.com/EGA-archive/ega-download-client).	AB SOLiD 4 System unspecified	6
EGAD00001003971	ICGC-TCGA DREAM Somatic Mutation Calling - Tumour Heterogeneity Challenge - WGS mapped reads		59
EGAD00001005747	RNAseq sample used in study titled "Immune-awakening revealed by peripheral T cell dynamics after one cycle of immunotherapy".	Illumina HiSeq 2500	1
EGAD00001006673	Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. For any questions please contact isuru@ebi.ac.uk or cthomas@ebi.ac.uk This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.	Illumina HiSeq 2000	448
EGAD00001008095	This dataset contains whole genome sequencing data, based in BAM files of three trio members. These BAM files contain information of chromsomes 21, X, Y and mitochondrial.		3
EGAD00001008096	This dataset contains whole genome sequencing data, based in paired end Fastq files of three trio members.	Illumina HiSeq 2500	3
EGAD00001008097	This dataset contains whole genome sequencing data, based in VCF of three trio members.		3
EGAD00001008392	The purpose of this project is to provide public human datasets for the study of rare diseases. The use of public human genomic background combined with the in-silico insertion of real disease-causing variants enable to have a representative dataset for testing purposes without facing ethical and legal issues associated with the use of human sensitive data. This project aims to help development of technical implementations for rare disease data integration, analysis, discovery, and federated access.	Illumina HiSeq 2000	18
EGAD00001009826	This is a test dataset derived from public data of the 1000 Genomes Project. Its purpose is not to allow for any inference about cohort data or results, but to aid bioinformaticians in the technical development and testing of tools, as well as data consumers in learning how to access information. This dataset consists of 3 pairs of light-weight (sliced) files: BAM + BAI, CRAM + CRAI and VCF + TBI. These files can be downloaded directly through the EGA-download-client PyEGA3 (https://github.com/EGA-archive/ega-download-client). For any further questions, please contact the DAC (Helpdesk - email: helpdesk [at] ega-archive [dot] org).	unspecified	1
EGAD00010000300	Summary statistics from Haemgen RBC GWAS	Affymetrix Illumina Perlegen	1
EGAD00010000434	Normalised mRNA expression	Illumina HT 12	1302
EGAD00010000438	Normalized miRNA expression data	Agilent ncRNA 60k	1480
EGAD00010000440	Segmented copy number data	Affymetrix_SNP6_raw	1302
EGAD00010000444	Agilent ncRNA 60k txt files	Agilent ncRNA 60k	1480
EGAD00010000528	Illumina HumanHT-12 v4 array		-
EGAD00010000934	Agilent miRNA dataset	Agilent SurePrint Human miRNA Microarray	2
EGAD00010000935	ACGH 244K dataset	Agilent 244K	10
EGAD00010000936	Affymetrix Exon Array dataset	Affymetrix GeneChip Human Exon 1.0 ST	2
EGAD00010000937	ACGH 180K dataset	Agilent 180K	5
EGAD00010000938	mRNA Array Agilent 44K dataset	Agilent 44K	16
EGAD00010000939	Illumina 1M SNP Array dataset	Illumina 1M SNP Array	2
EGAD00010001006	Proteomics LC-MS MS dataset	Liquid chromatographyâ€“mass spectrometry	8
EGAD00010001029	Summary statistics for a multi-cohort epigenome-wide association study. This includes summary statistics (effect-size, standard error, p-value) for 470,000 methylation markers.		-
EGAD00010002453	Synthetic dataset containing genome-wide genotypes of 500.000 individuals was generated using a hybrid approach combining coalescent approach and resampling based methods		500000
EGAD50000000276	The synthetic genomes have been created trying to mimic real cancer data of 4 patients (Named 185,186,187 and 188). Mutations are based on real CRC patients from the PCAWG dataset. For each patient, two tumor samples at different time points and one healthy sample have been simulated. The cancer intra-tumor heterogeneity and evolution in the patients is depicted by simulating reads from tumor subclones separately and then mixing them according to their clonal proportions in each sample. For rapid use and transfer only selected chromosomes have been generated for each patient. Chromosomes per patient: -185: chr4, chr5, chr7, chr17 -186: chr1, chr7, chr12, chr17 -187: chr1, chr2, chr5, chr12, chr17 -188: chr2, chr5, chr12, chr13, chr17 Worflows used to create BAM/BAI, VCF and MAF files from FASTQ (Alignment with GRCh38): - https://usegalaxy.eu/published/workflow?id=2c3d05023c02113e - https://usegalaxy.eu/published/workflow?id=1da86d74f8535f4e	unspecified	8
EGAD50000000564	This dataset contains 10 tumor and normal pairs synthetic WGS data of colorectal cancer that were simulated in a standard format of Illumina paired-end reads. The NEAT read simulator (version 3.0, https://github.com/zstephens/neat-genreads) was utilized to synthetize these 10 pairs of tumor and normal WGS data. In the procedure of data generation, simulated parameters (i.e., sequencing error statistics, read fragment length distribution and GC% coverage bias) were learned from data models provided by NEAT. The average sequencing depth for tumor and normal samples aimed to reach around 110X and 60X, respectively. For generation of synthetic normal WGS data per each sample, a germline variant profile from a real patient was down-sampled randomly, representing 50% germline variants of a given patient. These were mixed with the other 50% in silico germline variants that were modelled randomly using an average mutation rate (0.001), finally constituting a full germline profile for normal synthetic WGS data. For generation of synthetic tumor WGS data per each sample, a pre-defined somatic short variant profile (SNVs+Indels) learnt from a real CRC patient was added to the germline variant profile used for creating the normal synthetic WGS data of the same patient, consisting of the variants for tumor sample. Neither copy number profile nor structural variation profile was introduced into the tumor synthetic WGS data. Tumor content and ploidy were assumed to be 100% and 2, respectively. For mapping/variant detection, the Sarek pipeline v3.1.2 (https://nf-co.re/sarek/3.1.2) was used, specifically: 1. BWA v0.7.17-r1188 for read mapping 2. GATK v4.3.0.0 for pre-processing BAM file (including markduplicates and recalibration). 2. Mutect2 (GATK v4.3.0.0) for somatic variant calling 3. Strelka2 v2.9.10 for germline and somatic variant calling Metadata information of 10 CRC patients used for the generation of synthetic normal and tumor WGS data: Patient_id Tumor_barcode Normal_barcode Age Sex Tissue Cancer SIM007 SIM007_T SIM007_N 71 F Rectal Primary CRC SIM008 SIM008_T SIM008_N 45 F Colon Neuroendocrine Metastasis CRC SIM010 SIM010_T SIM010_N 62 M Colon Metastasis CRC SIM011 SIM011_T SIM011_N 55 M Colon Neuroendocrine Metastasis CRC SIM012 SIM012_T SIM012_N 57 M Rectal Metastasis CRC SIM013 SIM013_T SIM013_N 69 M Colon Metastasis CRC SIM014 SIM014_T SIM014_N 68 M Colon Neuroendocrine primary CRC SIM015 SIM015_T SIM015_N 58 F Colon Primary CRC SIM016 SIM016_T SIM016_N 49 M Colon/Rectal Primary CRC SIM017 SIM017_T SIM017_N 78 M Colon Neuroendocrine primary CRC	unspecified	20
EGAD50000000955	Synthetic - This dataset contains the pheno-clinical and genomic information of 42046 individuals from COVID Population 11 Finland, Subgroup 2. 2010 are affected by Phenotype 1, 2010 are affected by Phenotype 2, 188 are affected by Phenotype 1 and 2, and 37838 are control. The dataset also contains the information about the smoking habits of each individual.		42046
EGAD50000002083	This dataset contains the standardized clinical information related to the Use case 1 - Liver Cancer (HCC) from EuCanImage. The Data is Synthetic.		100