singel cell RNAseq dataset for the study "Composition and functional status of T and NK cells in Extramedullary myeloma tumor microenvironment""
Dataset contains all newly generated single cell RNAseq data generated for the study. It includes 5 EMM soft tissue samples, 5 EMM bone marrow samples (EMM_BM) and 6 RRMM BM samples without EMM (RRMM_BM). Samples that are sequenced twice are marked by _1 or _2 in the sample name. The sample name mentioned corresponds to the sample name used in the publication. The rest of the EMM single cell RNAseq samples mentioned int the publication, EMM_03_1, EMM_11, EMM_14_1, EMM_14_2, EMM_16, and EMM_15_1 are already uploaded in EGA archive – dataset id EGAD50000000053.
Dataset
EGAD50000001511
Providing safe access to sensitive human data across borders: Federated EGA becomes a reality
Enabling discovery and access to sensitive data across national boundaries is
vital for improving human health. It enables more powerful and efficient
research by increasing the volume and diversity of data available for
analysis. It allows us to better understand the causes of diseases – cancer,
rare diseases, infectious diseases like COVID-19 – and develop new medicines
and treatments.
Sensitive human omics data are typically generated by research initiatives and
shared using specialist repositories which provide services for data
submission, discovery, and access. The EGA is one such repository. Established
in 2008 at EMBL’s European Bioinformatics Institute (EMBL-EBI) in the UK, since 2012 the EGA (“Central EGA”) has been jointly managed by
EMBL-EBI and the
Centre for Genomic Regulation (CRG)
in Spain.
Many countries have emerging personalised medicine programmes which generate
data from national initiatives. These programmes are driving a transition in
human genomics from being research-driven to receiving funding through
healthcare. Data generated in a clinical context are subject to stricter
governance than research data and must follow national data protection
legislation.
To solve these challenges, the Federated EGA provides a network of connected
resources to enable transnational discovery of and access to human omics data
for research, while also respecting jurisdictional data protection
regulations. In this way, the Federated EGA infrastructure supports the goals
of European initiatives such as the
1+ Million Genomes initiative
(1+MG), the
European Health Data Space, and a number of
EU-funded 1+MG implementation projects including Beyond 1 Million
Genomes.
The Federated EGA is made up of “Nodes” – typically nationally funded and
operated – which store and manage data locally while allowing global discovery
within the Federated EGA network. Since 2016, multiple parallel efforts,
supported by ELIXIR and other transnational and national initiatives, have
created technical and legal frameworks for establishing the Federated EGA. In
2022, the first five Nodes – Finland, Germany, Norway, Spain, and Sweden –
officially joined the Federated EGA by signing Federated EGA Collaboration
Agreements with Central EGA.
The Finnish FEGA Node is operated by CSC - IT Center for Science and provides
data management services according to national laws and the requirements of
the EU General Data Protection Regulation (GDPR). These services provide tools
and support for the whole life-cycle of sensitive research data from collating
to analysis, publication, and authorised re-use. The development of the
services has been a joint effort with the other Nordic nodes within NeIC's
Tryggve and Heilsa projects, and funded by Finnish Ministry of Education and
Culture and projects coordinated by ELIXIR Finland.
Read more about Finnish FEGA signing.
The
German Human Genome-Phenome Archive (GHGA)
strives to provide a national infrastructure as well as an ethical and legal
framework that balances FAIR omics data usage and data protection needs for
Germany. As a Germany-wide consortium funded by the German Research Foundation
under the umbrella of the NFDI association, GHGA combines the expertise of 21
universities and research institutions to form a federated national
infrastructure.
Read more about GHGA signing.
The German Human Genome-Phenome Archive (GHGA)
strives to provide a national infrastructure as well as an ethical and legal
framework that balances FAIR omics data usage and data protection needs for
Germany. As a Germany-wide consortium funded by the German Research Foundation
under the umbrella of the NFDI association, GHGA combines the expertise of 21
universities and research institutions to form a federated national
infrastructure.
Read more about GHGA signing.
In Norway, a key component of the infrastructure is the services of The
Sensitive Data Service (TSD) offered by USIT at University of Oslo. The
Federated EGA Norway Node
is developed by ELIXIR Norway and operated by the University of Oslo as the
responsible legal entity. Core software modules are developed jointly with the
other Nordic nodes in the NeIC Tryggve and Heilsa projects.
Read more about FEGA Norway signing.
The
Spanish FEGA (es-FEGA)
is a national service for storing sensitive biomedical data in Spain.
Supported by the Spanish Institute of Bioinformatics (INB) in collaboration
with Central EGA, sensitive research datasets are primarily hosted at the
Barcelona Supercomputing Centre facilities.
The
Swedish Sensitive Data Archive
is a secure data archive and sharing platform for sensitive datasets. It was
developed by the National Bioinformatics Infrastructure Sweden (NBIS) in
collaboration with other Nordic ELIXIR Nodes through the Tryggve and Heilsa
projects funded by NeIC and coordinated with Central EGA through ELIXIR. Read
more about the Swedish Node signing.
The
Swedish Sensitive Data Archive
is a secure data archive and sharing platform for sensitive datasets. It was
developed by the National Bioinformatics Infrastructure Sweden (NBIS) in
collaboration with other Nordic ELIXIR Nodes through the Tryggve and Heilsa
projects funded by NeIC and coordinated with Central EGA through ELIXIR.
Read more about the Swedish Node signing.
By providing a solution for secure and efficient management of human omics
data, the Federated EGA aims to foster data reuse, enable reproducibility,
accelerate biomedical research, and improve human health.
Find out more
Interested in setting up your own Federated EGA Node? Check out the
FEGA Onboarding Knowledge Base
for more information.
The
ELIXIR Federated Human Data Community
is a great entry point for anyone interested in learning more about the
Federated EGA. You can:
Join the
ELIXIR Federated Human Data Community mailing list
(select “Human Data”)
Attend the ELIXIR Federated Human Data Community calls
Blog
safe-access-to-sensitive-human-data-federated-ega
How are we funded?
How are we funded?
The EGA is provided infrastructure, administrative support, advisory input, and other relevant and necessary aspects by:
CRG
The EGA at the CRG is funded and supported by “La Caixa Foundation”, The Generalitat de Catalunya (Catalan Government), the Spanish Ministerio de Asuntos Económicos y Transformación Digital (Spanish Government) and the Instituto de Salud Carlos III.
The Generalitat de Catalunya contributes with funding to the CRG and to the Barcelona Supercomputing Centre (BSC).
The Barcelona Supercomputing Center is decisively contributing to EGA by providing the required compute, storage and networking resources, as well as key personnel for shared operations of EGA infrastructure.
EMBL-EBI
As part of the European Molecular Biology Laboratory (EMBL), the EGA at EMBL’s European Bioinformatics Institute receives funding from the governments of EMBL’s member states. The UK government, via UK Research and Innovation, also continues to support EMBL-EBI technical infrastructure, with EMBL supporting operational costs.
A major source of funding for the EGA both at CRG and EMBL-EBI are competitive projects.The EGA resource continues to be funded through strategic engagement in collaborative partnerships and projects. Please see our projects page for more information. Other funders include the European Commission, the US National Institutes of Health, the Wellcome Trust, UK Research Councils, and our Industry Programme partners.
Last updated on October, 2024.
Documentation
about/projects-and-funders/funders
European Genome-phenome Archive 15th Anniversary Celebration
2023 marks the 15th Anniversary of the EGA, jointly managed by the European Bioinformatics Institute (EMBL-EBI) and the Centre for Genomic Regulation (CRG). To mark the occasion, a simultaneous celebration was held in both institutions on the 13th of December 2023. The teams gathered online to play a quiz game and celebrate all the achievements and milestones with two wonderful anniversary cakes.
In 2008, the European Genome-phenome Archive was created at the EBI-EMBL in Cambridge to guarantee that human genome and phenome data were available to the international scientific community while data privacy was preserved.
The six-person staff at the beginning of the project is now far behind us, with a team that reaches the number of 35 members. Since 2013, the European Bioinformatics Institute and the Centre for Genomic Regulation share responsibility for The European Genome-phenome Archive (EGA). At that time, the EGA had data for about 0,5 petabytes. Currently, the Archive contains more than 12 petabytes.
2023 comes to an end with several good news. The EGA has been renewed as an ELIXIR Core Data Resource. This was announced during the GA4GH 11th Plenary held in San Francisco last September when ELIXIR-Beacon was also confirmed to be a GA4GH Driver Project. Thus, the Beacon API has maintained this title since 2018. What’s more, the first Federated EGA dataset is now live on our website.
This year we also launched new services for EGA users in September. By the numbers, in 2023 the EGA counts 2.5 PB archived, 371 studies published, 208 new submitters and 19 active projects in which the team is participating, among others.
We look forward to continuing to support your research in 2024!
Blog
15-anniversary
PREDO_EGA_methylation_data_and_gestation_ages
This data set contains two data files. First data file (file name: PREDO_GA_EGA_methylation_data.csv) includes methylation data from 485512 sites accross human genome from 96 individuals acquired from Illumina 450K -chip. The other data file (file name: PREDO_GA_EGA_phenotypes.csv) contains the gestation ages and the genders of the 96 samples.
Dataset
EGAD00010001003
Sequencing_probands_and_families_with_severe_insulin_resistance_syndromes
This is an ongoing project and continuation to all the sequencing we have been doing over the last few years. We have some additional families and probands with syndromes of insulin resistance not previously sequenced within uk10k or other core funded projects. We would like to complete the sequencing in all of the good quality families and probands we have, this would require another ~50 samples to be WES sequenced. This cohort has already proven to be a rich source of interesting findings with papers in Science and Nature genetics.
Study
EGAS00001000488
Archive growth Statistics
EGA Statistics
Bibliography
Growth
Community
Archive
Distribution
Catalogue
In this section, we expose the overall volume of data available to download and the different archived file types.
EGA Archive growth in size and number of files
The below figure represents the EGA archive growth in size (TB) and number of files.
EGA Archive Growth per year
lineChart('ega-archive-growth', 'https://stats.ega-archive.org/growth/archive', ['Volume (TB)'], ['T'])
Archive files
The below figure represents on a initial level, the percentage of archived files by data technology. The second level, acessible by clicking in the required extension, displays the number of files by extension archived in the ega archive. Please click over the extension to learn more about the files.
EGA Archived Files
barChart('ega-archived-files', 'https://stats.ega-archive.org/extensions', false ,'', true)
Documentation
about/statistics/archive
Cognitively Affected DMD Patients have Unique Methylation Signatures Compared to Cognitively Normal DMD Patients
The overarching goal of this project was to identify changes in methylation patterns in Duchenne muscular dystrophy (DMD) patients with discordant symptoms using whole genome bisulfite sequencing (WGBS) from genomic DNA isolated from whole serum. DMD siblings (biological brothers) had the same genetic mutation in the dystrophin gene but with discordance in symptoms such as ambulation, cardiopulmonary function, and cognition. Trios (DMD sibling brothers and biological mother or father) and quartets (DMD sibling brothers and biological mother and father) were recruited for this study for intra- and inter-familial comparisons of the methylation of gene bodies.
Study
phs003118
The Atherosclerosis Risk in Communities (ARIC) Study
The Atherosclerosis Risk in Communities (ARIC) Study, sponsored by the National Heart, Lung and Blood Institute (NHLBI), is a prospective epidemiologic study conducted in four U.S. communities. The four communities are Forsyth County, NC; Jackson, MS; the northwest suburbs of Minneapolis, MN; and Washington County, MD. ARIC is designed to investigate the etiology and natural history of atherosclerosis, the etiology of clinical atherosclerotic diseases, and variation in cardiovascular risk factors, medical care and disease by race, gender, location, and date. ARIC includes two parts: the Cohort Component and the Community Surveillance Component. The Cohort Component began in 1987, and each ARIC field center randomly selected and recruited a cohort sample of approximately 4,000 individuals aged 45-64 from a defined population in their community. A total of 15,792 participants received an extensive examination, including medical, social, and demographic data. These participants were reexamined every three years with the first screen (baseline) occurring in 1987-89, the second in 1990-92, the third in 1993-95, and the fourth and last exam was in 1996-98. Follow-up occurs yearly by telephone to maintain contact with participants and to assess health status of the cohort. In the Community Surveillance Component, currently ongoing, these four communities are investigated to determine the community-wide occurrence of hospitalized myocardial infarction and coronary heart disease deaths in men and women aged 35-84 years. Hospitalized stroke is investigated in cohort participants only. Starting in 2006, the study conducts community surveillance of inpatient (ages 55 years and older) and outpatient heart failure (ages 65 years and older) for heart failure events beginning in 2005. ARIC is currently funded through January 31, 2012. This study is part of the Gene Environment Association Studies initiative (GENEVA, http://www.genevastudy.org) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI). The overarching goal is to identify novel genetic factors that contribute to atherosclerosis and cardiovascular disease through large-scale genome-wide association studies of well-characterized cohorts of adults in four defined populations. Genotyping was performed at the Broad Institute of MIT and Harvard, a GENEVA genotyping center. Data cleaning and harmonization were done at the GEI-funded GENEVA Coordinating Center at the University of Washington.
Study
phs000090
Sequencing probands and families with severe insulin resistance syndromes
This is an ongoing project and continuation to all the sequencing we have been doing over the last few years. We have some additional families and probands with syndromes of insulin resistance not previously sequenced within uk10k or other core funded projects. We would like to complete the sequencing in all of the good quality families and probands we have, this would require another ~50 samples to be WES sequenced. This cohort has already proven to be a rich source of interesting findings with papers in Science and Nature genetics.
Dataset
EGAD00001000694
Ovarian cancer sample size analysis
The dataset contains exome sequencing fastq from 5 ovarian cancer patients, paired with tumor normal blood samples. Three tumor samples were sequenced from each patient: a biopsy sample ("-1" suffix in the file name), a local sample (multiple regions around the biopsy pooled together, with the "-2" suffix in the file name), and a global sample (multiple regions from the tumor pooled together, with a "-3" suffix in the file name).
Dataset
EGAD00001005947
The EGA at the International Congress of Human Genetics
Today is the start of the International Congress of Human Genetics (ICHG).
This event, hosted by the African Society of Human Genetics (AfSHG) and the
Southern African Society for Human Genetics (SASHG), will reunite
international experts to highlight how genomic technologies are being managed
to address challenges generated by the current status of Human Health and
Genomics.
The
European Genome-phenome Archive
(EGA) will be at the Congress participating in workshops, meetings and talks
where topics such as data access and discovery as well as the
Federated EGA
will be present.
Data access and discovery
In the context of the “Federated data analysis workshop“ that will be held the 26ᵗʰ, managed in the context of the
CINECA project, a
hands-on-session on
Beacon v2
Discovery will be led by Mauricio Moldes. Another session on authorisation and
data access will be presented by Mallory Freeberg and Coline Thomas, with and
introduction by Thomas Keane. During the workshop, an end-to-end federated
data analysis use case will be presented.
Data access and discovery
In the context of the “
Federated data analysis workshop“ that will be held the 26ᵗʰ, managed in the context of the
CINECA project, a
hands-on-session on
Beacon v2 Discovery will be led by
Mauricio Moldes. Another session on authorisation and data access will be
presented by Mallory Freeberg and Coline Thomas, with and introduction by
Thomas Keane. During the workshop, an end-to-end federated data analysis use
case will be presented.
During the last four years, CINECA has been identifying gaps in data sharing
between Africa, Canada and Europe in order to build a federated solution for
data access. Thomas Keane will present the project within the sessions
happening the 24ᵗʰ.
The Federated EGA at the ICHG
A
Federated EGA poster
will be shown during the Poster Session 1 that will take place the 23ʳᵈ. The
objective is to bring awareness of the
European Genome-phenome Archive
as a resource for permanent archive and secure data sharing. In this context,
the network generated by the Federated EGA is key for cases where research
data cannot be shared outside local jurisdictions.
The Federated EGA was officially launched on 2022 with research institutes
from five countries becoming the first nodes of this network. Nowadays, with
more than 20 nodes preparing to join, it aims to become the largest human
omics data sharing initiative towards understanding human health and disease.
The H3Africa Initiative and the African research represented in the EGA
1.1% of the EGA studies include samples from African populations. Countries
such as South Africa, Kenya, Uganda or Egypt, among others, are present in the
72.577 samples from African individuals archived at the EGA.
23 EGA datasets are managed by the
Human Heredity and Health in Africa Initiative
(H3Africa). This Initiative will celebrate its
Consortium Meeting
the 27ᵗʰ and 28ᵗʰ, back-to-back with the ICHG. H3ABioNet, the
Pan African Bioinformatics Network for the H3Africa consortium, will also take this context to celebrate its
10-Year Symposium
during the Congress. Members of the EGA team will attend both of them, as the
H3ABioNet works in conjunction with H3Africa projects, to submit data to the
European Genome-Phenome Archive.
Blog
the-ega-at-the-international-congress-of-human-genetics
EGA file encryption types
EGA file encryption
The European Genome-Phenome Archive (EGA) stands as a significant initiative in the field of genomics, facilitating the secure storage and sharing of vast amounts of genomic data. Jointly hosted by the Centre for Genomic Regulation (CRG) and the European Bioinformatics Institute (EMBL-EBI), the EGA offers a robust infrastructure for managing encrypted data, ensuring the privacy and integrity of sensitive genomic information.
Both CRG and EMBL-EBI play pivotal roles in the EGA, each institution bringing unique expertise in managing different types of encryptions. EMBL-EBI EGACryptor encryption system, while CRG utilises Crypt4GH, resulting in two distinct encryption file types: cip and c4gh, respectively.
How can I check the file encryption?
To access your dataset of interest on the European Genome-Phenome Archive (EGA) and determine the corresponding encryption file extension, please follow these step-by-step instructions:
Utilise the search or navigation features provided on the EGA website to locate the dataset you are interested in.
Once you find the dataset webpage, click on the "Files" tab, typically located on the dataset's webpage.
Upon reaching the "Files" tab, you will be presented with a table containing comprehensive information about the files associated with the dataset. This table will include various columns, such as "File Name," "Size," and "Location." The "Location" column is particularly relevant for determining the encryption file extension.
Figure 1: File table showing all the information available for the files in a dataset. This table is specific for each dataset.
To identify the encryption file extension corresponding to a specific file, locate the "Location" column in the table. If the file is archived in Spain, the file will have the .c4gh extension, indicating that it is encrypted using Crypt4gh. Conversely, if the file is archived in the UK, the encrypted file extension will be .cip, indicating the usage of EGACryptor for encryption.
It is worth noting that a file can be available in both locations, meaning it will be accessible with both file extensions (.c4gh and .cip). This allows users to choose the encryption method that aligns with their preferred platform or analysis tools.
Documentation
check-encryption-type
How to use EGA Webin?
EGA Webin
EGA Webin serves as a platform for registering metadata for array based
submissions, large scale sequence submission as well as for legacy EGA
submission accounts (ega-box-XXXX). For large scale submitters of sequence
data you have also the option to submit metadata via
XMLS programmatic submission, while new submitters are advised to utilise the
Submitter Portal for their
submissions.
You can request a legacy EGA submission account (ega-box-XXXX) by populating this form. Please, allow two business days for our Helpdesk team to contact you after populating this form.
WEBIN actions:
Register metadata for a sequence submission
Register study, samples, experiments, runs, DAC, policy and dataset/s
after file upload.
Register components for your array-based metadata submission
Register study, samples, DAC or policy before uploading files.
Edit existing submission metadata
Change or update previously submitted metadata.
Register metadata for a sequence submission
Ensure that all sequence files have been
encrypted before
uploading
them to your submission account using the
EgaCryptor.
Go to the
EGA Webin
and log in using your submission account name (ega-box-XXX) and password.
Register components of your metadata submission
Study
Samples
Data Access Committee (DAC)
Data access policy
Dataset
For
array-base submissions: Study, Samples, Data Access Committee (DAC) and Data access policy may
all be registered BEFORE file upload and dataset registration through the
array-base template.
Register your Study
Go to the “Studies (Projects)” box
Click on “Register Study” and fill in the information related to your study.
Click on “Submit”, this will save the information and generate an EGA study
ID.
To use the study accession number in a publication, the study has to be
previously released on the EGA website, we suggest the following format:
"Sequence data has been deposited at the European Genome-phenome Archive
(EGA), which is hosted by the EBI and the CRG, under accession number
EGASXXXXXXXXXXX.Further information about EGA can be found on
https://ega-archive.org and “The European Genome-phenome Archive in 202 "(10.1093/nar/gkab1059)"
Register your Samples
Go to the “Samples” box
Click on “Register Samples”
Select “Download spreadsheet to register samples” and customise your
template, there is a default EGA template (EGA default checklist) but
more attributes can be added if required.
For the EGA default checklist, there are mandatory,recommended and optional
attributes. As well custom fields which can be added if required.
Mandatory attributes
Field Name
Description
tax_id
Taxonomy ID of the organism as in the NCBI Taxonomy database. Entries in
the NCBI Taxonomy database have integer taxon IDs. See our tips for sample
taxonomy here
scientific_name
Scientific name of the organism as in the NCBI Taxonomy database.
Scientific names typically follow the binomial nomenclature. For
example, the scientific name for humans is Homo sapiens.
sample_alias
Unique name of the sample. If not selected system will auto generate an
unique alias
sample_title
Title of the sample
sample_description
Description of the sample
phenotype ***
Where possible, please use the Experimental Factor Ontology (EFO) to
describe your phenotypes.
Recommended attributes
Field Name
Description
subject_id
Identifier for the subject where the sample has been derived from
gender *
Sex
Optional attributes
Field Name
Description
sex
sex of the organism from which the sample was obtained
disease_site
Affected organ
sample type
Affected organ
donor_id **
Identifier of the donor where the sample has been derived from
*Gender should be described as 'male', 'female' or 'unknown'. If 'unknown' due
to a known sex chromosome aneuploidy, please create a user defined attribute
called 'Sex chromosome karyotype' and add the appropriate value, for example,
'XXY'.
**Donor id (Subject id) should be a de-identified subject handle. If unknown,
please add 'unknown' to the field.
***Phenotypes should, where possible, be an
Experimental Factor Ontology
accession. If a term cannot be found to describe your phenotype please use
free text. All sample phenotypes considered important for further analysis of
the data should be provided (for example, tumour type), additional phenotype
attributes can be created by defining your own attributes; use the notion
'phenotype2', 'phenotype3', etc.
After you have customised the fields for the sample submission, download the
template and fill in the information.
Example of the sample template:
Finally upload the sample template to get the EGA accession IDs for the
samples.
Register your Data Access Committee (DAC)
Further information on the role of your DAC
Go to the “Data Access” box
Click on “Register Dacs”
Input the information about the DAC and register at least one main DAC
contact.
Register your Data Access Policy
Your Data Access Policy provides the terms and conditions of data use, this is
also referred to as the Data Access Agreement (DAA).
Completion of a DAA by the applicant/s should form part of the application
process to the
Data Access Committee (DAC).
Go to the “Data Access” box
Click on “Register Policies”
Select the DAC to which this policy will be linked to and fill in the policy
information.
Submitting your Runs and Analyses
This section is only for sequence data submission, for array-based submission
it can be skipped. Please refer to our
Submitting array based metadata
Runs Registration
Go to the “Raw Reads (Experiments and Runs)” box
Click on “Submit Reads”
Select “Download spreadsheet template for Read submission”
Select the template corresponding to your submission type
For the templates you have the option to customise the optional fields. To
check their description click on “Show Description “
Download the template and fill in the required information.
Example of the runs template:
We recommend that Fastq, BAM, and CRAM read files are submitted using
Webin-CLI
When using this interface instead of Webin-CLI, raw sequences must be
uploaded in one of the supported data formats before they can be submitted. The files
can be uploaded using FTP or Aspera.
The study and the sequenced samples must be pre-registered before the raw
reads are submitted. Please note that each individual study and sample should
be registered only once. You will be asked to provide information about the
sequencing libraries and instruments.
Submitting your Dataset
This section is only for sequence data submissions, for array based
submissions it can be skipped. Please refer to our
Submitting array based metadata
The dataset describes the data files, defined by the run (EGARXXXXXXXXXXX) and
analysis (EGAZXXXXXXXXXXX) accessions that make up the dataset and links the
collection of data files to a specified Data Access Committee and Data Access
Policy.
As a result, you must have registered your
Reads and experiments,
Data Access Committee (DAC) and
Data access policy before submitting your Dataset.
Please consider the number of datasets that your submission consists of, for
example, a case control study is likely to consist of at least two datasets.
In addition, we suggest that multiple datasets should be described for studies
using the same samples but different sequence technologies. Please contact
EGA Helpdesk for further assistance.
Go to the “Data Access” box
Click on “Register Dataset”
Select the Data Access Committee (DAC) and
Data access policy
Register your dataset
After submitting your dataset you should contact the
EGA Helpdesk
to provide a release date for your dataset.
Datasets are automatically held (i.e. not released) unless they are affiliated
to a study that has already been released.
Edit/update existing submission metadata
Go to the “Report” section of the object you would like to edit.
Locate the object and click on the arrow under action. An option menu will be
displayed. Objects can be edited through their XML or with the WEBIN menu.
After an object has been edited, changes would be available on the website
until the submission is released again. Please contact the
EGA Helpdesk if you require further assistance.
Documentation
submission/metadata/submission/EGA_webin
CADD/GADD centers on Antisocial Drug Dependence
CADD (Center for Antisocial Drug Dependence): Funded through NIDA 011015 to study genetic influences on, and treatment of, antisocial drug dependence, studying both clinical probands and their families, and community samples of matched controls, twins, and participants in an ongoing longitudinal adoption study. A collaboration between three organizations at two campuses of the University of Colorado. Longitudinal with three waves of data collection completed. GADD (Genetics of Adolescent Antisocial Drug Dependence): Funded originally through NIDA 012845, s multisite collaboration including adolescent subjects at high-risk for antisocial drug dependence and their siblings, recruited in Denver, CO and San Diego, CA. Longitudinal with two waves of data collection completed, one in progress as of May, 2018.
Study
phs001841
Single cell RNAseq and TCRseq data from tumor and blood samples from 4 patients with muscle invasive bladder cancer
For this dataset we performed single cell RNAseq paired with single cell TCR-seq on tumor and blood samples from 4 patients. This dataset contains 4 tumor samples as well as 4 blood samples. Each sample is made up of 2 sets of paired fastq files. The first pair contains reads corresponding to RNA transcripts (_Transcripts in file name), while the second pair contain reads corresponding to TCRs (_VDJ in file name). Sequenced on the Illumina NovaSeq6000 platform in a paired-end run using an SP flow cell (v1.5, 300 cycles).
Dataset
EGAD50000001381
From research to data sharing: exploring EGA user's experiences
The Cell Plasticity and Regeneration Group at the Bellvitge Biomedical Research Institute-IDIBELL focuses on the process of recruitment of macrophages that takes place in the small intestine during injury and healing. They recently published a paper titled “Mucosal Macrophages Govern Intestinal Regeneration in Response to Injury" in Gastroenterology Journal.
As part of the research, some experiments were conducted using human intestinal organoid lines. These cells were processed for RNA sequencing, and the sequencing data were deposited at the EGA to be made available to the scientific community.
When dealing with human genomic information, repositories must ensure the availability of the datasets while ensuring data protection. In this context, the European Genome-phenome Archive stands as a service for secure archiving and sharing of genetic, phenotypic and clinical data resulting from biomedical research.
Following the recent publication of their paper, we took the opportunity to talk to Ilias Moraitis, first author, and Jordi Guiu, group leader, to find out about their experience with data sharing.
Could you explain the focus of your research?
In the lab, we study intestinal regeneration and how immune cells participate in this process. We use several techniques: engineered mouse models, image tracing, as well as mouse and human cells intestinal organoids.
What challenges do you face regarding data management?
We didn’t have a lot of problems. Always the informatic part, the data analysis, can give problems. But everything was smooth and working.
We think it’s very important to deposit the data in repositories. For the mouse data it is very straightforward, but for us it was the first time depositing human data, which is sensitive because it comes from patients, and this is legally regulated. That’s why we thought about the EGA, and it was our first time.
Why do you think it is important to submit data to repositories such as EGA?
Because we think data is important for the science. Nowadays, we are sequencing a lot everywhere worldwide and having these resources shared with the scientific community it’s not only good for science, but it is also saving money and reducing costs. And there is so much data in there that can be used for other projects and for other questions. And it saves time!
How was the process of submitting the data to the EGA?
The communication worked very well but it was slower than we would like. But I think this is something we learn on the way. People usually wait until the last minute to do this before publication but if you know it in advance, you can start the process earlier.
If you had to repeat the process, would you do anything differently?
It’s like everything, isn’t it? For the first time you don’t know, you aren’t sure, but for the second time you know the steps, you know what to do and it’s faster.
Of course, we would to do it earlier. Also, there were some things related to the control access that we didn’t know how to manage, but it was something more internal to us. Who is going to receive the communications and when, is our ethical committee going to evaluate this? All these technicalities that we didn’t know before starting this process, so we had to think about all these things. Next time it will be faster on our side.
Do you have any suggestions for us to improve the submission process?
Maybe we’ll need to ask our bioinformatician who did it! He was in charge of this part. Besides that, we think if the process were faster, it would be better for everyone. Also, in science a lot of times you have to do a lot of things at the very last minute, because of the nature of the experiments, or the need to accomplish for submitting a paper.
Being faster in the process would be a plus.
On your side, we went through the process and were able to find some information, but there were aspects we weren't aware of and weren't sure how to handle. I'm not sure if other institutes have different protocols for managing data requests or how they handle these internally. We believe the main issue lies in how the forms are filled out.
Do you have any recommendations for other submitters?
The main advice is to do this with enough time. Submit the data when you have the sequences, and don’t wait until the last minute. And then it can be under embargo, so you don’t need to make it public at that moment. As soon as you sequence, upload it and then it will be there for whenever you need to publish.
How do you think we could encourage other researchers to submit their data?
There are different layers here. One is that it is mandatory. You have to do this. The other is that we belong to a research community, to the same community and sharing this is making our research community stronger and more efficient. And then, also because this gives you visibility. The data is there, other people can analyse your data, and this also will bring citations to your papers. It has many advantages.
Blog
from-research-to-data-sharing
Genetic Epidemiology Network of Arteriopathy (GENOA)
The Genetic Epidemiology Network of Arteriopathy (GENOA): GENOA is one of four research networks that form the NHLBI Family Blood Pressure Program (FBPP). From its inception in 1995, GENOA's long-term objective was to elucidate the genetics of hypertension and its arteriosclerotic target-organ damage, including both atherosclerotic (macrovascular) and arteriolosclerotic (microvascular) complications involving the heart, brain, kidneys, and peripheral arteries. Two GENOA cohorts were originally ascertained (1995-2000) through sibships in which at least 2 siblings had essential hypertension diagnosed prior to age 60 years. All siblings in the sibship were invited to participate, both normotensive and hypertensive. These include non-Hispanic White Americans from Rochester, MN (n =1583 at the 1st exam) and African Americans from Jackson, MS (N=1854 at the 1st exam). During the second exam (2000-2005), approximately 80% of participants were re-recruited. The GENOA data consists of biological samples (DNA, serum, urine) as well as demographic, anthropometric, environmental, clinical, biochemical, physiological, and genetic data for understanding the genetic predictors of diseases of the heart, brain, kidney, and peripheral arteries. Family Blood Pressure Program (FBPP): GENOA's parent program, the FBPP, is an unprecedented collaboration to identify genes influencing blood pressure (BP) levels, hypertension, and its target-organ damage. This program has conducted over 21,000 physical examinations, assembled a shared database of several hundred BP and hypertension-related phenotypic measurements, completed genome-wide linkage analyses for BP, hypertension, and hypertension associated risk factors and complications, and published over 130 manuscripts on program findings. The FBPP emerged from what was initially funded as four independent networks of investigators (HyperGEN, GenNet, SAPPHIRe and GENOA) competing to identify genetic determinants of hypertension in multiple ethnic groups. Realizing the greater likelihood of success through collaboration, the investigators began working together during the first funding cycle (1995-2000) and formalized this arrangement in the second cycle (2000-2005), creating a single confederation with program-wide and network-specific goals.
Study
phs000379
FBSeq: RNA sequencing of human fetal brain.
RNA sequencing of 120 human fetal brains (aged 12-19 post-conception weeks) was performed as part of a Medical Research Council (U.K.) funded project (MR/L010674/2) to investigate genetic effects on gene expression in the developing human brain and their role in neuropsychiatric disorders. Human fetal brain tissue from elective terminations of pregnancy was provided by the Human Developmental Biology Resource (HDBR) (http://www.hdbr.org), with ethical approval (#08/H0712/34), and with consent for use of fetal material for medical research provided by female donors. Total RNA was extracted from half of the available brain tissue from each fetus using Tri-Reagent and treated with TURBO DNase. RNA-Seq libraries were prepared using 1µg of purified total RNA and the Illumina TruSeq Stranded Total RNA Library Prep kit, following ribosomal RNA depletion. Libraries were sequenced on the Illumina HiSeq 2500 or HiSeq 4000 systems, generating at least 50 million read pairs per sample.
Study
EGAS00001003214
Exome_sequencing_of_short_SGA_children_with_IGF_I_and_insulin_resistance
Exome sequencing of short SGA children with IGF-I and insulin resistance. Collaboration with Professor David Dunger, University of Cambridge. Funded by NIHR.
Study
EGAS00001001086
Type 1 Diabetes Genetics Consortium (T1DGC): Genome-Wide Association Study in Type 1 Diabetes, 2008
Cases with Type 1 Diabetes (T1D) in the UK, were part of the Wellcome Trust Case Control Consortium (WTCCC) - http://www.wtccc.org.uk - that first reported in 2007: Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447, 661-678. [PubMed: 17554300] In this genome-wide association study (GWAS), funded by the NIH and JDRF, and sponsored by the Type 1 Diabetes Genetics Consortium (T1DGC), we were able to extend the case and control groups used in the WTCCC, with the intention of performing a well-powered meta-analysis. The study is written up as: Barrett, J.C., Clayton, D.G., Concannon, P., Akolkar, B., Cooper, J.D., Erlich, H.A., Julier, C., Morahan, G., Nerup, J., Nierras, C., Plagnol, V., Pociot, F., Schuilenburg, H., Smyth, D.J., Stevens, H., Todd, J.A., Walker, N.M., Rich, S.S. and The Type 1 Diabetes Genetics Consortium Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nature Genetics, PMID: 19430480. Resources - data: * Case data from this study is deposited here (i.e. in dbGaP) * Control data from this experiment - with subjects from the 1958 British Birth Cohort - is deposited with the European Genotype Archive (EGA): http://www.ebi.ac.uk/ega/ from where the WTCCC data is also available. * A complete description of how to request all components of the meta-analysis is available at: http://www.t1dbase.org/page/PosterView/display/poster_id/324 * Additional genetic data on the same case subjects, including some HLA types, are available from the Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory (JDRF/WT-DIL): http://www-gene.cimr.cam.ac.uk/todd/ Resources - samples: Case and control DNA samples are also available: * Case DNA samples are available from the JDRF/WT-DIL, as above, and will be available from the NIDDK Genetics Repository at Rutgers: https://www.niddkrepository.org * Control DNA samples are available from the 1958 British Birth Cohort (aka National Child Development Study): http://www.cls.ioe.ac.uk/studies.asp?section=000100020003 Use restrictions: There are access limitations to both data and samples, in order to keep their use in line with subjects' consent.
Study
phs000180
Comprehensive Transcriptional Analysis of Early Stage Urothelial Carcinoma using whole transcriptome sequencing
The expression profile and sequence variants of 476 early stage urothelial carcinoma were studied using whole transcriptome sequencing. RNA-Seq libraries were prepared by Ribo-Zero treatment of total-RNA followed by library preparation using ScriptSeq (both Epicentre/Illumina). RNA-Seq libraries were paired-end sequenced (2x 101 bp) on Illumina HiSeq 2000 and the resulting fastq files were processed using tools from the Genome Analysis Toolkit (GATK and from the Tuxedo suite.
Access to the sequence data (mapped and un-mapped bam and vcf files), containing person identifying information, needs signature on a controlled access form, and can be accessed at The European Genome-phenome Archive (EGA) following request. An expression matrice of FPKM values are available without restriction at ArrayExpress (E-MTAB-4321).
Study
EGAS00001001236
Organ_maturation_in_preparation_for_birth__Peds_RFA__to_develop_a_tissue__resource_and_a_single_cell_atlas_of_organ_development_and_maturation_for__dissemination_among_the_scientific_and_clinical_community__RNA
Knowledge about abnormal organ development is important to understand pathology and to develop novel treatment approaches for individuals with congenital and acquired disease. Most of our current understanding is based on examination of tissues from the embryo and early foetus, collected from women undergoing termination of pregnancy in the first trimester (third) of pregnancy. There is very little known about normal and abnormal organ development from a developmental perspective during the crucial last two-thirds of pregnancy when much remodelling of foetal tissues occurs. This study will generate a single-cell atlas of late-foetal lungs, blood, heart, bone and immune organs.
Study
EGAS00001008256
FEGA Sweden Helpdesk
FEGA Sweden is an archive for storing and sharing all kinds of data resulting from biomedical research projects. As a national node of the Federated European Genome-phenome Archive (FEGA), we enable researchers to store their data in Sweden in a way that meets the requirements of the General Data Protection Regulation (GDPR). Any data submitted to the archive is subject to controlled access, meaning access to the data only will be granted after a formal application procedure. Using FEGA Sweden for storing and sharing data is free of charge for academic users affiliated with a Swedish university. However, the university must have a signed Data Processing Agreement (DPA) with Uppsala University. The current list of universities with a signed DPA can be found here: https://fega.nbis.se/submission/complying-with-gdpr.html.
FEGA Sweden is hosted by the National Bioinformatics Infrastructure Sweden (NBIS), which legally is part of Uppsala University. NBIS forms the bioinformatics platform at SciLifeLab and constitutes the Swedish node of the European organisation ELIXIR.
Contact the FEGA Sweden Helpdesk at fega-sweden@nbis.se for further questions.
Dac
EGAC50000000077
SNP array data for the Milieu Intérieur cohort
The Milieu Intérieur Consortium aims at identifying the environmental and genetic factors driving variation in innate and adaptive immune systems among healthy individuals. In the last four years, we combined standardized flow cytometric analysis of blood leukocytes, transcriptional profiles of whole blood after immune stimulation and genome-wide DNA genotyping in 1,000 healthy, unrelated donors of western European ancestry, to show that age, sex, as well as persistent infections and smoking, are the most important drivers of inter-individual differences in blood cell composition and transcriptional response to pathogens. The EGAS00001002460 study includes genotype data for 5,699,237 genotyped and imputed SNPs in the 816 healthy donors of the Milieu Intérieur cohort. Note that 184 of the 1,000 initial donors did not agree to share their data online.
Study
EGAS00001002460
IMPRESS_all
Dataset containing tumor, normal and blood sample data, for various tissue and tumor types. Data is targeted methylation data, using smMIP probes to target informative CpG sites in the genome. Sample type can be inferred from the name, with 'W' being normal samples and 'TC' tumor samples. Dataset contains 259 samples in duplicate, with one half of the runs being cut by MSRE's, and the other not. For every sample paired fastq files are present.
Dataset
EGAD50000000882
A Genome-Wide Association Study of Heroin Dependence
This collaboration of Australian and American investigators aims to identify genes associated with liability for heroin dependence. The project uses a case-control design in which cases met lifetime DSM-IV criteria for heroin dependence. Controls included assessed individuals who did not meet DSM-IV heroin dependence criteria and unassessed general population controls. Cases and controls were obtained from the several large investigations including: The Comorbidity and Trauma Study, Heroin Dependence in Western Australia, the OZ-ALC Study, a Twin Study of Mole Development in Adolescence, and ongoing genetic studies of substance dependence conducted by investigators at Yale and collaborating institutions. These projects are briefly described below. The Comorbidity and Trauma Study (PI: Elliot Nelson), a retrospective case-control study examining genetic and environmental factors contributing to heroin dependence liability. The study was funded by the National Institute on Drug Abuse (NIDA), and was run in collaboration with Washington University, the Queensland Institute of Medical Research (QIMR), and the National Drug and Alcohol Research Centre (NDARC), University of New South Wales. Case participants were recruited from maintenance clinics in the greater Sydney area. Control participants were recruited from employment centres and community centres, open street malls, and local press servicing the same geographical area as the opioid maintenance treatment clinics and either denied recreational use of opioids or had used these drugs recreationally fewer than 11 times lifetime. The prevalence in these individuals of non-opioid licit drug dependence and illicit drug dependence as well as childhood trauma exposure and other psychiatric disorders is elevated considerably versus estimates of similar measures in Australian general population samples. Participants provided blood samples as a source of DNA and completed a comprehensive psychiatric diagnostic interview based on the Semi-Structured Assessment of the Genetics of Alcoholism - Australia (SSAGA-OZ) augmented with sections drawn from other instruments assessing childhood trauma exposure, family history, and screening for borderline personality disorder. Heroin Dependence in Western Australia (PI: Sybille Schwab) is a study focusing both on genetic contributions to heroin dependence and response to naltrexone treatment of the disorder. Participants completed a clinical assessment and provided blood samples during their treatment at the Perth Naltrexone Clinic now name as the Fresh Start Recovery Programme. Funding for the project was provided by the Australia Government's National Health and Medical Research Council (Grant # 513862; PI: Sybille Schwab) Affected subjects from ongoing genetic studies of substance dependence conducted by investigators at Yale (PI: Joel Gelernter) and collaborating institutions were collected in the course of several NIDA-funded studies. Those included in the current set were assessed by means of the SSADDA (Semi-Structured Assessment for Drug Dependence and Alcoholism). All are opioid dependent European-Americans, and all list heroin as the opioid must used. Most were collected at Yale University School of Medicine or University of CT School of Medicine under the supervision of Drs Joel Gelernter and Henry Kranzler. Control subjects were also collected in the course of several NIDA- and NIAAA-funded studies. Those included in the current set were assessed by means of the SSADDA (Semi-structured Assessment for Drug Dependence and Alcoholism). Most were collected at Yale University School of Medicine or University of Connecticut School of Medicine under the supervision of Drs Joel Gelernter and Henry Kranzler. The OZ-ALC Study (PI: Andrew Heath) consists of a large group of twins and their family members ascertained from the general population Australian Twin Registry who have participated in ongoing research projects. For the control investigation, we have selected individuals who do meet criteria for illicit drug dependence who have had GWAS genotyping with the Illumina Human CNV370-Quad. Inclusion of individuals with alcohol dependence or nicotine dependence was minimized. For a more detailed description of the study, please see: http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000181.v1.p1 The Twin Study of Mole Development in Adolescence (PI: Nick Martin) is an ongoing investigation of melanocytic naevi funded by the Australian Government's National Health and Medical Research Council (Grant # 389891; PI: Nick Martin). For the current project, unassessed parents of these twins will serve as a control group. These individuals will either have been previously genotyped with the Illumina Human 610-Quad BeadChip or will be genotyped as part of the current project. Parents have largely survived the period of risk for heroin dependence and, by virtue of their participation in this research, are very likely to have a prevalence of heroin dependence lower than that in the general population (i.e., <0.7%). In order to understand the immune modulatory effect of opioids across peripheral blood immune populations, a second study (PI: Christine S. Cheng) analyzed single cell RNA-seq profiles from the PBMCs of a subset of case participants and control participants from the Comorbidity and Trauma Study (PI: Elliot Nelson). The study assessed and identified transcriptional changes between opioid dependent and control samples across naive and LPS stimulated immune populations.
Study
phs000277
RNA-seq on neuroblastoma PDX model COG-N-519 treated with control miR-1283 and test miR-99b-5p mimics
COG-N-519 primary neuroblastoma cells were innoculated by subcutaneous injection into NSG mice. When tumours reached a size of 100-200mm3 miRNAs conjugated to nanoparticles were injected intratumourally every 2 days for 1 week. 24 hr after the last injection, snap frozen tumour tissue was harvested for RNA extraction and RNA-seq analysis.
Study
EGAS00001005581
Catalogue Statistics
EGA Statistics
Bibliography
Growth
Community
Archive
Distribution
Catalogue
What is in the EGA?
The European Genome-phenome Archive (EGA) overview
The EGA archives and distributes the results of several types of studies. Such studies include genome-wide association studies (GWAS), different purpose sequencing and molecular diagnosis assays among others. Moreover, these studies target many different kind of diseases from cancer to neurodegenerative alterations.
We summarised all this data in the charts below. The statistics are computed daily based on an updated list of studies. Each pie chart depicts a different aspect of the studies: which disease type was studied, which type of sampling method was used and what was the scope of the genomic analysis.
Number of studies per disease type
Studies in the EGA by disease
barHorizontalChart('disease-graph', 'https://stats.ega-archive.org/catalog/by_disease')
Number of studies per technology type
Studies in the EGA by technology
barHorizontalChart('technology-graph', '/stats/studies/technology', 'Include dbGaP')
Number of studies per sample type
Studies in the EGA by samples
barChart('samples-graph', '/stats/studies/sample-types', 'Include dbGaP')
Documentation
about/statistics/catalogue
Genetics_and_Networks_of_Congenital_Heart_Defects
Exome sequencing of families with Congenital Heart Defects of diverse sub-phenotypes. Comprises both parent-offspring trios for sporadic cases and multiplex families. Collaboration with David Brook, University of Nottingham. Funded by the British Heart Foundation.
Study
EGAS00001000762
RNA Sequencing Analysis of Patient-Derived Xenograft Tissue PIM-084 Treated with L-NMMA+Alpelisib vs Vehicle Control
This dataset contains RNA-seq analysis of vehicle control compared to L-NMMA+alpelisib treated metaplastic breast cancer PDX PIM084 tumors (RNA sequencing and analysis performed by Novogene). The submitted data contains the gene ID, gene name, normalized read counts from all samples from control group and combination treatment group, log2-fold change derived from DESeq2 R/EdgeR package, p-value, adjusted p-value, read counts from each group, and gene FPKM from the each sample in the group. Due to patient privacy, we are submitting only the processed and desensitized RNA sequencing analysis to this repository.
Study
phs003814
Molecular patterns of response and treatment failure after frontline venetoclax combinations in older patients with AML
This study assessed molecular determinants of response in a cohort of patients with AML that were treated with venetoclax in combination with either DNA methyltransferase inhibitors or low dose cytarabine. RNA sequencing was performed on 31 patients from three different response classes [10 Group A - Durable remission (n=10), Group B - Relapsed (n=10) and Group C - Refractory (n=11)]. Library preparation and sequencing was performed at the Australian Genome Research Facility, using the Truseq Stranded mRNA library kit. Technical and batch replicate samples are included, and these replicates are designated in the sample name. Gene count data are provided with the original publication. The use of the sequencing data is subject to a data transfer agreement and is restricted to ethically approved research into blood cell malignancies and cannot be used to assess germline variants.
Study
EGAS00001003820
Risk Assessment of Cerebrovascular Events (RACE) Study
This study includes 1,220 cases with young onset stroke (stroke before age 60 years) who are participants of the larger RACE study. Risk Assessment of Cerebrovascular Events (RACE) is an on-going existing case-control study of stroke now involving over 5000 imaging confirmed cases of stroke and 5000 controls, recruited from seven centers in Pakistan. The study is aimed to investigate the genetic, biomarker and lifestyle determinants of stroke and its subtypes. Cases are eligible for inclusion in the study if they: (i) are aged at least 18 years; (ii) present with a sudden onset of neurological deficit respecting a vascular territory with sustained deficit at 24 hours verified by medical attention within 72 hours after onset (onset is defined by when the patient was last seen normal and not when found with deficit); and (iii) the diagnosis is supported by CT/MRI; and (iv) present with a Modified Rankin Score < 2 prior to the stroke. Findings from patient's history, 12-lead ECG and CT or MRI of the brain. The mandatory procedures for inclusion in this investigation are: (i) clinical verification of cerebrovascular event within 72 hours of onset; (ii) neuroimaging CT (non-contrast) or MRI (MRI is not a mandatory investigation but recorded whenever ordered by the attending physician); and (iii) 12-lead ECG. All other ancillary investigations ordered by the attending physician are recorded as well. The TOAST classification method is used to classify ischemic stroke based on aetiology whereas the Oxfordshire classification is used to classify stroke neuro-anatomically. Control participants for this subset of young onset stroke were individuals enrolled in the Pakistan Risk of Myocardial Infarction Study (PROMIS), a case-control study of acute MI based in Pakistan. RACE capitalizes on the genetic data (including information on GWAS) that has already been collected from the healthy participants enrolled in PROMIS. RACE and PROMIS share similar methodology of recruitment. Participants from both these investigations are derived from similar catchment areas, hence providing an attractive opportunity for RACE to utilize PROMIS controls as common controls for genetic investigations. Controls in PROMIS were recruited following procedures and inclusion criteria as adopted for RACE cases. In order to minimize any potential selection biases, PROMIS controls selected for this stroke substudy were frequency matched to RACE cases based on age and gender and were recruited in the following order of priority: (1) non-blood related or blood related visitors of patients of the out-patient department; (2) non-blood related visitors of stroke patients; (3) patients of the out-patient department presenting with minor complaints (e.g. back pain, minor gastric complaints). Control subjects from the PROMIS study were genotyped at the Wellcome Trust Sanger Institute on the Illumina 660W Quad array. The Center for Non-Communicable Diseases, Pakistan, serves as the coordinating center for both RACE and PROMIS. More information on these research investigations can be found at www.cncdpk.com. This young onset stroke component to the RACE study was funded through the Gene Environment Association Studies initiative (GENEVA, www.genevastudy.org as one of three studies designed to assess the genetics of young onset stroke and modification of genetic effects by smoking. GENEVA is part of the trans-NIH Genes, Environment, and Health Initiative (GEI). Genotyping of 1,220 young onset stroke cases was performed at the Johns Hopkins University Center for Inherited Disease Research (CIDR). Data cleaning and harmonization were done at the GEI-funded GENEVA Coordinating Center at the University of Washington. This study is part of the Gene Environment Association Studies initiative (GENEVA, http://www.genevastudy.org) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI). The overarching goal is to identify novel genetic factors that contribute to stroke through large-scale genome-wide association studies of cases and controls recruited within Pakistan. Genotyping was performed at the Johns Hopkins University Center for Inherited Disease Research (CIDR). Data cleaning and harmonization were done at the GEI-funded GENEVA Coordinating Center at the University of Washington.
Study
phs000456
Genomewide Association Study of Alcohol Use and Alcohol Use Disorder in Australian Twin-Families (OZ-ALC GWAS)
The Australian twin-family study of alcohol use disorder (OZALC study) derives from telephone diagnostic interview studies of two general population volunteer cohorts of Australian twins (cohort 1, mostly born 1940-1964; cohort 2, born 1964-71) and the spouses of the former cohort - a total of over 11,000 families. Three coordinated studies, using a shared assessment protocol and with a shared goal of gene-discovery, were conducted - one funded by the National Institute on Drug Abuse, the others by the National Institute on Alcoholism and Alcohol Abuse - by investigators associated with the Midwest Alcoholism Research Center at Washington University in St. Louis, and investigators at Queensland Institute of Medical Research, Brisbane, Australia (led by Professor Nicholas Martin), using informative families identified from these cohorts. The first of these (NIDA Nicotine Addiction Genetics [NAG] project, PI Pamela Madden) identified index cases from the 3 cohorts with a history of heavy smoking (smoked 20 or more cigarettes daily, or 40 or more cigarettes on 1 or more occasions) and with additional available full siblings who were smokers, and interviewed and obtained blood samples from twins, and cooperative full siblings and parents, in order to identify families that would be informative for linkage analysis of a quantitative heaviness of smoking trait. The second identified additional families with an index case who either reported a history of alcohol dependence (DSM-IV), or scored above the 85th percentile on a quantitative measure of heaviness of alcohol use (alcohol factor score), derived from measures of frequency of heavy drinking, frequency of drinking to intoxication, and typical weekly consumption in standard drinks (all referenced to the respondent's heaviest drinking period) and of lifetime maximum 1-day alcohol consumption and maximum tolerance to alcohol (drinks before getting drunk or before feeling effects of alcohol). Interview and DNA were obtained from index cases and siblings, and DNA only from available parents. The goal of this second study (NIAAA OZ-ALCOHOL EDAC study, PI Andrew Heath) was to identify sibships including pairs who were either extreme concordant for the quantitative consumption measure (both scoring above the 85th percentile) or extreme discordant (one scoring above the 85th percentile and one scoring below the 30th percentile) that would be informative for linkage analysis. The third identified additional sibships solely on the basis of large sibship size, regardless of alcohol or tobacco use phenotypes (NIAAA OZ-BIGSIB study, PIs the late Richard Todd, Andrew Heath). From these coordinated studies a case-control series of alcohol dependent individuals and unaffected controls were constructed for a family-based Genomewide Association Study (OZALC-GWAS) of heaviness of alcohol use and alcohol dependence, funded by the National Institute of Alcoholism and Alcohol Abuse. These data are made available here for all investigators studying outcomes related to alcohol or tobacco use (including major depressive disorder).
Study
phs000181
ST dataset from subcortical white matter MS lesions (CA & CI) and controls
This dataset contains the FASTQ files, the correspondent H&E pictures with the fiducial frames and the json files that were used for the spatial transcriptomic analysis in our paper (n=19). Within the json files names, one can find the information about the slide name (V11M111-111) and the capture area (A1) for each sample.
Dataset
EGAD50000000520
Subtype specific studies of breast cancer progression. Milan cohort.
The aim of this study is to compare DCIS and IBC in a subtype stratified manner in several genomic levels (gene expression, copy number, methylation).The data in this archive is supplementary to previously published data.
Study
EGAS00001004390
Carcinoma of the oral tongue (OTSCC) genomic landscape characterisation
Carcinoma of the oral tongue (OTSCC) is the most common malignancy of the oral cavity, characterized by frequent recurrence and poor survival. The last three decades has witnessed a change in the OTSCC epidemiological profile, with increasing incidence in younger patients, females and never-smokers. Here, we sought to characterize the OTSCC genomic landscape and to determine factors that may delineate the genetic basis of this disease, inform prognosis and identify targets for therapeutic intervention.
Study
EGAS00001001329
NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)
This study consists of 338 VTE cases from an inception cohort of Olmsted County, MN residents (OC) with a first lifetime objectively-diagnosed idiopathic VTE during the 40-year study period, 1966-2005. All living study subjects were invited to provide a whole blood sample at the Mayo Clinical Research Unit for leukocyte genomic DNA and plasma collection. For living study subjects who did not provide a blood sample, we retrieved any leftover blood ("waste" blood) from samples collected as part of routine clinical diagnostic testing and used this to extract DNA after obtaining patient consent. For deceased cases, with IRB approval, we extracted DNA from any available stored tissue within the Mayo Tissue Archive. This "tissue" DNA has been successfully genotyped in prior studies. Three trained and experienced study nurse abstractors reviewed the complete medical records in the community of all potential cases. Note: WGS sample IDs for the previous GENEVA study cases (phs000289) are included in this dataset. The phenotypes for the GENEVA study are located under the above phs number.
Study
phs001402
Nasal epithelial cells of PCD and non-PCD patients grown at air-liquid interface for RNAseq analysis
Raw RNAseq data files of genetically unsolved PCD patients used for transcriptomic analysis aming to uplift the diagnostic rate. The non-PCD patients were used as a clinical comparator to the PCD patients. Majority of the nasal epithelial cells used for RNAseq were cultured at an air-liquid interface for 21 days, unless the data file name indicates a different air-liquid-culture time-point.
Study
EGAS00001006632
OncoArray: Follow-up of Ovarian Cancer Genetic Association and Interaction Studies (FOCI)
The Follow-up of Ovarian Cancer Genetic Association and Interaction Studies (FOCI) was one of five projects funded in 2010 as part of the NCI's Genetic Associations and Mechanisms in Oncology (GAME-ON) initiative (http://epi.grants.cancer.gov/gameon/). FOCI represents a collective effort that builds upon the strengths and history of collaboration inherent in the Ovarian Cancer Association Consortium (OCAC), a multidisciplinary group comprised of epidemiologists, genetic epidemiologists, statistical geneticists, molecular and cell biologists and clinicians that was formed in 2005. The other four funded GAME-ON projects were: the ColoRectal TransdisciplinaryStudy (CORECT), Elucidating Loci Involved in Prostate Cancer Susceptibility (ELLIPSE), Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE), and Transdisciplinary Research in Cancer of the Lung (TRICL). As part of our aim to discover, expand, and replicate ovarian cancer susceptibility loci, the GAME-ON projects and other consortia formed the OncoArray network (http://epi.grants.cancer.gov/oncoarray/) to develop and genotype a new custom genotyping array in large numbers of cancer cases and controls (over 400,000 samples) across multiple cancer types. The FOCI data includes over 50,000 ovarian cancer cases and controls genotyped with the Oncoarray at the Center for Inherited Disease Research (CIDR). Genotype calling and quality control procedures were performed under a standardized protocol across the Oncoarray consortium, and over 490,000 SNPs passed QC and are included under this dbGaP submission.
Study
phs001882
RNA-sequençing of 21 inflammatory hepatocellular adenomas
The French ICGC project on liver tumors is coordinated by Pr Jessica Zucman-Rossi and funded by Inca (French Institute for Cancer). The aim of the present project is to identify the catalog of somatic and germline mutations in liver tumors.The present series corresponds to 21 RNA-seq of IHCA samples.
Aligned bam files correspond to hg38 human reference.
Study
EGAS00001003685
EuCANCan: EUropean-CANadian Cancer network
The EUropean-CANadian Cancer network (EuCANCan) is a project aiming to develop
a federated network of aligned and interoperable infrastructures for the
homogeneous analysis, management and sharing of genomic oncology data for
Personalized Medicine. This initiative, funded by the
European Commission
and the
Canadian Institutes of Health
and coordinated by the Barcelona Supercomputing Center (BSC), is expected to process and provide the scientific community with around
30-35 thousand patient samples from different types of cancer, coming from
different nodes involved in this project, during the four years of the
project.
Data generated as part of primary studies, often in small hospitals or
research centers, is hardly reused by other projects or groups. EuCANCan is
proposed to
foster the homogenisation and discoverability of cancer samples
from very different sources for the whole community. These efforts will help
paving the way to future standardised and global genomics projects, specially
for those participant countries that eventually integrate the methodology and
infrastructure generated during the project.
“The more data we have access to the more and better discoveries. We attempt
to overcome the legal and technical limitations in order to generate, gather
and exploit bigger datasets as well as answer more ambitious questions”. These
were some of the words Dr. David Torrents,
main coordinator of the project, shared with a
regional radio
during the kick-off meeting of the project, on February 11th, 2019.
The EUCANCan network will be composed of reference nodes in Amsterdam,
Barcelona, Berlin, Heidelberg, Paris and Toronto which have established strong
research and clinical programs in the field of genomic oncology. The Centre
for Genomic Regulation (CRG), specifically
the European Genome-phenome Archive (EGA),
is among the leading institutions. Its exhaustive experience in biomedical
data archival and secure distribution is considered key for crucial
developments within the project. It is also notable the recent inclusion of
EUCANCan in the list of
GA4GH driver projects.
Further information and details can be found in the following links:
InsideHPC
BSC official description
Techweek (Spanish)
Blog
eucancan
Genome-Wide Association Studies of Prematurity and Its Complications (African American)
Preterm delivery resulting in the birth of a premature infant is a complex problem with a devastating impact on individuals, families and society. The prevalence of preterm birth has increased steadily in developed countries over the last 20 years and more than three million children die of preterm birth worldwide each year. Despite the importance of the problem and its disproportionate occurrence in poor and minority populations, the underlying causes have been difficult to identify. Spontaneous preterm labor has as its suspected triggers infection, stress, poor nutrition and inherited factors. The single best predictor for preterm delivery is a previous preterm birth. Studies of twins and of recurrences within families provide evidence that genetic factors underlie a substantive component of the risk for prematurity. One major challenge in studying genetic factors in prematurity is that the risk case is not truly established. The genetic risk could reside either in the mother and her uterus or in the infant/placenta. Identification of genetic factors in the mother and/or infant could provide insights into identifying relevant environmental covariates that may be more amenable to rapid interventions but difficult to find using standard epidemiology alone. A comprehensive genome-wide association study (GWAS) is the ideal way to identify those genes that would not be suspected based on our current understanding of the biology of parturition. We are using 2200 African American samples with term or preterm labor. A subset of these (~800) are infant samples recruited by the Neonatal Research Network as part of a study on cytokines and infection in extremely low birth weight infants (Schelonka RL, et al., 2011. PMID: 21145756). Therefore, this group consists of infants <1,000 grams with clinical outcome data for the infant allowing study of the genetic contributors for not only preterm birth but also the complications that often accompany preterm birth. The result will enable a better understanding of the biology of parturition and suggest environmental modifications that can prolong gestations to improve neonatal and adult outcomes. This study is part of the Gene Environment Association Studies initiative (GENEVA, http://www.genevastudy.org) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI). The overarching goal is to identify novel genetic factors that contribute to preterm birth through large-scale genome-wide association studies of African-American cases and controls from multiple sites in the United States. Genotyping was performed at the Johns Hopkins University Center for Inherited Disease Research (CIDR). Data cleaning and harmonization were done at the GEI-funded GENEVA Coordinating Center at the University of Washington.
Study
phs000353
SEAsia.Oceania.Australia
the tar archive contains unflitered genotype data from Reich et al AJHG 2011 study in plink format
Dataset
EGAD00010002302
SolomonIslands.Dataset
The tar archive contains unflitered genotype data from Pugach et al 2018 in plink format
Dataset
EGAD00010002306
DAC Dr. PORTEU – Gustave Roussy
Targeting Heterochromatin Eliminates Malignant Stem Cells in Chronic Myelomonocytic Leukemia Through Reactivation of Retroelements and Innate Immune pathways
Data Access Committee Members (Name, Email, Job Title):
Principal Investigator
PORTEU Françoise (FRANCOISE.PORTEU@gustaveroussy.fr)
Head of Bioinformatics platform
DELOGER Marc (Marc.DELOGER@gustaveroussy.fr)
Data Protection Officer
BECHET Clara (Clara.BECHET@gustaveroussy.fr)
Dac
EGAC50000000252
BrainSpan Atlas of the Human Brain
The NIH-funded BrainSpan (www.brainspan.org) and PsychENCODE (www.psychencode.org). Consortia sought to generate and analyze multi-dimensional genomics data from the developing and adult human brain in healthy and disease states. One of the main goals has been to perform large-scale and integrated analysis of the genome, transcriptome, and epigenome of the human brain to broaden our understanding of human neurodevelopment. This dataset consists of sixteen regions, including eleven neocortical areas, of human donors of both sexes and various ethnic groups. In the first stage of this project, we provide the genome-wide exon-level transcriptome data generated using the Affymetrix GeneChip Human Exon 1.0 ST Arrays, and the genome-wide genotyping data for 2.5 million markers using the Illumina Human Omni 2.5-Quad Bead Chips. In the second stage of this project, we provide whole-genome sequencing data, transcriptome data by mRNA-Seq, small RNA data by smRNA-seq, DNA cytosine methylation by Infinium HumanMethylation450 BeadChip, and epigenomic/epigenetic data by ChIP-Seq for H3K4me3, H3K27me3, H3K27ac and CTCF.
Study
phs000755
NHLBI TOPMed: The Jackson Heart Study (JHS)
Since there is a greater prevalence of cardiovascular disease among African Americans, the purpose of the Jackson Heart Study (JHS) is to explore the reasons for this disparity and to uncover new approaches to reduce it. The JHS is a large, community-based, observational study whose 5306 participants were recruited from among the non-institutionalized African-American adults from urban and rural areas of the three counties (Hinds, Madison, and Rankin) that make up the Jackson, MS, metropolitan statistical area (MSA). Jackson is the capital of Mississippi, the state with the largest percentage (36.3%) of African Americans in the United States. The JHS design included participants from the Jackson ARIC study who had originally been recruited through random selection from a drivers' license registry. Approximately six months before the JHS was to begin, an amendment to the federal Driver's Privacy Protection Act was passed that changed the level of consent for public release of personal information from driver's license lists from an "opt out" to an "opt in" basis. The Mississippi Highway Patrol was no longer able to release a complete listing of all persons with driver's licenses or state identification cards, which prevented its use in the JHS. New JHS participants were chosen randomly from the Accudata America commercial listing, which provides householder name, address, zip code, phone number (if available), age group in decades, and family components. The Accudata list was deemed to provide the most complete count of households for individuals aged 55 years and older in the Jackson MSA. A structured volunteer sample was also included in which demographic cells for recruitment were designed to mirror the eligible population. Enrollment was opened to volunteers who met census-derived age, sex, and socioeconomic status (SES) eligibility criteria for the Jackson MSA. In addition, a family component was included in the JHS. The sampling frame for the family study was a participant in any one of the ARlC, random, or volunteer samples whose family size met eligibility requirements. Eligibility included having at least two full siblings and four first degree relatives (parents, siblings, children over the age of 21) who lived in the Jackson MSA and who were willing to participate in the study. No upper age limit was placed on the family sample. Known contact information was obtained during the baseline clinic examination from the index family member with a verbal pedigree format to identify name(s), age(s), address (es), and telephone number(s). Recruitment was limited to persons 35-84 years old except in the family cohort, where those 21 years old and above were eligible. Only persons who otherwise met study criteria but were deemed to be physically or mentally incompetent by trained recruiters were excluded from study eligibility.1 1 Wyatt SB, Diekelmann N, Henderson F, Andrew ME, Billingsley G, Felder SH et al. A community-driven model of research participation: the Jackson Heart Study Participant Recruitment and Retention Study. Ethn Dis 2003; 13(4):438-455 (PMID: 14632263).
Study
phs000964
HELIUS cohort
The gut microbiota composition is unique to every individual but is shaped by common factors including diet, lifestyle, medication use, early-life determinants, living environment or genetics. Most of these factors may be influenced by ethnicity. This study explored variations in fecal microbiota composition in 6048 individuals with different ethnic backgrounds living in the same geographical area (Amsterdam, the Netherlands).
The HELIUS data are owned by the Amsterdam University Medical Centers, location AMC in Amsterdam, The Netherlands. To allow sharing of microbiome data collected in HELIUS with (inter)national researchers, 16s rRNA sequence analysis has been stored at the European genome-phenome archive (EGA; accession code EGAD00001004106). This requires that access needs to be granted, also because the HELIUS data are stored with relevant phenotypical variables. Access is granted to all researchers affiliated with an internationally recognized research institution who request to use the HELIUS data within the EGA context, after having signed the data transfer agreement. Any researcher can request the data by submitting a proposal to the HELIUS Executive Board as outlined at http://www.heliusstudy.nl/en/researchers/collaboration, by email: heliuscoordinator at amsterdamumc dot nl. The HELIUS Executive Board will check proposals if they do not conflict with ethical approvals and informed consent forms of the HELIUS study.
Study
EGAS00001002969
European Prospective Investigation into Cancer and Nutrition BAMs
516 DNA samples were collected from individuals upon enrollment into the European Prospective Investigation into Cancer and Nutrition study between 1993 and 1998 across 17 different centers. 126bp pair-end reads sequencing data from the Illumina platform were converted to fastq format, the 2bp molecular barcode information at each read of the pair was trimmed and was written in the reads name. The Thymine nucleotide required for ligation was removed from the sequences. Burroughs-Wheeler Aligner (BWA-mem) was used for alignment of the processed fastq files to the reference hg19 genome, following indel-re-alignment using GATK. An in-house algorithm was written to collapse read families that share the same molecular barcode sequence
Dataset
EGAD00001003583
Anthropological dataset 1 for The admixture histories of Cabo Verde
Datasets used in the article "The genetic and linguistic admixture histories of the islands of Cabo Verde" by Laurent R et al. eLife 2023 (DOI: https://doi.org/10.7554/eLife.79827 - URL: https://elifesciences.org/articles/79827)
File name "eGAdeposit_233CaboVerde_SampleInfo_FINAL_01062022.txt"
Column 1 corresponds to individual alphanumeric codes as in the "eGAdeposit_233CaboVerde_GenotypeFile_FINAL_01062022.vcf" genotype file
Column 2 corresponds to individual's biological sex as per genetic inference
Column 3 corresponds to individual's self-reported age in years
Column 4 corresponds to individual's self-reported cumulated number of years spent in academic or professional education
Dataset
EGAD00001008976
Single-cell bam files and RNA sequencing of viral RNA stocks
244 infected single-cell alveolar bam files, 48 empty well bam files, and 52 RNA sequencing of amplicons (4 SARS-CoV-2 variants with 12 batches and 4 viral variants pool samples).
244 alveolar single cells were captured over 12 experimental batches and experimental condition is written in metadata uploaded as "infected_cells_final_revision.csv". on github (https://github.com/twkim-0510/SARS-CoV-2_viral_competition). Each bam file name corresponds to the sample_name column of the metadata.
Dataset
EGAD00001009711
Federated EGA
Federated EGA
Federated EGA Index
Overview
Map of FEGA status across the world
Current status of the FEGA Network
Documentation
Overview
Federated EGA is a global network of repositories enabling secure discovery and access to sensitive human data.
Our vision is to accelerate scientific discovery and healthcare breakthroughs by creating the go-to worldwide sensitive human data resource.
Our values are:
Privacy & Trust: Ensuring privacy of data subjects, security of services and acting with integrity to build community trust.FAIRness: Driving reuse of data through standardisation of metadata and clear access policies.Diversity: Expanding our network and representing global diversity in the data we host to bring benefits worldwide.
Federated EGA - European by name, global by nature
Federated EGA collaborates with European and global initiatives (GA4GH, ELIXIR, 1+ Million Genomes Framework, GDI), demonstrating that a network for transnational discovery of and access to human data is possible. By providing a solution to emerging challenges around secure and efficient management of human omics and associated data, the FEGA Network fosters data reuse, enables reproducibility, and accelerates biomedical research.
For general enquiries or to request more information about Federated EGA, please send an email to fega-info [at] lists [dot] ega-archive [dot] org.
Read more about the Federated EGA in our commentary in Nature Genetics!
Learn more about setting up a Federated EGA Node in our onboarding documentation.
Map of FEGA status across the world
Current status of the FEGA Network
Node
Description
Joined
CEGA
Central EGA is managed by the Centre for Genomic Regulation (CRG) and EMBL-EBI. The EGA has been a trusted repository for sensitive data since 2008. CEGA coordinates the FEGA network, hosts central services and the EGA catalogue.
2022
Finland
FEGA Finland is hosted by CSC - IT Center for Science Ltd., which is the national ELIXIR Node in Finland. CSC is a state-owned company specialising in providing high-quality IT infrastructure and services for Finnish higher education and research.
2022
Germany
The German Human Genome-Phenome archive (GHGA) is part of the German National Research Data Infrastructure (NFDI). It is coordinated by the German Cancer Research Center (DKFZ) and includes over 20 academic partners.
2022
Norway
FEGA Norway is a service by ELIXIR Norway, the Norwegian node of the European organisation ELIXIR. University of Oslo as a partner institution in ELIXIR Norway is hosting the service relying on the TSD infrastructure for sensitive data. ELIXIR Norway is a consortium of 5 Norwegian universities and is coordinated by University of Bergen.
2022
Spain
FEGA Spain is co-hosted by the Barcelona Supercomputer Center (BSC) and the Centre for Genomic Regulation (CRG), both part of ELIXIR Spain and the Spanish National Infrastructure for Personalised Medicine associated with Science and Technology (IMPaCT).
2022
Sweden
FEGA Sweden is hosted by the National Bioinformatics Infrastructure Sweden (NBIS), which legally is part of Uppsala University. NBIS forms the bioinformatics platform at SciLifeLab and constitutes the Swedish node of the European organisation ELIXIR.
2022
Poland
FEGA Poland is hosted by Biobank Lodz, which is part of University of Lodz. Biobank Lodz is an element of the infrastructure of the Regional Digital Medicine Centers established by the Medical Research Agency.
2022
Portugal
FEGA Portugal is managed by BioData.pt, the distributed infrastructure for Life and Health data for Portugal. This entity is a non-profit association of 15 life sciences R&I organisations spread across the country, and the home of ELIXIR Node in Portugal.
2023
Canada
Canadian Genome-phenome Archive is part of the Pan Canadian Genome Library, with the mandate to allow the secure and responsible sharing and analysis of Canadian genomic information.
2024
Switzerland
Swiss FEGA is hosted by SIB Swiss Institute of Bioinformatics, which is the national ELIXIR Node in Switzerland, on behalf of the Swiss FEGA Partnership which includes 4 other key partners (ETH Zurich, Health 2030 Genome Center, Swiss Data Science Center and Switch).
2025
Documentation
Title
Version
Description
Mission, vision, values
Federated EGA Mission, vision, values
1.0
The official Mission-Vision-Values statement for FEGA, produced and formally approved by the FEGA Strategic Committee in September 2025.
Structure and Organisation
Federated EGA Structure and Organisation
2.0
The structure of the FEGA network and service expectations, responsibilities and commitments. FEGA is organised into three tiers: Central EGA, FEGA Nodes and FEGA Affiliates.
Strategic Committee
Federated EGA Strategic Committee
1.1
Terms of reference for the FEGA Strategic Committee describe the purpose and objectives of the committee, which is to provide direction and strategic planning for the FEGA network. The committee receives input from, and provides feedback for the EGA Strategic Committee.
Operations Committee
Federated EGA Operations Committee
1.1
Terms of reference for the FEGA Operations Committee describe the purpose and objectives of the committee, which is to review operational performance and coordinate technical roadmaps of FEGA Nodes and Central EGA. The committee receives advice from, and provides operational reporting to the FEGA Strategic Committee.
Guidelines
Node Operations guidelines
2.1
An overview of the operational areas which require resources in order to create a FEGA Node. The document is based on more than 15 years' experience of EMBL-EBI and CRG operating the EGA and initial expericences of the inaugural FEGA Nodes. The operational areas of responsability are in line with the FEGA Maturity Model, which guides establishment and operation of FEGA Nodes.
Collaboration
Federated EGA Node Collaboration Agreement
1.3
Nodes are welcome to make a copy of this current version of the CA to start its review with their legal teams and understand the responsibilities of joining FEGA. Nevertheless, this version (the one with a watermark) shall not be signed: the official version needs to be obtained from FEGA prior signing through its official channels.
Documentation
about/projects-and-funders/federated-ega
Neuromics / RD-Connect - Huntington's disease
This study contains omics datasets from the Neuromics project (www.rd-neuromics.eu) on rare neuromuscular and neurodegenerative disorders. Data includes BAM and VCF files from whole-exome sequencing and standardised phenotypic data mapped to the human phenotype ontology (HPO). In some cases proteomic, transcriptomic and metabolomic data may also be available. This study groups together datasets from individuals with a Huntington's disease phenotype and also includes some unaffected family members. Search under the Neuromics name to find related studies for other neuromuscular and neurodegenerative disorders.
Study
EGAS00001000698
Targeted panel DNA sequencing of melanomas, nevi and melanocytic tumors
This dataset was derived from 360 formalin-fixed paraffin-embedded (FFPE) samples distributed across (i) 118 primary melanomas, (ii) 132 nevi and (iii) 110 melanocytic tumors. Next generation sequencing (NGS) libraries were prepared following the FFPE DNA Archer VariantPlex Somatic Protocol for Illumina. Input DNA was determined by the PreSeq DNA QC assay, with 100 ng of DNA used for samples with high quality scores, 200 ng for low quality scores and 300 ng for samples with bad quality scores as defined by the PreSeq DNA QC assay. NGS libraries were quantified using the KAPA Library Quantification Kit for Illumina (KK4824) and sequencing performed on the Nextseq 500/2000 without custom primers using paired-end sequencing (151bp for Read 1 and Read 2) with index reads (8bp for Index Read 1 and Index Read 2). For each sample, resulting paired-end fastq files are available as part of this dataset. Prefix for each sample name is associated with the the type of sample: NAE: nevi samples. MEL: Melanoma samples. BL: melanocytic tumor.
Dataset
EGAD50000001297
Genes and Blood Clotting Study (GABC)
Objectives: Use genome-wide approaches to identify genetic variants that influence common thrombosis and hemostasis factors, as well as selected common human traits. Design/Methods: The GABC study was a prospective sibling cohort design. Siblings were recruited by targeted email to the undergraduate and graduate student email lists at the University of Michigan. Healthy persons between 14 and 35 years old who had healthy siblings within the same age restriction were able to participate. Study participants agreed to an online informed consent and subsequently completed a 52-question online survey describing their specific bleeding traits as well as many common human traits. Fifty milliliters of blood was collected into a citrate-dextrose solution (ACD) from each participant. An aliquot of whole blood was used for an automated complete blood count analysis and the remainder was processed into platelet poor plasma and buffy coat portions. Plasma and buffy coat aliquots were snap frozen and stored in liquid nitrogen for future studies. 1189 individuals representing 507 sibships were collected between 06/26/2006 and 01/30/2009. Phenotyping Survey Details: To characterize individual bruising and bleeding history, the online survey recorded answers to questions based on a modified von Willebrand Disease (VWD) screening questionnaire. To characterize a collection of participant's common human traits, the survey recorded answers to questions about height, weight, presence of skin tags, history of acne, eye color, hair color, hair line characteristics, skin sunburn sensitivity, skin tanning ability, natural skin color, freckling, cheek dimpling, earlobe shape, shoe size, foot arch characteristics, hand fifth digit morphology, history of dyslexia, history of migraine headaches, history of seasonal allergies, history of apthous ulcers, tendency to sneeze while walking into a bright sunny place, history of dental caries, need for corrective eye lenses, handedness and like or dislike of strongly flavored foods. Biochemical phenotyping: Assays for plasma Von Willebrand Factor (VWF) antigen were performed using ELISA and "Alphalisa" techniques. Automated complete blood count analysis was performed on a Bayer Advia 120 on all participants (including WBC differential, RBC indices, and platelet count.) For the dbGaP v2 update, new biochemical phenotypes have been submitted and include von Willebrand Factor, von Willebrand Factor propeptide, plasminogen, gamma prime fibrinogen, ADAMTS 13, antithrombin III, protein C, and protein S. All new phenotypes were obtained using "Alphalisa" techniques. Genotyping Details: SNP genotyping was performed using genomic DNA extracted from peripheral blood at the Broad Institute, (MIT/Harvard). Genotyping was performed on the Illumina Omni-1 quad chip at the Broad Institute. For the dbGaP v2 update, genotyping data from the Illumina Human Exome was deposited. This study is part of the Gene Environment Association Studies initiative (GENEVA, http://www.genevastudy.org) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI). The overarching goal is to identify novel genetic factors that contribute to blood clotting through large-scale genome-wide association studies of siblings. Genotyping was performed at the Broad Institute of MIT and Harvard, a GENEVA genotyping center. Data cleaning and harmonization was performed by the primary investigators at the University of Michigan, Ann Arbor, and at the GEI-funded GENEVA Coordinating Center at the University of Washington. This study serves as a resource for investigators who are interested in the genetic determinants of specific plasma proteins in a healthy population. The sibling cohort design allows for linkage analysis in addition to association studies. Analysis of thrombosis and hemostasis related traits should help elucidate specific biochemical and genetic networks that maintain hemostasis. We hope to identify specific genetic determinants of VWF levels in order to better understand the factors that influence the development of VWD.
Study
phs000304
Wisconsin Longitudinal Study on Aging
The Wisconsin Longitudinal Study (WLS) is a long-term study of a random sample of men and women, who graduated from Wisconsin high schools in 1957, and their siblings. The WLS panel started out with a panel of 10,317 members from the class of 1957. Over time a second panel of 8,734 randomly selected siblings of the original graduate panel were recruited for the study. Of these combined panel members, 9,027 survived and contributed saliva for genetic analysis. Survey data were collected from the original respondents or their parents in 1957, 1964, 1975, 1992, 2004, and 2011, and from a selected sibling in 1977, 1994, 2005, and 2011. WLS data provide a detailed record of educational, social, psychological, economic, mental and physical health characteristics of a relatively homogeneous population that is almost entirely of Northern and Western European ancestry. Saliva was first collected in 2007-2008 by mail. Additional samples were collected in the course of home interviews that began in March 2010.
Study
phs001157
Test dataset with ligh-weight files
This is a test dataset derived from public data of the 1000 Genomes Project. Its purpose is not to allow for any inference about cohort data or results, but to aid bioinformaticians in the technical development and testing of tools, as well as data consumers in learning how to access information.
This dataset consists of 3 pairs of light-weight (sliced) files: BAM + BAI, CRAM + CRAI and VCF + TBI. These files can be downloaded directly through the EGA-download-client PyEGA3 (https://github.com/EGA-archive/ega-download-client).
For any further questions, please contact the DAC (Helpdesk - email: helpdesk [at] ega-archive [dot] org).
Dataset
EGAD00001009826
Recurrent somatic JAK-STAT mutations within a novel RUNX1-mutated pedigree
The acquisition of somatic mutations is an emerging field of investigation in familial leukemia. Currently, genetic profiles in familial MDS/AML are considered analogous to sporadic disease, although the patterns of clonal evolution within families are poorly defined. We performed whole exome profiling of tumour samples from a novel RUNX1 mutated family, to determine the stepwise evolution of MDS/AML across 4 young siblings. Three siblings developed monocytic AML/RAEB2 at 5 years of age, with hepatosplenomegaly and somatic mutations upregulating JAK-STAT signalling, the latter are typically detected in <5% of sporadic MDS/AML. Two siblings acquired the canonical JAK2 V617F mutation, while another acquired a unique missense mutation of SH2B3, a negative regulator of JAK2. Notably, 2/3 siblings demonstrated dosage amplification of these mutations due to acquired uniparental disomy of chromosomes 9p and 12q (encompassing JAK2 and SH2B3, respectively). All 4 siblings were heterozygous for the 46/1 JAK2 haplotype associated with predisposition to sporadic V617F myeloproliferative disorders, which likely influenced the acquisition of JAK2 mutations. Our findings provide further evidence that relatives with shared germline mutations may acquire somatic mutations in a non-random manner leading to convergent patterns of disease evolution.
Study
EGAS00001001862
snRNA-seq in white matter post-mortem tissue from MS and controls
This Dataset is currently hosted by the European Nucleotide Archive. To access the data contained within the Dataset please follow the link below:
https://www.ebi.ac.uk/ena/browser/view/PRJEB39323
Dataset consists of 20 snRNA-seq bam files from 10X v2. 5 samples from postmortem white matter tissue from non-neurological controls and15 samples from different MS lesions from the white matter tissue of 4 postmortem progressive MS patients.
Dataset
EGAD00001004544
The genomic echoes of the last Green Sahara on the Fulani and Sahelian people
Study
EGAS00001007499
NHLBI TOPMed: The Cleveland Family Study (CFS)
The Cleveland Family Study (CFS) is one cohort involved in the WGS project. The CFS was designed to provide fundamental epidemiological data on genetic and non-genetic risk factors for sleep disordered breathing (SDB). In brief, the CFS is a family-based study that enrolled a total of 2284 individuals from 361 families between 1990 and 2006. The sample was selected by identifying affected probands who had laboratory diagnosed obstructive sleep apnea. All first degree relatives, spouses and available second degree relatives of affected probands were studied. In addition, during the first 5 study years, neighborhood control families were identified through a neighborhood proband, and his/her spouses and first degree relatives. Each exam, occurring at approximately 4 year intervals, included new enrollment as well as follow up exams for previously enrolled subjects. For the first three visits, data, including an overnight sleep study, were collected in the participants' homes while the last visit occurred in a general clinical research center (GCRC). Phenotypic characterization of the entire cohort included overnight sleep apnea studies, blood pressure, spirometry, anthropometry and questionnaires. The GCRC exam (n=735 selected individuals) included more comprehensive phenotype data on a focused subsample of the larger cohort, to permit linking SDB phenotypes with cardio-metabolic phenotypes, with an interest in identifying genetic loci that are associated with these related phenotypes. In this last round of data collection, a subset of 735 individuals was selected based on expected genetic informativity by choosing pedigrees where siblings had extremes of the apnea hypopnea index (AHI). Participants underwent detailed phenotyping including laboratory polysomnography (PSG), ECG, spirometry, nasal and oral acoustic reflectometry, vigilance testing, and blood and urine collection before and after sleep and after an oral glucose tolerance test. A wide range of biochemical measures of inflammation and metabolism were assayed by a Core Laboratory at the University of Vermont. 994 individuals were sequenced as part of TOPMed Phase 1, including 507 African-Americans and 487 European-Americans. Among the sequenced individuals, 156 were probands with diagnosed sleep apnea, an additional 706 were members of families with probands, and 132 were from neighborhood control families. 298 individuals were sequenced as part of TOPMed Phase 3.5, including 169 African-Americans and 129 European-Americans. Among the newly sequenced individuals, 33 were probands with diagnosed sleep apnea, an additional 214 were members of families with probands, and 51 were from neighborhood control families. Please note: Phenotype and pedigree data are available through "NHLBI Cleveland Family Study (CFS) Candidate Gene Association Resource (CARe)", phs000284.
Study
phs000954
DNA Methylation Analysis of Peripheral Blood Cells from Siblings Discordant for ASD
Autism is a common disorder whose causes are poorly understood. Both genetic and environmental influences are thought to act together causing autism. Epigenetics is an area that bridges genetic and environmental influencing and commonly is studied by examining dynamic methylation changes to the DNA. We have screened a total of 78 autistic boys and their fathers by this method in 807 genes and find 116 methylation changes that together can correctly identify nearly 80% of the affected boys from their fathers. We propose to confirm these exciting data in a much larger set of genes in 1,200 autistic boys and their families, including unaffected brothers. These data not only could provide new insight into the causes of autism but could also result in a screening test for autism.
Study
phs000619
Gut microbiome dynamics unravelled with metagenomics sequencing
It has been already several years since we discovered that microbial
therapeutics like Faecal Microbiota Transplantation can be used to treat
Clostridium difficile infection (CDI) and Inflammatory bowel disease (IBD)
with great success. The concept has challenged us to test the limits of our
knowledge with the perspective to improve health in the most diverse
conditions like Parkinson disease, depression, autism and obesity, just to
mention a few.
Note: the figure belongs to the original paper and is owned by the authors.
Allyson L.Bird and colleagues from Genentech Inc. (San Francisco) recently
published
an interesting study
building up on the key concept that “As microbial therapeutics are increasingly being tested in diverse patient populations, it is essential to
understand the host and environmental factors influencing the microbiome.”
The authors of this work analysed 1359 gut microbiome samples using shotgun
metagenomics sequencing. The results of the comparison done between
microbiota composition across the different groups showed that biological sex
is the strongest driver of diversity, while aging (age range studied: 20–69)
appears to be the second factor. As smartly outlined in the authors’ own
graphical abstract (figure to the left), Bacteroidota species consistently
increased with donors' age while Actinobacteriota species,
including Bifidobacterium, decreased.
This study also integrated data from oncological patients to draw
cancer-related evidences. After correcting the biases for technical and
geographic differences, they found that cancer patients show significantly
altered gut bacterial communities compared with their healthy counterparts.
Eg. E.coli was found to be increased in cancer patients together with
Bacteroidota/Firmicutes ratios. By comparing sequential samples taken from the
same donors (413) within short-distance time points, the authors found that in
the short term, intra-individual differences were less than the
inter-individual ones, indicating a certain community stability being the
norm.
One of the pivotal points from this interesting study is that the data
proceeds from extremely well-defined and selected donors. Specifically, the
data for 949 selected healthy donors came from the Milieu Intérieur (MI)
Consortium and extensive metadata, including demographic variables,
serological measures, dietary information, and systemic immune profiles, are
available. This precise stratification allowed for the robust detection of
smaller changes, compared to former works. All sequencing data has been
deposited in the EGA archive under the accession number
EGAS00001004437.
Enjoy this amazing
open access paper
published in the Experimental Journal of Medicine on January 2021.
Blog
gut-microbiome-dynamics-unravelled
H3Africa - Consortium WGS
To address gaps in sequence data for some of the African populations being studied in H3Africa projects, additional sequencing has been carried out through various projects and the data are being made available. The first dataset includes a high coverage whole genome sequence dataset generated at the Baylor College of Medicine with samples provided by H3Africa PIs and collaborators, funded by the National Institutes of Health. The sequence data files are accompanied by minimal metadata, including country, ethnic group (where available) and sex.
Study
EGAS00001005972
Exome sequencing of short SGA children with IGF-I and insulin resistance
Exome sequencing of short SGA children with IGF-I and insulin resistance. Collaboration with Professor David Dunger, University of Cambridge. Funded by NIHR.
Dataset
EGAD00001002208
Population Architecture using Genomics and Epidemiology (PAGE): Causal Variants Across the Life Course (CALiCo): Atherosclerosis Risk in Communities (ARIC)
CALiCo ARIC The Atherosclerosis Risk in Communities Study (ARIC), sponsored by the National Heart, Lung and Blood Institute (NHLBI), is a prospective epidemiologic study conducted in four U.S. communities. ARIC is designed to investigate the etiology and natural history of atherosclerosis, the etiology of clinical atherosclerotic diseases, and variation in cardiovascular risk factors, medical care and disease by race, gender, location, and date. ARIC includes a Cohort Component and a Community Surveillance Component. Cohort enrollment began in 1987. Each ARIC field center randomly selected and recruited a sample of approximately 4,000 individuals aged 45-64 from a defined population in their community. A total of 15,792 participants received an extensive examination, including medical, social, and demographic data. These participants were reexamined every three years with the first screen (baseline) occurring in 1987-89, the second in 1990-92, the third in 1993-95, and the fourth and last exam wastook place in 1996-98. Follow-up occurs yearly byA fifth cohort examination is underway (2011-2013). Yearly telephone tointerviews maintain contact with participants and to assess health status of the cohort. In the Community Surveillance Component, currently ongoing, these four communities are investigated to determine the community-wide occurrence of hospitalized myocardial infarction and coronary heart disease deaths in men and women aged 35-84 years. Hospitalized stroke is investigated in cohort participants only. The study conducts community surveillance of inpatient heart failure (ages 55 years and older) and cohort surveillance outpatient heart failure events beginning in 2005.
Study
phs000223
International Consortium on the Genetics of Systemic Lupus Erythematosus (SLEGEN)
The genetic makeup of an individual strongly influences the risk of developing systemic lupus erythematosus (SLE). The identification of genes that predispose an individual to SLE will lead to earlier and better diagnosis, better treatments, and possibly prevention. To this end, the International Consortium on the Genetics of Systemic Lupus Erythematosus (SLEGEN) was formed in 2005 and is composed of lupus researchers who agreed to pool their knowledge and resources to search for genes that predispose to lupus. Eight laboratories contributed DNA samples for genotyping at the Broad Institute and association with SLE was performed by the Data Coordinating Center (Wake Forest University), as part of a four stage study design. Stages one and two of this design were graciously funded by the Alliance for Lupus Research (www.lupusresearch.org). In this stage of the study, approximately 767 SLE patients (cases) were compared to approximately 383 non-SLE patients (controls) for differences among the Illumina HumanHap300. The affected individuals are all females of European decent. 82% of the cases are the index case from multiplex pedigrees for SLE and the remaining 18% have self-reported first degree relatives with SLE. A detailed summary of the methods and results can be found in the manuscript in Nature Genetics February 2008 by SLEGEN "Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants ITGAM, PXK, KIAA1542 and other loci". (Please see also Study Accession: phs000202.v1.p1)
Study
phs000216
Growth Statistics
EGA Statistics
Bibliography
Growth
Community
Archive
Distribution
Catalogue
In this section the growth of registered objects such as studies, datasets and DACs in the EGA.
Available Study, Dataset and DAC by year.
The below figure represent the distribution of available Studies, Datasets and DACs per year.
Available EGA Studies, Datasets and DACs.
barChart('growth-objects-released', 'https://stats.ega-archive.org/growth/objects/released', ['Number of Objects', ['DAC', 'Dataset', 'Study']])
Created Study, Dataset and DAC by year.
The below figure represent the distribution of created Studies, Datasets and DACs per year.
Created EGA Studies, Datasets and DAC
barChart('growth-objects', 'https://stats.ega-archive.org/growth/objects', ['Number of Objects', ['DAC', 'Dataset', 'Study']])
Documentation
about/statistics/growth
WES Breast Patient-derived Tumor Organoid
This dataset includes 18 Whole Exome Sequencing (WES) samples from 5 subjects. WES is performed on a matched pair of case/control samples, e.g., tumor/control or organoid/control. For each patient, the same control sample is used for the analysis of tumor and organoid samples.
The sample name structure identifies the type of sample: <subjectId>_[TON][12]_[Case|Ctrl]_EX2; where T, O, and N refer to Tumor tissue, Organoid, and control sample, respectively. The number 1 or 2 refers to the specific tissue, and Case and Ctrl indicate a tumor tissue (or organoid) and a control sample, respectively.
For example, ICSBCS007_T1_Case_EX2 is the first tumor tissue of subject ICSBCS007; ICSBCS007_O1_Case_EX2 is the tumor organoid derived from that tumor sample, and ICSBCS007_N1_Ctrl_EX2 is the matching control sample.
More information can be found in the sample information table.
Dataset
EGAD50000000961
The Mexican-American Coronary Artery Disease Study (MACAD)
The MACAD Study, funded by NHLBI, was designed to explore genetic contributions to coronary artery disease and glucose homeostasis traits among Hispanics using a family-based design. The baseline examination of the cohort included the euglycemic hyperinsulinemic clamp test from which the two key phenotypes were obtained: insulin sensitivity (M) and metabolic clearance rate of insulin (MCRI). Genome-wide genotyping was obtained under separate funding by NIDDK as a part of the GUARDIAN (Genetics Underlying Diabetes in Hispanics) Consortium.
Study
phs001397
The Hypertension-Insulin Resistance Family Study (HTN-IR)
The HTN-IR Study, funded by NHLBI, was designed to explore genetic contributions to hypertension and glucose homeostasis traits among Hispanics using a family-based design. The baseline examination of the cohort included the euglycemic hyperinsulinemic clamp test from which the two key phenotypes were obtained: insulin sensitivity (M) and metabolic clearance rate of insulin (MCRI). Genome-wide genotyping was obtained under separate funding by NIDDK as a part of the GUARDIAN (Genetics Underlying Diabetes in Hispanics) Consortium.
Study
phs001394
Whole Exome Sequencing of 15 Tumor/Normal pairs of inflammatory hepatocellular adenomas
The French ICGC project on liver tumors is coordinated by Pr Jessica Zucman-Rossi and funded by Inca (French Institute for Cancer). The aim of the present project is to identify the catalog of somatic and germline mutations in liver tumors. The present series corresponds to 15 Tumor/Normal pairs of Whole Exome Sequencing (WES) of IHCA samples.
Aligned bam files can be both in hg19 (CHC750T/CHC750N; CHC2189T/CHC2189N; CHC2615T/CHC2614N) or hg38 (other samples).
Study
EGAS00001003686
Dac for "Combined single-cell transcriptomics and T-cell receptor sequencing reveal heterogeneity of mycosis fungoides between and within patients and identify a CD4+ cytotoxic subtype"
Dac for "Combined single-cell transcriptomics and T-cell receptor sequencing reveal heterogeneity of mycosis fungoides between and within patients and identify a CD4+ cytotoxic subtype" with
Name: Menghong Yin
Email address: menghong.yin@dkfz-heidelberg.de
Associated affiliation: Translational Skin Cancer Research Lab (DKFZ)
Dac
EGAC50000000168
PacBio Revio WGS on 10 carriers of ring and marker chromosomes
Bam files containing PacBio HiFi reads from carriers of ring and marker chromosomes. The reads where genereated using the PacBio Revio platform. Each individual was sequenced to roughly 30X coverage on one flow cell per individual. The chromosome of interest is indicated in the file name.
Dataset
EGAD50000002111
Deciphering Developmental Disorders (DDD)
The Deciphering Developmental Disorders (DDD) study is a research collaboration between the Wellcome Trust Sanger Institute, the NHS clinical genetics services and families across the UK and Ireland. The project aims to improve the diagnosis of children with developmental disorders by using high-resolution microarray and massively parallel sequencing technologies on 12,000 children and their parents. Genetic changes that explain the child's symptoms will be displayed in the DECIPHER database (https://decipher.sanger.ac.uk). Extended datasets generated by the DDD project will be available in the European Genome-Phenome Archive with access carefully managed by a Data Access Committee.
Study
EGAS00001000775
The Human Pancreas Analysis Program (HPAP)
The past decade has seen a dramatic improvement in our ability to phenotype and molecularly profile human tissues with unprecedented resolution at the genomic, epigenomic, protein, and functional levels. The Human Pancreas Analysis Program (HPAP), part of the Human Islet Research Network and supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) through multiple NIH grants, is performing deep phenotyping of the human endocrine pancreas to better understand the cellular and molecular events that precede and lead to beta-cell loss and/or dysfunction in type 1 diabetes (T1D) and type 2 diabetes (T2D) as well as to accumulate, analyze, and distribute high-value data sets to the diabetes research community through the HPAP PANC-DB database (and additional details provided at that site). To this end, HPAP employs state-of-the-art technologies to perform comprehensive analyses of pancreas biology as it pertains to organ donors with T1D, autoantibody-positive donors without diabetes, donors with T2D, and control donors. Pancreas procurement and analyses take advantage of the expertise and extensive network of Organ Procurement Organizations (OPOs) and autoantibody screening centers established through the JDRF/The Lenona M. and Harry B. Helmsley Charitable Trust (HCT) – funded nPOD program. In contrast to nPOD, the major product provided by HPAP is not archived biomaterial subject to broad distribution, but rather, the delivery of extensive and high-quality molecular data sets to the diabetes research community in order to facilitate further discovery. Together, nPOD and HPAP are complementary programs that assist the diabetes community and afford the maximal opportunity for advancing knowledge about the pathogenesis of T1D and T2D.
Study
phs002465
Single individual whole genome sequencing of Jakun, Indigenous Peoples of the Peninsular Malaysia
This study disected the genetic structure of a Jakun individual, a sub-tribe of Indigenous Proto-Malay from Peninsular Malaysia. They are postulated to have settled in Peninsular Malaysia approximately 4-6 KYA, during the last agricultural expansion. However, their genetic structure is still not well understood.
Study
EGAS50000000740
Follow_up_for_second_tier_signals_from_the_arcOGEN_GWAS
The arcOGEN UK nationwide consortium has been funded by the arc to undertake a genome-wide association scan in osteoarthritis. GWA genotyping has been carried out at the Sanger on the illumina 610k array. This replication study was desinged to follow-up second-tier signals from the arcOGEN GWS in 25 SNPs (1Sequenom iPLEX) with p<5x10^-6 on 6000 additional UK OA and control samples.
Study
EGAS00001001017
Novel manifestations of immune dysregulation and granule defects in gray platelet syndrome
Gray Platelet Syndrome (GPS) is a rare recessive bleeding disorder resulting from biallelic variants in NBEAL2. As part of a comprehensive evaluation of the phenotype and genotype in 47 patients with GPS, four different blood cell-types (platelets, neutrophils, monocytes, and CD4-lymphocytes) were evaluated using bulk RNA-seq in five patients and five controls. These data are deposited in this archive in FASTQ format.
Study
EGAS00001004216
Cardiovascular Health Study (CHS) Cohort: an NHLBI-funded observational study of risk factors for cardiovascular disease in adults 65 years or older
The Cardiovascular Health Study (CHS) is a prospective study of risk factors for development and progression of CHD and stroke in people aged 65 years and older. The 5,888 study participants were recruited from four U.S. communities and have undergone extensive clinic examinations for evaluation of markers of subclinical cardiovascular disease. The original cohort, enrolled in 1989-90, totaled 5,201 participants. A supplemental cohort of 687 predominantly African-American participants was enrolled in 1992-93. Clinic examinations were performed at study baseline and at annual visits through 1998-1999, and again in 2005-2006. Examination components included medical and personal history, medication inventory, ECG, blood pressure, anthropometry, assessment of physical and cognitive function, and depression screening. Other components done less frequently included phlebotomy, spirometry, echocardiography, carotid ultrasound, cerebral magnetic resonance imaging, measurement of ankle-brachial index and retinal exam. Participants were contacted by telephone annually between exams to collect information about hospitalizations and potential cardiovascular events. Since 1999, participants have been contacted every six months by phone, primarily to identify cardiovascular events and to assess physical and cognitive health. Standard protocols for the identification and adjudication of events were implemented during follow-up. The adjudicated events are myocardial infarction, angina, heart failure (HF), stroke, transient ischemic attack (TIA), claudication and mortality. The Cardiovascular Health Study Cohort is utilized in the following dbGaP substudies. To view genotypes, analysis, expression data, other molecular data, and derived variables collected in these substudies, please click on the following substudies below or in the "Substudies" section of this top-level study page phs000287 Cardiovascular Health Study (CHS) Cohort: an NHLBI-funded observational study of risk factors for cardiovascular disease in adults 65 years or older. phs000226 STAMPEED: Cardiovascular Health Study (CHS) phs000301 PAGE: CaLiCo: Cardiovascular Health Study (CHS) phs000377 CARe: Candidate Gene Association Resource (CARe) phs000400 GO-ESP: Heart Cohorts Exome Sequencing Project (CHS) phs000667 CHARGE: Cardiovascular Health Study (CHS)
Study
phs000287
Peripheral blood RNA sequencing of samples for a healthy cohort and a cohort with cancer patients
RNA sequencing (fastq files) of white blood cells (WBCs) from healthy donors (n=376) and cancer patients (n=421) with different diagnoses, stages of disease and previously administered treatments, was performed. Samples from cancer patients were collected from the BostonGene clinical program; all patients provided written consent per IRB-approved protocols. Blood samples from healthy donors were purchased from multiple collection centers throughout the United States.
Whole blood samples (3 ml) in K2-EDTA tubes received within 24 hours of collection at RT underwent red blood cell (RBC) lysis to isolate WBCs. Isolated WBCs for RNA sequencing were centrifuged at 300 x g for 5 minutes with a maximum of 10^6 cells per vial. The supernatant was removed, and the cells were resuspended in cold Homogenization Buffer (2% 1-Thioglycerol, Promega). Samples were then frozen at -80°C until extraction. RNA extraction was performed from frozen samples with Maxwell RSC simplyRNA Cells Kit (Promega) using the benchtop automated Maxwell RSC Instrument (Promega).
Libraries were prepared with Illumina TruSeq® Stranded mRNA Library Prep (Poly-A mRNA; stranded). Libraries were sequenced on NovaSeq 6000 as Paired-End Reads (2x150) with targeted coverage of 50 mln reads.
Dataset
EGAD50000000414
Novel manifestations of immune dysregulation and granule defects in gray platelet syndrome
Gray Platelet Syndrome (GPS) is a rare recessive bleeding disorder resulting from biallelic variants in NBEAL2. As part of a comprehensive evaluation of the phenotype and genotype in 47 patients with GPS, four different blood cell-types (platelets, neutrophils, monocytes, and CD4-lymphocytes) were evaluated using bulk RNA-seq in five patients and five controls. These data are deposited in this archive in FASTQ format.
Dataset
EGAD00001005950
Genetics and Networks of Congenital Heart Defects
Exome sequencing of families with Congenital Heart Defects of diverse sub-phenotypes. Comprises both parent-offspring trios for sporadic cases and multiplex families. Collaboration with David Brook, University of Nottingham. Funded by the British Heart Foundation.
Dataset
EGAD00001002251
NIDDM-Atherosclerosis Study (NIDDM-Athero)
The NIDDM-Atherosclerosis Study, funded by NHLBI, was designed as a family study to examine the genetic basis of subclinical atherosclerosis and diabetes in Hispanic families. Family members of probands with T2D were recruited in the Los Angeles area. The baseline examination of the cohort included the euglycemic hyperinsulinemic clamp test from which the two key phenotypes were obtained: insulin sensitivity (M) and metabolic clearance rate of insulin (MCRI). Genome-wide genotyping was obtained under separate funding by NIDDK as a part of the GUARDIAN (Genetics Underlying Diabetes in Hispanics) Consortium.
Study
phs001130
Combination Therapies for Personalised Cancer Medicine in drug resistant EGFR mutant lung cancer
This study involves mutagenizing a range of different cell lines with ENU to identify those mutations which engender resistance to targeted treatment.
In the last decade we have begun to move towards the use of targeted therapeutics in the clinic, often resulting in dramatic response rates in cancer patients. However, clinical experience has demonstrated time and time again that such responses are invariably followed by the development of drug resistance. Identification of the underlying mechanisms may enable us to reverse such resistance. Traditional methods of modelling this resistance in vitro can lead to a plethora of potential resistance mechanisms that can be difficult to decipher.
Study
EGAS00001002509
Combination Therapies for Personalised Cancer Medicine in 11-18
This study involves mutagenizing 11-18 with ENU to identify those mutations which engender resistance to targeted treatment.
In the last decade we have begun to move towards the use of targeted therapeutics in the clinic, often resulting in dramatic response rates in cancer patients. However, clinical experience has demonstrated time and time again that such responses are invariably followed by the development of drug resistance. Identification of the underlying mechanisms may enable us to reverse such resistance. Traditional methods of modelling this resistance in vitro can lead to a plethora of potential resistance mechanisms that can be difficult to decipher.
Study
EGAS00001002579
Whole-genome sequencing data of human hematopoietic stem and progenitor cells in post-transplant clonal hematopoiesis
This dataset contains BAM files of whole-genome sequencing data of single human HSPCs after colony expansion, for three individuals who harbor a DNMT3A mutant clonal hematopoiesis clone in their blood system after hematopoietic cell transplantation. 29 samples are present in this dataset; 10 samples for LTHIT005, 10 samples for LTHIT131, and 9 samples for LTHIT069. In the sample name, samples are annotated for being DNMT3A wildtype (wt) or mutant (mut), except for LTHIT069 (wt samples: F13, H4, K2, L16, N12, mut samples: C8, E10, L9, N5). Whole-genome sequencing was performed after standard Illumina WGS library preparation and sequenced on a Novaseq 6000 using 150 bp paired-end sequencing. The sequencing depth is 15x genome coverage, or 50Gbases per sample.
Dataset
EGAD50000001347
Therapeutic Trial of Potassium and Acetazolamide in Andersen-Tawil Syndrome
Andersen-Tawil syndrome (ATS) is an ion channel disorder that causes episodes of muscle weakness and potentially life-threatening heart arrhythmias. The majority of ATS cases are caused by a mutation in the KCNJ2 gene, which is linked to potassium channels in the heart, brain, and skeletal muscle; other cases are presumed to be caused by an undetermined gene lesion. To date, the treatment for ATS has been largely anecdotal, and no treatments have been formally assessed in a controlled clinical trial. This study will determine whether potassium supplements and/or acetazolamide, which is a diuretic medication, affect the duration of muscle weakness and heart rhythm abnormalities in people with ATS. Participation in this study will last about 11 months. Participants will first attend a 3-day inpatient visit that will include a medical history, physical examination, blood work, heart rhythm testing by an electrocardiogram (ECG) and Holter monitor, strength testing, a health questionnaire, and daily potassium supplementation. Participants will also track the number and length of weakness episodes that they experience while in the hospital. On the last day of the inpatient visit, participants will be provided with multiple bottles containing either potassium or placebo. Participants will then return home for an 18-week treatment period that will consist of six 3-week-long treatments of either potassium or placebo, with the treatment schedule being randomly determined. Upon completing the first 18-week treatment period, participants will attend a second 3-day inpatient visit that will include the same tests and procedures as the first. The only difference will be that participants will receive acetazolamide along with potassium. This will be followed by a second 18-week treatment period that will consist of six 3-week-long treatments of either acetazolamide or placebo. At the end of the second treatment period, participants will fill out another health questionnaire. Throughout both 18-week treatment periods, participants will phone in daily to track any muscle or heart problems. They will also provide blood samples on a weekly basis. At Weeks 2, 5, 8, 11, 14, and 17 of both treatment periods, participants will wear a Holter monitor for 24 hours and then mail it in. A final outpatient visit will occur 8 weeks after the end of the second treatment period and will include heart rhythm testing, muscle strength testing, and blood work.
Study
phs001316
Solve-RD Solving the Unsolved Rare Dieseases
Solve-RD – solving the unsolved rare diseases is a research project funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779257 from 1 January 2018 to 31 March 2024. Six European Reference Networks (ERNs; ERN-RND, -ITHACA, -EuroNMD, -GENTURIS, -RITA and -EpiCare) contributed data and samples to one or more of the four cohorts for data re-analysis and novel omics. For more information see https://solve-rd.eu/results/solve-rd-data/.
Study
EGAS00001003851
Siberia.Pakendorf
The tar archive contains a) the txt file with the genotypes, b) illumina annotation file with info on SNPs, c) sample info file unfiltered illumina data, autosomes only data from Pugach et al MBE 2016 The Complex Admixture History and Recent Southern Origins of Siberian Populations
Dataset
EGAD00010002304
Human Tumor Atlas Network (HTAN)
An NCI-funded Cancer Moonshot initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease.
Study
phs002371
Wistar PDX Development and Trial Center
Overall the outcomes of patients with metastatic melanoma have improved dramatically over the last decade due to an improved understanding of the molecular drivers of this disease. In particular, multiple targeted therapy regimens have been approved for patients with a BRAFV600E/K mutation, which are present in ~50% of cutaneous melanomas. These treatments achieve clinical responses in ~80% of patients with a BRAFV600E/K mutation, thus providing proof-of-concept of the therapeutic potential for personalized therapeutic strategies. However, most of the patients will progress within 2 years of starting those therapies. Further, currently there are no targeted therapies that have been shown to be effective in patients with a wild-type BRAF. Thus, there are unmet clinical needs to develop treatments that prevent or overcome resistance to existing therapies for patients with a BRAFV600E/K mutation, and that are effective in patients without a BRAF mutation. In order to facilitate the development of new therapeutic strategies, over the last 5 years we have led a major effort to develop a broad collection of PDX models to reflect the clinical, histological, and genetic heterogeneity of this disease. Our collection of PDX models represents one of the largest collections for any human malignancy, and our initial testing demonstrates that the collection accurately recapitulates the oncogenic drivers and molecular heterogeneity that is observed in patients. This collection also includes a subset of PDX established from patients with acquired resistance to targeted therapies that have been maintained on those agents in vivo to sustain their resistant phenotype. Together these efforts have generated a robust resource to develop, refine, and prioritize new personalized combinatorial therapies for patients. Thus, we propose to establish a multi-disciplinary and multi-institutional PDTC Program focused on the use and continued expansion of our robust melanoma PDX collection to identify new therapeutic approaches that fill important clinical gaps in this disease.
Study
phs002432
Carcinoma of Unknown Primary (CUP): A comparison across tissue and liquid biomarkers (CUP-COMP) study
The Carcinoma of Unknown Primary (CUP): A comparison across tissue and liquid biomarkers (CUP-COMP) study (NCT: NCT04750109) was sponsored by The Christie NHS Foundation Trust and funded by Innovate UK. Data generated include: A) Whole genome sequencing* of paired germline blood and tumour tissue B) Whole genome sequencing *of germline blood only C) Next-generation sequencing of circulating tumour DNA (FoundationOne Liquid CDX) D) Next-generation sequencing of tumour tissue (FoudationOne CDX) *Whole genome sequencing - germline sample at 30x coverage and somatic sample at 75x coverage performed by Illumina NovaSeq 6000 instrument in the 150bp PE read mode.
Study
EGAS00001008239
Center for Education and Drug Abuse Research (CEDAR)
Under its programmatic name, the Center for Education and Drug Abuse research (CEDAR), this Center of Excellence project was completed in 2015. The CEDAR sample consists of 775 nuclear families comprising an adult male (proband), his spouse/mate, and their biological child who is 10-12 years of age at recruitment. This child is designated as index child (IC), and is followed through age 30. Minimal exclusionary criteria were applied to maximize the sample's representativeness. Follow-up evaluations on ICs were conducted at ages 14, 16, 19, 22, 25, 27 and 30. Interim questionnaires were mailed annually beginning at age 20. CEDAR utilizes the family/high-risk paradigm. The offspring thus comprise high average risk (HAR) and low average risk (LAR) groups, constituting the majority of the CEDAR children sample: 250 HAR males, 100 HAR females, 250 LAR males and 94 LAR females. The remainder of the IC sample, 50 males and 31 females, are the offspring of fathers who have a lifetime non-SUD psychiatric dis-order. The follow-up rate from baseline to the final assessment is ~65%. The cohort of children is 76% White and 22% Black. The representation of females among ICs is lower than males, due to the later start of recruit-ment (2nd funding cycle; the study originally focused on males). The CEDAR genetic study has been augmented by the inclusion of its sample into the whole-genome genotyping effort conducted under the Genes, Environment and Development Initiative (GEDI). DNA from the blood samples of target children (index cases) and their siblings submitted to the NIDA Genetics Depository a Rutgers University was obtained and genotyped on Illumina Human660W-Quad Beadchips, with a high average genotyping rate of 99.8%. The sample comprised 158 females and 271 males (37 and 63 %), ~70:30 European- and African- American, with approximately 1:1 ratio of children of affected fathers and normal controls.
Study
phs001649
HELIUS cohort gut microbiome batch2
The gut microbiota composition is unique to every individual but is shaped by common factors including diet, lifestyle, medication use, early-life determinants, living environment or genetics. Most of these factors may be influenced by ethnicity. This study explored variations in fecal microbiota composition in 6048 individuals with different ethnic backgrounds living in the same geographical area (Amsterdam, the Netherlands).
The HELIUS data are owned by the Amsterdam University Medical Centers, location AMC in Amsterdam, The Netherlands. To allow sharing of microbiome data collected in HELIUS with (inter)national researchers, 16s rRNA sequence analysis has been stored at the European genome-phenome archive (EGA; accession code EGAD00001004106). This requires that access needs to be granted, also because the HELIUS data are stored with relevant phenotypical variables. Access is granted to all researchers affiliated with an internationally recognized research institution who request to use the HELIUS data within the EGA context, after having signed the data transfer agreement. Any researcher can request the data by submitting a proposal to the HELIUS Executive Board as outlined at http://www.heliusstudy.nl/en/researchers/collaboration, by email: heliuscoordinator at amsterdamumc dot nl. The HELIUS Executive Board will check proposals if they do not conflict with ethical approvals and informed consent forms of the HELIUS study.
Dataset
EGAD00001009732
HELIUS cohort gut microbiome
The gut microbiota composition is unique to every individual but is shaped by common factors including diet, lifestyle, medication use, early-life determinants, living environment or genetics. Most of these factors may be influenced by ethnicity. This study explored variations in fecal microbiota composition in 6048 individuals with different ethnic backgrounds living in the same geographical area (Amsterdam, the Netherlands).
The HELIUS data are owned by the Amsterdam University Medical Centers, location AMC in Amsterdam, The Netherlands. To allow sharing of microbiome data collected in HELIUS with (inter)national researchers, 16s rRNA sequence analysis has been stored at the European genome-phenome archive (EGA; accession code EGAD00001004106). This requires that access needs to be granted, also because the HELIUS data are stored with relevant phenotypical variables. Access is granted to all researchers affiliated with an internationally recognized research institution who request to use the HELIUS data within the EGA context, after having signed the data transfer agreement. Any researcher can request the data by submitting a proposal to the HELIUS Executive Board as outlined at http://www.heliusstudy.nl/en/researchers/collaboration, by email: heliuscoordinator at amsterdamumc dot nl. The HELIUS Executive Board will check proposals if they do not conflict with ethical approvals and informed consent forms of the HELIUS study.
Dataset
EGAD00001004106
Circulating RNAs in Acute Heart Failure (CRUCIAL)
The purpose of this American Heart Association-funded and NIH-funded study is to examine circulating RNAs in the acute congestive heart failure (CHF) setting, and how they change with decongestive therapy, and their function in vitro and in vivo. The investigators are testing the hypothesis that ex-RNA levels change significantly during decongestion therapy and can be used as a marker of those individuals who respond to CHF therapy (in terms of cardiac structure or outcome). Additionally, the translational research design allows the investigators to assay the effects of these RNAs on tissue phenotypes in vitro.
Study
phs003403
IBD dataset
We performed multi-omics profiling of 38 Crohn's disease and Ulcerative colitis patients across several stimulations (RPMI, LPS, Salmonella, in total 80 samples). Nuclei were profiled using the 10X Multiome protocol which offers paired RNA+ATAC from the same nucleus (e.g. shared barcodes). Per library, the ATAC (I1, R1, R2, R3 reads) and RNA (I1, I2, R1, R2 reads) are provided.
Pools are genetically multiplexed across donors. Genotype files are provided to allow genetic demultiplexing.
Dataset
EGAD50000000198
Indonesian sea-nomads genomic history
This study observes the origin of the last living sea nomad group in Indonesia using genome-wide SNP data, and tracks their dispersal in the archipelago. This study involves new samples on not only the sea nomad group, but also surrounding populations where the sea nomad settled. We revealed the scenario of their origin and unique admixture patterns during their dispersal and re-settlement.
Study
EGAS00001002246