Segmented (HMM) copy number aberrations (CNA); discovery set
In 2012 a collaborative effort headed by the teams of Carlos Caldas in Cambridge, UK and Samuel Aparicio, Canada, produced a pivotal study aimed at defining the mutational landscape of breast cancer. They gathered a collection of over 2000 breast cancer samples (METABRIC) associated with long-term clinical follow-up. They delineated an array of inherited and somatically acquired genetic variants and associated them to gene expression. This led to the identification of putative cancer-driving events and improved stratification of breast cancer patients. The study elicited huge attention and was tackled as herculean effort by fellow researchers. Cristina Curtis and her colleagues deposited the data in the EGA archive (EGAS00000000083). What they probably could not foresee is that their data would be reused more than 100 times, becoming one of the most downloaded datasets at EGA. In 2015, Frederik S. Varn and colleagues accessed the above-mentioned data from the New Hampshire, USA, to correlate it to transcriptional programs of hematopoietic cells infiltrated in breast tumours and found that those can influence patients' prognosis. Their findings, published in Nature Communications, improved the understanding of the interplay between immune system and cancer. Again, in 2017 the dataset was downloaded by an Italian/Polish collaboration and used to corroborate their mechanistic study that demonstrate how a long non-coding RNA (lncRNA) is aberrantly localized into the chromatin, where it modulates oncogenic isoforms expression, thus contributing to breast cancer. To date, the METABRIC dataset has been reused to contribute to 147 publications. Its impact on breast cancer research is incalculable, and it even spread beyond that, with recent contributions to the understanding of pan-cancer mechanisms and features. Some of the publications are translating into improvements in diagnosis and treatment of different types of breast cancer. This is just an example of how data upcycling can amplify the potential of any dataset, well past the scope of its creation, the imagination of its owners and any geographical border. At EGA, we are proud to empower such fruitful worldwide cycles of knowledge, providing a platform that enables safe sharing of sensitive genetic data.
In 2025, as in previous years, EGA has facilitated the reuse of numerous datasets spanning a broad range of research and healthcare topics. To reflect on this activity, we would like to highlight the top 10 studies behind the most requested datasets of last year. Notably, the vast majority of these highly requested datasets are cancer-focused, particularly involving the study of immune checkpoint inhibitor therapies. Immune checkpoint inhibitors (ICIs) have become one of the most transformative advances in modern cancer therapy. These monoclonal antibodies enhance antitumor immune responses by targeting inhibitory immune checkpoint molecules. While ICIs have led to remarkably favorable and durable responses in some patients, treatment outcomes vary widely. This variability has fueled intense research efforts aimed at understanding resistance mechanisms, exploring combination therapies, and discovering predictive biomarkers to better stratify patients and guide treatment decisions. Study Title Disease 1 EGAS00001005503 Integrated genomic analyses reveal molecular correlates of clinical response and resistance to atezolizumab in combination with bevacizumab in advanced hepatocellular carcinoma Liver Cancer 2 EGAS00001005013 Intratumoral plasma cells predict outcomes to PD-L1 blockade in non-small cell lung cancer Lung Cancer 3 EGAS00001004809 BIOKEY: A single-cell catalogue of the dynamic changes underlying Checkpoint Immunotherapy response in Early Breast Cancer Breast Cancer 4 EGAS50000000138 BIOKEY: Immune heterogeneity in small cell lung cancer and vulnerability to immune checkpoint blockade Lung Cancer 5 EGAS50000000497 Bladder cancer subtyping study across 4 atezo clinical trials Bladder Cancer 6 EGAS50000000689 MOSAIC - Multi-Omics Spatial Atlas In Cancer Cancer Atlas 7 EGAS00000000083 METABRIC (Breast cancer - genome and transcriptome) Breast Cancer 8 EGAS00001005702 Human genomic and phenotypic synthetic data for the study of rare diseases Synthetic data 9 EGAS00001002556 TGF-β attenuates tumour response to PD-L1 blockade by contributing to exclusion of T cells Bladder Cancer 10 EGAS00001004353 Molecular subsets in renal cancer determine outcome to checkpoint and angiogenesis blockade Renal Cancer The most requested dataset of the year was EGAD00001008128. It belongs to a study that analyzed 358 liver cancer patients enrolled in clinical trials and treated with the combination of atezolizumab (anti–PD-L1) and bevacizumab (anti-VEGF), atezolizumab alone, or sorafenib. The study highlights how bevacizumab enhances anti-PD-L1 effect and identifies candidate biomarkers to predict treatment response. Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer and the third leading cause of cancer-related death worldwide [1]. Because symptoms often appear late, many patients are diagnosed at advanced stages, when curative options are no longer viable. For many years, sorafenib—a multikinase inhibitor approved in 2007—was the only systemic treatment available. Although numerous clinical trials followed, it was not until a decade later that additional therapies showed meaningful clinical benefit. Among these advances, immunotherapy emerged as a promising first-line option [2]. In particular, the combination of atezolizumab and bevacizumab demonstrated improved survival in patients with advanced HCC, reshaping the treatment landscape. However, treatment efficacy remains limited by the complex and diverse hepatic microenvironments that influence immune responses, as well as by high intratumoral heterogeneity. Identifying robust predictors of treatment response is critical to enable effective patient stratification and maximize the benefits of these therapies [2]. The continued demand for these datasets highlights the interest and critical role of data reuse in deepening our understanding of cancer, accelerating the development of more effective therapies, and improving patient outcomes. By providing secure access to high-quality genomic and clinical data, EGA acts as a catalyst for biomedical progress. References Sung, Hyuna, et al. "Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries."CA: a cancer journal for clinicians 71.3 (2021): 209-249. Ladd, Alexandra D, et al.“Mechanisms of drug resistance in HCC.” Hepatology 79(4) (2024): 926-940
The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (CNVs, SNPs) and acquired somatic copy number aberrations (CNAs) were associated with expression in 40% of genes, although the landscape was dominated by cis and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP, and MAP2K4. Unsupervised analysis of paired DNA/RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, ER-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration 0152hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the 0152CNA-devoid sub-group and a Basal-specific chromosome 5 deletion-driven mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic copy number aberrations on the transcriptome.