Study Overview The Environmental Determinants of Diabetes in the Young (TEDDY) Study is a longitudinal study that investigates genetic and genetic-environmental interactions, including gestational events, childhood infections, dietary exposures, and other environmental factors after birth, in relation to the development of islet autoimmunity and type 1 diabetes (T1D). A consortium of six clinical centers assembled to participate in the development and implementation of the study to identify environmental triggers for the development of islet autoimmunity and T1D in genetically susceptible individuals. Beginning in 2004, the TEDDY study screened over 400,000 newborns for high-risk HLA-DR, DQ genotypes from both the general population and families already affected by T1D. The TEDDY study enrolled around 8,676 participants across six clinical centers worldwide (Finland, Germany, Sweden and three in the United States) in the 15-year prospective follow-up. Participants are followed every three months for islet autoantibody (IA) measurements with blood sampling until four years of age and then at least every six months until the age of 15. After the age of four, autoantibody positive participants continue to be followed at three month intervals and autoantibody negative participants are followed at six-month intervals. In addition to the analysis of autoantibodies, additional data and sample collection are performed at each visit. Parents collect monthly stool samples in early childhood. The parents also fill out questionnaires at regular intervals in connection with study visits and record information about diet and health status in the child's TEDDY Book between visits. Continued long-term follow-up of the currently active TEDDY participants will provide important scientific information on early childhood diet, reported and measured infections, vaccinations, and psychosocial stressors that may contribute to the development of type 1 diabetes and islet autoimmunity. Additional information on the TEDDY study is available in the following articles: Rewers et al., 2008, PMID: 19120261 and Hagopian et al., 2006, PMID: 17130573. Details of the TEDDY protocol can be found in Hagopian et al., 2011, PMID: 21564455. TEDDY data currently available in dbGaP include: gene expression, SNPs, exome, microbiome (gut, nasal, and plasma), RNA sequencing, and whole genome sequencing. For more information on TEDDY Study version history please refer to TEDDY Study dbGaP README File. ImmunoChip SNP DNA from whole blood samples on study participants and their family members (mothers, fathers, and siblings) was obtained and used for SNP genotyping. Genotyping was performed by the Center for Public Health Genomics at the University of Virginia using the Illumina ImmunoChip SNP array, which contains around 196,000 SNPs from 186 regions associated with 12 autoimmune diseases (Hadley et al., 2015, PMID: 26010309). Data cleaning and validation included the removal of subjects with a low call rate (< 5% SNPs missing) and differences in reported sex and prior genotyping at the TEDDY HLA laboratory. Additionally, SNPs with a low call rate or Hardy-Weinberg equilibrium P value < 10-6, except for chromosome 6 due to HLA eligibility requirements, were removed from the final dataset (Törn et al., 2015, PMID: 25422107).TEDDY-T1DExome ArrayDNA from whole blood samples on study participants and their family members (mothers, fathers, and siblings) was obtained and used for genotyping. Genotyping was performed by the University of Virginia using the Illumina TEDDY-T1DExome array. The TEDDY-T1DExome array is a custom chip that contains 550,601 markers from the Infinium CoreExome-24 v1.1 BeadChip and an additional 90,214 tagSNPs specifically selected by the TEDDY investigators based on their associations with nutrients, vitamins, type 2 diabetes, autoimmune diseases, body-mass index, or other exposures and phenotypes measured by TEDDY study.The Illumina GenTrain2 algorithm was used for genotype calling. Sample quality control metrics included sample call rate, heterozygosity rate and concordance of gender between the information reported and genotyped. Gene Expression The TEDDY study collected peripheral blood for the extraction of total RNA from enrolled children starting at 3 months of age, and then at 3 month intervals up to 48 months and then biannually. Total RNA was extracted using a high throughput (96-well format) extraction protocol using magnetic (MagMax) beads technology at the TEDDY RNA Laboratory, Jinfiniti Biosciences in Augusta, GA. Purified RNA (200 ng) was further used for cRNA amplification and labeling with biotin using Target Amp cDNA synthesis kit (Epicenter catalog no. TAB1R6924). Labeled cRNA was hybridized to the Illumina HumanHT-12 Expression BeadChips based on the manufacturer's instructions. The HumanHT-12 Expression BeadChip provides coverage for more than 47,000 transcripts and known splice variants across the human transcriptome. Microbiome The TEDDY microbiome study aimed to characterize the longitudinal development of the microbiome, including bacteria, viruses and other microorganisms in the gut, plasma, and nasal cavity of prediabetic and diabetic subjects compared to autoantibody negative non-diabetic subjects. Stool samples used were collected monthly from 3 to 48 months, after which stool samples were collected every 3 months. Nasal swab samples were collected every 3 months starting at 9 months of age until 48 months, after which nasal swabs were collected every 6 months. Plasma samples were collected every 3 months starting at 3 months of age until 48 months, after which plasma samples were collected every 6 months. If the subject was autoantibody positive at 48 months then they remained on the 3 month collection interval for nasal swab and plasma samples. Samples underwent 16s rRNA gene sequencing, DNA and viral RNA metagenomics shotgun sequencing, and sequencing of the internal transcribed spacer (ITS) regions. Additional information on the TEDDY microbiome data is available in the following articles: Vatanen et al., 2018, PMID: 30356183, Stewart et al., 2018, PMID: 30356187, and Vehik et al., 2020, PMID: 31792456. RNA Sequencing The TEDDY study aimed to characterize the transcriptome in subjects with islet autoimmunity and type 1 diabetes compared to matched control subjects. Peripheral blood was collected to extract total RNA from enrolled children starting at 3 months of age, and then at 3 month intervals up to 48 months and then biannually. Total RNA was extracted using a high throughput (96-well format) extraction protocol using magnetic (MagMax) beads technology at the TEDDY RNA Laboratory, Jinfiniti Biosciences in Augusta, GA. Purified RNA was then sent to the Broad Institute for the generation of the TEDDY RNA sequencing (RNA-Seq) data. The RNA samples were prepped using Superscript III reverse transcriptase and Illumina's TruSeq Stranded mRNA Sample Prep Kit. The TruSeq libraries were run on the Illumina HiSeq2500 platform. Whole Genome Sequencing The TEDDY study aimed to conduct deep whole genome sequencing and examine the genomic variations in subjects with islet autoimmunity and type 1 diabetes compared to matched autoantibody negative and non-diabetic children. DNA from whole blood was obtained from TEDDY children for whole genome sequencing. The WGS data were generated on the Illumina HiSeq X Ten system.
The purpose of this project is to identify the genetic factors contributing to dental caries in children and adults. The datasets come from the Center for Oral Health Research in Appalachia (COHRA), which has the long-term goal of determining the sources of oral health disparities in a high risk, Northern Appalachian population so that effective preventive interventions can be designed and targeted. The Specific Aims of this project are to perform genome-wide association scans of dental caries of the (1) primary dentition in children, and (2) permanent dentition in adults, to identify novel risk variants and to replicate previously nominated risk variants. This project brings together samples from three cohorts: COHRA1 is a cross-sectional cohort comprising members of 862 northern Appalachian families; approximately 80% of the cohort has been previously genotyped by the Center for Inherited Disease Research with support from NIDCR using the Illumina Quad W array (see dbGaP Study Accession: phs000095.v3.p1). Dental SCORE is a cross-sectional cohort comprising approximately 550 unrelated individuals who underwent the same data collection protocol as COHRA1. COHRA2 is an ongoing longitudinal cohort that recruited approximately 1100 northern Appalachian women during pregnancy, and followed them and their children through their children's early childhood; the current project period will continue data collect though age 6 of the child. Phenotypes for this project were derived from intra-oral examinations performed by trained and calibrated research hygienists. In brief, each tooth was recorded as present or absent, and each surface of each present tooth was scored for evidence of decay. From these data, dental caries indices were generated. This project contains two phases of genotyping: (1) collection of exome SNP Chip data for the previously genotyped COHRA1 samples, and (2) collection of whole-genome SNP Chip data for the remaining COHRA1 samples and all Dental SCORE and COHRA2 samples. These data will support efforts to test hypotheses regarding the causal relationships of risk factors contributing to the unusually high rates of caries formation in the Appalachian population. Ultimately, these data may inform the development of an integrated model of caries risk, in which the effects of genetics, oral ecology, diet, and other environmental/psychosocial factors and behaviors are modeled in concert to explain the disparities, including the high rate of caries onset before age 6. The gene-mapping Aims of this project, which seek to identify the genetic factors that contribute to caries risk, are a requisite step in realizing this integrated model.
Accessing Data: Please refer to “Authorized Access” below regarding accessing data through the BioData Catalyst ecosystem. The data from this accession is not available for download through dbGaP.Related Studies: Other Jackson Heart Study data available include: Imaging studies (JHS-Imaging), Genetics and genomics (The Jackson Heart Study, phs000286.v7.p2), and Collaborative Cohort of Cohorts for COVID-19 Research (C4R-JHS, phs002907.v1.p1): Jackson Heart Study (phs002907.v1.p1).Available Data: Data available for request include Jackson Heart Study visit 1-3 examination cycles, collated annual follow-up communication data through 2016, and follow-up for mortality, heart disease, and stroke events through 2014.Objectives: The objectives of the Jackson Heart Study are to: 1) investigate the associations of biological, psychosocial, and behavioral factors with the incidence atherosclerotic events and health outcomes in an African American cohort; and 2) increase access to and the participation of African American populations and scientists in biomedical research and professions.Background: It has long been recognized that African Americans share a disproportionate burden of deleterious health outcomes including diabetes, hypertension, kidney disease and early onset of cardiovascular disease. The Jackson Heart Study was initiated in 2000 to explore potential mechanisms and mediators of health outcomes in a large African American cohort. In addition, the JHS conducts a variety of community education and outreach activities to promote healthy lifestyles to reduce disease risk burden and student training programs to promote and support public health research.Participants: African American men and women, age 35-84 at entry. Of the 5306 cohort members enrolled in the study, the data repository contains data from 3,883 that provided informed consent to share their data with investigators not affiliated with the study.Design: Participants were enrolled in the study from 2000-2004 from urban and rural areas of three counties (Hinds, Madison and Rankin) in the Jackson, MS metropolitan statistical area. Participants were enrolled from each of 4 recruitment pools: a random sample component (17%), volunteer component (30%), currently enrolled in the Atherosclerosis Risk in Communities (ARIC) Study (31%), and secondary family members (22%). Recruitment was limited to non-institutionalized adult African Americans age 35-84 years, except in the family cohort where those age ≥21 years were eligible. The final cohort of 5,306 participants includes 6.59% of all African American residents aged 35-84 (N=76,426, US Census 2000). Data collection at the baseline exam included a medical history, physical examination, blood/urine analytes and interview questions on areas such as: physical activity; stress, coping and spirituality; racism and discrimination; socioeconomic status; and health care access.The current release of the Jackson Heart Study includes data collected at baseline, exam 2 (2005-2008), and exam 3 (2009-2013). Annual follow-up and surveillance of clinical cardiovascular events is ongoing.
This dataset contains two sets of samples. The reference sample set consists of a total of 669 samples that had been reported previously to be euploid by the NIPTIFY screening test. The validation sample set is based on a previously published validation study by Zilina et al. (1), consisting of 423 samples, of which 259 were high-risk pregnancies that had undergone diagnostic invasive prenatal analysis (1). All samples were sequenced with Illumina NextSeq 500 platform, producing 85 bp single-end reads with an average per-sample coverage of 0.32× at the University of Tartu, Institute of Genomics Core Facility, according to the manufacturer’s standard protocols, as described previously (1). This study was performed with the approval of the Research Ethics Committee of the University of Tartu (#315/T-13). 1. Zilina O, Rekker K, Kaplinski L, Sauk M, Paluoja P, Teder H, et al. Creating basis for introducing non‐invasive prenatal testing in the Estonian public health setting. Prenat Diagn [Internet]. 2019 Dec 6;39(13):1262-8. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/pd.5578
The Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) is a collaborative effort comprised of a coordinating center and scientific researchers from well-characterized cohort and case-control studies conducted in North America and Europe. This international consortium aims to accelerate the discovery of common and rare genetic risk variants for colorectal cancer by conducting large-scale meta-analyses of existing and newly generated genome-wide association study (GWAS) data, replicating and fine-mapping of GWAS discoveries, and investigating how genetic risk variants are modified by environmental risk factors. To expand these efforts, we assembled case-control sets or nested case-control sets from 20 different North American or European studies. Summary descriptions and study participant inclusions/exclusion criteria for each of these studies are detailed below. The Black Women's Health Study (BWHS): Is the largest follow-up study of the health of African-American women (Cozier et al., 2004; Rosenberg et al., 1995) [PMID: 15018884; PMID: 7722208]. The purpose is to identify and evaluate causes and preventives of cancers and other serious illnesses in African-American women. Among the diseases being studied are breast cancer, colorectal cancer, type 2 diabetes, uterine fibroids, systemic lupus erythematosus, and cardiovascular disease. The study began in 1995, when 59,000 black women from all parts of the United States enrolled through postal questionnaires. The women provided demographic and health data on the 1995 baseline questionnaire, including information on weight, height, smoking, drinking, contraceptive use, use of other selected medications, illnesses, reproductive history, physical activity, diet, use of health care, and other factors. The participants are followed through biennial questionnaires to determine the occurrence of cancers and other illnesses and to update information on risk factors. Self-reports of cancer are confirmed through medical records and state cancer registry records. Mouthwash-swish samples, as a source of DNA, were obtained from ~26,000 BWHS participants in 2002-2007. DNA was isolated from the mouthwash-swish samples at the Boston University Molecular Core Genetics Laboratory using the QIAAMP DNA Mini Kit (Qiagen). All incident colorectal cancer cases with a DNA sample were included in the present analysis. Two controls per case, selected from among BWHS participants free of colorectal cancer at end of follow-up, were matched to cases on year of birth (+/- 2 years) and geographical region of residence (Northeast, South, Midwest, and West). A total 209 colorectal cancer cases and 423 controls were sent for genotyping. Campaign Against Cancer and Heart Disease (CLUE II): The Campaign Against Cancer and Heart Disease, is a prospective cohort designed to identify biomarkers and other factors associated with risk of cancer, heart disease, and other conditions (Kakourou et al., 2015) [PMID: 26220152]. 32,894 participants were recruited from May through October 1989 from Washington County, Maryland and surrounding communities. Colorectal cancer cases (n = 297) and matched controls (n = 296) were identified between 1989 and 2000 among participants in the CLUE II cohort of Washington County, Maryland. Colorectal Cancer Study of Austria (CORSA): In the ongoing colorectal cancer study of Austria (CORSA), more than 13,000 Caucasian participants have been recruited within the province-wide screening project "Burgenland Prevention Trial of Colorectal Disease with Immunological Testing" (B-PREDICT) since 2003 (Hofer et al., 2011) [PMID: 21422235]. All inhabitants of the Austrian province Burgenland aged between 40 and 80 years are annually invited to participate in fecal immunochemical testing and haemoccult positive screening participants are invited for colonoscopy. CORSA includes genomic DNA and plasma of colorectal cancer cases, low-risk and high-risk adenomas, and colonoscopy-negative controls. Controls received a complete colonoscopy and were free of colorectal cancer or polyps. CORSA participants have been recruited in the four KRAGES hospitals in Burgenland, Austria, and additionally, at the Medical University of Vienna (Department of Surgery), the Viennese hospitals "Rudolfstiftung" and the "Sozialmedizinisches Zentrum Sud", and at the Medical University of Graz (Department of Internal Medicine). 1403 colorectal cancer and advanced colorectal adenoma cases, and 1404 matched controls were selected for the study. Distribution of factors sex and age (5 year strata) were evenly matched between cases and controls. Cancer Prevention Study II (CPS II): The CPS II Nutrition cohort is a prospective study of cancer incidence and mortality in the United States, established in 1992 and described in detail elsewhere (Calle et al., 2002; Campbell et al., 2014) [PMID: 12015775; PMID: 25472679]. At enrollment, participants completed a mailed self-administered questionnaire including information on demographic, medical, diet, and lifestyle factors. Follow-up questionnaires to update exposure information and to ascertain newly diagnosed cancers were sent biennially starting in 1997. Reported cancers were verified through medical records, state cancer registry linkage, or death certificates. The Emory University Institutional Review Board approves all aspects of the CPS II Nutrition Cohort. A total of 360 cases and 359 controls were selected for this study. Czech Republic Colorectal Cancer Study (Czech Republic CCS): Cases with positive colonoscopy results for malignancy, confirmed by histology as colon or rectal carcinomas, were recruited between September 2003 and May 2012 in several oncological departments in the Czech Republic (Prague, Pilsen, Benesov, Brno, Liberec, Ples, Pribram, Usti and Labem, and Zlin). Two control groups, sampled at the same time of cases recruitment, were included in the study. The first group consisted of hospital-based individuals with a negative colonoscopy result for malignancy or idiopathic bowel diseases. The reasons for the colonoscopy were: i) positive fecal occult blood test, ii) hemorrhoids, iii) abdominal pain of unknown origin, and iv) macroscopic bleeding. The second control group consisted of healthy blood donor volunteers from a blood donor center in Prague. All individuals were subjected to standard examinations to verify the health status for blood donation and were cancer-free at the time of the sampling. Details of CRC cases and controls have been reported previously (Vymetalkova et al., 2014; Naccarati et al., 2016; Vymetalkova et al., 2016) [PMID: 24755277; PMID: 26735576; PMID: 27803053]. All subjects were informed and provided written consent to participate in the study. They approved the use of their biological samples for genetic analyses, according to the Declaration of Helsinki. The design of the study was approved by the Ethics Committee of the Institute of Experimental Medicine, Prague, Czech Republic. All subjects included in the study were Caucasians and comprised 1792 cases and 1764 matched controls. Controls were matched to CRC cases as 1:1 ratio. Matching was done on age and sex. Age was matched on +-5 years, whereas sex was matched exactly. For the cases without matched controls, matching was done only on sex. Early Detection Research Network (EDRN): The aim of the EDRN initiative is to develop and sustain a biorepository for support of translational research (Amin et al., 2010) [PMID: 21031013]. High-quality biospecimens were accrued and annotated with pertinent clinical, epidemiologic, molecular and genomic information. A user-friendly annotation tool and query tool was developed for this purpose. The various components of this annotation tool include: CDEs are developed from the College of American Pathologists (CAP) Cancer Checklists and North American Association of Central Cancer Registries (NAACR) standards. The CDEs provides semantic and syntactic interoperability of the data sets by describing them in the form of metadata or data descriptor. A total of 352 colorectal case samples and 399 controls were selected for this study. Controls were matched to CRC cases based on age and sex. The EPICOLON Consortium (EPICOLON): The EPICOLON Consortium comprises a prospective, multicentre and population-based epidemiology survey of the incidence and features of CRC in the Spanish population (Fernandez-Rozadilla et al., 2013) [PMID: 23350875]. Cases were selected as patients with de novo histologically confirmed diagnosis of colorectal adenocarcinoma. Patients with familial adenomatous polyposis, Lynch syndrome or inflammatory bowel disease-related CRC, and cases where patients or family refused to participate in the study were excluded. Hospital-based controls were recruited through the blood collection unit of each hospital, together with cases. All of the controls were confirmed to have no history of cancer or other neoplasm and no reported family history of CRC. Controls were randomly selected and matched with cases for hospital, sex and age (+- 5 years). A total of 370 cases and 370 controls were selected for genotyping. Hawaii Adenoma Study: For this adenoma study, two flexible-sigmoidoscopy screening clinics were first used to recruit participants on Oahu, Hawaii. Adenoma cases were identified either from the baseline examination at the Hawaii site of the Prostate Lung Colorectal and Ovarian cancer screening trial during 1996-2000 or at the Kaiser Permanente Hawaii's Gastroenterology Screening Clinic during 1995-2007. In addition, starting in 2002 and up to 2007, we also approached for recruitment all eligible patients who underwent a colonoscopy in the Kaiser Permanente Hawaii Gastroenterology Department. Cases were patients with histologically confirmed first-time adenoma(s) of the colorectum and were of Japanese, Caucasian or Hawaiian race/ethnicity. Controls were selected among patients with a normal colorectum and were individually matched to the cases on age at exam, sex, race/ethnicity, screening date (+-3 months) and clinic and type of examination (colonoscopy or flexible sigmoidoscopy). We recruited 1016 adenoma cases (67.8% of all eligible) and 1355 controls (69.2% of all eligible); 889 cases and 1169 controls agreed to give a blood and 29 cases and 34 controls, a mouthwash sample. A total of 989 cases and 1185 controls were genotyped for this study. Columbus-area HNPCC Study (HNPCC, OSUMC): Patients with colorectal adenocarcinoma diagnosed at six participating hospitals were eligible for this study, regardless of age at diagnosis or family history of cancer. Patients with a clinical diagnosis of familial adenomatous polyposis were not eligible for this study. These six hospitals perform the vast majority of all operations for CRC in the Columbus metropolitan area (population 1.7 million). The institutional review board at all participating hospitals approved the research protocol and consent form in accordance with assurances filed with and approved by the United States Department of Health and Human Services. Briefly, during the period of January 1999 through August 2004, 1,566 eligible patients with CRC were accrued to the study (Hampel et al., 2008) [PMID 18809606]. A total of 1472 colorectal cancer samples had enough blood DNA remaining to be sent for genotyping. Control samples were provided by the Ohio State University Medical Center%#39;s (OSUMC) Human Genetics Sample Bank. The Columbus Area Controls Sample Bank is a collection of control samples for use in human genetics research that includes both donors' anonymized biological specimens and linked phenotypic data. The data and samples are collected under the protocol "Collection and Storage of Controls for Genetics Research Studies", which is approved by the Biomedical Sciences Institutional Review Board at OSUMC. Recruitment takes place in OSUMC primary care and internal medicine clinics. If individuals agree to participate, they provide written informed consent, complete a questionnaire that includes demographic, medical and family history information, and donate a blood sample. 4-7 ml of blood is drawn into each of 3 ACD Solution A tubes and is used for genomic DNA extraction and the establishment of an EBV-transformed lymphoblastoid cell culture, cell pellet in Trizol, and plasma. Controls were matched to CRC cases as 1:1. Matching was done on age at reference time (age_ref), race, and sex. Age_ref was matched on +-5 years. Sex and race were matched exactly. For the cases without matched controls, matching was done only on sex and race with 1:1 ratio. Since controls are fewer than cases, one control is matched on 2 cases at most. Health Professionals Follow-up Study (HPFS): A parallel prospective study to the NHS (Nurses' Health Study). The HPFS cohort comprised 51,529 men aged 40-75 who, in 1986, responded to a mailed questionnaire (Rimm et al., 1990) [PMID: 2090285]. Participants provided information on health related exposures, including current and past smoking history, age, weight, height, diet, physical activity, aspirin use, and family history of colorectal cancer. Colorectal cancer and other outcomes were reported by participants or next-of-kin and were followed up through review of the medical and pathology record by physicians. Overall, more than 97% of self-reported colorectal cancers were confirmed by medical record review. Information was abstracted on histology and primary location. Incident cases were defined as those occurring after the subject provided the blood sample. Prevalent cases were defined as those occurring after enrollment in the study but before the subject provided the blood sample. Follow-up evaluation has been excellent, with 94% of the men responding to date. Colorectal cancer cases were ascertained through January 1, 2008. In 1993-1995, 18,825 men in the HPFS mailed blood samples by overnight courier, which were aliquoted into buffy coat and stored in liquid nitrogen. In 2001-2004, 13,956 men in the HPFS who had not provided a blood sample previously mailed in a swish-and-spit sample of buccal cells. Incident cases were defined as those occurring after the subject provided a blood or buccal sample. Prevalent cases were defined as those occurring after enrollment in the study in 1986, but before the subject provided either a blood or buccal sample. After excluding participants with histories of cancer (except nonmelanoma skin cancer), ulcerative colitis, or familial polyposis, case-control sets were previously constructed. In addition to colorectal cancer cases and controls, a set of adenoma cases and matched controls with available DNA from buffy coat were selected for genotyping. Over the follow-up period, data were collected on endoscopic screening practices and, if individuals had been diagnosed with a polyp, the polyps were confirmed to be adenomatous by medical record review. Adenoma cases were ascertained through January 1, 2008. A separate case-control set was constructed of participants diagnosed with advanced adenoma matched to control participants who underwent a lower endoscopy in the same time period and did not have an adenoma. Advanced adenoma was defined as an adenoma 1 cm or larger in diameter and/or with tubulovillous, villous, or highgrade dysplasia/carcinoma-in-situ histology. Matching criteria included year of birth (within 1 year) and month/ year of blood sampling (within 6 months), the reason for their lower endoscopy (screening, family history, or symptoms), and the time period of any prior endoscopy (within 2 years). Controls matched to cases with a distal adenoma either had a negative sigmoidoscopy or colonoscopy examination, and controls matched to cases with proximal adenoma all had a negative colonoscopy. In total, 159 advanced adenoma cases and 109 controls were selected for genotyping. Leeds Colorectal Cancer Study (LCCS): Following local ethical approval, colorectal cancer cases were recruited from 1997 until 2012 in Leeds, UK through surgical clinics. Initially, funding was provided by the UK Ministry of Agriculture, Farming and Fisheries (subsequently the Food Standards Agency) and Imperial Cancer Research Fund (subsequently Cancer Research UK). Recruitment also occurred similarly in Dundee, Perth and York between the periods of 1997 and 2001 using the same protocol and the data and samples were combined. Pathologically confirmed cases were consented at outpatient clinics, providing information on known and postulated risk factors for colorectal cancer (diet, lifestyle and family history) as well as providing a blood sample for DNA. Exclusion criteria included pre-existing diverticular disease and an inability to complete the questionnaire. The General Practitioners of cases (all UK residents have a nominated General Practitioner to whom to refer initial medical queries) and these GPs were asked to send letters to other persons on their patient list of the same gender and born within 5 years of the case. Subsequently to enhance the number of controls, we systematically invited patients from selected GP practices. Diet was assessed in cases and controls using an extensive dietary and lifestyle questionnaire modified by that produced by the European Prospective Investigation in Cancer (EPIC). The frequency that each specific food items were eaten was recorded and we also obtained average fruit and vegetable consumption as a cross-check. In total, 1591 cases and 739 controls provided a DNA sample. The North Carolina Colon Cancer Studies (NCCCS I/II): The North Carolina Colon Cancer Studies (NCCCS I- colon and NCCCS II-rectal) were population-based case-control studies conducted in 33 counties of North Carolina. Cases were identified using the rapid case ascertainment system of the North Carolina Central Cancer Registry. Patients with a first diagnosis of histologically confirmed invasive adenocarcinoma of the colon (cecum through sigmoid colon) between October 1996 and September 2000 were classified as potential cases in the NCCCS I. The NCCCS II included patients with a first diagnosis of histologically confirmed invasive adenocarcinoma of the sigmoid colon, rectosigmoid, or rectum (hereafter collectively referred to as rectal cancer) between May 2001 and September 2006. Additional eligibility requirements were: aged 40-80 years, residence in one of the 33 counties, ability to give informed consent and complete an interview, had a driver's license or identification card issued by the North Carolina Department of Motor Vehicles (if under the age of 65), and had no objections from the primary physician in regards to contacting the individual. Controls, identified and sampled during the respective study dates, were selected from two sources. Potential controls under the age of 65 were identified using the North Carolina Department of Motor Vehicles records. For those 65 years and older, records from the Center for Medicare and Medicaid Services were used. Controls were matched to cases using randomized recruitment strategies. Recruitment probabilities were done using strata of 5-year age, sex, and race groups. Dietary information was collected using a modified version of the semiquantitative food frequency questionnaire developed at the National Cancer Institute. In addition, participants were asked about vitamin and mineral supplementation, special diets, restaurant eating, sodium use, and fats used in cooking. In NCCCS I, 515 colorectal cases and 687 matched controls were sent for genotyping. In NCCCS II, 796 colorectal cases and 823 controls were sent from the NCCCS II for genotyping. Controls were matched to CRC cases as 1:1 ratio. Matching was done on age, race, and sex. Age was matched on +-5 years. Race and sex was matched exactly. For the cases without matched controls, matching was done only on sex and race. Nurses Health Study (NHS): The NHS cohort began in 1976 when 121,700 married female registered nurses age 30-55 years returned the initial questionnaire that ascertained a variety of important health-related exposures (Belanger et al., 1978) [PMID: 248266]. Since 1976, follow-up questionnaires have been mailed every 2 years. Colorectal cancer and other outcomes were reported by participants or next-of-kin and followed up through review of the medical and pathology record by physicians. Overall, more than 97% of self-reported colorectal cancers were confirmed by medical-record review. Information was abstracted on histology and primary location. The rate of follow-up evaluation has been high: as a proportion of the total possible follow-up time, follow-up evaluation has been more than 92%. Colorectal cancer cases were ascertained through June 1, 2008. In 1989 -1990, 32,826 women in NHS I mailed blood samples by overnight courier, which were aliquoted into buffy coat and stored in liquid nitrogen. In 2001-2004, 29,684 women in NHS I who did not previously provide a blood sample mailed a swish-and-spit sample of buccal cells. Incident cases were defined as those occurring after the subject provided a blood or buccal sample. Prevalent cases were defined as those occurring after enrollment in the study in 1976 but before the subject provided either a blood or buccal sample. After excluding participants with histories of cancer (except nonmelanoma skin cancer), ulcerative colitis, or familial polyposis, case-control sets were previously constructed from which DNA was isolated from either buffy coat or buccal cells for genotyping. In addition to colorectal cancer cases and controls, a set of advanced adenoma cases and matched controls with available DNA from buffy coat were selected for genotyping. Over the follow-up period, data were collected on endoscopic screening practices and, if individuals had been diagnosed with a polyp, the polyps were confirmed to be adenomatous by medical record review. Adenoma cases were ascertained through June 1, 2011. A separate case-control set was constructed of participants diagnosed with advanced adenoma matched to control participants who underwent a lower endoscopy in the same time period and did not have an adenoma. Advanced adenoma was defined as an adenoma more than 1 cm in diameter and/or with tubulovillous, villous, or high-grade dysplasia/carcinoma-in-situ histology. Matching criteria included year of birth (within 1 year) and month/year of blood sampling (within 6 months), the reason for their lower endoscopy (screening, family history, or symptoms), and the time period of any prior endoscopy (within 2 years). Controls matched to cases with a distal adenoma either had a negative sigmoidoscopy or colonoscopy examination, and controls matched to cases with proximal adenoma all had a negative colonoscopy. A total of 272 cases and 236 matched controls were sent to CIDR for the advanced adenoma case-control set. Northern Swedish Health and Disease Study (NSHDS): Comprises over 110,000 participants, including approximately one third with repeated sampling occasions, from three population-based cohorts (Dahlin et al., 2010; Myte et al., 2016) [PMID: 20197478; PMID: 27367522]. The largest is the ongoing Vasterbotten Intervention Programme, in which all residents of Vasterbotten County are invited to a health examination upon turning 30 (some years), 40, 50 and 60 years of age. Extensive measured and self-reported health and lifestyle data, as well as blood samples for central biobanking in Umea, Sweden, are collected at the health exam. Leucocyte DNA samples for 1:1-matched CRC case-control sets from the NSHDS, of which 878 samples are included in this study, have been selected for genotyping. This is in addition to 354 samples from the NSHDS previously analyzed as part of the multicenter EPIC cohort. Cancer-specific and overall survival data are available for all patients. For at least 425 patients, archival tumor tissue has been analyzed for the BRAF V600E mutation and by sequencing codon 12 and 13 for KRAS mutations, as well as for MSI screening status by immunohistochemistry and for an eight-gene CIMP panel using quantitative real-time PCR (MethyLight). Ohio Colorectal Cancer Prevention Initiative (OCCPI, OSUMC): OCCPI (ClinicalTrials.gov identifier: NCT01850654) is a population-based study of colorectal cancer patients diagnosed in one of 51 hospitals throughout the state of Ohio from January 1, 2013 through December 31, 2016. The OCCPI was created to decrease CRC incidence in Ohio by identifying patients with hereditary predisposition (statewide universal tumor screening for newly diagnosed CRC patients), increase colonoscopy compliance for first-degree relatives of CRC patients, and encourage future research through the creation of a biorepository. The 51 Ohio hospitals participating in the OCCPI were selected to represent a cross-section of clinical centers in the state based on high reported volume of CRC patients, affiliation with a high volume hospital, or interest in participation. Institutional Review Board (IRB) approval was obtained by the individual hospitals, Community Oncology Programs, or by ceding review to the OSU IRB. Written informed consent was obtained. A total of 2139 colorectal cases were genotyped. Patients were considered eligible for this study if they were age 18 or older at the time of enrollment, if they had a surgical resection (or biopsy if unresectable) in the state of Ohio demonstrating an adenocarcinoma of the colorectum from 1/1/13 - 12/31/16. Matched control samples were selected from the Ohio State University Medical Center's (OSUMC) Human Genetics Sample Bank in an identical way to the selection for the Columbus-area HNPCC Study (please refer to the description for the Columbus-area HNPCC Study). Prostate, Lung, Colorectal and Ovarian Cancer Screening Trail (PLCO): PLCO enrolled 154,934 participants (men and women, aged between 55 and 74 years) at ten centers into a large, randomized, two-arm trial to determine the effectiveness of screening to reduce cancer mortality. Sequential blood samples were collected from participants assigned to the screening arm. Participation was 93% at the baseline blood draw. In the observational (control) arm, buccal cells were collected via mail using the "swish-and-spit" protocol and participation rate was 65%. Details of this study have been previously described (Huang et al., 2016) [PMID: 27673363] and are available online (http://dcp.cancer.gov/plco). For this study 1651 advanced adenoma cases and 1392 controls were selected for genotyping. Selenium and Vitamin E Prevention Trial (SELECT): The Selenium and Vitamin E Cancer Prevention Trial (SELECT) was a double-blind, placebo controlled clinical trial which explored using selenium and vitamin E alone and in combination to prevent prostate cancer in healthy men (Lippman et al., 2009) [PMID: 19066370]. Secondary endpoints included the prevention of colorectal and lung cancers. SELECT was conducted at 427 sites and centers in the United States, Canada and Puerto Rico; 35,533 men 55 years and older (50 or older if African American) were randomized beginning August 22, 2001. Supplementation was discontinued on October 23, 2008 due to futility. 308 colorectal cancer cases and 308 matched controls were selected from the SELECT population and sent for genotyping. Screening Markers For Colorectal Disease Study and Colonoscopy and Health Study (SMS-REACH): Details on this study population were previously reported (Burnett-Hartman et al., 2014) [PMID: 24875374]. Participants were enrollees in an integrated health-care delivery system in western Washington State (Group Health Cooperative, Seattle, Washington) aged 24-79 years who underwent an index colonoscopy for any indication between 1998 and 2007 and donated a buccal-cell or blood sample for genotyping analysis. Study recruitment took place in 2 phases, with phase 1 occurring in 1998-2003 and phase 2 occurring in 2004-2007. Persons who had undergone a colonoscopy less than 1 year prior to the index colonoscopy, persons with inadequate bowel preparation for the index colonoscopy, and persons with a prior or new diagnosis of colorectal cancer, a familial colorectal cancer syndrome (such as familial adenomatous polyposis), or another colorectal disease were ineligible. Patients diagnosed with adenomas or serrated polyps and persons who were polyp-free at the index colonoscopy (controls) were systematically recruited during both phases of recruitment. Approximately 75% agreed to participate and provided written informed consent. Based on medical records, persons who agreed to participate and those who refused study participation were similar with respect to age, sex, and colorectal polyp status. Study protocols were approved by the institutional review boards of the Group Health Cooperative and the Fred Hutchinson Cancer Research Center (Seattle, Washington). A total of 575 cases and 508 matched were selected for the study. Controls were matched to CRC cases as 1:1 ratio. Matching was done on age_ref, race, and sex. Age_ref was matched on +-5 years. The Women's Health Initiative (WHI): WHI is a long-term national health study that has focused on strategies for preventing heart disease, breast and colorectal cancer, and osteoporotic fractures in postmenopausal women. The original WHI study included 161,808 postmenopausal women enrolled between 1993 and 1998. The Fred Hutchinson Cancer Research Center in Seattle, WA serves as the WHI Clinical Coordinating Center for data collection, management, and analysis of the WHI. The WHI has two major parts: a partial factorial randomized Clinical Trial (CT) and an Observational Study (OS); both were conducted at 40 Clinical Centers nationwide. The CT enrolled 68,132 postmenopausal women between the ages of 50-79 into trials testing three prevention strategies. If eligible, women could choose to enroll in one, two, or all three of the trial components. The components are: Hormone Therapy Trials (HT): This double-blind component examined the effects of combined hormones or estrogen alone on the prevention of coronary heart disease and osteoporotic fractures, and associated risk for breast cancer. Women participating in this component with an intact uterus were randomized to estrogen plus progestin (conjugated equine estrogens [CEE], 0.625 mg/d plus medroxyprogesterone acetate [MPA] 2.5 mg/d] or a matching placebo. Women with prior hysterectomy were randomized to CEE or placebo. Both trials were stopped early, in July 2002 and March 2004, respectively, based on adverse effects. All HT participants continued to be followed without intervention until close-out. Dietary Modification Trial (DM): The Dietary Modification component evaluated the effect of a low-fat and high fruit, vegetable and grain diet on the prevention of breast and colorectal cancers and coronary heart disease. Study participants were randomized to either their usual eating pattern or a low-fat dietary pattern. Calcium/Vitamin D Trial (CaD): This double-blind component began 1 to 2 years after a woman joined one or both of the other clinical trial components. It evaluated the effect of calcium and vitamin D supplementation on the prevention of osteoporotic fractures and colorectal cancer. Women in this component were randomized to calcium (1000 mg/d) and vitamin D (400 IU/d) supplements or a matching placebo. The Observational Study (OS)examines the relationship between lifestyle, environmental, medical and molecular risk factors and specific measures of health or disease outcomes. This component involves tracking the medical history and health habits of 93,676 women not participating in the CT. Recruitment for the observational study was completed in 1998 and participants were followed annually for 8 to 12 years. All centrally confirmed cases of invasive colorectal cancers, or deaths from colorectal cancer were selected as potential cases from September 30, 2015 database. Controls were participants free of colorectal cancer (invasive or in situ) as of September 30, 2015. Potential cases and controls were excluded if they (1) were non-White; (2) had history of colorectal cancers at baseline; (3) lost to follow-up after enrollment; (4) DbGAP ineligible; (5) had <1.25ug of DNA; (6) selected for WHI study M26 Phase I or II; (7) selected for WHI study AS224 and also included in the imputation project. A total of 578 cases and 104,429 controls met the eligibility criteria. Each case was matched with 1 control (1:1) that exactly met the following matching criteria: age (+-5 years), 40 randomization centers (exact), WHI date (+-3 years), CaD date (+-3 years), OS flag (exact), HRT assignments (exact), DM assignments (exact), and CaD assignments (exact). Control selection was done in a time-forward manner, selecting one control for each case from the risk set at the time of the case's event. The matching algorithm was allowed to select the closest match based on a criteria to minimize an overall distance measure (Bergstralh EJ, Kosanke JL. Computerized matching of cases to controls. Technical Report #56, Department of Health Sciences Research, Mayo Clinic, Rochester MN. April 1995). Each matching factor was given the same weight. When exact matches could not be found, the matching criteria were gradually relaxed among unmatched cases and controls until all cases had found matched controls. Using the matching criteria specified above, 559 of the 578 eligible cases found exact matches. The matching criteria was then relaxed to : Age+-5, randomization centers, WHI date +- 3 years, CaD date +- 3 years, OS flag, HRT flag, DM flag, CaD flag. 17 of the remaining 19 unmatched cases found matched controls. By matching on Age+-5, randomization centers, WHI date +- 3 years, CaD date +- 3 years, OS flag, HRT flag, the remaining 2 unmatched cases found their matches.
The purpose of this project is to make clinical measurements from the PREDICT-HD consortium available through the dbGaP mechanism. The phenotype data will first be converted into a community open standard and subsequently exported to dbGaP for archival and open access distribution of the results of the studies. This will permit members of the scientific community to utilize a permanent resource for investigating the interactions of phenotypes upon an international cohort of early Huntington Disease. In version 2 cut of the data we provided HD CAG repeat lengths for both alleles as well as enrollment age of all participants. We have also generated unique identifiers prospectively compatible with the larger initiative GWAS in Huntington's Disease project (also on DbGaP). As such, the version 1 cut of the data was mainly proof of concept and should be deprecated. Going forward, all updates will add-on to version 2 cut of the data. In version 3 cut of the data, we provided baseline or the first usable MRI T1-weighted imaging analysis subcortical and cortical segmentations and cortical parcellations based on a customized Freesurfer 5.2 pipeline developed at The University of Iowa. The customizations to the standard pipeline were mainly to improve bias field correction and image normalization such that segmentation of gray, white, internal csf, dura and surface CSF are optimized for the Freesurfer pipeline. There are 1111 subjects with results in this data release. In version 4 cut of the data, we provided all longitudinal clinical measurements for all subjects (total of 1476) assessmented or enrolled through the end of 2013. Additionally, we are providing measurements on 39 baseline FDG PET images spatially normalized by SPM5 into MNI space, relative regional metabolic values computed in 120 volumes of interest (VOI) defined in the Automated Anatomical Labeling (AAL) Atlas (Tzourio-Mazoyer et al. 2002), and global metabolic values calculated by SPM standard mean voxel value (within per image fullmean/8 mask). This project is a funded ancillary study of PREDICT-HD. In version 5 cut of the data, we provided the first of many forthcoming results from ancilliary studies of PREDICT-HD. In this data cut, we provide individual subject results derived from structural MRI data. The earliest MRI session for each subject was used. The results summarized represent source based morphometry loading coefficients for 23 components (see: "Patterns of Co-Occurring Gray Matter Concentration Loss across the Huntington Disease Prodrome", Ciarochi et al., 2016, Front Neurol. 2016; 7: 147, Published online 2016 Sep 21. doi: 10.3389/fneur.2016.00147]. In this version 6 cut of the data, we provide a full set of derived data, more than 10,000 raw MRI images, and ancillary study data sets. For sample information please link to: PREDICT-HD Biospecimen Resources
We sought to characterize cellular heterogeneity in the human cerebral cortex at a molecular level during cortical neurogenesis. We captured single cells and generated sequencing libraries using the C1TM Single-Cell Auto Prep System (Fluidigm), the SMARTer Ultra Low RNA Kit (Clontech), and the Nextera XT DNA Sample Preparation Kit (Illumina). We performed unbiased clustering of the single cells and further examined transcriptional variation among cell groups interpreted as radial glia. Within this population, the major sources of variation related to cell cycle progression and the stem cell niche from which radial glia were captured. We found that outer subventricular zone radial glia (oRG cells) preferentially express genes related to extracellular matrix formation, migration, and stemness, including TNC, PTPRZ1, FAM107A, HOPX, and LIFR and related this transcriptional state to the position, morphology, and cell behaviors previously used to classify the cell type. Our results suggest that oRG cells maintain the subventricular niche through local production of growth factors, potentiation of growth factor signals by extracellular matrix proteins, and activation of self-renewal pathways, thereby contributing to the developmental and evolutionary expansion of the human neocortex. For study version 2, we have updated this data set to include additional primary cells that we infer to represent microglia, endothelial cells, and immature astrocytes, as well as additional cells from the developing neural retina, and from iPS-cell derived cerebral organoids. The genes distinguishing these cell populations may reveal biological processes supporting the diverse functions of these cell types as well as vulnerabilities of specific cell types in human genetic diseases and in viral infections. For study version 3, we have updated the data set to include additional primary cells, including those published in Nowakowski et al., 2017 (PMID:29217575). For study version 4, we have additionally performed parallel analyses of transcriptomes and physiological responses of 476 single cells isolated from developing human cortex. As a result, we were able to identify physiological response profiles of specific progenitor and neuronal cell types during human cortical development.For study version 5, we have additionally performed bulk RNA sequencing of organotypic primary cultures from the developing human cortex with or without exposure to SARS-CoV-2 to understand the infectability and transcriptomic effects of SARS-CoV-2 on the developing human cortex.For study version 6, we investigated the impact of LIFR signaling on neural proliferation and differentiation of human oRG cells. Specifically, we isolated oRG cells from the primary developing neocortex and cultured them for four weeks with and without LIF treatment. We then performed single-cell RNA sequencing using the Chromium Single Cell 3’ Reagent Kits (v3.1, Dual index) from 10x Genomics to understand the molecular and cellular changes of oRG differentiation upon LIF treatment.
The Gabriella Miller Kids First Pediatric Research Program (Kids First) is a trans-NIH effort initiated in response to the 2014 Gabriella Miller Kids First Research Act and supported by the NIH Common Fund. This program focuses on gene discovery in pediatric cancers and structural birth defects and the development of the Gabriella Miller Kids First Pediatric Data Resource (Kids First Data Resource).All of the WGS and phenotypic data from this study are accessible through dbGaP and kidsfirstdrc.org, where other Kids First datasets can also be accessed.This project aims to sequence an unparalleled number of cases of de novo Acute Myeloid Leukemia (AML) and Down Syndrome AML (DS-AML), to establish a database comprised of genomic and transcriptome information which can be interrogated for both somatic and germline variants. Identification of the somatic variants will provide valuable data on the potential genes and pathways that can be targeted for therapeutic purposes. In addition, interrogation of the host’s constitutional genome may yield valuable information about potential germline variants that, in combination with the somatic data, might provide a more informed approach to patient care. For those patients with predisposition mutations, chemotherapy alone might not be adequate for cure, and stem cell transplantation might be required. Also, those who might be at high risk of adverse secondary events (cardiac complications, secondary malignancies, etc.) can be identified early and their therapy can be tailored to minimize anticipated complications. Thus, we propose that the optimum outcome can only be obtained through comprehensive interrogation of the somatic and germline genomes to fully annotate the genomic makeup of the leukemia and its host. Knowing the genomic and transcriptomic makeup of these patients, along with a full complement of clinical characteristics for this cohort, will be critical for making strong correlations which may aid in therapeutic development for future patients. The de novo AML, DS-AML, and Acute Promyelocytic Leukemia (APL) cases were all collected through clinical protocols conducted by the Children’s Oncology Group (COG). In addition to funding from the Gabriella Miller Kids First Pediatric Research Program, the DS-AML cohort was specifically funded by the Lifespan to Understand Down syndrome (INCLUDE) Project.
The Gabriella Miller Kids First Pediatric Research Program (Kids First) is a trans-NIH effort initiated in response to the 2014 Gabriella Miller Kids First Research Act and supported by the NIH Common Fund. This program focuses on gene discovery in pediatric cancers and structural birth defects and the development of the Gabriella Miller Kids First Pediatric Data Resource (Kids First Data Resource). All of the genomic and phenotypic data from this study are accessible through dbGaP. The data is also available at the Kids First Portal, where other Kids First datasets can also be accessed in the cloud for data analysis, data visualization, collaboration and interoperability, open to all researchers and developers.Pediatric malignant germ cell tumors (GCTs) represent approximately 6% of childhood cancers, including 3% of tumors in children aged 0-14 years and 15% of tumors in adolescents. GCTs are heterogeneous and grouped together due to the presumed common cell of origin, the primordial germ cell (PGC). GCTs typically occur in the testes or ovaries; however, extragonadal GCTs can occur and are likely a result of abnormal germ cell migration during development. Evidence suggests that GCTs, including those diagnosed in adults, are initiated in utero. Thus, alterations in normal embryonic development are likely to be especially relevant to GCT etiology. Germline susceptibility has not been evaluated in an agnostic fashion in pediatric GCT, mainly due to a lack of an adequate number of samples; however, emerging evidence supports a genetic etiology.In the two Kids First GCT projects, we will evaluate genetic susceptibility to intracranial and extracranial GCT by sequencing probands and their unaffected parents. The goals of the project are to: 1) evaluate the contribution of rare genetic variants in GCT through the use of aggregate burden tests, focusing on genes and established regulatory regions; 2) identify de novo SNVs and CNVs in pediatric GCT using a case-parent triad design; and 3) identify molecular signatures in GCT tumor specimens, overall and by age group and tumor characteristics. Whole Genome Sequencing (WGS) data generated through the Gabriella Miller Kids First Pediatric Research Program will provide an opportunity to investigate the genetic origins of GCT in a diverse set of samples. Given the limited knowledge of GCT etiology and biology, the results of the proposed analyses are likely to have a big impact on the field.
The incidence of acute myeloid leukemia (AML) increases with age and mortality exceeds 90% when diagnosed after age 60. Only 10-15% of cases evolve from a pre-existing myeloproliferative or myelodysplastic disorder; the remaining cases arise de novo without a detectable prodrome and are diagnosed upon development of bone marrow failure. Analysis of diagnostic blood samples has demonstrated that de novo AML is preceded by the accumulation of somatic mutations in pre-leukemic hematopoietic stem and progenitor cells (preL-HSPCs) that subsequently undergo clonal expansion. If individuals in this pre-leukemic phase could be identified, methods for determination of risk and monitoring for progression to overt AML could be developed. However recurrent AML mutations also accumulate during aging in healthy individuals who never develop AML, referred to as age related clonal hematopoiesis (ARCH). To distinguish individuals with preL-HSPCs at high risk of developing AML from those with ARCH, we undertook deep targeted sequencing of genes recurrently mutated in AML in blood samples from 133 individuals in the European Prospective Investigation into Cancer and Nutrition (EPIC) study taken on average 6 years before they developed AML (pre-AML group), together with 683 matched healthy individuals (Control group). Pre-AML cases displayed accelerated age-correlated accumulation of somatic mutations.The identity, number and variant allele frequency (VAF) of mutations differed between the two groups, and were incorporated into a computational model of AML risk prediction that accurately distinguished pre-AML cases from controls on average 7 years prior to AML development. Our findings provide proof of concept that early prediction of AML development is feasible in high-risk populations, paving the way for early disease detection, monitoring, and potentially prevention.