Mutations that drive oncogenesis in cancer can generate neoantigens that may be recognized by the immune system. Identification of these neoantigens remains challenging due to the complexity of the major histocompatibility complex (MHC) antigen and T-cell receptor interaction. Here we describe the development of a systematic approach to efficiently identify and validate immunogenic neoantigens. Whole exome sequencing of tissue from a melanoma patient was used to identify nonsynonymous mutations, followed by MHC binding prediction and identification of tumor clonal architecture. The top 18 putative class I neoantigens were selected for immunogenicity testing via a novel in vitro pipeline in HLA-A201 healthy donor blood. Naive CD8 T cells from donors were stimulated with allogeneic dendritic cells pulsed with peptide pools and then with individual peptides. The presence of antigen-specific T cells was determined via functional assays. We identified one putative neoantigen that expanded T cells specific to the mutant form of the peptide and validated this pipeline in a subset of patients with bladder tumors treated with durvalumab (N = 5). Within this cohort, the top predicted neoantigens from all patients were immunogenic in vitro. Finally, we looked at overall survival in the whole durvalumab-treated bladder cohort (N = 37) by stratifying patients by tertile measure of tumor mutation burden (TMB) or neoantigen load. Patients with higher neoantigen and TMB load tended to show better overall survival.
The Resource for Genetic Epidemiology Research on Aging (GERA) Cohort was created by a RC2 "Grand Opportunity" grant that was awarded to the Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH) and the UCSF Institute for Human Genetics (AG036607; Schaefer/Risch, PIs). The RC2 project enabled genome-wide SNP genotyping (GWAS) to be conducted on a cohort of over 100,000 adults who are members of the Kaiser Permanente Medical Care Plan, Northern California Region (KPNC), and participating in its RPGEH. The purpose of the RPGEH is to facilitate research on the genetic and environmental factors that affect health and disease by linking together clinical data from electronic health records, survey data on demographic and behavioral factors, and environmental data from various sources, with genetic data derived from biospecimens collected from participants. At the time of the award of the RC2 project in late 2009, the RPGEH had established a cohort of about 140,000 individuals who had answered a detailed survey, provided saliva samples for extraction of DNA, and given broad consent for the use of their data in studies of health and disease. To maximize the diversity of the resulting sample, the GERA cohort was formed by including all racial and ethnic minority participants with saliva samples (N = 20,925; 19%); the remaining participants were drawn sequentially and randomly from white non-Hispanic participants (89,341; 81%). A total of 110,266 participant samples were included to ensure that at least 100,000 were successfully assayed. The resulting GERA cohort is 42% male, 58% female, and ranges in age from 18 to over 100 years old with an average age of 63 years at the time of the RPGEH survey (2007). The sample is ethnically diverse, generally well-educated with above average income. Approximately 69% of the participants are married or living with a partner. Length of membership in KPNC averages 23.5 years. UCSF and RPGEH investigators worked with the genomics company Affymetrix to design four custom microarrays for genotyping each of the four major race-ethnicity groups included in the GERA Cohort, described in detail in Hoffmann et al., 2011a and 2011b. Following genotyping and quality control procedures, and after removal of invalid, discordant, or withdrawn samples, about 103,000 participants were successfully genotyped. The resulting genotypic data were linked to survey data and data abstracted from the electronic medical records. As described below, all RPGEH participants were mailed new consent forms with explicit discussion of the placement of data in the NIH-maintained dbGaP. About 77% of participants returned completed consent forms, resulting in a final sample size of 78,486 participants in the GERA Cohort with data for deposit into dbGaP. Origins of the RPGEH GERA Cohort The goal in creating the RPGEH GERA cohort was to create a large, multiethnic, and comprehensive population-based resource for research into the genetic and environmental basis of common age-related diseases and their treatment, and factors influencing healthy aging and longevity. The GERA Cohort consists of a diverse cohort of more than 100,000 adults who are members of the Kaiser Permanente Medical Care Plan, Northern California Region (KPNC), and participating in its Research Program on Genes, Environment and Health (RPGEH). KPNC is an integrated health care delivery system with a population of about 3.3 million people in northern California. The membership of KPNC is representative of the general population in the 14 county area in which facilities are located, although the membership is underrepresented for the extremes of income at both ends of the spectrum. The RPGEH utilizes the longitudinal electronic health records (EHR) of KPNC to obtain clinical, laboratory, imaging and pharmacy information on all cohort members, to which personal demographic, behavioral and health characteristics have been added through member surveys. The GERA Cohort comprises a subsample of the RPGEH participant cohort, and was created through the RC2 award from the NIA, NIMH, and NIH Common Fund as described above. GERA Study Design The GERA Cohort is a subsample, as described above, of the longitudinal cohort enrolled in the Kaiser Permanente RPGEH. The RPGEH cohort includes about 400,000 survey participants of whom about 200,000 have provided broad consent and a sample of saliva or blood for use in studies of genetic and environmental factors in health and disease. The GERA Cohort was developed from a mailed survey sent to all adult members of KPNC who had been members for two years or more in 2007. All survey respondents were contacted and asked to complete a consent form; those who completed consent forms were asked to provide a saliva sample. Additional male participants were added to the RPGEH through inclusion of the Northern California sample of the California Men's Health Study (CMHS) cohort of about 40,000 men from KPNC, ages 45-69 years old at the time of the CMHS survey in 2002-2003. The CMHS participants contributed about 15,400 saliva samples to the RPGEH and were eligible for inclusion in the GERA Cohort. CMHS participants were included according to the same sampling design as for the RPGEH cohort as a whole. Specifically, all minority participants were selected for inclusion in order to maximize representation of minorities in the GERA Cohort, and Non-Hispanic White participants were selected at random to complete the sample of 110,266 GERA Cohort participants. GERA Genotypic Data High-density genotyping was conducted at UCSF using custom designed Affymetrix Axiom arrays, as described in Hoffmann et al. (2011a; 2011b). To maximize genome-wide coverage of common and less common variants, four specific arrays were designed for individuals of Non-Hispanic White (EUR), East Asian (EAS), African-American (AFR), and Latino (LAT) race/ethnicity. There was broad overlap among the SNPs on the arrays, which were designed using a hybrid greedy imputation algorithm (Hoffmann et al., 2011b) applied to genotype information validated by Affymetrix from the 1000 Genomes Project. However, in order to capture low frequency variants specific to particular race-ethnicity groups, SNP content varies between arrays. A more detailed description of the process of genotyping and results is included in Genotyping of DNA Samples. Description of the analyses of population structure and development of principal components for adjustment of population structure is included in Population Structure Analysis. GERA Phenotypic Data RPGEH and CMHS Survey Data. The sources of data on demographic and behavioral factors deposited in dbGaP for the GERA Cohort are the RPGEH and CMHS surveys. Data on common demographic factors such as gender, race/ethnicity, marital status, and education and on behavioral factors such as smoking, alcohol consumption, and body mass index, have been cleaned, edited, reconciled between the two surveys, and compiled into summary indices, where appropriate, for deposition into dbGaP. A more complete description of the survey variables is included in Survey Variables Documentation. Please note that the terms of use of the GERA Cohort Data, as specified in the Data Use Certification (DUC), prohibit the use of survey variables as outcomes in analyses. For example, a genome-wide association study (GWAS) of education or smoking is not permitted as specified by the DUC. Only health conditions can be used as outcome variables in analyses. Health Conditions derived from Kaiser Permanente Electronic Medical Records. Data on the occurrence of health conditions in participants in the GERA Cohort have been derived from summarizing ICD-9 coded diagnoses in Kaiser Permanente's electronic medical records. An algorithm that aggregates specific ICD-9 codes into appropriate diagnostic groups for selected conditions is applied to outpatient and inpatient databases; see Disease and Conditions Definitions Documentation for details. The criterion for including a condition as "present" for a participant is the occurrence of two or more diagnoses within a diagnostic category occurring on separate days. Two or more is used as the criterion in order to reduce false positives due to mistakes or rule-out diagnoses. When compared with validated disease registries, the criterion of 2+ diagnoses yields high specificity and good sensitivity. ICD-9 codes in the electronic records are specified in several ways. For outpatient visits occurring during the period 1995 to 2006, diagnoses were assigned by the treating physician who endorsed specific diagnoses on an optically scanned list that varied by specialty. Beginning in 2006 with the advent of an integrated, fully electronic medical record, outpatient diagnoses are made by physicians/ providers using a pull down menu. Discharge diagnoses from inpatient stays are specified by physicians and coded by specially trained coders. Databases of ICD-9 codes for diagnoses assigned at outpatient visits, or as one of the discharge diagnoses following inpatient stays, are complete and available for all KPNC members dating back to 1995. Although the average length of KPNC membership among GERA cohort members is 23.5 years in 2007, not all have been members since 1995, so the history for some conditions, such as those that are not chronic or recurrent, may not be complete for all cohort members. The year of first membership in KPNC is included as a variable in the list of survey variables, enabling investigators to estimate the number of years of observation of each Cohort member. RPGEH Access and Collaborations Website and Procedures The RPGEH maintains a web portal for inquiries and applications for collaboration and access to data. The url is: https://rpgehportal.kaiser.org/. RPGEH has an application process and an Access Review Committee that reviews applications for collaboration and use. For more details, please contact RPGEH through the website.
The Gabriella Miller Kids First Pediatric Research Program (Kids First) is a trans-NIH effort initiated in response to the 2014 Gabriella Miller Kids First Research Act and supported by the NIH Common Fund. This program focuses on gene discovery in pediatric cancers and structural birth defects and the development of the Gabriella Miller Kids First Pediatric Data Resource (Kids First Data Resource). All of the WGS and phenotypic data from this study are accessible through dbGaP and kidsfirstdrc.org, where other Kids First datasets can also be accessed. Children with disseminated neuroblastoma have a very high risk of treatment failure and death despite receiving intensified chemotherapy, radiation therapy and immunotherapy. The long-term goal of our research program is to ultimately improve neuroblastoma cure rates by first comprehensively defining the genetic basis of the disease. The central hypothesis to be tested here is that neuroblastoma arises largely due to the epistatic interaction of common and rare heritable DNA variation. Here we will perform a comprehensive whole genome sequencing of 563 quartets of neuroblastoma patient germline and diagnostic tumor DNAs and germline DNAs from both parents. The case series was recently collected through a Children's Oncology Group epidemiology clinical trial and is robustly annotated with complete demographic (age, sex, race, ethnicity), clinical (e.g. age at diagnosis, stage, risk group), epidemiologic (parental dietary and exposure questionnaire) and biological (e.g. tumor MYCN status and multiple other tumor genomic measures) co-variates. Subjects were consented for genetic research and DNA is immediately available for shipment for sequencing. We propose Illumina-based whole genome sequencing in the 593 "trio" germline samples (Aim 1; due to missing parent: 487 full neuroblastoma triads, 106 child-single parent dyads = 1673 whole genome sequences) and matched diagnostic tumor DNA (Aim 2; N=366) at 30x sequencing depth (N=2039 whole genome sequences). Also in Aim 2 we will perform whole exome (100x) and RNA sequencing on the 366 tumor DNA and 228 tumor RNA samples from this cohort. Finally, we propose a pilot study of structural variation using long-range sequencing in 10 non-overlapping tumor samples chosen based on potentially relevant chromosomal alterations discovered with conventional NGS. Thus, a total of 2277 individual samples and 2655 sequences will be generated. We will use our established analytic pipeline that is currently being used to study the germline genomes of all cases sequenced through the NCI supported Therapeutically Applicable Research to Generate Effective Treatments program. We plan a three-stage analytic approach, first focusing on classic de novo and inherited Mendelian damaging alterations. We will next integrate our extensive epigenomic data from human neuroblastoma cell lines and genome-wide association study data (N=5,703 neuroblastoma cases to date) to guide a comprehensive assessment of noncoding variants that influence tumor initiation with a recently established analytic pipeline. Finally, we will utilize the tumor DNA analyses to inform relevance via somatic gain or loss of function effects at the sequence and/or copy number levels. All data generated in this project will be immediately placed into the Genomic Data Commons (GDC) and we will compute within this environment by importing our analytic pipelines into the GDC. These data will be fully integrated into the Kids First Data Resource and freely shared with all academically qualified petitioners. This comprehensive data set derived from a large and richly phenotyped series of neuroblastoma DNA quartets will be integrated with existing germline and/or tumor genomic data from over 6,000 neuroblastoma subjects (but none with matched patient-parent germline sequencing data) to provide an unparalleled opportunity to comprehensively discover the genetic basis of neuroblastoma.
We have sequenced the whole transcriptomes of 18 ovarian clear-cell carcinomas and 1 ovarian clear-cell carcinoma cell line and found somatic mutations in ARID1A (the AT-rich interactive domain 1A [SWI-like] gene) in 6 of the samples. ARID1A encodes BAF250a, a key component of the SWI–SNF chromatin remodeling complex. We sequenced ARID1A in an additional 210 ovarian carcinomas and a second ovarian clear-cell carcinoma cell line and measured BAF250a expression by means of immunohistochemical analysis in an additional 455 ovarian carcinomas
Small intestine neuroendocrine tumor (SI-NET), the most common cancer of the small bowel, often displays a curious multifocal phenotype with several intestinal tumors centered around a regional lymph node metastasis, yet the typical path of evolution of these lesions remains unclear. Here, we determined the complete genome sequences of 24 tumor and 3 adjacent normal tissue samples with their paired normal blood samples (totally 33 whole genomes) from 6 patients with multifocal SI-NETs, allowing elucidation of phylogenetic relationships between multiple intestinal tumors and metastases in individual patients.
Childhood cancer remains one of the leading causes of death in pediatric patients in Europe. Pediatric sarcomas, comprising soft tissue sarcomas and malignant bone tumors, are a heterogenous group of malignancies, with more than 50 subtypes (WHO classification). Due to low case numbers, studying pediatric sarcomas requires accurate and reliable preclinical models. Here, we established 18 soft tissues sarcoma PDX models, including Ewing Sarcoma, Rhabdmyosarcoma and Osteosarcoma. We characterized these models by Whole Exome Sequencing and assessd the response to a wide range of drugs.
The purpose of our study was to assess the influence of oral microbiota on the development of esophageal cancer. Our preliminary case-control studies reported a global alteration of foregut microbiome in esophageal adenocarcinoma with the strongest changes found in the oral microbiome. We hypothesise that commensal oral bacteria are capable of activating or degrading carcinogens in cigarette smoke and therefore may contribute to esophageal carcinogenesis. We conducted a prospective study nested in two large US cohorts, to determine whether oral microbiota are associated with subsequent esophageal adenocarcinoma.
We generated a collection of patient-derived pancreatic normal and cancer organoids. We performed whole genome sequencing, targeted exome sequencing, and RNA sequencing on organoids as well as matched tumor and normal tissue if available. This dataset is a valuable resource for pancreas cancer researchers, and those looking to compare primary tissue to organoid culture. In our linked publication, we show that pancreatic cancer organoids recapitulate the mutational spectrum of pancreatic cancer. Furthermore, RNA sequencing of organoids demonstrates the presence of both transcriptional subtypes of pancreas cancer.
In this study, we hypothesize that shallow long insert whole genome sequencing (LI-WGS) increases our power for detecting breakpoints compared to shallow short insert WGS. We performed a priori analyses to demonstrate the benefits of LI-WGS, developed a long insert library preparation protocol based off Illumina's protocol, and compared LI-WGS against short insert WGS on test samples. We then used long insert WGS to identify translocations and copy number changes in tumor and germline samples collected from cancer patients with different malignancies.
The NIDDM-Atherosclerosis Study, funded by NHLBI, was designed as a family study to examine the genetic basis of subclinical atherosclerosis and diabetes in Hispanic families. Family members of probands with T2D were recruited in the Los Angeles area. The baseline examination of the cohort included the euglycemic hyperinsulinemic clamp test from which the two key phenotypes were obtained: insulin sensitivity (M) and metabolic clearance rate of insulin (MCRI). Genome-wide genotyping was obtained under separate funding by NIDDK as a part of the GUARDIAN (Genetics Underlying Diabetes in Hispanics) Consortium.