Important links to apply for individual-level data Genetic Analysis Workshop Instructions to Request Authorized Access Data Use Certification Requirements (DUC) Apply here for controlled access to individual level data Research Use Statement Questions regarding GAW16 should be directed to Vanessa Olmo at vanessa@business-endeavors.com. Problem 2: Description of the Framingham Heart Study In GAW16, we use data drawn from the Framingham Heart Study. The Framingham Heart Study — under the direction of National Heart, Lung, and Blood Institute; NHLBI — began in 1948 with the recruitment of adults from the town of Framingham, Massachusetts. At the time, little was known about the general causes of heart disease and stroke, but the death rates for cardiovascular disease (CVD) had been increasing steadily since the beginning of the 20th century and had become an American epidemic. The Framingham Heart Study is now conducted in collaboration with Boston University. The objective of the Framingham Heart Study was to identify the common factors or characteristics that contribute to CVD by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke. Between 1948 and 1953 the researchers recruited 5,209 subjects (2,336 men and 2,873 women) between the ages of 29 and 62 from the town of Framingham, Massachusetts and began the first round of extensive physical examinations and lifestyle interviews that they would later analyze for common patterns related to CVD development. Subjects were recruited from lists of addresses recorded for the town. Two out of every three households were approached for participation in the study. While there was no intention to recruit families for family studies, the plan was to recruit all household members in the ages 30-60 within each house that was selected for study. Hence, many biologically related individuals were recruited, including 1644 spouse pairs. Since 1948, these participants have returned to the study every two years for a detailed medical history, physical examination, and laboratory tests. Now in 2008 at 60 years of follow up, there remain about 500 participants from this cohort. Between 1971 and 1975 the study enrolled a second-generation group — 5,124 of the original participants' children and the spouses of these children — to participate in similar examinations. 2,616 subjects are offspring of the original spouse pairs and 34 are stepchildren. A total of 898 offspring are children of cohort members where only one parent was a study participant and 1,576 are spouses of the offspring. The Offspring Cohort has been followed every four years through 2001 (except between Exams 1 and 2 with an intervening 8 years) using protocols similar to those used for study of the Original Cohort. Between 2002 and 2005 the study enrolled the third generation (Gen3) of the Framingham Heart Study - 4095 offspring of the second generation. None of their spouses were recruited. An additional 103 parents of this third generation, who were not recruited between 1971 and 1975, were also recruited at this time. The latter group is not included in the GAW16 data. With the recruitment of this third generation, the study has increasingly focused on genetic factors associated with the development of cardiovascular disease and its associated risk factors. To date, there is only one examination of this generation of participants. A description of the recruitment of this third generation and comparison with the earlier generations at their initial recruitment is presented in Splansky GL et al., 2007. Further information on the Study can be found at http://www.nhlbi.nih.gov/about/framingham/index.html. Genome-wide Dense SNP Scan in Framingham Heart Study Genetic studies did not begin in the FHS until the 1990s. In the late 1980s and through the 1990s DNA was extracted from blood samples of surviving FHS participants. In 2007, the FHS entered a new phase with the conduct of genotyping for the FHS SHARe (SNP Health Association Resource) project, for which dense SNP genotyping was performed using approximately 550,000 SNPs (GeneChip® Human Mapping 500K Array Set and the 50K Human Gene Focused Panel) in 10,775 samples (some duplicates) from the three generations of subjects (including over 900 pedigrees). Affymetrix conducted all genotyping for the FHS SHARe project, using the 250K Sty, 250K Nsp, and the supplemental 50K platforms. Eighty-nine percent of the DNA samples were collected during the 1990s. To maximize the power of the study, we also extracted DNA from 1133 blood samples, drawn from subjects who had no DNA, to include in the SHARe project. These samples had been sitting in our refrigerators for some time, a few as far back as the 1970s. We refer to these DNA samples as the legacy samples. These samples had a higher failure rate in the genotyping process (40%) than the other eighty-nine percent (3%). Affymetrix invoked its own criteria for a sample to succeed in genotyping. All non-legacy samples must succeed on all three platforms, while legacy samples needed to pass on at least one platform. When a sample failed, additional attempts were made. Samples that repeatedly failed 2-4 times were called failures. Other samples failed due to issues of genotyped sex identification not matching our records or low SNP concordance among SNPs common across arrays or contamination. Eighty-nine percent of the legacy samples for which genotyping results are available passed all three platforms. The genotyping data from the 10,043 samples from 9354 subjects that passed the Affymetrix criteria were additionally checked for gender consistency and consistency with family structure, resulting in genotyping data for 9,274 participants in FHS SHARe. Genotype calls were made with the BRLMM algorithm. The SHARe database is housed at the National Center for Biotechnology Information database of genotypes and phenotypes (NCBI dbGaP) and contains all ~550,000 SNPs. This genome-wide dense SNP scan and a subset of phenotypes from the Framingham Heart Study are the focus of the Genetic Analysis Workshop 16. Further information on the specific variables in the Problem 2 dataset can be found by clicking on the Documents tab at the top of the page. Problem 3: Description of the Simulated Data Set The focus of this simulation is gene discovery in genome-wide association scans (GWAS). The Framingham Heart Study data set (distributed as "Problem 2") is the basis for the FHS* simulated data. The pedigree structures are derived from the data distributed for Problem 2, and we distribute an accompanying triplet file (triplet_sim) containing person ID, father ID, mother ID, to ensure the identical subjects, pedigrees, and singletons are used in the simulated data analysis. Consistent with standard practice, founders and singletons are designated as subjects with both fshare and mshare equal to zero (missing). The simulated data includes a total of 6,479 subjects with both phenotype and genotype data, in 942 pedigrees distributed among 3 generations and 188 singletons. Data inclusion is consistent with the subjects' consent for use by both for-profit and not-for-profit researchers. The genotypes for all Problem 3 replicates are fixed as measured and distributed for Problem 2 for both the genomewide scan and the additional candidate gene SNPs, for a total of approximately 550,000 SNPs (GeneChip® Human Mapping 500K Array Set and the 50K Human Gene Focused Panel). Thus, to analyze the Problem 3 simulated data you also will need to download the Problem 2 genotypes. Note that there are slight discrepancies in counts between Problems 2 and 3 due to a change in consents between the two datasets. Several phenotypes that contribute to coronary heart disease (CHD) were simulated for all individuals with genotypes across three different time points, 10 years apart. All genotyped individuals have complete data; the effects of missing values can be investigated by user-specified missing value patterns. There are 200 longitudinal datasets created, based on the generating model, and each replication is found in a separate dataset. We suggest that if only one replication is to be analyzed, that it be replication 1 to enable more precise comparisons among analytical approaches. The 'shareid' will allow you to merge the simulated phenotype data with the Problem 2 genotype data, and reconstruction of the pedigrees using the distributed 'triplet_sim' file or for larger families' relationships with the triplet distributed with the Problem 2. The simulated data problem is further described in the associated readme file, and a data dictionary is provided defining all the variables. For disclosure of the generating model for these data, please contact Jean MacCluer at jean@sfbrgenetics.org.
This postmortem study examines molecular, genetic and epigenetic signatures in the brains of hundreds of subjects with or without mental disorders conducted by the DIRP NIMH Human Brain Collection Core (HBCC). The brain tissues are obtained under protocols approved by the CNS IRB (NCT00001260), with the permission of the next-of-kin (NOK) through the Offices of the Chief Medical Examiners (MEOs) in the District of Columbia, Northern Virginia and Central Virginia. Additional samples were obtained from the University of Maryland Brain and Tissue Bank (contracts NO1-HD-4-3368 and NO1-HD-4-3383) (http://www.medschool.umaryland.edu/btbank/ and the Stanley Medical Research Institute: http://www.stanleyresearch.org/brain-research/). Clinical characterization, neuropathological screening, toxicological analyses, and dissections of various brain regions were performed as previously described (Lipska et al. 2006; PMID: 16997002). All patients met DSM-IV criteria for a lifetime Axis I diagnosis of psychiatric disorders including schizophrenia or schizoaffective disorder, bipolar disorder and major depression. Controls had no history of psychiatric diagnoses or addictions. SNP array: Array-based genotyping was performed on most samples published in this collection. The number of SNPs assayed via Illumina chips varied between 650,000 and 5 Million. Cerebellar tissue was generally used for genotyping studies. # Diagnosis SNP Array 1 Anxiety Disorder 1 2 Autism Spectrum Disorder 13 3 Bipolar Disorder 114 4 Control 387 5 Eating Disorder (ED) 2 6 Major Depressive Disorder (MDD) 186 7 Obsessive Compulsive Disorder (OCD) 5 8 Post-Traumatic Stress Disorder (PTSD) 0 9 Schizophrenia 220 10 Other 7 11 Tic Disorder 3 12 Undetermined 1 13 Williams Syndrome 2 Table: Numbers of samples in each diagnostic category. DNA extraction: 45-80 mg of cerebellar tissue was pulverized for DNA extractions. The QIAamp DNA mini Kit (Qiagen) method was employed for tissue DNA extraction. The tissue was initially lysed using Tissue Lyser (Qiagen) and extractions were accomplished according to manufacturer's protocol. The DNA was captured in 500uL elution buffer. The concentrations were measured using Thermo Scientific's NanoDrop 1000/NanoDrop ONE. The mean yield was 128.85 uG (+/- 79.48), the mean ratio of 260/280 was 1.87 (+/- 0.105), and the mean ratio of 260/230 was 2.48 (+/-1.75). Genotyping methods: Three types of Illumina Beadarray chips were used: HumanHap650Y, Human1M-Duo, and HumanOmni5M-Quad (San Diego, California). The genotyping was done according to the manufacturer's protocol (Illumina Proprietary, Catalog # WG-901-5003, Part # 15025910 Rev.A, June 2011). Approximately, 400ng DNA was used and each DNA sample was QC tested for 260/280 ratio by nanodrop and DNA band intactness on 2% agarose gel. Briefly, the samples were whole-genome amplified, fragmented, precipitated and resuspended in appropriate hybridization buffer. Denatured samples were hybridized on prepared Bead Array Chips. After hybridization, the Bead Chip oligonucleotides were extended by a single fluorescent labeled base, which was detected by fluorescence imaging with an Illumina Bead Array Reader, iScan. Normalized bead intensity data obtained for each sample were loaded into the Illumina Genome Studio (Illumina, v.2.0.3) with cluster position files provided by Illumina, and fluorescence intensities were converted into SNP genotypes. Microarray: We generated RNA expression data using array technology for psychiatric subjects compared to non-psychiatric subjects as controls. We used tissues from three different brain regions i.e. hippocampus, dorsolateral prefrontal cortex (DLPFC), and dura mater for a large cohort of individuals (total number 552 subjects for hippocampus, 800 for DLPFC and 146 for dura). Total RNA was extracted from ~100 mg of tissue using the RNeasy kit (Qiagen) according to the manufacturer's protocol. RNA quality and quantity were examined using the Bioanalyzer (Agilent, Inc) and NanoDrop (Thermo Scientific, Inc), respectively. Samples with RNA integrity number (RIN) # Diagnosis DLPFC Hippo Dura 1 Anxiety Disorder 1 0 0 2 Autism Spectrum Disorder 14 6 0 3 Bipolar Disorder 90 49 0 4 Control 336 270 75 5 Eating Disorder (ED) 2 1 0 6 Major Depressive Disorder (MDD) 144 87 0 7 Obsessive Compulsive Disorder (OCD) 5 3 0 8 Post-Traumatic Stress Disorder (PTSD) 6 0 0 9 Schizophrenia 192 125 71 10 Other 5 6 0 11 Tic Disorder 3 3 0 12 Undetermined 1 1 0 13 Williams Syndrome 2 1 0 Table: Numbers of samples in each diagnostic category. RNA-Seq of Dorso-lateral prefrontal cortex: All brains were collected and the dorsolateral prefrontal cortical (DLPFC) samples dissected at the HBCC, DIRP, NIMH. Dorsolateral prefrontal cortex (DLPFC) specimens were dissected from right or left hemisphere of frozen coronal slabs. The study was funded by the DIRP, NIMH under contract (#HHSN 271201400099C) with Icahn School of Medicine at Mount Sinai,1106402 One Gustave L. Levy Place, Box 3500, New York NY 10029-6574. RNA extraction, library preparation and sequencing were performed under contract at Icahn School of Medicine. The Common Mind Consortium (CMC) provided project management support. RNA isolation: Total RNA from 468 HBCC samples was isolated from approximately 100 mg homogenized tissue from each sample by TRIzol/chloroform extraction and purification with the Qiagen RNeasy kit (Cat#74106) according to manufacturer's protocol. Samples were processed in randomized batches of 12. The order of extraction for schizophrenia, bipolar, and MDD disorders and control samples was assigned randomly with respect to diagnosis and all other sample characteristics. The mean total RNA yield was 24.2 ug (+/- 9.0). The RNA Integrity Number (RIN) was determined by 4200 Agilent TapeStation System. Samples with RIN DLPFC RNA-Seq quantified expression data are provided for 364 samples. Data were generated, QC'd, processed and quantified as follows: RNA library preparation and sequencing: All samples submitted to the New York Genome Center for RNAseq were prepared for sequencing in randomized batches of 94. The sequencing libraries were prepared using the KAPA Stranded RNAseq Kit with RiboErase (KAPA Biosystems). rRNA was depleted from 1ug of RNA using the KAPA RiboErase protocol that is integrated into the KAPA Stranded RNAseq Kit. The insert size and DNA concentration of the sequencing library was determined on Fragment Analyzer Automated CE System (Advanced Analytical) and Quant-iT PicoGreen (ThermoFisher) respectively. Schizophrenia Bipolar Control 89 65 210 Table: Numbers of samples in each diagnostic category. RNA-Seq of subgenual anterior cingulate cortex (sgACC): All the 200 post-mortem brain samples (61 controls; 39 bipolar disorder; 46 schizophrenia; 54 major depressive disorder) were collected by the HBCC, DIRP, NIMH. RNA Extraction and Quality Assessment: Tissue from sgACC was pulverized and stored at -80°C. Total RNA was extracted from 50-80 mg of the tissue using QIAGEN RNeasy Lipid Tissue Mini Kit (QIAGEN, Cat. # 74804) with DNase treatment (QIAGEN, Cat. # 79254). The RNA Integrity Number (RIN) for each sample was assessed with high-resolution capillary electrophoresis on the Agilent Bioanalyzer 2100 (Agilent Technologies, Palo Alto, California). The concentration of RNA and their 260/280 ratio (2.1+/- 0.032 SD) were determined with NanoDrop (Thermo Scientific). RNA sequencing: Stranded RNA-Seq libraries were constructed after rRNA depletion using Ribo-Zero GOLD (Illumina). RNA sequencing was performed at National Institute of Health Intramural Sequencing Center (NISC). Schizophrenia Bipolar Control MDD 46 39 61 54 Table: Numbers of samples in each diagnostic category. Whole Genome Sequencing: All brains were collected and dissected at the HBCC, DIRP, NIMH. This study generates whole genome sequencing data using sequencing of DNA in the dorsolateral prefrontal cortex (DLPFC), anterior cingulate cortex (ACC) or cerebellum of 443 individuals with schizophrenia, bipolar disorder and major depressive disorder and non-psychiatric controls. The study was funded by the DIRP, NIMH under contract (#HHSN 271201400099C) with Icahn School of Medicine at Mount Sinai,1106402 One Gustave L. Levy Place, Box 3500, New York NY 10029-6574. DNA extraction, library preparation and sequencing were performed under contract at Icahn School of Medicine. The Common Mind Consortium (CMC) provided project management support. All specimens were dissected from right or left hemisphere of frozen coronal slabs. DNA Library Preparation and Sequencing: All samples submitted to the New York Genome Center for WGS were prepared for sequencing in randomized batches of 95. The sequencing libraries were prepared using the Illumina PCR-free DNA sample preparation Kit. The insert size and DNA concentration of the sequencing library was determined on Fragment Analyzer Automated CE System (Advanced Analytical) and Quant-iT PicoGreen (ThermoFisher) respectively. A quantitative PCR assay (KAPA), with primers specific to the adapter sequence, was used to determine the yield and efficiency of the adaptor ligation process. Performed on the Illumina HiSeqX with 30X coverage. Schizophrenia Bipolar Control 115 78 230 Table: Numbers of samples in each diagnostic category. ChIP-Seq: All brains were collected and the dorsolateral prefrontal cortical (DLPFC) samples dissected at the HBCC, DIRP, NIMH. This study generates epigenetic data using sequencing of DNA after chromatin immunoprecipitation (ChIP-Seq) for marks H3K4me3 and H3K27ac in the dorsolateral prefrontal cortex (DLPFC). Dorsolateral prefrontal cortex (DLPFC) specimens were dissected from right or left hemisphere of frozen coronal slabs. The study was funded by the DIRP, NIMH under contract (#HHSN 271201400099C) with Icahn School of Medicine at Mount Sinai,1106402 One Gustave L. Levy Place, Box 3500, New York NY 10029,6574. Chromatin precipitation, library preparation and sequencing were performed under contract at Icahn School of Medicine. The Common Mind Consortium (CMC) provided project management support. Chromatin immunoprecipitation (ChIP) assays for histone marks H3K4me3 and H3K27ac were carried out using Native ChIP. Micrococcal Nuclease (MNase) (Sigma, N3755) treatment was used to digest chromatin into mononucleosomes. The following antibodies were used for chromatin pull-down: anti-H3K4me3 (Cell Signaling, Cat# 9751BC, lot 7) and anti-H3K27ac (Active Motif, Cat# 39133, Lot # 31814008). Histone modification-enriched genomic DNA fragments were recovered using Protein A/G magnetic beads (Thermo Scientific, 88803-88938 or Millipore 16-663), and then washed, eluted, and treated with RNAse A and proteinase K. Final ChIP DNA products were isolated using phenol-chloroform extraction followed by ethanol precipitation. The efficiency of each ChIP assay was validated using Qubit concentration measurement and qPCR for positive (GRIN2B, DARPP32) and negative (HBB) control genomic regions. Only ChIP assays that passed quality control were further processed for library preparation and sequencing; this included ChIP DNA that was not detectable on Qubit but showed a good signal and expected enrichment patterns in qPCR. HISTONE_MARK H3K27ac H3K4me3 Input Bipolar 56 4 7 Control 158 11 24 Schizophrenia 79 11 12 Table: Numbers of individuals in each assay grouped by histone mark or input.Long-Read Whole-Genome Sequencing (WGS) Cohort Description: Brain specimens were obtained from the Human Brain Collection Core (HBCC), part of the NIH NeuroBioBank. Samples were collected under protocols approved by the NIH CNS Institutional Review Board (IRB) (NCT03092687), with informed consent from next-of-kin (NOK). Collection was coordinated through the Offices of the Chief Medical Examiners (MEOs) in Washington, D.C., Northern Virginia, and Central Virginia. Clinical metadata and documentation are publicly available via the NIMH Data Archive (NDA) (Collection #3151) https://nda.nih.gov/edit_collection.html?id=3151 Eligibility Criteria No clinical diagnosis of major neuropsychiatric or neurodegenerative diseaseNo diagnosis of cognitive impairment during life All individuals were confirmed to be neurologically normal at time of deathDemographics Initial cohort size: 155 individuals Ancestry: All individuals self-identified as African or African-admixed Mean age at death: 44.2 years (range: 18–85 years) Sex distribution: 36.4% femaleSample Processing: Frozen frontal cortex tissue was dissected and processed according to the public protocol: https://www.protocols.io/view/processing-human-frontal-cortex-brain-tissue-for-p-kxygxzmmov8j/v2. High-molecular-weight DNA was extracted and libraries were prepared using the Oxford Nanopore Technologies (ONT) LSK-114 kit. Sequencing was performed using ONT PromethION flow cells (R10.4.1 chemistry) Data Processing and Quality Control: Basecalling: Conducted using Guppy v6.38 Read Alignment: Reads were aligned to the GRCh38 reference genome using minimap2 Sample Identity Verification: Sample identity was validated by comparing ONT-derived SNP calls with matched short-read WGS genotypes to ensure concordance and prevent sample swaps Variant Calling and Phasing: Reads were base-called with Guppy v6.38. Reads were aligned to GRCh38 using minimap2. We verified sample identity by cross-checking ONT SNV calls with the existing short-read WGS genotypes, confirming no sample switches. The napu pipeline (https://github.com/nanoporegenomics/napu_wf) produced; haplotype-resolved assemblies, joint small-variant (SNV/indel) calls, and multi-caller structural-variant sets, all reported on GRCh38 and phased where possible. Raw signal data were basecalled to obtain 5-methyl-cytosine (5mC) status; methylation tags were added to the phased BAM files. Genome-wide methylation summaries are provided in BED format.Dataset Filtering and Exclusions: All 155 samples underwent sequencing and SNP-based ancestry inference 8 samples were excluded due to ancestry inconsistent with African or African-admixed background 1 sample was excluded due to insufficient sequencing quality Final Sample Set: 146 high-quality samples from individuals of African or African-admixed ancestry were retained for downstream analyses See PMID: 39764002 for further analysis detailsDiagnosis#SamplesControl155Table: Diagnostic Summary.Note: The data derived from HBCC resources were removed from dbGAP and are now available in the NIMH Data Archive (NDA). They include genotypes, short read whole genome sequencing (WGS), epigenetics (DNA methylation, ChIP-seq for histones), RNA expression (qPCR, microarray, RNA-seq, single nucleus RNA-seq) of various brain regions in cases with schizophrenia, bipolar disorder, major depression, substance use disorders and normative controls. Please access our NDA collection (https://nda.nih.gov/edit_collection.html?id=3151) for further detail.
Programmatic submissions (XML based) For further information please check our Submission FAQs, submission quickguide as well as submission terms! Introduction Besides the Submitter Portal tool, EGA supports programmatic sequence and clinical data metadata submissions. If you are not sure what this means, you may want to explore our brief metadata introduction. Programmatic submissions are recommended for array-based submission. Moreove, it may be of help if your submission is recurrent or it is difficult to manage manually due to its sheer size. Otherwise, we highly recommend using the Submitter Portal to perform submissions. In this page we will guide you through the required steps to programmatically submit data to the EGA. Programmatic submissions require your metadata to be structured for an easy and straightforward validation and archival. It basically consists in formatting your metadata as Extensible markup language (XML) files and submitting them to the EGA using the WEBIN Before submitting metadata to the EGA, it is important to ensure that the information in your XML files is compliant with our standards. You can see further details on how these standards are maintained at EGA at our EGA Schemas documentation page. Using WEBIN, you can validate your XML files against EGA's schemas to ensure that your metadata is compliant before submission. WEBIN services WEBIN production service WEBIN test service We advise you to submit your metadata to the test service when submitting to the production service for the first time. The test service is identical to the production service except that all submissions will be discarded in the following 24 hours. This allows you to learn about the submission process without having to worry about data being submitted. Authentication Authentication is required each time a submission is made. The submission service uses HTTPS protocol for metadata encryption and identification to provide a secure submission environment. Data file upload Both Runs and Analyses reference files (e.g. FASTQ need to be uploaded to the EGA before these metadata objects are submitted. In other words, if you submit a Run that references a file that we cannot find associated with your account, the metadata submission will fail. See further details on how to upload your files in our File Upload documentation. Metadata model of the EGA Our metadata model is formed by multiple metadata objects. Check further details in our documentation at our EGA Schema documentation page. Working with EGA XMLs files Now that the basic concepts of the EGA metadata have been described, you can start preparing your programmatic submission through XML. Here you will find the guidance on how to prepare the XML files. Programmatic Submission Tutorial Video Take a look at the Programmatic Submission Tutorial Video, which explains the workflow of a programmatic submission and goes over an example metadata submission. Programmatic Submission Tutorial Video. When building your XML files, we recommend using text editors (e.g.Sublime Text or VisualStudio) that allow you to visualise the structure of the XML with ease. Furthermore, these editors constantly check the consistency of the XML structure. Alternatively, and if the submission consists of a big number of objects (specially analyses), you may find the tool star2xml handy. This tool allows for a direct conversion between metadata in a tabular format (e.g. a spreadsheet) into XMLs. Identifying objects: Aliases and center names Every EGA object must be uniquely identified within the submission account using their alias attribute. The aliases can be used in submissions to make references between EGA objects. Let us dig into EGA's use of aliases and center names: alias: every object should have a name that is unique within your submission account. Once submitted successfully, every alias will be assigned a unique and permanent accession (EGA ID). refname: when an object references another by its alias, the alias of the referenced object goes into the "refname" attribute of the referencing object. For example, if a sample has the alias "sample1", and an experiment uses this sample, then the experiment's "EXPERIMENT/SAMPLE/refname" attribute should be "sample1". center_name: The "center_name" attribute is required within the submission XML and, if not provided when the object is submitted, it will be automatically filled using your default EGA account center_name. This element is the "controlled vocabulary acronym or abbreviation that is provided to the account holder when the account is first generated". If the submitter is brokering a submission for another institute, the submitter should use their special broker account name in broker_name while the data centre acronym remains in center_name. Log-in details should have been provided when you requested a submission account. Please contact our Helpdesk team if you have any questions. run_center: Many submitting centers contract out the actual sample sequencing to another center. In these cases, the sequencing center should be acknowledged in the run_center attribute. Again, this is controlled vocabulary and the acronym should be sought from EGA helpdesk before submitting. Please contact our Helpdesk team if you have any questions. Prepare your XMLs The goal of this section is to provide sufficient information to be able to create the metadata XML documents required for programmatic submissions. Please note, the EGA utilises the XML schemas maintained at the European Nucleotide Archive (ENA). It is important due to the fact that by using a similar system, some pieces of documentation from the ENA's programmatic submission can also help you with your programmatic submission to the EGA. For example, you can submit programmatically without using a Submission XML by following the steps at Submission actions without submission XML. A submission does not have to contain all different types of XMLs. For example, it is possible to submit only a few samples; or a study that is later to be referenced. You can submit each object one by one, or submit all in a batch: you choose what method of submission works best for you. We do recommend, nevertheless, that you submit the objects to be referenced (e.g. samples or studies) first, and the objects that reference these (e.g. experiments or datasets) afterwards. You can see a graphical view of these objects and their relationships at our EGA Schemas page. Independently of the submission scenario, you will always require a Dataset XML. The entity of a dataset is what is used to control access to the given data, in the form of runs or analyses. In other words, when a requester is granted access, it is through the dataset and the objects (e.g. runs or analyses) that the dataset contains, granting access to them in one go. Given the nature of the EGA, a dataset XML will always be required for the data access. First, we will differentiate between submissions of "raw" and "processed" data: Runs and Analyses, respectively. Run data submissions Raw data derives from instruments "as is". For example, a plain sequence file (e.g. FASTQ or unaligned BAM files) would be considered raw data. A typical raw (unaligned) sequence read submission consists of 8 XMLs: Submission Study Sample Experiment Run DAC Policy Dataset When technical reads (e.g. barcodes, adaptors or linkers) are included in the submitted raw sequences, a spot descriptor must be submitted to describe the position of the technical reads so that they can be removed. The following data files can be submitted without providing spot descriptor information in the experiment/run XML: BAM files (single reads) SFF files (single reads without barcodes) FastQ files (single reads without any technical reads) Complete Genomics files Analysis data submissions Processed data is, in some way, refined raw data. This includes raw data that has been processed by some form of analysis method (e.g. alignment, noise reduction, etc.). For example, an aligned sequence (e.g. BAM file), that was created using raw FASTQ files, would be a processed file. This category includes most types of data: sequence alignment files (e.g. BAM or CRAM), clinical data (e.g. phenopackets), sequence variation files (e.g. VCF), sequence annotation, etc. A typical EGA analysis data submission consists of 7 EGA XML: Submission Study Sample Analysis DAC Policy Dataset We accept three different types of analysis data submissions: BAM files (for multiple read alignments) VCF files (for sequence variations) Phenotype files (in any format) In anycase, keep in mind that samples must be created in order to be referenced in the analyses. In other words, the provenance of the information within the BAM, VCF and phenotype files Example XMLs Below you can find a non-extensive list of example XMLs with descriptive fields (i.e. explaining what to provide in each field). Furthermore, you can also find real examples (i.e. the true value of the provided fields) in our GitHub repository. Submission XML The submission XML is used to validate, submit or update any number of other objects. The submission XML refers to other XMLs. New submissions use the ADD action to submit new objects. Object updates are done using the MODIFY action and objects can be validated using the VERIFY action. Descriptive submission XML example True values submission XML example Study XML The study XML is used to describe the study containing a title, a study type and abstract as it would appear in a publication. Descriptive study XML example True values study XML example Please use the following notation within the property "STUDY_LINKS" when including PubMed citations in the Study XML: <STUDY_LINKS> <STUDY_LINK> <XREF_LINK> <DB>PUBMED</DB> <ID>18987735</ID> </XREF_LINK> </STUDY_LINK> </STUDY_LINKS> Sample XML The sample XML is used to describe the samples used to obtain the data, whether they were sequenced, measured in any other way, or have an associated phenotype. The mandatory fields include information about the taxonomy of the sample, sex, subject ID and phenotype. For example, the mandatory attribute fields for each sample would look like these, within the array of "SAMPLE_ATTRIBUTES": <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>subject_id</TAG> <VALUE>free text!</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>sex</TAG> <VALUE>female/male/unknown</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>phenotype</TAG> <VALUE>Free text, EFO terms (e.g. EFO:0000574) are recommended</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> Sample is one of the most important objects to be described biologically, it is highly recommended that “TAG-VALUE” pairs are generated as SAMPLE_ATTRIBUTES to describe the sample in as much detail as possible. For example, were we to give the population ancestry of the sample, we could add a new attribute to the array, in which, for example, we would indicate that the sample derives from an individual of "Mende in Sierra Leone" (MSL), with an african ancestry: <SAMPLE_ATTRIBUTE> <TAG>Population</TAG> <VALUE>MSL</VALUE> </SAMPLE_ATTRIBUTE> Given that VALUE and TAG are free text, the combinations are limitless in order to give you full flexibility on the information you want to provide. We recommend you use the Experimental Factor Ontology (EFO) to describe the phenotypes of your samples. You can provide more than one phenotype by adding more items to the array of SAMPLE_ATTRIBUTES. Phenotypes considered essential for understanding the data submission should be provided. Each phenotype described should be listed as a separate sample attribute <SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTE>. There is no limit to the number of phenotypes that can be submitted. If a suitable EFO accession cannot be found for your phenotype attribute, please consider using another controlled ontology database (e.g. HPO, MONDO, etc.) before using free text. Descriptive sample XML example True values sample XML example Experiment XML The experiment XML is used to describe the experimental setup, including instrument platform and model details, library preparation details, and any additional information required to correctly interpret the submitted data. Where any of these values differ between runs, a new experiment object must exist, since runs are grouped by experiments. Each experiment references a study and a sample by alias, or if previously-submitted, by accession. Pooled data must be demultiplexed by barcode for submission. Descriptive experiment ( Illumina paired read ) XML example True values experiment ( Illumina paired read ) XML example Run XML The run XML is used to associate data files with experiments and typically comprises a single data file (e.g. a FASTQ file). Please note that pooled samples should be de-multiplexed prior submission and submitted as different runs. Descriptive run XML example True values run XML example Analysis XML Given that an analysis can be used to submit any type of processed data to the EGA, we will list below an example of each of the three most common types of analysis XMLs submitted to the EGA: sequence alignments (e.g. BAM files); sequence variation (e.g. VCF files); and clinical metadata or phenotypes (e.g. phenopackets). Regardless of the type of processed data submitted in the analysis, the analysis must be associated with a Study and can reference multiple types of other objects, from samples to experiments, if they are available at the EGA. Just like with Runs, whenever a file is submitted to the EGA through an analysis object, the file MD5 checksums must be present, in order for the EGA to validate file integrity upon transfer. This also includes index files when applicable (e.g. .bai.md5 files). Ideally, any analysis that uses a reference sequence for some kind of alignment (e.g. BAM, CRAM or VCF files), would contain metadata about the alignment, such as INSDC reference assemblies and sequences, by either using accessions (e.g. CM000663.1) or common labels (e.g. GRCh37). Read alignment (BAM) Analysis XML The Analysis can be used to submit BAM alignments to EGA. Only one BAM file can be submitted in each analysis and the samples used within the BAM read groups must be associated with Samples. Descriptive bam alignments XML example True values bam alignments XML example Sequence variation (VCF) Analysis XML The Analysis can be used to submit VCF files to EGA. Only one VCF file can be submitted in each analysis and the samples used within the VCF files must be associated with Samples. Download analysis XML (VCF) Phenotype files The Analysis XML can be used to submit phenotype files to the EGA. Only one phenotype file can be submitted in each analysis and the samples used within the phenotype files must be associated with EGA Samples. Download analysis XML (Phenotype) DAC XML The DAC XML describes the Data Access Committee (DAC) affiliated to the data submission. The DAC may consist of a group or a single individual and is responsible for the data access decisions based on the application procedure described in the POLICY.XML. As with any other object, if it was already submitted to the EGA, there is no need to submit it again: you can reference an existing object within the EGA. Hence, A DAC XML does not need to be provided if your submission is affiliated to an existing EGA DAC.. Further information on DACs can be found here, and you can always contact our Helpdesk team if you have further inquiries. Descriptive dac XML example True values dac XML example Policy XML The Policy XML describes the Data Access Agreement (DAA) to be affiliated to the named Data Access Committee. Descriptive policy XML example True values study XML example Dataset XML The dataset XML describes the data files, defined by the Run.XML and Analysis.XML, that make up the dataset and links the collection of data files to a specified Policy. The dataset xml is commonly the last metadata object to be submitted, since it references multiple other entities. Please consider the number of datasets that your submission consists of. For example, a case-control study is likely to consist of at least two datasets. In addition, we suggest that multiple datasets should be described for studies using the same samples but different sequence technologies. Descriptive dataset XML example True values dataset XML example Validating and submitting your EGA Validating EGA's XMLs through Webin After you have ensured that the XMLs are properly formatted and contain all the required information. You can proceed to validate and submit your data. Use the curl command to validate your XML file: Once you have prepared your XML file and asserted you have access to Webin, you can validate your XML file programmatically against EGA's schemas using the curl command. There are multiple ways in which you can validate your XMLs. This variety has to do with the fact that: (1) there are 2 instances of Webin (test and production); and (2) that validation is a default step during submission. In other words, any time that you submit your data through Webin, it will be validated automatically before being accepted. This allows for 4 possible routes of validation, all having the same validation result: validating or submitting to either the production service or the test service of Webin. For example, directly validating a "study" object XML in the testing service (wwwdev…) would look like the following: curl -u <USERNAME>:<PASSWORD> -F "ACTION=VALIDATE" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" In this command, you would need to replace <USERNAME> and <PASSWORD> with your EGA account username and password, respectively. You would also replace <INPUT_FILE> with the path to your XML file. A mock example would look like the following: curl -u ega-test-data@ebi.ac.uk:egarocks -F "ACTION=VALIDATE" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" The validation attempt can have different results depending on the given arguments: If your XML file is valid according to EGA's schemas, you will see a message indicating that your XML file is compliant. For example, see below for our mock example, where the "success" was "true" (i.e. no validation errors found). Nevertheless, notice how the "<STUDY accession=" is empty: it is because we were simply validating, so the study did not get an accession or ID. <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2023-04-11T15:19:28.850+01:00" submissionFile="submission-EBI-TEST_1681222768850.xml" success="true"> <STUDY accession="" alias="Mock example" status="PRIVATE"/> <SUBMISSION accession="" alias="SUBMISSION-11-04-2023-15:19:28:840"/> <MESSAGES> <INFO>VALIDATE action has been specified.</INFO> <INFO>Submission has been rolled back.</INFO> <INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO> </MESSAGES> <ACTIONS>VALIDATE</ACTIONS> <ACTIONS>PROTECT</ACTIONS> If there are any errors or warnings, the tool will display them, allowing you to correct them before submitting your data to EGA. For example, in the following response, it is said that the object we were trying to submit was already existing, and therefore the "success" was "false". <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2023-04-11T15:12:35.609+01:00" submissionFile="submission-EBI-TEST_1681222355609.xml" success="false"> <STUDY alias="Example!_Human Microbiome Project SP56J" status="PRIVATE" holdUntilDate="2023-03-11Z"/> <SUBMISSION alias="SUBMISSION-11-04-2023-15:12:35:576"/> <MESSAGES> <ERROR>In study, alias: "Example!_Human Microbiome Project SP56J". The object being added already exists in the submission account with accession: "ERP127584".</ERROR> <INFO>VALIDATE action has been specified.</INFO> <INFO>Submission has been rolled back.</INFO> <INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO> </MESSAGES> <ACTIONS>VALIDATE</ACTIONS> <ACTIONS>PROTECT</ACTIONS> If the curl command retrieves no response at all, please double check if your username and password are correctly provided. Also notice the "ACTION=..." argument passed to the Curl command. This specifies the action to take during the call to Webin, so we do not need a "Submission" XML just for a validation attempt. See more at submission actions without submission XML. Furthermore, validation of multiple files or objects (e.g. sample, experiment, study…) can be done in a single command by adding more arguments (i.e. '-F'). For example: curl -u <USERNAME>:<PASSWORD> -F "ACTION=VALIDATE" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" -F "SAMPLE=@sample.xml" -F "DATASET=@dataset.xml" As mentioned above, beside "validate" action in the test environment, you can also validate your metadata by three other methods: "Validate" in the production server. From our example above, you simply need to take the "dev" away from the URL. curl -u <USERNAME>:<PASSWORD> -F "ACTION=VALIDATE" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" "Add" in the development server. From our example above, you would simply need to replace the action: from "validate" to "add". Whatever is submitted to this service will be discarded in 24h, so whether something gets submitted or not would not matter in the long run. curl -u <USERNAME>:<PASSWORD> -F "ACTION=ADD" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" "Add" in the productionserver. A combination of the previous two methods, which would render this attempt into a submission. This path is just to be taken when you are sure your metadata is compliant and what you want to submit. curl -u <USERNAME>:<PASSWORD> -F "ACTION=ADD" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" What happens after the submission of a dataset XML? Once you have completed the registration of your dataset/s please contact our Helpdesk Team to provide a release date for your study. Please note that all datasets affiliated to unreleased studies are automatically placed on hold until the authorised submitter or DAC contact contact the EGA Helpdesk for the study to be released. We strongly advise you not to delete your data until EGA Helpdesk confirms that your data has been successfully archived.