Click on a Dataset ID in the table below to learn more, and to find
out who to contact about access to these data
Dataset ID
Description
Technology
Samples
EGAD50000000276
The synthetic genomes have been created trying to mimic real cancer data of 4 patients (Named 185,186,187 and 188). Mutations are based on real CRC patients from the PCAWG dataset. For each patient, two tumor samples at different time points and one healthy sample have been simulated. The cancer intra-tumor heterogeneity and evolution in the patients is depicted by simulating reads from tumor subclones separately and then mixing them according to their clonal proportions in each sample. For rapid use and transfer only selected chromosomes have been generated for each patient.
Chromosomes per patient:
-185: chr4, chr5, chr7, chr17
-186: chr1, chr7, chr12, chr17
-187: chr1, chr2, chr5, chr12, chr17
-188: chr2, chr5, chr12, chr13, chr17
Worflows used to create BAM/BAI, VCF and MAF files from FASTQ (Alignment with GRCh38):
- https://usegalaxy.eu/published/workflow?id=2c3d05023c02113e
- https://usegalaxy.eu/published/workflow?id=1da86d74f8535f4e
unspecified
8
EGAD50000000564
This dataset contains 10 tumor and normal pairs synthetic WGS data of colorectal cancer that were simulated in a standard format of Illumina paired-end reads. The NEAT read simulator (version 3.0, https://github.com/zstephens/neat-genreads) was utilized to synthetize these 10 pairs of tumor and normal WGS data. In the procedure of data generation, simulated parameters (i.e., sequencing error statistics, read fragment length distribution and GC% coverage bias) were learned from data models provided by NEAT. The average sequencing depth for tumor and normal samples aimed to reach around 110X and 60X, respectively.
For generation of synthetic normal WGS data per each sample, a germline variant profile from a real patient was down-sampled randomly, representing 50% germline variants of a given patient. These were mixed with the other 50% in silico germline variants that were modelled randomly using an average mutation rate (0.001), finally constituting a full germline profile for normal synthetic WGS data.
For generation of synthetic tumor WGS data per each sample, a pre-defined somatic short variant profile (SNVs+Indels) learnt from a real CRC patient was added to the germline variant profile used for creating the normal synthetic WGS data of the same patient, consisting of the variants for tumor sample. Neither copy number profile nor structural variation profile was introduced into the tumor synthetic WGS data. Tumor content and ploidy were assumed to be 100% and 2, respectively.
For mapping/variant detection, the Sarek pipeline v3.1.2 (https://nf-co.re/sarek/3.1.2) was used, specifically:
1. BWA v0.7.17-r1188 for read mapping
2. GATK v4.3.0.0 for pre-processing BAM file (including markduplicates and recalibration).
2. Mutect2 (GATK v4.3.0.0) for somatic variant calling
3. Strelka2 v2.9.10 for germline and somatic variant calling
Metadata information of 10 CRC patients used for the generation of synthetic normal and tumor WGS data:
Patient_id Tumor_barcode Normal_barcode Age Sex Tissue Cancer
SIM007 SIM007_T SIM007_N 71 F Rectal Primary CRC
SIM008 SIM008_T SIM008_N 45 F Colon Neuroendocrine Metastasis CRC
SIM010 SIM010_T SIM010_N 62 M Colon Metastasis CRC
SIM011 SIM011_T SIM011_N 55 M Colon Neuroendocrine Metastasis CRC
SIM012 SIM012_T SIM012_N 57 M Rectal Metastasis CRC
SIM013 SIM013_T SIM013_N 69 M Colon Metastasis CRC
SIM014 SIM014_T SIM014_N 68 M Colon Neuroendocrine primary CRC
SIM015 SIM015_T SIM015_N 58 F Colon Primary CRC
SIM016 SIM016_T SIM016_N 49 M Colon/Rectal Primary CRC
SIM017 SIM017_T SIM017_N 78 M Colon Neuroendocrine primary CRC
unspecified
20