Blood Somatic Mutations in TCGA Donors Suffering from Solid Tumors
Blood somatic mutations identified across 8,530 solid tumor patients in TCGA cohort. We reasoned that low-coverage whole-genome sequencing of blood samples routinely carried out in cancer genomics projects may be repurposed to detect clonal hematopoiesis (CH). Thus, inspired by a previous approach to identify early mutations in the development of the hematopoietic system, we implemented a pipeline to systematically carry out this "reverse" somatic mutation calling on the paired blood/tumor samples. Blood somatic mutations were called using a 'reverse' approach, in which the tumor sample taken from each patient was used as control of their germline genome. Blood somatic mutations are thus detected as variants in the blood (with respect to the human reference genome) that are absent in the tumor sample. The variant calling was carried out in our in house cluster on downloaded blood/tumor BAM files. The matched blood and tumoral BAM files -- masked and deduplicated using GATK -- of 8,530 whole-exome patients were obtained as described above. The variant calling was carried out using Strelka2 (employing default parameters) with the blood sample as the tumoral input and the tumor sample as control (reverse calling). All variants with two or more supporting reads matching the caller PASS filter and with VAF<0.5 were kept. Mutations in lowly mappable regions as defined by the DUST algorithm (k=30) and UMAP68 (36-kmers) were excluded. Contiguous variants were merged into double-base substitutions. Variants with greater frequency across the cohorts than the DNMT3A R882H or JAK2 V617F hotspot in a cohort-specific Panel of Normals and in gnomAD v2.1 were removed. This was equivalent to discarding variants present in these datasets with an allele frequency greater than 0.008 in PoN TCGA and 0.0003 in gnomAD v2.1. Additionally, common SNPs as defined by the dbSNP 151 Common UCSC track and dbSNP were excluded. Mutations within segmental duplications, simple repeats and masked regions as defined in UCSC tracks were also removed. Finally, samples with the mutation count above the 97.5 percentile of the mutation burden across the cohort were deemed unreliable and excluded for further analyses. We call the set of variants obtained after the application of these filters the full set. A more conservative subset of somatic mutations was generated from the full set of blood somatic mutations. To this end, we applied MosaicForecast, a software designed to phase mutations to polymorphisms with the aim of identifying somatic mutations in a small number of cells and also predict mosaicism for the unphased ones with a random forest classifier. As a result, we obtained a subset of mosaic-phased mutations, and a subset of mutations likely to be somatic (mosaic set).
In order to obtain access to this study, please first obtain access to TCGA study (phs000178).
- Type: Case Set
- Archiver: The database of Genotypes and Phenotypes (dbGaP)