Need Help?


Background Massively parallel sequencing technology has transformed cancer genomics. It is now feasible, in a clinically relevant time-frame, for a clinically manageable cost, to screen DNA from patient tumours for mutations essentially genome-wide. The challenge for personalised medicine will be to increase the sample size to thousands or tens of thousands of well-characterised cases in order to attain sufficient statistical power to stratify patients accurately across the complexity and genomic heterogeneity expected for most of the common tumour types. Currently, whole genome sequencing on this scale is not feasible, and targeted sequencing of relevant portions of the genome will be required. Pilot data We have developed protocols for large-scale, multiplexed sequencing of 100-200 genes in thousands of samples. Essentially, using robotic technology, genomic DNA from the cancer specimen is processed into sequencing libraries with unique DNA barcodes, thereby allowing sequencing reads to be attributed to the sample they derive from. Currently, these sequencing libraries can be generated in a 96-well format using fully automated protocols, and we are exploring methods to expand this to a 384-well format. The sequencing libraries are pooled and hybridized to custom sets of RNA baits representing the genomic regions of interest. Sequencing of the pulled-down libraries is done in pools of 48-96 samples per lane of an Illumina Hi-Seq. This protocol is already implemented at the Sanger Institute. We have published proof that somatic mutations in novel cancer genes can be identified from exome-wide sequencing. In unpublished pilot data, we have established the feasibility of robotic library production, custom pull-down, and multiplexed sequencing of barcoded libraries for 100 known myeloid cancer genes across 760 myelodysplasia samples. Highlights of the data thus far analysed reveal that the coverage is remarkably even between samples; when 96 samples are run, average coverage per lane of sequencing is ~250, with 90-95% of targeted exons covered by >25 reads; known mutations can be discovered in the data set; and the protocol is amenable to whole genome amplified DNA. The bioinformatic algorithms for identification of substitutions and indels in pull-down data are well-established; we have pilot data proving that copy number changes, LOH and genomic rearrangements in specific regions of interest can also be identified by tiling of baits across the relevant loci. Proposal We propose to apply this methodology to 10000 samples from patients with AML enrolled in clinical trials over the last 10-20 years. Oncogenic point mutations and potentially genomic rearrangements will be identified, and linked to clinical outcome data, with a view to undertaking the following sorts of analyses: • Identification of co-occurrence, mutual exclusivity and clusters of driver mutations. • Correlation of prognosis with driver mutations and potentially gene-gene interactions • Exploration of genomic markers of drug response Ultimately, we would like to be in a position to release the mutation data together with matched clinical outcome data to genuine medical researchers via a controlled access approach, possibly within the COSMIC framework ( The vision here is to generate a portal whereby a clinician faced with an AML patient and his / her mutational profile can obtain a ‘personalised’ prediction of outcome, together with a fair assessment of the uncertainty of the estimate. With a sufficient sample size, there would also be the potential to develop decision support algorithms for therapeutic choices based on such data.

Click on a Dataset ID in the table below to learn more, and to find out who to contact about access to these data

Dataset ID Description Technology Samples
EGAD00001000606 Illumina MiSeq 38