Dataset

Targeted de novo phasing and long-range assembly by template mutagenesis

Dataset ID Technology Samples
EGAD00001008444 Illumina MiSeq 4

Dataset Description

Long-range sequencing with low error rate has been challenging. Sequence assembly and phasing usually require a high-quality reference genome for mapping, so working on highly-variable genomic regions or regions with no reference genome information would be difficult. In this study, we describe novel bench protocols and algorithms to obtain ultra-low-error-rate haplotype-phased sequence assemblies of regions 10 KB in length using a short-read sequencing platform that simultaneously solves the above two problems. We accomplish this by imprinting each template strand from a target region with a dense and unique mutation pattern. The mutation process randomly and independently converts ~50% of cytosines to uracils. Short-read sequencing libraries are made from both mutated and unmutated templates. A conservative de Bruijn graph approach seeds an assembly of the mutated templates, which we then extend by mapping paired-end reads. We next partition the template assemblies into two or more haplotypes after using the unmutated sequence library to recover almost all of the mutated bases. The final haplotype is assembled and corrected for residual template mutations and PCR errors. We obtain per-base-error rates below 10 9. We apply this method to a human family, correctly assembling and phasing three genomic intervals, including the highly polymorphic HLA-B gene.

Data Use Conditions

See further information on Data Use Conditions

Label Code Version Modifier
general research use DUO:0000042 2021-02-23