Contemporary electrophorisis-based sequencing machines produce curves registering the amount of each of the four nucleotide bases as a function of sequence position. For homogeneous DNA samples, the largest peaks at each position define the underlying sequence. However, more careful analysis of sequence trace data holds promise for determining the presence and frequency of mutations in inhomogeneous samples.
In this paper, we look at the problem of using sequence trace data to identify sequence variants in mixed DNA populations. Our work is motivated by a new line of capillary electrophorisis sequencing machines being developed by BioPhotonics Corporation. By using advanced single-photon detectors and other technologies, BioPhotonics has the capability to not only detect but accurately determine the relative frequency of each base at each position to within 10%, and expects to reduce this error rate to 1% in the near future.
This motivates a variety of questions concerning how accurately we can sequence mixed populations from a single sample using relative frequency information. Possible applications of this technology include:
Such microarray studies will continue to help develop our understanding of gene expression and disease. However, the technologies used for widespread diagnostic tests may well be different, to minimize costs and increase robustness. Indeed, a major goal of BioPhotonics efforts is developing smaller, cheaper DNA sequencing machines with the vision of placing them in doctor's offices for diagnostic applications.
Particularly important for many medical applications is the need to analyze sequence from heterogeneous genomic samples. Such mixed populations naturally arise from acquired mutations, say, in cancer, where various mutations to oncogenes such as p53 can lead to dramatically different disease courses. Extensive databases of p53 mutations are being constructed.
In this paper, we provide simulation results demonstrating our ability to identify p53 mutations as a function of mutation frequency and sequencing accuracy.
In this paper, we study the potential of this approach both theoretically and through simulation. We demonstrate that, under reasonable assumptions of polymorphism rates and error probabilities, pool sizes of over 100 people can be analyzed on a single sequencing run.
In this paper, we demonstrate that accurate determination of the relative frequencies of four distinct strains can be made even in the face of base-frequency error rates up to 25%.