Back To Index Previous Article Next Article Full Text

Statistica Sinica 12(2002), 179-202


Lei Li

University of Southern California and Florida State University

Abstract: One of the key practices of the Human genome project is Sanger DNA sequencing. Its data analysis part is called base-calling and attempts to reconstruct target DNA sequences from fluorescence intensities generated by sequencing machines. In this paper, we present a modeling framework of DNA sequencing, in which a base-calling scheme arises naturally. A large portion of DNA sequencing errors come from the diffusion effect in electrophoresis, and deconvolution is the tool to solve this problem. We present a new version of the parametric deconvolution which is motivated by the spike-convolution model and some recently obtained results regarding its asymptotics. One application of the asymptotics is to look at the resolution issue from the perspective of confidence intervals. We also report on an empirical study of the progressiveness of electrophoretic diffusion by way of estimating the slowly-changing width parameter in the spike-convolution model. Furthermore, we include an example of complete preprocessing of DNA sequencing data.

Key words and phrases: Base-calling, color-correction, DNA sequencing, electrophoresis, parametric deconvolution, resolution, spike-convolution model, width.

Back To Index Previous Article Next Article Full Text