Jun Liu and Bin Yu (2002). INTRODUCTION. Vol.12, No.1.

Statistica Sinica 12(2002), 3-5

INTRODUCTION

When the Human Genome Project (HGP) announced the early completion of sequencing, a new era of bioinformatics began. The pace is expected to escalate on sequence refinement of chromosomes or other genetic organells for both human subpopulations and other organisms. While the massive biopolymer sequence and structure information and genetic marker data (e.g., single-nucleotide polymorphisms) will continue to flourish, functional genomics, which studies all of the genes in a genome simultaneously within a functional framework, is also gearing up rapidly. High throughput microarray technologies, which allow for the simultaneous measurements of expression levels of thousands of genes, have shown great promise in studying intricate problems that range from gene regulation and protein interaction, to pathways for genetic diseases and the discovery of target subpopulations for drug or other therapies. To explore the rich information in both the sequence and microarray data, statistical analysis is indispensable in that all biological processes are inherently non-deterministic, or stochastic, and all biological experiments contribute noise.

In this special issue, we are very pleased to present sixteen papers which high- light the exciting opportunities and challenges in bioinformatics. Among them, ten are on microarray/gene expression data analyses, and six are on a wide range of other topics such as pedigree/linkage analysis, DNA sequence base calling, RNA secondary structure simulations, and protein structure prediction. There are of course many other areas that are left out; for example, protein sequence alignment and classification, gene regulatory motif discovery, study of evolution, biopolymer database search methods, and comparative genome analysis.

The ten microarray papers reflect current statistical research in the area. Since cDNA microarray images are noisy, replicated experiments are needed to assure the accuracy of gene expressions derived from these images. For replicated microarray experiments, Kerr, Afshari, Benneet, Bushel, Martinez, Walker and Churchill propose the use of ANOVA as a natural framework for log intensities of both red and green channels, although the final parameters of interest are still very much log ratios. Lonnstedt and Speed use an empirical Bayes model to deal with replicated data (log intensity ratios), with an estimation of the parameters in the prior. Often clustering is the first step for the understanding of genes under study, and it has been the topic of microarray with much statistical activity. Tibshirani, Hastie, Narasimhan, Eisen, Sherlock, Brown and Botstein propose an exploratory cluster scoring technique based on an earlier method SAM (significance analysis of microarrays); Lazzeroni and Owen introduce a novel plaid model which allows for partially overlapped clusters to occur. They argue that this leads to more interpretable structures and demonstrate the applicability of this model beyond gene expression data. However, there are many clustering methods available and they often produce quite different clusters. Goldstein, Ghosh and Conlon discuss the dependence of the clustering on the clustering algorithm and the choice of genes and sample. Chen, Jaradat, Banerjee, Tanaka, Ko and Zhang compare and evaluate different clustering algorithm in a specific study of embryonic stem cell gene expression, and give biological interpretations of the common clusters. Applying an earlier methodology on gene subset selection and clustering by the same authors, Bryan, Pollard and van der Laan present simulation results on new paired and unpaired data to demonstrate the impact of the sampling variability in covariance estimation. Dudoit, Yang, Callow and Speed address the important question of identification of differentially expressed genes under two conditions (e.g., cancer or not) based on replicated cDNA microarray data. Their basis is multiple hypothesis testing with a strongly controlled family-wise type I error. They also cover the pre-processing of microarray images and normalization. The two papers by Li, Yan and Yuan and by Shedden study multiple gene expression slides, which may come from a time course or under different experiment conditions for the same organism. In Li, Yan and Yuan mRNA measurements are taken consecutively over about 3 cell-cycles for a stain of yeast. The authors employ principal component curves, nested models, and a graphical method to discover an interesting oscillatory pattern for over 500 genes that went un-noticed in earlier works. Shedden's experimental conditions are the different disease stages, and he represents the gene expression of a tumor sample as a linear combination of some low-dimensional time series corresponding to alternative pathways (plus noise). Such a model can be fitted easily by least squares, as illustrated for two data sets in the paper.

The remaining six papers address a variety of important problems in other areas of bioinformatics. Xu, Xu and Olman bring us to the crown jewel of biopolymer research -- the protein folding problem, i.e., the prediction of the three-dimensional fold of a protein molecule based on its sequence information. Despite its fundamental importance in molecular biology, and many years of vigorous attacks from leading scientists, this problem is still largely unsolved. For practical structure prediction, it has been shown that the statistically-based threading method is a more effective alternative than the classical molecular simulation method. This method ``threads'' the given protein sequence into a set of known structural templates (constructed based on proteins with known structures) and finds the most suitable sequence-template fit. Xu, Xu and Olman describe an enlightening neural-network approach for assessing the reliability of the prediction results from such threading methods. Ye presents a Bayesian algorithm for predicting secondary structures of a RNA moledule. It is demonstrated that statistical thinking is key to a proper view of these more flexible (compared to proteins) molecules -- the ensemble of posterior samples of the structure reveals more important information than the optimal (single) structure prediction with the lowest energy. It is also shown that these posterior draws can be effectively used in designing antisense oligonudleotides for blocking the expression of a target gene. Li introduces a set of sophisticated statistical models for analyzing DNA sequencing data, which promises to reduce the base-calling error by orders of magnitude, an important step toward genome-based individualized medicine.

Using data on multiple polymorphic markers from sib-pair has become very popular in genetic linkage analyses because the involved statistical analyses are relatively straightforward (with few confounding issues). In order to further improve the analysis of such data, Chang, Chen, Hsiao and Hsiung present an efficient nonparamentric method to study the sib-pairs' Identify-By-Decent process. Their new method can handle both quantitative and qualitative traits in a coherent framework. Another popular type of genetic data is that of pedigress, from which one can identify genetic markers whose alleles tend to be co-inherited with the disease status within families. However, such data can lead to erroneous results if the relationship among the individuals in the pedigress is misspecified. McPeek proposes a mathematical framework and a likelihood-based method to identify relationship errors from genome screening data. A major drawback of either the sib-pair or the pedigree data is that they are generally not powerful enough to support the fine mapping of the disease gene, for which a population-based case-control study is often used. An important issue in these case-control studies is population admixture, i.e., mutations that differentiate populations (e.g., Asians versus Caucasians) can be confounded with disease-causing mutations when the rates of the latter mutations are different in different populations. To cope with this difficulty, Zhang, Kidd and Zhao propose to first cluster the sampled individuals according to a Gaussian mixture model, based on their genotypes at a series of independent markers, and then construct the test statistic as a weighted average of allele frequency differences between cases and controls within each cluster. They show by simulations that their new method has the correct type-I error and a greater power than some existing methods.

We hope you enjoy reading these papers as much as we did.

Jun Liu, Cambridge, MA

Bin Yu, Berkeley, CA Guest Editors