Back To Index Previous Article Next Article Full Text

Statistica Sinica 25 (2015),

SPARSE QUADRATIC DISCRIMINANT ANALYSIS
FOR HIGH DIMENSIONAL DATA
Quefeng Li and Jun Shao
University of Wisconsin-Madison and East China Normal University

Abstract: Many contemporary studies involve the classification of a subject into two classes based on n observations of the p variables associated with the subject. Under the assumption that the variables are normally distributed, the well-known linear discriminant analysis (LDA) assumes a common covariance matrix over the two classes while the quadratic discriminant analysis (QDA) allows different covariance matrices. When p is much smaller than n, even if they both diverge, the LDA and QDA have the smallest asymptotic misclassification rates for the cases of equal and unequal covariance matrices, respectively. However, modern statistical studies often face classification problems with the number of variables much larger than the sample size n, and the classical LDA and QDA can perform poorly. In fact, we give an example in which the QDA performs as poorly as random guessing even if we know the true covariances. Under some sparsity conditions on the unknown means and covariance matrices of the two classes, we propose a sparse QDA based on thresholding that has the smallest asymptotic misclassification rate conditional on the training data. We discuss an example of classifying normal and tumor colon tissues based on a set of p = 2,000 genes and a sample of size n = 62, and another example of a cardiovascular study for n = 222 subjects with p = 2,434 genes. A simulation is also conducted to check the performance of the proposed method.

Key words and phrases: Classification, high dimensionality, normality, smallest asymptotic misclassification rate, sparsity estimates, unequal covariance matrices.

Back To Index Previous Article Next Article Full Text