Subhajit Dutta, Probal Chaudhuri and Anil K. Ghosh (2014). Linear discriminant analysis of character sequences using occurrences of words. Vol. 24, No. 1, 493-514.

Statistica Sinica 24 (2014), 493-514

LINEAR DISCRIMINANT ANALYSIS OF CHARACTER

SEQUENCES USING OCCURRENCES OF WORDS

Subhajit Dutta

, Probal Chaudhuri

and Anil K. Ghosh

King Abdullah University of Science and Technology and

Indian Statistical Institute

Abstract: Classification of character sequences, where the characters come from a finite set, arises in disciplines such as molecular biology and computer science. For discriminant analysis of such character sequences, the Bayes classifier based on Markov models turns out to have class boundaries defined by linear functions of occurrences of words in the sequences. It is shown that for such classifiers based on Markov models with unknown orders, if the orders are estimated from the data using cross-validation, the resulting classifier has Bayes risk consistency under suitable conditions. Even when Markov models are not valid for the data, we develop methods for constructing classifiers based on linear functions of occurrences of words, where the word length is chosen by cross-validation. Such linear classifiers are constructed using ideas of support vector machines, regression depth, and distance weighted discrimination. We show that classifiers with linear class boundaries have certain optimal properties in terms of their asymptotic misclassification probabilities. The performance of these classifiers is demonstrated in various simulated and benchmark data sets.

Key words and phrases: Bayes classifier, Markov and hidden Markov models, misclassification probability, order of a Markov model, -fold cross-validation, word frequency.