Back To Index Previous Article Next Article Full Text

Statistica Sinica 26 (2016), 1-34

Jiashun Jin and Zheng Tracy Ke
Carnegie Mellon University and University of Chicago

Abstract: Often when we deal with ‘Big Data’, the true effects we are interested in are Rare and Weak (RW). Researchers measure a large number of features, hoping to find perhaps only a small fraction of them to be relevant to the research in question; the effect sizes of the relevant features are individually small so the true effects are not strong enough to stand out for themselves.

Higher Criticism (HC) and Graphlet Screening (GS) are two classes of methods that are specifically designed for the Rare/Weak settings. HC was introduced to determine whether there are any relevant effects in all the measured features. More recently, HC was applied to classification, where it provides a method for selecting useful predictive features for trained classification rules. GS was introduced as a graph-guided multivariate screening procedure, and was used for variable selection.

We develop a theoretical framework where we use an Asymptotic Rare and Weak (ARW) model simultaneously controlling the size and prevalence of useful/significant features among the useless/null bulk. At the heart of the ARW model is the so-called phase diagram, which is a way to visualize clearly the class of ARW settings where the relevant effects are so rare or weak that desired goals (signal detection, variable selection, etc.) are simply impossible to achieve. We show that HC and GS have important advantages over better known procedures and achieve the optimal phase diagrams in a variety of ARW settings.

HC and GS are flexible ideas that adapt easily to many interesting situations. We review the basics of these ideas and some of the recent extensions, discuss their connections to existing literature, and suggest some new applications of these ideas.

Key words and phrases: Classification, control of FDR, feature ranking, feature selection, graphlet screening, hamming distance, higher criticism, large-scale inference, rare and weak effects, phase diagram, sparse precision matrix, sparse signal detection, variable selection.

Back To Index Previous Article Next Article Full Text