Abstract: A DNA microarray experiment simultaneously measures the expression levels of thousands of genes. An important question is to identify genes that express differentially between two types of tissues or at different experimental conditions. Since large numbers of genes are compared simultaneously, simple use of significance tests can easily lead to false positive findings. We propose a sequential procedure for estimating the empirical null distribution of multiple hypothesis testing and apply the procedure to identify differentially expressed genes in microarray experiments. Our procedure can be viewed as a new method to estimate the -value proposed by Storey (2002). The key intuition is to obtain an estimate of the null distribution that is robust to the observations from the alternative distribution. Technically, we borrow strength from the missing data literature so that we can avoid estimating the density function corresponding to differentially expressed genes nonparametrically, but can focus on estimating the null density. Numerical comparisons between our method and Storey's original method were conducted in simulated and real data examples. The numerical results show that our procedure outperforms the originally estimated -values in almost all scenarios.
Key words and phrases: False discovery rate, Markov chain Monte Carlo, microarray data analysis, missing data, multiple hypothesis testing.