Wei Pan, Kyeong S. Jeong, Yang Xie and Arkady Khodursky (2008). A nonparametric empirical bayes approach to joint modeling of multiple sources of genomic data. Vol. 18, No. 2, 709-729.

Statistica Sinica 18(2008), 709-729

A NONPARAMETRIC EMPIRICAL BAYES APPROACH TO

JOINT MODELING OF MULTIPLE SOURCES

OF GENOMIC DATA

Wei Pan

, Kyeong S. Jeong

, Yang Xie

and Arkady Khodursky

University of Minnesota, University of California, Los Angeles

and University of Texas Southwestern Medical Center

Abstract: With the rapid accumulation of various high-throughput genomic and proteomic data, one is compelled to develop new statistical methods that can take advantage of existing multiple sources of data. In our motivating example, a chromatin-immunoprecipitation (ChIP) microarray experiment was conducted to detect binding target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in E. coli. In addition, a cDNA microarray dataset is available to compare gene expression of the wild type with that of a mutant with the Lrp gene deleted in E. coli. It is biologically reasonable to assume that the genes with altered expression are more likely to be regulated by Lrp than those with no expression change. Hence we aim to borrow information in the gene expression data to increase statistical power to detect the binding targets of Lrp. We propose a novel joint model for protein-DNA binding data and gene expression data; under mild modeling assumptions, it is shown that the method is optimal, equivalent to a joint likelihood ratio test. We compare the joint modeling with two existing methods of combining separate analyses. We adopt a nonparametric empirical Bayes (EB) method to draw statistical inference in the joint model; in particular, we propose a new method, maximum likelihood conditional on the binding data, to estimate two prior probabilities for the expression data, which are non-identifiable based on the expression data alone. We use simulated data to demonstrate the improved performance of the joint modeling over other approaches. Application to the Lrp data also shows better performance of the joint modeling than that of analyzing the binding data alone.

Key words and phrases: ChIP-chip, computational biology, false discovery rate, gene expression, Lrp, microarray.