Jian Huang, Yuling Jiao, Jin Liu and Can Yang (2021). REMI: REGRESSION WITH MARGINAL INFORMATION AND ITS APPLICATION IN GENOME-WIDE ASSOCIATION STUDIES. Vol 31 No. 4, 1985-2004.

Abstract: We consider the problem of variable selection and estimation in high-dimensional linear regression models when complete data are not accessible, but we do have certain marginal information or summary statistics. This problem is motivated by genome-wide association studies (GWASs) with millions of genotyped single nucleotide polymorphisms (SNPs), which have been widely used to identify risk variants among complex human traits/diseases. With the large number of completed GWASs, statistical methods using summary statistics have become increasingly important because of the inaccessibility of individual-level data. In this study, we propose the regression with marginal information (REMI) method, an ℓ₁ penalized approach with estimated marginal effects and an estimated covariance matrix of the predictors with external reference samples. The proposed method is highly scalable and capable of analyzing multiple GWAS data sets from hundreds of thousands individuals and a large number of SNPs. We also establish an upper bound on the error of the REMI estimator, which has the same order as that of the minimax error bound of the Lasso with complete individual-level data. We conduct simulation studies to evaluate the performance of the proposed method. An interesting finding is that when there is a large number of marginal estimates available with a small number of reference samples, as in a GWAS, the proposed method yields good estimation and prediction results, outperforming the Lasso with complete data, but with a relatively small sample size. We apply the proposed method to the 10 traits GWAS data of the Northern Finland Birth Cohorts program. In particular, the real-data analysis results indicate that a summary-level-based analysis using the REMI outperforms an individual-level-based analysis when the sample size of the summary-level data is larger than that of the individual-level data. In summary, our theoretical and real-data results provide solid support for a summary-level-based analysis. As a result, polygenic risk scores of a wide variety of complex diseases can be obtained using summary statistics with theoretically guaranteed performance. The developed R package and the code to reproduce the results are available at https://github.com/gordonliu810822/REMI.

Key words and phrases: Genome-wide association studies, high dimensional regression, marginal information, polygenic risk score.