Abstract
We derive simple, closed-form estimators of variance components of genetic
interest, including the heritability and variance of environmental errors, in genomewide association studies (GWAS) involving Biobank-size number of individuals. Con-
sistency and asymptotic normality of the proposed estimators are established. Inferential analyses, including confidence intervals and hypothesis testing, for variance
components of genetic interest are developed based on the asymptotic distribution.
The method has significant advantage over the existing BOLT-REML method, designed for such Big GWAS data, in that the latter is not capable of carrying out the
inferential analyses. The new method also has potentially computational advantage.
Finite-sample performance of the proposed method, both in terms of statistical properties of the proposed estimators and confidence intervals and in terms of computa-
tional efficiency, is studied empirically and compared with the BOLT-REML method.
Empirical comparison also shows that our method has significant computational advantage over a moment-matching method called mmhe, and similar statistical per-
formance as mmhe. While most of the theoretical results are established under the
assumption of independent single nucleotide polymorphisms (SNPs), we demonstrate
Jiming Jiang’s ORCID ID: 0000-0001-6364-4717.
how to extend the results to the case of C-dependent SNPs. A real-data example of
the UK Biobank data is discussed.
Key words and phrases: ANOVA, Big GWAS Data, heritability, proportion of causal SNPs, random matrix theory, variance components 1
Information
| Preprint No. | SS-2025-0022 |
|---|---|
| Manuscript ID | SS-2025-0022 |
| Complete Authors | Jiming Jiang, Leqi Xu, Jiangshan Zhang, Hongyu Zhao |
| Corresponding Authors | Jiming Jiang |
| Emails | jimjiang@ucdavis.edu |
References
- [1] Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., et al. (2018), The UK Biobank resource with deep phenotyping and genomic data, Nature 562, 203–209.
- [2] Dao, C., Jiang, J., Paul, D. and Zhao, H. (2022), Variance estimation and confidence intervals from genome-wide association studies through high-dimensional misspecified mixed model analysis, J. Stat. Plan. Infer. 220, 15–23.
- [3] Diggle, P. J., Heagerty, P., Liang, K. Y., and Zeger, S. L. (2002), Analysis of Longitudinal Data, 2nd ed., Oxford Univ. Press.
- [4] Ge, T., Chen, CY., Neale, BM., Sabuncu, MR., Smoller, JW. (2017), Phenome-wide heritability analysis of the UK Biobank. PLOS Genetics 13, e1006711.
- [5] Hall, P. and Heyde, C. C. (1980), Martingale Limit Theory and Its Application, Academic Press, New York.
- [6] Hansen, L. P. (1982), Large sample properties of generalized method of moments estimators, Econometrica 50, 1029–1054.
- [7] Haseman, J. and Elston, R. (1972), The investigation of linkage between a quantitative trait and a marker locus, Behav. Genet. 2, 3–19.
- [8] Hou, K., Burch, K. S., Majumdar, A., Shi, H., Mancuso, N., Wu, Y., Sankararaman, S., and Pasaniuc,
- B. (2019), Accurate estimation of SNP-heritability from biobank-scale data irrespective of genetic architecture, Nature Genetics 51, 1244–1251.
- [9] Jiang, J. (2003), Empirical method of moments and its applications, J. Statist. Planning Inference 115, 69–84.
- [10] Jiang, J. (2022), Large Sample Techniques for Statistics, 2nd ed., Springer, New York.
- [11] Jiang, J. and Nguyen, T. (2021), Linear and Generalized Linear Mixed Models and Their Applications, 2nd ed., Springer, New York.
- [12] Jiang, J., Li, C., Paul, D., Yang, C., and Zhao, H. (2016), On high-dimensional misspecified mixed model analysis in genome-wide association study, Ann. Statist. 44, 2127–2160.
- [13] Jiang, J., Jiang, W., Paul, D., Zhang, Y., and Zhao, H. (2023), High-dimensional asymptotic behavior of inference based on GWAS summary statistics, Stat. Sin. 33, 1555–1576.
- [14] Lin, Zhaotong, et al. (2022), Estimating SNP heritability in presence of population substructure in biobank-scale datasets, Genetics 220.4: iyac015.
- [15] Loh, P. R., Bhatia, G., Gusev, A., Finucane, H. K., Bulik-Sullivan, B. K. et al. (2015), Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis, Nature Genetics 47, 1385–1392.
- [16] McFadden, D. (1989), A method of simulated moments for estimation of discrete response models without numerical integration, Econometrica 57, 995–1026.
- [17] Pazokitoroudi, A., Wu, Y., Burch, K.S. et al. (2020), Efficient variance components analysis across millions of genomes, Nat. Commun. 11, 4020. doi: 10.1038/s41467-020-17576-9.
- [18] Searle, S. R., Casella, G., and McCulloch, C. E. (1992), Variance Components, Wiley, New York.
- [19] Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A. , and Yang, J.
- (2017), 10 Years of GWAS Discovery: Biology, Function, and Translation, Amer. J. Hum. Genet. 101, 5–22.
- [20] Wu, Y. and Sankararaman, S. (2018), A scalable estimator of SNP heritability for biobank-scale data, Bioinformatics 34, i187–i194.
- [21] Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P. and others (2015), UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Medicine 12, e1001779.
Acknowledgments
Jiang’s research was partially supported by the NSF grants DMS-1713120
and DMS-1914465. Zhao’s research was partially supported by the NSF grant
DMS-1713120 and NIH grants R01 GM134005 and R01 HG012735.
Supplementary Materials
The Supplementary Material contains proofs of the theoretical results.