Abstract

High-dimensional compositional data are increasingly prevalent across

diverse fields of modern scientific research. Regression analysis involving compositional data presents unique challenges, particularly when covariate measurement

errors are present. These errors can propagate across composition components

due to their inherent dependency structure, complicating the application of conventional error-in-variables regression techniques. To simultaneously address the

compositional nature and measurement errors in the high-dimensional design

matrix of compositional covariates, we propose the Error-in-Composition (Eric)

Lasso, a novel method for regression analysis with high-dimensional compositional covariates subject to measurement error. We establish theoretical guar-

antees for Eric Lasso, including estimation error bounds and asymptotic signconsistent variable selection properties.

The finite-sample performance of the

method is demonstrated through simulation studies and a real-world application.

The authors are listed in alphabetical order. Correspondence should be addressed

Information

Preprint No.SS-2025-0223
Manuscript IDSS-2025-0223
Complete AuthorsWenxi Tan, Lingzhou Xue, Songshan Yang, Xiang Zhan
Corresponding AuthorsLingzhou Xue
Emailslzxue@psu.edu

References

  1. Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society Series B: Statistical Methodology 44(2), 139–160.
  2. Aitchison, J. and J. Bacon-Shone (1984). Log contrast models for experiments with mixtures. Biometrika 71(2), 323–330.
  3. Allali, I., J. W. Arnold, and J. Roach (2017). A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome. BMC Microbiology 17(1), 1–16.
  4. Austin, G. I., A. Brown Kav, S. ElNaggar, H. Park, J. Biermann, A.-C. Uhlemann, I. Pe’er,
  5. and T. Korem (2025). Processing-bias correction with debias-m improves cross-study generalization of microbiome-based prediction models. Nature Microbiology, 1–15.
  6. Belloni, A., M. Rosenbaum, and A. B. Tsybakov (2017). Linear and conic programming estimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society Series B: Statistical Methodology 79(3), 939–956.
  7. Bhattacharjee, S., B. Li, X. Wu, and L. Xue (2025). Doubly robust estimation of causal effects for random object outcomes with continuous treatments. arXiv preprint arXiv:2506.22754.
  8. Bhattacharjee, S., B. Li, and L. Xue (2025). Nonlinear global fr´echet regression for random objects via weak conditional expectation. The Annals of Statistics 53(1), 117–143.
  9. Clausen, D. S. and A. D. Willis (2022). Modeling complex measurement error in microbiome experiments. arXiv preprint arXiv:2204.12733.
  10. Combettes, P. L. and C. L. M¨uller (2021). Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications. Statistics in Biosciences 13(2), 217–242.
  11. Datta, A. and H. Zou (2017). Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics 45(6), 2400–2426.
  12. Datta, A. and H. Zou (2020). A note on cross-validation for lasso under measurement errors. Technometrics 62(4), 549–556.
  13. Fiksel, J., S. Zeger, and A. Datta (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics 78(3), 974–987.
  14. Firth, D. and F. Sammut (2023). Analysis of composition on the original scale of measurement. arXiv preprint arXiv:2312.10548.
  15. Gihawi, A., Y. Ge, J. Lu, D. Puiu, A. Xu, C. S. Cooper, D. S. Brewer, M. Pertea, and
  16. S. L. Salzberg (2023). Major data analysis errors invalidate cancer microbiome findings. Mbio 14(5), e01607–23.
  17. Greenacre, M., E. Grunsky, J. Bacon-Shone, I. Erb, and T. Quinn (2023). Aitchison’s compositional data analysis 40 years on: A reappraisal. Statistical Science 1(1), 1–25.
  18. Hawinkel, S., F. Mattiello, L. Bijnens, and O. Thas (2019). A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics 20(1), 210–221.
  19. Jiang, R., X. Zhan, and T. Wang (2023). A flexible zero-inflated poisson-gamma model with application to microbiome sequence count data. Journal of the American Statistical Association 118(542), 792–804.
  20. Lin, W., P. Shi, R. Feng, and H. Li (2014). Variable selection in regression with compositional covariates. Biometrika 101(4), 785–797.
  21. Loh, P.-L. and M. J. Wainwright (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. The Annals of Statistics 40(3), 1637–1664.
  22. Martin, B. D., D. Witten, and A. D. Willis (2020). Modeling microbial abundances and dysbiosis with beta-binomial regression. The Annals of Applied Statistics 14(1), 94.
  23. McLaren, M. R., J. T. Nearing, A. D. Willis, K. G. Lloyd, and B. J. Callahan (2022). Implications of taxonomic bias for microbial differential-abundance analysis. bioRxiv.
  24. McLaren, M. R., A. D. Willis, and B. J. Callahan (2019). Consistent and correctable bias in metagenomic sequencing experiments. Elife 8, e46923.
  25. Mishra, A. and C. L. M¨uller (2022). Robust regression with compositional covariates. Computational Statistics & Data Analysis 165, 107315.
  26. Poore, G. D., E. Kopylova, Q. Zhu, C. Carpenter, S. Fraraccio, S. Wandro, T. Kosciolek,
  27. S. Janssen, J. Metcalf, S. J. Song, et al. (2020). Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579(7800), 567–574.
  28. Randolph, T. W., S. Zhao, W. Copeland, M. Hullar, and A. Shojaie (2018). Kernel-penalized regression for analysis of microbiome data. The Annals of Applied Statistics 12(1), 540.
  29. Rios, N., Y. Shi, J. Chen, X. Zhan, L. Xue, and Q. Li (2025). Composition-on-composition regression analysis for multi-omics integration of metagenomic data. Bioinformatics 41(7), btaf387.
  30. Rios, N., L. Xue, and X. Zhan (2024). A latent variable mixture model for composition-oncomposition regression with application to chemical recycling. The Annals of Applied Statistics 18(4), 3253–3273.
  31. Shi, P., Y. Zhou, and A. R. Zhang (2022). High-dimensional log-error-in-variable regression with applications to microbial compositional data analysis. Biometrika 109(2), 405–420.
  32. Srinivasan, A., L. Xue, and X. Zhan (2021). Compositional knockoff filter for high-dimensional regression analysis of microbiome data. Biometrics 77(3), 984–995.
  33. Srinivasan, A., L. Xue, and X. Zhan (2023). Identification of microbial features in multivariate regression under false discovery rate control. Computational Statistics & Data Analysis 181, 107621.
  34. Susin, A., Y. Wang, K.-A. Lˆe Cao, and M. L. Calle (2020). Variable selection in microbiome compositional data analysis. NAR Genomics and Bioinformatics 2(2), lqaa029.
  35. Tang, Z.-Z., G. Chen, A. V. Alekseyenko, and H. Li (2017). A general framework for association analysis of microbial communities on a taxonomic tree. Bioinformatics 33(9), 1278–1285.
  36. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.
  37. Vandeputte, D., G. Kathagen, K. D’hoe, S. Vieira-Silva, M. Valles-Colomer, J. Sabino, J. Wang,
  38. R. Y. Tito, L. De Commer, Y. Darzi, S. Vermeire, G. Falony, and J. Raes (2017). Quantitative microbiome profiling links gut community variation to microbial load. Nature 551(7681), 507–511.
  39. Wu, G. D., J. Chen, C. Hoffmann, K. Bittinger, Y.-Y. Chen, S. A. Keilbaugh, M. Bewtra, D. Knights, W. A. Walters, R. Knight, R. Sinha, E. Gilroy, K. Gupta, R. Baldassano,
  40. L. Nessel, H. Li, F. D. Bushman, and J. D. Lewis (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science 334(6052), 105–108.
  41. Yang, H., S. Bhattacharjee, L. Xue, and B. Li (2025). Variable selection for additive global fr´echet regression. arXiv preprint arXiv:2509.13685.
  42. Yue, Y., Y. Mao, T. D. Read, V. Fedirko, G. A. Satten, X. Chen, X. Zhan, and Y.-J. Hu (2025). Integrative analysis of microbial 16s gene and shotgun metagenomic sequencing data improves statistical efficiency in testing differential abundance. Journal of the American Statistical Association, 1–9.
  43. Zhao, H. and T. Wang (2024). Debiased high-dimensional regression calibration for errors-invariables log-contrast models. Biometrics 80(4), ujae153.
  44. Zhao, N. and G. A. Satten (2021). A log-linear model for inference on bias in microbiome studies. In Statistical Analysis of Microbiome Data, pp. 221–246. Springer. Penn State University

Acknowledgments

The authors would like to thank the Co-Editor, the Associate Editor, and

the anonymous referees for their helpful suggestions and constructive comments. The research of Tan and Xue was supported by the U.S. National

Science Foundation (NSF) grant DMS-2210775 and the U.S. National Institutes of Health (NIH) grant 1R01GM152812. The research of Yang was

supported by the National Key R&D Program of China 2023YFA1008702

and the National Natural Foundation of China (NSFC) 12301389. The research of Zhan was supported by the National Natural Science Foundation

of China (grant no. 12371287).

Supplementary Materials

The online supplementary materials consist of technical proofs of theorems

and additional numerical results.


Supplementary materials are available for download.