Abstract
High-dimensional compositional regression is common in microbiome studies, where covariates are
relative abundances and the number of taxa often exceeds the sample size. Standard regression methods
may be invalid for such data. While the centered log-ratio transformed linear model offers a principled
framework, it yields unstable estimates with limited target samples.
We develop transfer learning
procedures that borrow information from auxiliary source studies for high-dimensional compositional
regression with subcomposition structures and additional non-compositional covariates. The proposed
methods incorporate the compositional linear constraint through constrained ℓ1 -regularized estimation
and allow both model and covariate shifts across studies. We propose Oracle-Trans-sub-Coda-Lasso,
for known informative sources, and Trans-sub-Coda-Lasso, which detects informative sources using
marginal screening statistics.
Under suitable regularity and similarity conditions, we establish the
ℓ2-norm error convergence rate of the oracle estimator and the consistency of the source-detection
procedure. Simulations and an application to ulcerative colitis gut microbiome data for body mass
index prediction demonstrate the improved performance of the proposed methods.
Key words and phrases: Transfer learning, Multisource data, Compositional data, Centered log-ratio model, High-dimensional regression
Information
| Preprint No. | SS-2025-0314 |
|---|---|
| Manuscript ID | SS-2025-0314 |
| Complete Authors | Qinqin Hu, Xiaojing Luo, Chencheng Ma, Wang Zhou |
| Corresponding Authors | Qinqin Hu |
| Emails | qqhu@sdu.edu.cn |
References
- Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 44: 139-160.
- Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell: The Blackburn Press.
- Aitchison, J. and Bacon-Shone, J. (1984). Log contrast models for experiments with mixtures. Biometrika, pages
- Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964– 2984.
- Bertsekas, D. (1996). Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific.
- Bolnick, D. I., Snowberg, L. K., Hirsch, P. E., Lauber, C. L., Org, E., Parks, B., Lusis, A. J., Knight, R., Caporaso,
- J. G., and Svanb¨ack, R. (2014). Individual diet has sex-dependent effects on vertebrate gut microbiota. Nature communications, 5(1):4500.
- Bradley, E. and Haran, J. (2024). The human gut microbiome and aging. Gut Microbes, 16(1):2359677. PMID: 38831607.
- B¨uhlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer Science & Business Media.
- Cheng, L., Wang, K., and Tsung, F. (2020). A hybrid transfer learning framework for in-plane freeform shape accuracy control in additive manufacturing. IISE Transactions, 53(3):298–312.
- Combettes, P. L. and M¨uller, C. L. (2021). Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications. Statistics in Biosciences, 13(2):217–242.
- He, Y., Li, Q., Hu, Q., and Liu, L. (2022). Transfer learning in high-dimensional semiparametric graphical models with application to brain connectivity analysis. Statistics in Medicine, 41(21):4112–4129.
- He, Y., Li, Z., Liu, D., Qin, K., and Xie, J. (2024a). Representational transfer learning for matrix completion.
- He, Y., Liu, D., Sun, Y., and Wang, Y. (2025). Transpca for large-dimensional factor analysis with weak factors: Power enhancement via knowledge transfer.
- He, Z., Sun, Y., Liu, J., and Li, R. (2024b). Transfusion: Covariate-shift robust transfer learning for high-dimensional regression. In Dasgupta, S., Mandt, S., and Li, Y., editors, International Conference on Artificial Intelligence quantile regression.
- Jin, J., Yan, J., Aseltine, R. H., and Chen, K. (2024). Transfer learning with large-scale quantile regression. Technometrics, 66(3):381–393.
- Li, S., Cai, T. T., and Li, H. (2022). Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):149–173.
- Li, Z., Liu, D., He, Y., and Zhang, X. (2024a). Simultaneous estimation and dataset selection for transfer learning in high dimensions by a non-convex penalty.
- Li, Z., Qin, K., He, Y., Zhou, W., and Zhang, X. (2024b). Knowledge transfer across multiple principal component analysis studies.
- Li, S., Zhang, L., Cai, T. T., and Li, H. (2024). Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Journal of the American Statistical Association, 119(546):1274–1285.
- Lin, W., Shi, P., Feng, R., and Li, H. (2014). Variable selection in regression with compositional covariates. Biometrika, 101(4):785–797.
- Ma, H., Zheng, Q., Zhang, Z., and Lai, H.and Peng, L. (2023). Globally adaptive longitudinal quantile regression with high dimensional compositional covariates. Statistica Sinica, 33(Spec Issue):1295–1318.
- Mishra, A. and M¨uller, C. L. (2022). Robust regression with compositional covariates. Computational Statistics & Data Analysis, 165:107315.
- Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
- Pasolli, E., Truong, D. T., Malik, F., Waldron, L., and Segata, N. (2016). Machine learning meta-analysis of large ( ) guarantee. Transactions on Machine Learning Research.
- Shi, P., Zhang, A., and Li, H. (2016). Regression analysis for microbiome compositional data. The Annals of Applied Statistics, 10:1019–1040.
- Shi, P., Zhou, Y., and Zhang, A. R. (2022). High-dimensional log-error-in-variable regression with applications to microbial compositional data analysis. Biometrika, 109(2):405–420.
- Sun, Z., Xu, W. L., Cong, X. M., Li, G., and Chen, K. (2020). Log-contrast regression with functional compositional predictors: Linking preterm infant’s gut microbiome trajectories to neurobehavioral outcome. Annals of Applied Statistics, 14(3):1535–1556.
- Tan, W., Xue, L., Yang, S., and Zhan, X. (2024). High-dimensional log contrast models with measurement errors. arXiv preprint arXiv:2407.15084.
- Tian, Y. and Feng, Y. (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, 118(544):2684–2697.
- Torrey, L. and Shavlik, J. (2010). Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 242–264, Hershey, PA: IGI Global.
- Wang, T. and Zhao, H. (2017). Structured subcomposition selection in regression and its application to microbiome data analysis. The Annals of Applied Statistics, 11(2):771 – 791.
- Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big data, 3:1–40.
- Yuan, P., Jin, C., and Li, G. (2024). Fdr control for linear log-contrast models with high-dimensional compositional covariates. Computational Statistics & Data Analysis, 197:107973.
- Zhang, S., Wang, H., and Lin, W. (2025). Care: Large precision matrix estimation for compositional data. Journal of the American Statistical Association, 120(549): 305–317. Qinqin Hu
Acknowledgments
The authors wish to thank the Co-Editor, the Associate Editor and three reviewers for
their many helpful and insightful comments and suggestions that greatly improved the paper. This work was supported in part by Shandong Provincial Natural Science Foundation
Supplementary Materials
The Supplementary Material includes the following sections: S1, the details of Algorithm
1S2, proof for the theorem 1; S3, proof for the theorem 2; S4, additional comparison with
other constrained methods; S5, additional results of our numerical studies and real studies.