Abstract

Receiver Operating Characteristic (ROC) curves are commonly used

to evaluate the performance of classification or prediction algorithms. However,

in the literature, the uncertainty assessment of such algorithm is less rigorously

addressed. In this article, we examine the limitations of the popular bootstrap

method for ROC uncertainty quantification, elucidated by a simple yet essential

model-based classification approach. We show that our proposed approach based

on conformal prediction provides a valid solution for quantifying the uncertainty

of the ROC curve and the Youden index. Both theoretical and numerical results

corroborate the improved uncertainty quantification by the conformal inference

over the bootstrap method.

Information

Preprint No.SS-2025-0127
Manuscript IDSS-2025-0127
Complete AuthorsZheshi Zheng, Bo Yang, Peter Song
Corresponding AuthorsPeter Song
Emailspxsong@umich.edu

References

  1. Adler, W. and B. Lausen (2009). Bootstrap estimated true and false positive rates and roc curve. Computational statistics & data analysis 53(3), 718–729.
  2. Campbell, G. (1994). Advances in statistical methodology for the evaluation of diagnostic and laboratory tests. Statistics in medicine 13(5-7), 499–508.
  3. Carel, J.-C. and J. L´eger (2008). Precocious puberty. New England Journal of Medicine 358(22), 2366–2377.
  4. DiCiccio, T. J. and B. Efron (1996). Bootstrap confidence intervals. Statistical science 11(3), 189–228.
  5. Emmanuel, M. and B. R. Bokor (2017). Tanner stages.
  6. Euling, S. Y., S. G. Selevan, O. H. Pescovitz, and N. E. Skakkebaek (2008). Role of environmental factors in the timing of puberty. Pediatrics 121(Supplement 3), S167–S171.
  7. Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters 27(8), 861–874.
  8. Hilgers, R. (1991). Distribution-free confidence bounds for roc curves. Methods of information in medicine 30(02), 96–101.
  9. Horvath, S. and K. Raj (2018). Dna methylation-based biomarkers and the epigenetic clock theory of ageing. Nature reviews genetics 19(6), 371–384.
  10. Jensen, K., H.-H. M¨uller, and H. Sch¨afer (2000). Regional confidence bands for roc curves. Statistics in medicine 19(4), 493–509.
  11. Lee, Y., E. T. Tchetgen, and E. Dobriban (2024). Batch predictive inference. arXiv preprint arXiv:2409.13990.
  12. Lei, J., M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association 113(523), 1094–1111.
  13. Liu, B., Y. Wei, Y. Zhang, and Q. Yang (2017). Deep neural networks for high dimension, low sample size data. In IJCAI, Volume 2017, pp. 2287–2293.
  14. Mack, Y. and M. Rosenblatt (1979). Multivariate k-nearest neighbor density estimates. Journal of Multivariate Analysis 9(1), 1–15.
  15. Marshall, W. A. and J. M. Tanner (1970). Variations in the pattern of pubertal changes in boys. Archives of disease in childhood 45(239), 13–23.
  16. McEwen, L. M., K. J. O’Donnell, M. G. McGill, R. D. Edgar, M. J. Jones, J. L. MacIsaac,
  17. D. T. S. Lin, K. Ramadori, A. Morin, N. Gladish, et al. (2020). The pedbe clock accurately estimates dna methylation age in pediatric buccal cells. Proceedings of the National Academy of Sciences 117(38), 23329–23335.
  18. Nakas, C. T., L. E. Bantis, and C. A. Gatsonis (2023). ROC analysis for classification and prediction in practice. Chapman and Hall/CRC.
  19. Perng, W., M. Tamayo-Ortiz, L. Tang, B. N. S´anchez, A. Cantoral, J. D. Meeker, D. C. Dolinoy,
  20. E. F. Roberts, E. A. Martinez-Mier, H. Lamadrid-Figueroa, et al. (2019). Early life exposure in mexico to environmental toxicants (element) project. BMJ open 9(8), e030427.
  21. Rio, E. et al. (2017). Asymptotic theory of weakly dependent random processes, Volume 80. Springer.
  22. Schafer, H. (1994). Efficient confidence bounds for roc curves. Statistics in medicine 13(15), 1551–1561.
  23. Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic learning in a random world, Volume 29. Springer.
  24. Xie, M. and Z. Zheng (2022). Homeostasis phenomenon in conformal prediction and predictive distribution functions. International Journal of Approximate Reasoning 141, 131–145.
  25. Youden, W. J. (1950). Index for rating diagnostic tests. Cancer 3(1), 32–35.

Acknowledgments

We thank Yahui Zhang for her help in the DNA methylation data used in the

analysis. We thank the Associate Editor and two anonymous reviewers for

their constructive comments and insightful suggestions, which have helped

us significantly improve the paper. This research was supported by the U.S.

National Institutes of Health R01ES033656.

A

Bootstrap and conformal algorithms for ROC confidence bands

Supplementary Materials

The Supplementary Material contains additional technical details concerning the proofs of Proposition 1 in Example 1, Proposition 2 and Proposi-

tion 3. Additional numerical results are also given in the Supplementary

Material, including coverages for both bootstrap and conformal confidence

bands at different confidence levels using simulated data, and supplementary figures for the performances on the sexual maturation prediction. The

R code used to generate the numerical results in this paper has been made

publicly available at the GitHub repository:


Supplementary materials are available for download.