The Population and Personalized Areas Under the Receiving Operating Characteristic Curve

Haben Michael and Lu Tian

doi:10.5705/ss.202024.0100

Abstract

We consider two generalizations of the area under the receiver operating char

acteristic curve (“AUC”), a popular measure of discrimination, to accommodate clustered

data. We describe situations in which the two cluster AUCs diverge and other situations

in which they coincide. Differences are described using concrete models and visualizations,

while quantitative results are used to relate the two generalizations. Procedures for joint

estimation and inference are also presented, along with a simulation study. We apply the

results to data collected on urban policing behavior.

Key words and phrases: AUC, Confounding, Clustered data, Simpson’s paradox 1 Introduction The AUC is a widely used measure of how well a scalar predictor discriminates between two outcomes. As a population parameter, the AUC is the probability that the value of a randomly sampled predictor from one of the outcome classes is less than an independently sampled predictor from the other outcome class. There are several ways to generalize the AUC to accommodate clustered data. What we refer to as the “population AUC” appears to be the most commonly studied. The population AUC evaluates the predictor’s typical effect on an entire population, as further discussed below

Information

Preprint No.	SS-2024-0100
Manuscript ID	SS-2024-0100
Complete Authors	Haben Michael, Lu Tian
Corresponding Authors	Haben Michael
Emails	hmichael@math.umass.edu

References

Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of mathematical psychology, 12(4):387– 415.
Benhin, E., Rao, J., and Scott, A. (2005). Mean estimating equation approach to analysing cluster-correlated data with nonignorable cluster sizes. Biometrika, 92(2):435–450.
Bugni, F., Canay, I., Shaikh, A., and Tabord-Meehan, M. (2022). Inference for cluster randomized experiments with non-ignorable cluster sizes. arXiv preprint ArXiv:2204.08356.
Dorfman, D. D. and Alf Jr, E. (1969). Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals—rating-method data. Journal of mathematical psychology, 6(3):487–496.
Emir, B., Wieand, S., Jung, S.-H., and Ying, Z. (2000). Comparison of diagnostic markers with repeated measurements: a non-parametric ROC curve approach. Statistics in Medicine, 19(4):511–523.
Goel, S., Rao, J. M., and Shroff, R. (2016). Precinct or prejudice? Understanding racial disparities in New York City’s stop-and-frisk policy. The Annals of Applied Statistics, 10(1):365–394.
Hanley, J. A. (1988). The robustness of the “binormal” assumptions used in fitting ROC curves. Medical decision making, 8(3):197–203.
Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
Lee, M.-L. T. and Dehling, H. G. (2005). Generalized two-sample U-statistics for clustered data. Statistica Neerlandica, 59(3):313–323.
Lindley, D. V. and Novick, M. R. (1981). The role of exchangeability in inference. The annals of statistics, pages 45–58. p
Liu, H., Li, G., Cumberland, W. G., Wu, T., et al. (2005). Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. Journal of Data Science, 3(3):257–278.
Michael, H., Tian, L., and Ghebremichael, M. (2019). The ROC curve for regularly measured longitudinal biomarkers. Biostatistics, 20(3):433–451.
Obuchowski, N. A. (1997). Nonparametric analysis of clustered ROC curve data. Biometrics, 53:567–578.
Pearl, J. (2014). Comment: Understanding Simpson’s paradox. The American Statistician, 68(1):8–13.
Ridgeway, G. (2006). Assessing the effect of race bias in post-traffic stop outcomes using propensity scores. Journal of Quantitative Criminology, 22(1):1–29.
Ridgeway, G. and MacDonald, J. M. (2009). Doubly robust internal benchmarking and false discovery rates for detecting racial bias in police stops. Journal of the American Statistical Association, 104(486):661–668.
Rosner, B. and Grove, D. (1999). Use of the Mann–Whitney U-test for clustered data. Statistics in medicine, 18(11):1387–1400.
Sen, P. K. (1960). On Some Convergence Properties of U-tatistics. Calcutta Statistical Association Bulletin, 10(1-2):1–18.
Toledano, A. Y. (2003). Three methods for analysing correlated ROC curves: a comparison in real data sets from multi-reader, multi-case studies with a factorial design. Statistics in medicine, 22(18):2919–2933.
Wu, Y. and Wang, X. (2011). Optimal weight in estimating and comparing areas under the receiver operating characteristic curve using longitudinal data. Biometrical journal, 53(5):764–778. Haben Michael

Acknowledgments

The authors wish to thank Prof. Maria Cuellar for helpful consultation regarding the data

analysis, and an anonymous reviewer for contributing the substance of Prop. 1(2).

[1] Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of mathematical psychology, 12(4):387– 415.

[2] Benhin, E., Rao, J., and Scott, A. (2005). Mean estimating equation approach to analysing cluster-correlated data with nonignorable cluster sizes. Biometrika, 92(2):435–450.

[3] Bugni, F., Canay, I., Shaikh, A., and Tabord-Meehan, M. (2022). Inference for cluster randomized experiments with non-ignorable cluster sizes. arXiv preprint ArXiv:2204.08356.

[4] Dorfman, D. D. and Alf Jr, E. (1969). Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals—rating-method data. Journal of mathematical psychology, 6(3):487–496.

[5] Emir, B., Wieand, S., Jung, S.-H., and Ying, Z. (2000). Comparison of diagnostic markers with repeated measurements: a non-parametric ROC curve approach. Statistics in Medicine, 19(4):511–523.

[6] Goel, S., Rao, J. M., and Shroff, R. (2016). Precinct or prejudice? Understanding racial disparities in New York City’s stop-and-frisk policy. The Annals of Applied Statistics, 10(1):365–394.

[7] Hanley, J. A. (1988). The robustness of the “binormal” assumptions used in fitting ROC curves. Medical decision making, 8(3):197–203.

[8] Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.

[9] Lee, M.-L. T. and Dehling, H. G. (2005). Generalized two-sample U-statistics for clustered data. Statistica Neerlandica, 59(3):313–323.

[10] Lindley, D. V. and Novick, M. R. (1981). The role of exchangeability in inference. The annals of statistics, pages 45–58. p

[11] Liu, H., Li, G., Cumberland, W. G., Wu, T., et al. (2005). Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. Journal of Data Science, 3(3):257–278.

[12] Michael, H., Tian, L., and Ghebremichael, M. (2019). The ROC curve for regularly measured longitudinal biomarkers. Biostatistics, 20(3):433–451.

[13] Obuchowski, N. A. (1997). Nonparametric analysis of clustered ROC curve data. Biometrics, 53:567–578.

[14] Pearl, J. (2014). Comment: Understanding Simpson’s paradox. The American Statistician, 68(1):8–13.

[15] Ridgeway, G. (2006). Assessing the effect of race bias in post-traffic stop outcomes using propensity scores. Journal of Quantitative Criminology, 22(1):1–29.

[16] Ridgeway, G. and MacDonald, J. M. (2009). Doubly robust internal benchmarking and false discovery rates for detecting racial bias in police stops. Journal of the American Statistical Association, 104(486):661–668.

[17] Rosner, B. and Grove, D. (1999). Use of the Mann–Whitney U-test for clustered data. Statistics in medicine, 18(11):1387–1400.

[18] Sen, P. K. (1960). On Some Convergence Properties of U-tatistics. Calcutta Statistical Association Bulletin, 10(1-2):1–18.

[19] Toledano, A. Y. (2003). Three methods for analysing correlated ROC curves: a comparison in real data sets from multi-reader, multi-case studies with a factorial design. Statistics in medicine, 22(18):2919–2933.

[20] Wu, Y. and Wang, X. (2011). Optimal weight in estimating and comparing areas under the receiver operating characteristic curve using longitudinal data. Biometrical journal, 53(5):764–778. Haben Michael