Back To Index Previous Article Next Article Full Text

Statistica Sinica 34 (2024), 2277-2300

DISTRIBUTED LOGISTIC REGRESSION FOR
MASSIVE DATA WITH RARE EVENTS

Xuetong Li1, Xuening Zhu*2 and Hansheng Wang 1

1Peking University and 2 Fudan University

Abstract: Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. A distributed framework faces the following two challenges. The first challenge is how to distribute the data. Here, we investigate two distribution strategies, namely, the RANDOM strategy and the COPY strategy. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy with the IPW objective function is the best solution for a distributed logistic regression with rare events. We demonstrate the finite sample performance of the distributed methods using simulation studies and a real-world Swedish Traffic Sign dataset.

Key words and phrases: Distributed system, logistic regression, massive rare events data.

Back To Index Previous Article Next Article Full Text