Abstract: Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. A distributed framework faces the following two challenges. The first challenge is how to distribute the data. Here, we investigate two distribution strategies, namely, the RANDOM strategy and the COPY strategy. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy with the IPW objective function is the best solution for a distributed logistic regression with rare events. We demonstrate the finite sample performance of the distributed methods using simulation studies and a real-world Swedish Traffic Sign dataset.
Key words and phrases: Distributed system, logistic regression, massive rare events data.