Statistica Sinica 34 (2024), 611-636
Li He, William Li* , Difan Song and Min Yang
Abstract: The need to analyze large amounts of data without losing information is evidenced by the recent increase in attention for the information-based optimal subdata selection (IBOSS) approach. However, there are no systematic explorations of this framework, including characterizing the optimal subset when the model is more complex than first-order linear models. Motivated by a real finance case study on the effect of corporate attributes on firm value, we systematically explore the framework and steps required to use IBOSS for data reduction. In the context of second-order models, we develop a novel algorithm for selecting informative subdata. We also evaluate the performance of the proposed algorithm in terms of prediction and variable selection, the latter of which is important for complex models, but has not received sufficient attention in the IBOSS field. Empirical studies demonstrate that the proposed algorithm adequately addresses the trade-off between computation complexity and statistical efficiency, one of six core research directions for theoretical data science research proposed by the US National Science Foundation. The case study demonstrates the potential effect of the IBOSS strategy in scientific fields beyond statistics, particularly finance.
Keywords words and phrases: Algorithm, computation complexity, IBOSS, statistical efficiency.