Back To Index Previous Article Next Article Full Text

Statistica Sinica 33 (2023), 2041-2064

SUBSAMPLING AND JACKKNIFING: A PRACTICALLY
CONVENIENT SOLUTION FOR LARGE DATA ANALYSIS
WITH LIMITED COMPUTATIONAL RESOURCES

Shuyuan Wu1, Xuening Zhu2 and Hansheng Wang1

1Peking University and 2Fudan University

Abstract: Modern statistical analysis often involves large data sets, for which conventional estimation methods are not suitable, owing to limited computational resources. To solve this problem, we propose a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample as if it were the population. Then, we obtain multiple subsamples with greatly reduced sizes using simple random sampling with replacement. We do not recommend sampling methods without replacement, because this would incur a significant data processing cost when the processing occurs on a hard drive. However, such a cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory and processed. Based on subsampled data sets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged to form the final estimator. We show theoretically that the final estimator is consistent and asymptotically normal. Furthermore, its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is easily implemented on most computer systems, and thus is widely applicable.

Key words and phrases: GPU, jackknife, large dataset, subsampling.

Back To Index Previous Article Next Article Full Text