Andreas Buja and Werner Stuetzle (2006). Observations on bagging. Vol.16, No.2

Statistica Sinica 16(2006), 323-351

OBSERVATIONS ON BAGGING

Andreas Buja and Werner Stuetzle

University of Pennsylvania and University of Washington

Abstract: Bagging is a device intended for reducing the prediction error of learning algorithms. In its simplest form, bagging draws bootstrap samples from the training sample, applies the learning algorithm to each bootstrap sample, and then averages the resulting prediction rules. More generally, the resample size may be different from the original sample size , and resampling can be done with or without replacement. We investigate bagging in a simplified situation: the prediction rule produced by a learning algorithm is replaced by a simple real-valued U-statistic of i.i.d. data. U-statistics of high order can describe complex dependencies, and yet they admit a rigorous asymptotic analysis. We show that bagging U-statistics often but not always decreases variance, whereas it always increases bias. The most striking finding, however, is an equivalence between bagging based on resampling with and without replacement: the respective resample sizes $M_{with} = \alpha_{with} N$ and $M_{w/o} = \alpha_{w/o} N$ produce very similar bagged statistics if $\alpha_{with} = \alpha_{w/o} / (1-\alpha_{w/o})$ . While our derivation is limited to U-statistics, the equivalence seems to be universal. We illustrate this point in simulations where bagging is applied to CART trees.

Key words and phrases: Bagging, CART, U-statistics.