Wilson W. Lu and Randy R. Sitter (2008). Disclosure risk and replication-based variance estimation. Vol. 18, No. 4, 1669-1687.

Statistica Sinica 18(2008), 1669-1687

DISCLOSURE RISK AND REPLICATION-BASED

VARIANCE ESTIMATION

Wilson W. Lu and Randy R. Sitter

Acadia University and Simon Fraser University

Abstract: Protecting respondents from disclosure of their identity in publicly released survey data is of practical concern to many government agencies. Methods for doing so include suppression of cluster and stratum identifiers, and altering or swapping record values between respondents. Unfortunately, stratum and cluster identifiers are usually needed for variance estimation using linearization or replication methods. One might feel that releasing a set of replicate weights that also have stratum and cluster identifiers suppressed might circumvent this problem to some extent, especially using some random resampling such as the bootstrap. In this article, we first demonstrate that by viewing the replicate weights as observations in a high dimensional space one can easily use clustering algorithms to reconstruct the cluster identifiers irrespective of the resampling method even if the replicate weights are randomly altered. We then propose a fast algorithm for swapping cluster and strata identifiers of ultimate units before creating replicate weights without significantly impacting resulting variance estimates of characteristics of interest. The methods are illustrated by application to publicly released data from the National Health and Nutrition Examination Surveys, where such disclosure issues are extremely important.

Key words and phrases: Balanced repeated replication, bootstrap, confidentiality, jackknife.