Statistica Sinica: Volume 28, Number 4, October 2018This is an example of an RSS feedhttp://www3.stat.sinica.edu.tw/statistica/Thu, 4 October 2018 00:01:00 +0000 Thu, 4 October 2018 00:01:00 +00001800
/statistica/J28N4/J28N41/J28N41.html
A MIXED-EFFECTS ESTIMATING EQUATION APPROACH TO NONIGNORABLE MISSING LONGITUDINAL DATA WITH REFRESHMENT SAMPLES Xuan Bi and Annie Qu 1653-1675<span style='font-size=12pt;'><center>Abstract</center> Nonignorable missing data occur frequently in longitudinal studies and can cause biased estimations. Refreshment samples which draw new subjects randomly in subsequent waves from the original population could mitigate the bias. In this paper, we introduce a mixed-effects estimating equation approach that enables one to incorporate refreshment samples and recover informative missing information from the measurement process. We show that the proposed method achieves consistency and asymptotic normality for fixed-effect estimation under shared-parameter models, and we extend it to a more general nonignorable-missing framework. Our finite sample simulation studies show the effectiveness and robustness of the proposed method under different missing mechanisms. In addition, we apply our method to election poll longitudinal survey data with refreshment samples from the 2007-2008 Associated Press Yahoo! News.<p>Key words and phrases: Missing not at random, non-monotone missing pattern, quadratic inference function, shared-parameter model, survey data.</span>
/statistica/J28N4/J28N410/J28N410.html
FUNCTIONAL LINEAR REGRESSION MODELS FOR NONIGNORABLE MISSING SCALAR RESPONSES Tengfei Li, Fengchang Xie, Xiangnan Feng, Joseph G. Ibrahim, Hongtu Zhu and the Alzheimers Disease Neuroimaging Initiative 1867-1886<span style='font-size=12pt;'><center>Abstract</center> As an important part of modern health care, medical imaging data, which can be regarded as densely sampled functional data, have been widely used for diagnosis, screening, treatment, and prognosis, such as for finding breast cancer through mammograms. The aim of this paper is to propose a functional linear regression model for using functional (or imaging) predictors to predict clinical outcomes (e.g., disease status), while addressing missing clinical outcomes. We introduce an exponential tilting semiparametric model to account for the nonignorable missing data mechanism. We develop a set of estimating equations and the associated computational methods for both parameter estimation and the selection of the tuning parameters. We also propose a bootstrap resampling procedure for carrying out statistical inference. We systematically establish the asymptotic properties (e.g., consistency and convergence rate) of the estimates calculated from the proposed estimating equations. Simulation studies and a data analysis are used to illustrate the finite sample performance of the proposed methods. <p>Key words and phrases: Estimating equation, exponential tilting, functional data, imaging data, nonignorable missing data, tuning parameters.</span>
/statistica/J28N4/J28N411/J28N411.html
ASSESSMENT OF NONIGNORALBE LOG-LINEAR MODELS FOR AN INCOMPLETE CONTINGENCY TABLE Seongyong Kim and Daeyoung Kim 1887-1905<span style='font-size=12pt;'><center>Abstract</center> A challenging problem in the analysis of an incomplete contingency table is that the use of nonignorable nonresponse models requires explicit specification of missing data mechanism. In this paper we propose a data analytic approach to aid in distinguishing between plausible nonignorable log-linear models for an incomplete contingency table. The proposed method involves the computation of a set of response odds and nonresponse odds that are directly connected with the magnitude of the parameters representing types of nonignorable mechanism assumed in the log-linear models. These odds can be easily estimated from observed counts. We illustrate the performance of the proposed method with simulation and data. We also discuss a generalizability of the proposed method in two directions, its applicability for a three-way incomplete contingency table and its applicability for nonignorable nonresponse models other than the log-linear models.<p>Key words and phrases: Contingency table, log-linear model, nonignorable nonresponse.</span>
/statistica/J28N4/J28N412/J28N412.html
A ROBUST ALIBRATION-ASSISTED METHOD FOR LINEAR MIXED EFFECTS MODEL UNDER CLUSTER-SPECIFIC NONIGNORABLE MISSINGNESS Yongchan Kwon, Jae Kwang Kim, Myunghee Cho Paik and Hongsoo Kim 1907-1928<span style='font-size=12pt;'><center>Abstract</center> We propose a method for linear mixed effects models when the covariates are completely observed but the outcome of interest is subject to missing under cluster-specific nonignorable (CSNI) missingness. Our strategy is to replace missing quantities in the full-data objective function with unbiased predictors derived from inverse probability weighting and calibration technique. The proposed approach can be applied to estimating equations or likelihood functions with modified E-step, and does not require numerical integration as do previous methods. Unlike usual inverse probability weighting, the proposed method does not require correct specification of the response model as long as the CSNI assumption is correct, and renders inference under CSNI without a full distributional assumption. Consistency and asymptotic normality are shown with a consistent variance estimator. Simulation results and a data example are presented.<p>Key words and phrases: Calibration method, cluster-specific nonignorable missingness, inverse probability weighting, nonignorable missingness.</span>
/statistica/J28N4/J28N413/J28N413.html
A ROBUST ALIBRATION-ASSISTED METHOD FOR LINEAR MIXED EFFECTS MODEL UNDER CLUSTER-SPECIFIC NONIGNORABLE MISSINGNESS Yongchan Kwon, Jae Kwang Kim, Myunghee Cho Paik and Hongsoo Kim 1907-1928<span style='font-size=12pt;'><center>Abstract</center> Missing data are frequently encountered in longitudinal clinical trials. To better monitor and understand the progress over time, one must handle the missing data appropriately and examine whether the missing data mechanism is ignorable or nonignorable. In this article, we develop a new probit model for longitudinal binary response data. It resolves a challenging issue for estimating the variance of the random effects, and substantially improves the convergence and mixing of the Gibbs sampling algorithm. We show that when improper uniform priors are specified for the regression coefficients of the joint multinomial model via a sequence of one-dimensional conditional distributions for the missing data indicators under nonignorable missingness, the joint posterior distribution is improper. A variation of Jeffreys prior is thus established as a remedy for the improper posterior distribution. In addition, an efficient Gibbs sampling algorithm is developed using a collapsing technique. Two model assessment criteria, the deviance information criterion (DIC) and the logarithm of the pseudomarginal likelihood (LPML), are used to guide the choices of prior specifications and to compare the models under different missing data mechanisms. We report on extensive simulations conducted to investigate the empirical performance of the proposed methods. The proposed methodology is further illustrated using data from an HIV prevention clinical trial. <p>Key words and phrases: Collapsed Gibbs sampler, DIC, identifiability, Jeffreys prior, latent variable, LPML, probit model. </span>
/statistica/J28N4/J28N414/J28N414.html
SEMIPARAMETRIC ESTIMATION WITH DATA MISSING NOT AT RANDOM USING AN INSTRUMENTAL VARIABLE BaoLuo Sun, Lan Liu, Wang Miao, Kathleen Wirth, James Robins and Eric J. Tchetgen Tchetgen 1965-1983<span style='font-size=12pt;'><center>Abstract</center> Missing data occur frequently in empirical studies in the health and social sciences, and can compromise our ability to obtain valid inference. An outcome is said to be missing not at random (MNAR) if, conditional on the observed variables, the missing data mechanism still depends on the unobserved outcome. In such settings, identification is generally not possible without imposing additional assumptions. Identification is sometimes possible, however, if an instrumental variable (IV) is observed for all subjects that satisfies the exclusion restriction that the IV affects the missingness process without directly influencing the outcome. In this paper, we provide necessary and sufficient conditions for nonparametric identification of the full data distribution under MNAR with the aid of an IV. In addition, we give sufficient identification conditions that are more straightforward to verify in practice. For inference, we focus on estimation of a population outcome mean, for which we develop a suite of semiparametric estimators that extend methods previously developed for data missing at random. Specifically, we propose a novel doubly robust estimator of the mean of an outcome subject to MNAR. For illustration, the methods are used to account for selection bias induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer characteristics such as gender, age and years of experience as IVs. <p>Key words and phrases: Doubly robust, instrumental variable, inverse probability weighting, missing not at random.</span>
/statistica/J28N4/J28N415/J28N415.html
A MEAN SCORE METHOD FOR SENSITIVITY ANALYSIS TO DEPARTURES FROM THE MISSING AT RANDOM ASSUMPTION IN RANDOMISED TRIALS Ian R. White, James Carpenter and Nicholas J. Horton 1985-2003<span style='font-size=12pt;'><center>Abstract</center> Most analyses of randomised trials with incomplete outcomes make untestable assumptions and should therefore be subjected to sensitivity analyses. However, methods for sensitivity analyses are not widely used. We propose a mean score approach for exploring global sensitivity to departures from missing at random or other assumptions about incomplete outcome data in a randomised trial. We assume a single outcome analysed under a generalised linear model. One or more sensitivity parameters, specified by the user, measure the degree of departure from missing at random in a pattern mixture model. Advantages of our method are that its sensitivity parameters are relatively easy to interpret and so can be elicited from subject matter experts; it is fast and non-stochastic; and its point estimate, standard error and confidence interval agree perfectly with standard methods when particular values of the sensitivity parameters make those standard methods appropriate. We illustrate the method using data from a mental health trial. <p>Key words and phrases: Intention-to-treat analysis, longitudinal data analysis, mean score, missing data, randomised trials, sensitivity analysis.</span>
/statistica/J28N4/J28N416/J28N416.html
PROPENSITY SCORE MATCHING ANALYSIS FOR CAUSAL EFFECTS WITH MNAR COVARIATES Bo Lu and Robert Ashmead 2005-2025<span style='font-size=12pt;'><center>Abstract</center> In observational studies, propensity score methods are popular for estimating causal effects. With completely observed data, this approach is valid under several assumptions; however, in practice data are often missing which can have a substantial impact on the estimation. Current remedies to deal with missing covariates in propensity score methods generally fall into two categories. Some authors propose to account for the missing data patterns in propensity score estimation. Others propose to first impute the missing data, then utilize conventional propensity score adjustment methods. Both approaches assume that the data are missing at random (MAR), and there is little discussion regarding the impact on treatment effect estimation if covariates are missing not at random (MNAR). In this paper, we first examine the implication of the MAR assumption under the potential outcome framework. We then propose a sensitivity analysis method for assessing the impact of a MNAR covariate on treatment effect estimation with a matching estimator, with varying magnitudes of unmeasured confounding effect due to the missing covariate. Our method takes full advantage of the information contained in the partially missing covariate by matching on the observed portion and identifying a bounding distribution for the missing portion. It can be interpreted similarly as Rosenbaums sensitivity analysis, and the results are robust since we make few parametric assumptions. We illustrate the application of the method using the 2012 Ohio Medicaid Assessment Survey (OMAS) to investigate the effect of health insurance on health outcomes, where an important covariate, household income, is partially missing. <p>Key words and phrases: Matching, not missing at random, propensity score, sensitivity analysis.</span>
/statistica/J28N4/J28N417/J28N417.html
PROPENSITY SCORE MATCHING ANALYSIS FOR CAUSAL EFFECTS WITH MNAR COVARIATES Bo Lu and Robert Ashmead 2005-2025<span style='font-size=12pt;'><center>Abstract</center> We consider nonrandomized pretest-posttest designs with complex survey data for observational studies. We show that two-sample pseudo empirical likelihood methods provide efficient inferences on the treatment effect, with a missing- by-design feature used for forming the two samples and the baseline information incorporated through suitable constraints. The proposed maximum pseudo empirical likelihood estimators of the treatment effect are consistent and pseudo empirical likelihood ratio confidence intervals are constructed through bootstrap calibration methods. The proposed methods require estimation of propensity scores which depend on the underlying missing-by-design mechanism. A simulation study was conducted to examine finite sample performances of the proposed methods under different scenarios of nonignorable and ignorable missing patterns. An application to the International Tobacco Control Policy Evaluation Project Four Country Surveys is also presented to demonstrate the use of the proposed methods for examining the mode effect in survey data collection.<p>Key words and phrases: Auxiliary information, complex survey, confidence interval, empirical likelihood, missing-by-design, pretest-posttest study, propensity scores, treatment and control.</span>
/statistica/J28N4/J28N418/J28N418.html
BASELINE ZONE ESTIMATION IN TWO DIMENSIONS WITH REPLICATED MEASUREMENTS UNDER A CONVEXITY CONSTRAINT Wang Miao and Eric Tchetgen Tchetgen 2049-2067<span style='font-size=12pt;'><center>Abstract</center> We study identification of parametric and semiparametric models with missing covariate data. When covariate data are missing not at random, identification is not guaranteed even under fairly restrictive parametric assumptions, a fact that is illustrated with several examples. We propose a general approach to establish identification of parametric and semiparametric models when a covariate is missing not at random. Without auxiliary information about the missingness process, identification of parametric models is strongly dependent on model specification. However, in the presence of a fully observed shadow variable that is correlated with the missing covariate but otherwise independent of the missingness conditional on the covariate, identification is more broadly achievable, including in fairly large semiparametric models. Special consideration is given to the generalized linear models with the missingness process unrestricted. Under such a setting, the outcome model is identified for a number of familiar generalized linear models, and we provide counterexamples when identification fails. For estimation, we describe an inverse probability weighted estimator that incorporates the shadow variable to estimate the propensity score model, and we evaluate its performance via simulations. We further illustrate the shadow variable approach with a data example about home prices in China. <p>Key words and phrases: Identification, missing covariate data, missing not at random, shadow variable.</span>
/statistica/J28N4/J28N419/J28N419.html
DISCRETE CHOICE MODELS FOR NONMONOTONE NONIGNORABLE MISSING DATA: IDENTIFICATION AND INFERENCE Eric J. Tchetgen Tchetgen, Linbo Wang and BaoLuo Sun 2069-2088<span style='font-size=12pt;'><center>Abstract</center> Nonmonotone missing data arise routinely in empirical studies of the social and health sciences and, when ignored, can induce selection bias and loss of efficiency. It is common to account for nonresponse under a missing-at-random assumption which, although convenient, is rarely appropriate when nonresponse is nonmonotone. Likelihood and Bayesian missing data methodologies often require specification of a parametric model for the full data law, thus a priori ruling out any prospect for semiparametric inference. In this paper, we propose an all-purpose approach which delivers semiparametric inferences when missing data are nonmonotone and not at random. The approach is based on a discrete choice model (DCM) as a means to generate a large class of nonmonotone nonresponse mechanisms that are nonignorable. Sufficient conditions for nonparametric identification are given, and a general framework for fully parametric and semiparametric inference under an arbitrary DCM is proposed. Special consideration is given to the case of logit discrete choice nonresponse model (LDCM) for which we describe generalizations of inverse-probability weighting, pattern-mixture estimation, doubly robust estimation, and multiply robust estimation. <p>Key words and phrases: Doubly robust, inverse-probability-weighting, missing not at random, nonmonotone missing data, pattern mixture.</span>
/statistica/J28N4/J28N42/J28N42.html
DISCRETE CHOICE MODELS FOR NONMONOTONE NONIGNORABLE MISSING DATA: IDENTIFICATION AND INFERENCE Eric J. Tchetgen Tchetgen, Linbo Wang and BaoLuo Sun 2069-2088<span style='font-size=12pt;'><center>Abstract</center> We consider the estimation of unknown parameters in a generalized linear model when some covariates have nonignorable missing values. When an instrument, a covariate that helps identifying parameters under nonignorable missingness, is appropriately specified, a pseudo likelihood approach similar to that in Tang, Little and Raghunathan (2003) or Zhao and Shao (2015) can be applied. However, this approach does not work well when the instrument is a weak predictor of the response given other covariates. We show that the asymptotic variances of the pseudo likelihood estimators for the regression coefficients of covariates other than the instrument diverge to infinity as the regression coefficient of the instrument goes to 0. By an imputation-based adjustment for the score equations, we propose a new estimator for the regression coefficients of the covariates other than the instrument. This works well even if the instrument is a weak predictor. It is semiparametric since the propensity of missing covariate data is completely unspecified. To solve the adjusted score equation, we develop an iterative algorithm that can be applied by using standard softwares at each iteration. We establish some theoretical results on the convergence of the proposed iterative algorithm and asymptotic normality of the resulting estimators. A variance estimation formula is also derived. Some simulation results and a data example are presented for illustration. <p>Key words and phrases: Adjusted likelihood, identifiability, instruments, nonignorable missing covariate data, pseudo-likelihood, semiparametric. </span>
/statistica/J28N4/J28N420/J28N420.html
STRATEGIC BINARY CHOICE MODELS WITH PARTIAL OBSERVABILITY Mark David Nieman 2089-2105<span style='font-size=12pt;'><center>Abstract</center> Strategic interactions among rational, self-interested actors are commonly theorized in the behavioral, economic, and social sciences. The theorized strategic processes have traditionally been modeled with multi-stage structural estimators, which improve parameter estimates at one stage by using the information from other stages. Multi-stage approaches, however, impose rather strict demands on data availability: data must be available for the actions of each strategic actor at every stage of the interaction. Observational data are not always structured in a manner that is conducive to these approaches. Moreover, the theorized strategic process implies that these data are missing not at random. In this paper, I derive a strategic logistic regression model with partial observability that probabilistically estimates unobserved actor choices related to earlier stages of strategic interactions. I compare the estimator to traditional logit and split-population logit estimators using Monte Carlo simulations and a substantive example of the strategic firm-regulator interaction associated with pollution and environmental sanctions. <p>Key words and phrases: Data missing not at random, partial observability, strategic choice models.</span>
/statistica/J28N4/J28N421/J28N421.html
GENERALIZED METHOD OF MOMENTS FOR NONIGNORABLE MISSING DATA Li Zhang, Cunjie Lin and Yong Zhou 2107-2124<span style='font-size=12pt;'><center>Abstract</center> In this study, we consider the problem of nonignorable missingness in the framework of generalized method of moments. To model the missing propensity, a semiparametric logistic regression model is adopted and we modify this model with nonresponse instrumental variables to overcome the identifiability issue. Under the identifiability conditions, we mitigate the effects of nonignorable missing data through reformulated estimating equations imputed via a kernel regression method, then the idea of generalized method of moments is applied to estimate the parameters of interest and the tilting parameter in propensity simultaneously. Moreover, the consistency and the asymptotic normality of the proposed estimators are established and we find that the price we pay for estimating an unknown tilting parameter is an increased variance for the estimator of population parameters, that is quite acceptable in contrast with validation sample, especially for practical problems. The proposed method is evaluated through simulation studies and demonstrated on a data example. <p>Key words and phrases: Estimating equations, exponential tilting, generalized method of moments, kernel regression, nonignorable missing, nonresponse instrument.</span>
/statistica/J28N4/J28N422/J28N422.html
GENERALIZED METHOD OF MOMENTS FOR NONIGNORABLE MISSING DATA Li Zhang, Cunjie Lin and Yong Zhou 2107-2124<span style='font-size=12pt;'><center>Abstract</center> The regularization approach for variable selection was well developed for a completely observed data set in the past two decades. In the presence of missing values, this approach needs to be tailored to different missing data mechanisms. In this paper, we focus on a flexible and generally applicable missing data mechanism. That contains both ignorable and nonignorable missing data mechanism assumptions. We show how the regularization approach for variable selection can be adapted to the situation under this missing data mechanism. The computational and theoretical properties for variable selection consistency are established. The proposed method is further illustrated by comprehensive simulation studies and data analyses. <p>Key words and phrases: Missing data mechanism, nonignorable missing data, penalized pairwise pseudo likelihood, regularization, selection consistency, variableselection.</span>
/statistica/J28N4/J28N423/J28N423.html
ESTIMATION OF AREA UNDER THE ROC CURVE UNDER NONIGNORABLE VERIFICATION BIAS Wenbao Yu, Jae Kwang Kim and Taesung Park 2149-2166<span style='font-size=12pt;'><center>Abstract</center> The Area Under the Receiving Operating Characteristic Curve (AUC) is frequently used for assessing the overall accuracy of a diagnostic marker. However, estimation of AUC relies on knowledge of the true outcomes of subjects: diseased or non-diseased. Because disease verification based on a gold standard is often expensive and/or invasive, only a limited number of patients are sent to verification at doctors discretion. Estimation of AUC is generally biased if only small verified samples are used and it is thus necessary to make corrections for such lack of information. Correction based on the ignorable missingness assumption (or missing at random) is also biased if the missing mechanism depends on the unknown disease outcome, which is called nonignorable missing. In this paper, we propose a propensity-score-adjustment method for estimating the AUC based on the instrumental variable assumption when the missingness of disease status is nonignorable. The new method makes parametric assumptions on the verification probability, and the probability of being diseased for verified samples rather than for the whole sample. The proposed parametric assumption on the observed sample is easier to be verified than the parametric assumption on the full sample. We establish the asymptotic properties of the proposed estimators. A simulation study was performed to compare the proposed method with existing methods. The proposed method is applied to an Alzheimers disease data collected by National Alzheimers Coordinating Center. <p>Key words and phrases: Instrumental variable, missing data, not missing at random, ROC curve.</span>
/statistica/J28N4/J28N424/J28N424.html
BAYESIAN INFERENCE FOR NONRESPONSE TWO-PHASE SAMPLING Yue Zhang, Henian Chen and Nanhua Zhang 2167-2187<span style='font-size=12pt;'><center>Abstract</center> Nonresponse is an important practical problem in epidemiological surveys and clinical trials. Common methods for dealing with missing data rely on untestable assumptions. In particular, non-ignorable modeling, which derives inference from the likelihood function based on a joint distribution of the variables and the missingness indicators, can be sensitive to misspecification of this distribution and may also have problems with identifying the parameters. Nonresponse two-phase sampling (NTS), which re-contacts and collects data from a subsample of the initial nonrespondents, has been used to reduce nonresponse bias. The additional data collected in phase II provide important information for identifying the parameters in the non-ignorable models. We propose a Bayesian selection model which utilizes the additional data from phase II and develop an efficient Markov chain Monte Carlo algorithm for the posterior computation. We illustrate the proposedmodel on simulation studies and a Quality of Life (QOL) dataset. <p>Key words and phrases: Bayesian selection model, Markov chain Monte Carlo, missing not at random, quality of life, two-phase sampling.</span>
/statistica/J28N4/J28N425/J28N425.html
BAYESIAN INFERENCE FOR NONRESPONSE TWO-PHASE SAMPLING Yue Zhang, Henian Chen and Nanhua Zhang 2167-2187<span style='font-size=12pt;'><center>Abstract</center> Let (Y<sub>𝒾</sub> , θ<sub>𝒾</sub>) , 𝒾 = 1 ,..., 𝓃 , be independent random vectors distributed as (Y , θ ) ~ G<sup>*</sup> , where the marginal distribution of θ is completely unknown, and the conditional distribution of Y conditional on θ is known. It is desired to estimate G<sup>*</sup> , as well as E<sub>G*</sub> ℎ (Y , θ) for a given ℎ , based on the observed Y <sub>1</sub> ,..., Y <sub>n</sub> . In this paper we suggest a method for these problems and discuss some of its applications. The method involves a quadratic programming step. It is computationally efficient and may handle large data sets, where the popular method that uses EM-algorithm is impractical.The general approach of empirical Bayes, together with our computational method, is demonstrated and applied to problems of treating non-response. Our approach is nonstandard and does not involve missing at random type of assumptions. We present simulations, as well as an analysis of a data set from the LaborForce Survey in Israel.We also suggest a method, that involves convex optimization for constructing confidence intervals for E<sub>G*</sub> ℎ under the above setup.<p>Key words and phrases: Non-Response, NPMLE.</span>
/statistica/J28N4/J28N43/J28N43.html
SENSITIVITY ANALYSIS FOR UNMEASURED CONFOUNDING IN COARSE STRUCTURAL NESTED MEAN MODELS Shu Yang and Judith J. Lok 1703-1723<span style='font-size=12pt;'><center>Abstract</center> Coarse Structural Nested Mean Models (SNMMs, Robins (2000)) and G-estimation can be used to estimate the causal effect of a time-varying treatment from longitudinal observational studies. However, they rely on an untestable assumption of no unmeasured confounding. In the presence of unmeasured confounders, the unobserved potential outcomes are not missing at random, and standard G-estimation leads to biased effect estimates. To remedy this, we investigate the sensitivity of G-estimators of coarse SNMMs to unmeasured confounding, assuming a nonidentifiable bias function which quantifies the impact of unmeasured confounding on the average potential outcome. We present adjusted G-estimators of coarse SNMM parameters and prove their consistency, under the bias modeling for unmeasured confounding. We present a sensitivity analysis for the effect of the ART initiation time on the mean CD4 count at year 2 after infection in HIV-positive patients, based on the prospective Acute and Early Disease Research Program. <p>Key words and phrases: Censoring, confounding by indication, estimating equations, HIV/AIDS research, non-ignorable, sequential randomization. </span>
/statistica/J28N4/J28N44/J28N44.html
CALIBRATION AND MULTIPLE ROBUSTNESS WHEN DATA ARE MISSING NOT AT RANDOM Peisong Han 1725-1740<span style='font-size=12pt;'><center>Abstract</center> In missing data analysis, multiple robustness is a desirable property resulting from the calibration technique. A multiply robust estimator is consistent if any one of the multiple data distribution models and missingness mechanism models is correctly specified. So far in the literature, multiple robustness has only been established when data are missing at random (MAR). We study how to carry out calibration to construct a multiply robust estimator when data are missing not at random (MNAR). With multiple models available, where each model consists of two components, one for data distribution for complete cases and one for missingness mechanism, our proposed estimator is consistent if any one pair of models are correctly specified. <p>Key words and phrases: Calibration, empirical likelihood, missing not at random (MNAR), multiple robustness, nonignorable nonresponse.</span>
/statistica/J28N4/J28N45/J28N45.html
CALIBRATION AND MULTIPLE ROBUSTNESS WHEN DATA ARE MISSING NOT AT RANDOM Peisong Han 1725-1740<span style='font-size=12pt;'><center>Abstract</center> With nonignorable missing data, likelihood-based inference should be based on the joint distribution of the study variables and their missingness indicators. These joint models cannot be estimated from the data alone, thus requiring the analyst to impose restrictions that make the models uniquely obtainable from the distribution of the observed data. We present an approach for constructing classes of identifiable nonignorable missing data models. The main idea is to use a sequence of carefully set up identifying assumptions, whereby we specify potentially different missingness mechanisms for different blocks of variables. We show that the procedure results in models with the desirable property of being non-parametric saturated. <p>Key words and phrases: Identification, missing not at random, non-parametric saturation, partial ignorability, sensitivity analysis.</span>
/statistica/J28N4/J28N46/J28N46.html
CALIBRATION AND MULTIPLE ROBUSTNESS WHEN DATA ARE MISSING NOT AT RANDOM Peisong Han 1725-1740<span style='font-size=12pt;'><center>Abstract</center> Call-back of nonrespondents is common in surveys involving telephone or mail interviews. In general, these call-backs gather information on unobserved responses, so incorporating them can improve the estimation accuracy and efficiency. Call-back studies mainly focus on Alho (1990)s selection model or the pattern mixture model formulation. In this paper, we generalize the Heckman selection model to nonignorable nonresponses using call-back information. The unknown parameters are then estimated by the maximum likelihood method. The proposed formulation is simpler than Alhos selection model or the pattern mixture model formulation. It can reduce the bias caused by the nonignorably missing mechanism and improve the estimation efficiency by incorporating the call-back information. Further, it provides a marginal interpretation of a covariate effect. Moreover, the regression coefficient of interest is robust to the misspecification of the distribution. Simulation studies are conducted to evaluate the performance of the proposed method. For illustration, we apply the approach to National Health Interview Survey data. <p>Key words and phrases: Call-back, heckman model, maximum likelihood estimate, nonignorable, nonresponse.</span>
/statistica/J28N4/J28N47/J28N47.html
CALIBRATION AND MULTIPLE ROBUSTNESS WHEN DATA ARE MISSING NOT AT RANDOM Peisong Han 1725-1740<span style='font-size=12pt;'><center>Abstract</center> In this paper, a general regression model with responses missing not at random is considered. From a rank-based estimating equation, a rank-based estimator of the regression parameter is derived. Based on this estimator's asymptotic normality property, a consistent sandwich estimator of its corresponding asymptotic covariance matrix is obtained. In order to overcome the over-coverage issue of the normal approximation procedure, the empirical likelihood based on the rank-based gradient function is defined, and its asymptotic distribution is established. Extensive simulation experiments under different settings of error distributions with different response probabilities are considered, and the simulation results show that the proposed empirical likelihood approach has better performance in terms of coverage probability and average length of confidence intervals for the regression parameters compared with the normal approximation approach and its least-squares counterpart. A data example is provided to illustrate the proposed methods. <p>Key words and phrases: Empirical likelihood, imputation, non-ignorable missing, rank-based estimator.</span>
/statistica/J28N4/J28N48/J28N48.html
CALIBRATION AND MULTIPLE ROBUSTNESS WHEN DATA ARE MISSING NOT AT RANDOM Peisong Han 1725-1740<span style='font-size=12pt;'><center>Abstract</center> The presence of missing values complicates statistical analyses. In design of experiments, missing values are particularly problematic when constructing optimal designs, as it is not known which values are missing at the design stage. When data are missing at random it is possible to incorporate this information into the optimality criterion that is used to find designs; Imhof, Song and Wong (2002) develop such a framework. However, when data are not missing at random this framework can lead to inefficient designs. We investigate and address the specific challenges that not missing at random values present when finding optimal designs for linear regression models. We show that the optimality criteria depend on model parameters that traditionally do not affect the design, such as regression coefficients and the residual variance. We also develop a framework that improves efficiency of designs over those found when values are missing at random.<p>Key words and phrases: Covariance matrix, information matrix, linear regression model, missing observations, not missing at random, optimal design.</span>
/statistica/J28N4/J28N49/J28N49.html
CALIBRATION AND MULTIPLE ROBUSTNESS WHEN DATA ARE MISSING NOT AT RANDOM Peisong Han 1725-1740<span style='font-size=12pt;'><center>Abstract</center> We show how to use Bayesian uncertainty analysis to study several three-way contingency tables, each obtained from a single area, when one, two or three categories are missing. This is an extension of Nandram and Woo (2015) to cover small areas. One approach to analyze these data is to construct several tables (one complete and the others incomplete) with each table corresponding to one or more missing categories. When tables are incomplete and nonignorable nonresponse models are used, there are nonidentifiable parameters. To deal with these parameters, we describe five hierarchical Bayesian models, which are an ignorable nonresponse model and four nonignorable nonresponse models. Rather than performing a sensitivity analysis, we perform the Bayesian uncertainty analysis by placing priors on the nonidentifiable parameters. This is done to reduce the effects of the nonidentifiable parameters that is accomplished by projecting the parameters to a lower dimensional space and allowing the reduced set of parameters to share a common distribution. Also, this procedure allows a "borrowing of strength" from larger areas to improve estimation in smaller areas. We use the griddy Gibbs sampler to fit our models and we use goodness-of-fit procedures to assess model fit. We use an illustrative example and a simulation study to compare our models when inference is made about finite population proportions of the cells of the three-way tables.<p>Key words and phrases: Bayesian uncertainty analysis, griddy Gibbs sampler,model diagnostics, nonidentifiable parameters, nonignorable nonresponse model.</span>
/statistica/J28N5/J28N51/J28N51.html
PETER GAVIN HALL Terry Speed 2215-2235<span style='font-size=12pt;'><center>Abstract</center> This paper is about Peter Hall, documenting and commenting on how Peter's contemporaries, be they peers, colleagues, friends or students, or even the older generation of the statistical community, perceived Peter, in their own words. To some extent, I also do the same with Peter's own perceptions and thoughts.<p>Key words and phrases: Credo, gentle, kind, life, mind, passions, Peter Gavin Hall, soul. </span>
/statistica/J28N5/J28N510/J28N510.html
HYBRID COMBINATIONS OF ARAMETRIC AND EMPIRICAL LIKELIHOODS Nils Lid Hjort, Ian W. McKeague and Ingrid Van Keilegom 2389-2407
/statistica/J28N5/J28N511/J28N511.html
EMPIRICAL LIKELIHOOD RATIO TESTS FOR COEFFICIENTS IN HIGH-DIMENSIONAL HETEROSCEDASTIC LINEAR MODELS Honglang Wang, Ping-Shou Zhong and Yuehua Cui 2409-2433<span style='font-size=12pt;'><center>Abstract</center> This paper considers hypothesis testing problems for a low-dimensional coefficient vector in a high-dimensional linear model with heteroscedastic variance. Heteroscedasticity is a commonly observed phenomenon in many applications, including finance and genomic studies. Several statistical inference procedures have been proposed for low-dimensional coefficients in a high-dimensional linear model with homoscedastic variance, which are not applicable for models with heteroscedastic variance. The heterscedasticity issue has been rarely investigated and studied. We propose a simple inference procedure based on empirical likelihood to overcome the heteroscedasticity issue. The proposed method is able to make valid inference even when the conditional variance of random error is an unknown function of high-dimensional predictors. We apply our inference procedure to three recently proposed estimating equations and establish the asymptotic distributions of the proposed methods. Simulation studies and real data applications are conducted to demonstrate the proposed methods.<p>Key words and phrases: Empirical likelihood, heteroscedastic linear models, high-dimensional data, low-dimensional coefficients.</span>
/statistica/J28N5/J28N512/J28N512.html
AN OUTLYINGNESS MATRIX FOR MULTIVARIATE FUNCTIONAL DATA CLASSIFICATION Wenlin Dai and Marc G. Genton 2435-2454<span style='font-size=12pt;'><center>Abstract</center> The classification of multivariate functional data is an important task in scientific research. Unlike point-wise data, functional data are usually classified by their shapes rather than by their scales. We define an outlyingness matrix by extending directional outlyingness, an effective measure of the shape variation of curves that combines the direction of outlyingness with conventional statistical depth. We propose classifiers based on directional outlyingness and the outlyingness matrix. Our classifiers provide better performance compared with existing depth-based classifiers when applied on both univariate and multivariate functional data from simulation studies. We also test our methods on two data problems: speech recognition and gesture classification, and obtain results that are consistent with the findings from the simulated data.<p>Key words and phrases: Directional outlyingness, functional data classification, multivariate functional data, outlyingness matrix, statistical depth.</span>
/statistica/J28N5/J28N513/J28N513.html
ADAPTIVE FUNCTIONAL LINEAR REGRESSION VIA FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS AND BLOCK THRESHOLDING T. Tony Cai, Linjun Zhang and Harrison H. Zhou 2455-2468<span style='font-size=12pt;'><center>Abstract</center> Theoretical results in the functional linear regression literature have so far focused on minimax estimation where smoothness parameters are assumed to be known and the estimators typically depend on these smoothness parameters. In this paper we consider adaptive estimation in functional linear regression. The goal is to construct a single data-driven procedure that achieves optimality results simultaneously over a collection of parameter spaces. Such an adaptive procedure automatically adjusts to the smoothness properties of the underlying slope and covariance functions. The main technical tools for the construction of the adaptive procedure are functional principal component analysis and block thresholding. The estimator of the slope function is shown to adaptively attain the optimal rate of convergence over a large collection of function spaces. <p>Key words and phrases: Adaptive estimation, block thresholding, eigenfunction, eigenvalue, functional data analysis, functional principal component analysis, minimax estimation, rate of convergence, slope function, smoothing, spectral decomposition. </span>
/statistica/J28N5/J28N514/J28N514.html
FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS FOR DERIVATIVES OF MULTIVARIATE CURVES Maria Grith, Heiko Wagner, Wolfgang K. Häardle and Alois Kneip 2469-2496<span style='font-size=12pt;'><center>Abstract</center> We propose two methods based on the functional principal component analysis (FPCA) to estimate smooth derivatives for a sample of observed curves with a multidimensional domain. We apply the eigendecomposition to a) the dual covariance matrix of the derivatives; b) the dual covariance matrix of the observed curves, and take derivatives of their eigenfunctions. To handle noisy and discrete observations, we rely on local polynomial regression. We show that if the curves are contained in a finite-dimensional function space, the second method performs better asymptotically. We apply our methodology in simulations and an empirical study of option implied state price density surfaces. Using call data for the DAX 30 stock index between 2002 and 2011, we identify three components that are interpreted as volatility, skewness and tail factors, and we find evidence of term structure variation. <p>Key words and phrases: Derivatives, dual method, functional principal component analysis, multivariate functions, option prices, state price densities.</span>
/statistica/J28N5/J28N515/J28N515.html
SINGULAR ADDITIVE MODELS FOR FUNCTION TO FUNCTION REGRESSION Byeong U. Park, Chun-Jui Chen, Wenwen Tao and Hans-Georg Müller 2497-2520<span style='font-size=12pt;'><center>Abstract</center> In various functional regression settings one observes i.i.d. samples of paired stochastic processes (X, Y) and aims at predicting the trajectory of Y, given the trajectory X. For example, one may wish to predict the future segment of a process from observing an initial segment of its trajectory. Commonly used functional regression models are based on representations that are obtained separately for X and Y. In contrast to these established methods, often implemented with functional principal components, we base our approach on a singular expansion of the paired processes X, Y with singular functions that are derived from the cross-covariance surface between X and Y. The motivation for this approach is that the resulting singular components may better reflect the association between X and Y. The regression relationship is then based on the assumption that each singular component of Y follows an additive regression model with the singular components of X as predictors. To handle the inherent dependency of these predictors, we develop singular additive models with smooth backfitting. We discuss asymptotic properties of the estimates as well as their practical behavior in simulations and data analysis. <p>Key words and phrases: Additive model, cross-covariance operator, functional data analysis, singular decomposition, smooth backfitting.</span>
/statistica/J28N5/J28N516/J28N516.html
METHODOLOGY AND CONVERGENCE RATES FOR FUNCTIONAL TIME SERIES REGRESSION Tung Pham and Victor M. Panaretos 2521-2539<span style='font-size=12pt;'><center>Abstract</center> The functional linear model extends the notion of linear regression to the case where the response and covariates are iid elements of an infinite-dimensional Hilbert space. The unknown to be estimated is a Hilbert-Schmidt operator, whose inverse is by definition unbounded, rendering the problem of inference ill-posed. In this paper, we consider the more general context where the sample of response/covariate pairs forms a weakly dependent stationary process in the respective product Hilbert space: simply stated, the case where we have a regression between functional time series. We consider a general framework of potentially nonlinear processes, expoiting recent advances in the spectral analysis of functional time series. This allows us to quantify the inherent ill-posedness, and to motivate a Tikhonov regularisation technique in the frequency domain. Our main result is the rate of convergence for the corresponding estimators of the regression coefficients, the latter forming a summable sequence in the space of Hilbert-Schmidt operators. In a sense, our main result can be seen as a generalisation of the classical functional linear model rates to the case of time series, and rests only upon Brillinger-type mixing conditions. It is seen that, just as the covariance operator eigenstructure plays a central role in the independent case, so does the spectral density operator's eigenstructure in the dependent case. While the analysis becomes considerably more involved in the dependent case, the rates are strikingly comparable to those of the i.i.d. case, but at the expense of an additional factor caused by the necessity to estimate the spectral density operator at a nonparametric rate, as opposed to the parametric rate for covariance operator estimation. <p>Key words and phrases: Frequency analysis, functional linear model, spectral density operator, system identification, Tikhonov regularisation.</span>
/statistica/J28N5/J28N517/J28N517.html
EDGEWORTH CORRECTION FOR THE LARGEST EIGENVALUE IN A SPIKED PCA MODEL Jeha Yang and Iain M. Johnstone 2541-2564
/statistica/J28N5/J28N518/J28N518.html
CALIBRATED PERCENTILE DOUBLE BOOTSTRAP FOR ROBUST LINEAR REGRESSION INFERENCE Daniel McCarthy, Kai Zhang, Lawrence D. Brown, Richard Berk, Andreas Buja, Edward I. George and Linda Zhao 2565-2589<span style='font-size=12pt;'><center>Abstract</center> We consider inference for the parameters of a linear model when the covariates are random and the relationship between response and covariates is possibly non-linear. Conventional inference methods such as z intervals perform poorly in these cases. We propose a double bootstrap-based calibrated percentile method, perc-cal, as a general-purpose CI method which performs very well relative to alternative methods in challenging situations such as these. The superior performance of perc-cal is demonstrated by a thorough, full-factorial design synthetic data study as well as a data example involving the length of criminal sentences. We also provide theoretical justification for the perc-cal method under mild conditions. The method is implemented in the R package "perccal", available through CRAN and coded primarily in C++, to make it easier for practitioners to use. <p>Key words and phrases: Confidence intervals, Edgeworth expansion, resampling, second-order correctness.</span>
/statistica/J28N5/J28N519/J28N519.html
EDGEWORTH EXPANSIONS FOR A CLASS OF SPECTRAL DENSITY ESTIMATORS AND THEIR APPLICATIONS TO INTERVAL ESTIMATION Arindam Chatterjee and Soumendra N. Lahiri 2591-2608<span style='font-size=12pt;'><center>Abstract</center> In this paper we obtain valid Edgeworth expansions (EEs) for a class of spectral density estimators of a stationary time series. The spectral estimators are based on tapered periodograms of overlapping blocks of observations. We give conditions for the validity of a general order EE under an approximate strong mixing condition on the random variables. We use the EE results to study higher order coverage accuracy of confidence intervals (CIs) based on Studentization and on Variance Stabilizing transformation. It is shown that the accuracy of the CIs critically depends on the length of the blocks employed. We use the EE results to determine the optimal orders of the block lengths for one- and two-sided CIs under both methods. Theoretical results are illustrated with a moderately large simulation study.We dedicate this paper to the memory of Professor Peter Hall who made fundamental contributions to asymptotic theory of Statistics and extensively used EEs to study higher order coverage properties of CIs. <p>Key words and phrases: Confidence intervals, frequency domain, stationary, studentization, taper, variance stabilizing transformation.</span>
/statistica/J28N5/J28N52/J28N52.html
PETER GAVIN HALL A BRIEF REMEMBRANCE OF THE MAN AND HIS WORK Francisco J. Samaniego 2237-2248
/statistica/J28N5/J28N520/J28N520.html
A BOOTSTRAP METHOD FOR CONSTRUCTING POINTWISE AND UNIFORM CONFIDENCE BANDS FOR CONDITIONAL QUANTILE FUNTIONS Joel L. Horowitz and Anand Krishnamurthy 2609-2632<span style='font-size=12pt;'><center>Abstract</center> This paper is concerned with inference about the conditional quantile function in a nonparametric quantile regression model. Any method for constructing a confidence interval or band for this function must deal with the asymptotic bias of nonparametric estimators of the function. In such estimation methods, as local polynomial estimation, this is usually done through undersmoothing or explicit bias correction. The latter usually requires oversmoothing. However, there are no satisfactory empirical methods for selecting bandwidths that under- or over- smooth. This paper extends the bootstrap method of Hall and Horowitz (2013) for conditional mean functions to conditional quantile functions. The paper also shows how the bootstrap method can be used to obtain uniform confidence bands. The bootstrap method uses only bandwidths that are selected by standard methods such as cross validation and plug-in. It does not use under- or oversmoothing. The results of Monte Carlo experiments illustrate the numerical performance of the bootstrap method. <p>Key words and phrases: Bias, bootstrap, confidence band, nonparametric estimation, quantile estimation.</span>
/statistica/J28N5/J28N521/J28N521.html
PARTIAL CONSISTENCY WITH SPARSE INCIDENTAL PARAMETERS Jianqing Fan, Runlong Tang and Xiaofeng Shi 2633-2655<span style='font-size=12pt;'><center>Abstract</center> The penalized estimation principle is fundamental to high-dimensional problems. In the literature, it has been extensively and successfully applied to various models with only structural parameters. In this paper, we apply this penalization principle to a linear regression model with not only structural parameters but also sparse incidental parameters. For the estimators of the structural parameters, we derive their consistency and asymptotic normality, which reveals an oracle property. However, the penalized estimators for the incidental parameters possess only partial selection consistency, not consistency. This is an interesting partial consistency phenomenon: the structural parameters are consistently estimated while the incidental ones are not. For the structural parameters, also considered is an alternative two-step penalized estimator, which has fewer possible asymptotic distributions and thus is more suitable for statistical inferences. A data-driven approach for selecting a penalty regularization parameter is provided. The finite-sample performance of the penalized estimators for the structural parameters is evaluated by simulations and a data set is analyzed. We also extend the methods and results to the case where the number of the structural parameters diverge but slower than the sample size. <p>Key words and phrases: Oracle property, partial consistency, penalized estimation, sparse incidental parameter, structural parameter, two-step estimation.</span>
/statistica/J28N5/J28N522/J28N522.html
PARTIAL CONSISTENCY WITH SPARSE INCIDENTAL PARAMETERS Jianqing Fan, Runlong Tang and Xiaofeng Shi 2633-2655<span style='font-size=12pt;'><center>Abstract</center> Martingale limit theory is increasingly important in modern probability theory and mathematical statistics. In this article, we give a selected overview of Peter Hall's contributions to both the theoretical foundations and the wide applicability of martingales. We highlight his celebrated coauthored book, Hall and Heyde (1980) and his ground-breaking paper, Hall (1984). To illustrate the power of his martingale limit theory, we present two contemporary applications to estimating and testing high dimensional covariance matrices. In the first, we use the martingale central limit theorem in Hall and Heyde (1980) to obtain the simultaneous risk optimality and consistency of Stein's unbiased risk estimation (SURE) information criterion for large covariance matrix estimation. In the second application, we use the central limit theorem for degenerate U-statistics in Hall (1984) to establish the consistent asymptotic size and power against more general alternatives when testing high-dimensional covariance matrices. <p>Key words and phrases: Degenerate U-statistics, hypothesis testing, large covariance matrix, martingale limit theory, Stein's unbiased risk estimation.</span>
/statistica/J28N5/J28N523/J28N523.html
HIGH-DIMENSIONAL TWO-SAMPLE COVARIANCE MATRIX TESTING VIA SUPER-DIAGONALS Jing He and Song Xi Chen 2671-2696<span style='font-size=12pt;'><center>Abstract</center> This paper considers testing for two-sample covariance matrices of high-dimensional populations. We formulate a multiple test procedure by comparing the super-diagonals of the covariance matrices. The asymptotic distributions ofthe test statistics are derived and the powers of individual tests are studied. The test statistics, by focusing on the super-diagonals, have smaller variation than the existing tests that target on the entire covariance matrix. The advantage of the proposed test is demonstrated by simulation studies, as well as an empirical study on a prostate cancer dataset. <p>Key words and phrases: High dimensional test, multiple test, sparse alternative, two-sample test for covariance matrices.</span>
/statistica/J28N5/J28N524/J28N524.html
ESTIMATING A DISCRETE LOG-CONCAVE DISTRIBUTION IN HIGHER DIMENSIONS Hanna Jankowski and Yan Hua Tian 2697-2712<span style='font-size=12pt;'><center>Abstract</center> We define a new class of log-concave distributions on the discrete lattice 𝕫 <sup>𝒹</sup>, and study its properties. We show how to compute the maximum likelihood estimator of this class of probability mass functions from an independent and identically distributed sample, and establish consistency of the estimator, even if the class has been incorrectly specified. For finite sample sizes, in our simulations, the proposed estimator outperforms a purely nonparametric approach (the empirical distribution), but is able to remain comparable to the correct parametric approach. Notably, the new class of distributions has a natural relationship with log-concave densities. <p>Key words and phrases: Log-concave, maximum likelihood estimation, multivariate data, probability mass function estimation, shape-constrained methods.</span>
/statistica/J28N5/J28N525/J28N525.html
ESTIMATING A DISCRETE LOG-CONCAVE DISTRIBUTION IN HIGHER DIMENSIONS Hanna Jankowski and Yan Hua Tian 2697-2712<span style='font-size=12pt;'><center>Abstract</center> For theoretical properties of variable selection procedures for Cox's model, we study the asymptotic behavior of partial likelihood for the Cox model. We find that the partial likelihood does not behave like an ordinary likelihood, whose sample average typically tends to its expected value, a finite number, in probability. Under some mild conditions, we prove that the sample average of partial likelihood tends to infinity at the rate of the logarithm of the sample size, in probability. We apply the asymptotic results on the partial likelihood to study tuning parameter selection for penalized partial likelihood. We find that the penalized partial likelihood with the generalized cross-validation (GCV) tuning parameter proposed in Fan and Li (2002) enjoys the model selection consistency property, despite the fact that GCV, AIC and C<sub>p</sub>, equivalent in the context of linear regression models, are not model selection consistent. Our empirical studies via Monte Carlo simulation and a data example confirm our theoretical findings.<p>Key words and phrases: Akaike information criterion, Bayesian information criterion, LASSO penalized partial likelihood, SCAD, variable selection.</span>
/statistica/J28N5/J28N526/J28N526.html
ESTIMATING A DISCRETE LOG-CONCAVE DISTRIBUTION IN HIGHER DIMENSIONS Hanna Jankowski and Yan Hua Tian 2697-2712<span style='font-size=12pt;'><center>Abstract</center> Data sharpening for kernel regression and density estimation was introduced by the late Peter Hall. We review briefly his enormous contribution to the literature in this area and then propose a data sharpening procedure arising from imposition of a soft global functional constraint in local regression analysis. Instead of enforcing the constraint everywhere, the procedure guides the data in directions which enable satisfaction or near-satisfaction of the given property globally through the use of a penalty. It results in a modified local regression estimator which possesses a closed functional form and which includes a conventional local regression estimator as a special case. The approach can accommodate various constraints, most of which in practice are motivated by expert prior knowledge. We demonstrate theoretically and numerically that the proposed estimator is an improved variant of the corresponding local regression estimator. It achieves a reduction in variance while maintaining the bias at the same level. Although the focus in the paper is on local polynomial regression, the technique can be applied, in principle, to any linear nonparametric estimator, including regression splines, smoothing and penalized splines and other recently proposed kernel estimators. We exhibit usefulness of the proposed approach with an analysis of a collection of temperatures at the airport of Vancouver. The analysis reveals a possible monotonic trend underlyingthe conventional supposition of a periodic (seasonal) temporal structure. <p>Key words and phrases: Bias-variance trade-off, functional constraint, kernel smoothing, quadratic penalty.</span>
/statistica/J28N5/J28N527/J28N527.html
BIAS REDUCTION FOR NONPARAMETRIC AND SEMIPARAMETRIC REGRESSION MODELS Ming-Yen Cheng, Tao Huang, Peng Liu, and Heng Peng 2749-2770<span style='font-size=12pt;'><center>Abstract</center> Nonparametric and semiparametric regression models are useful statistical regression models to discover nonlinear relationships between the response variable and predictor variables. However, optimal efficient estimators for the nonparametric components in the models are biased which hinders the development of methods for further statistical inference. In this paper, based on the local linear fitting, we propose a simple bias reduction approach for the estimation of the nonparametric regression model. Our approach does not need to use higher-order local polynomial regression to estimate the bias, and hence avoids the double bandwidth selection and design sparsity problems suffered by higher-order local polynomial fitting. It also does not inflate the variance. Hence it can be easily applied to complex statistical inference problems. We extend our approach to varying coefficient models, to estimate the variance function, and to construct simultaneous confidence band for the nonparametric regression function. Simulations are carried out for comparisons with existing methods, and a data example is used to investigate the performance of the proposed method.<p>Key words and phrases: Simultaneous confidence band, undersmoothing, variance function estimation.</span>
/statistica/J28N5/J28N528/J28N528.html
NONLINEAR REGRESSION ESTIMATION USING SUBSET-BASED KERNEL PRINCIPAL COMPONENTS Yuan Ke, Degui Li and Qiwei Yao 2771-2794<span style='font-size=12pt;'><center>Abstract</center> We study the estimation of conditional mean regression functions through the so-called subset-based kernel principal component analysis (KPCA). Instead of using one global kernel feature space, we project a target function into different localized kernel feature spaces at different parts of the sample space. Each localized kernel feature space reflects the relationship on a subset between the response and covariates more parsimoniously. When the observations are collected from a strictly stationary and weakly dependent process, the orthonormal eigenfunctions which span the kernel feature space are consistently estimated by implementing an eigenanalysis on the subset-based kernel Gram matrix, and the estimated eigenfunctions are then used to construct the estimation of the mean regression function. Under some regularity conditions, the developed estimator is shown to be uniformly consistent over the subset with a convergence rate faster than those of some well-known nonparametric estimation methods. In addition, we discuss some generalizations of the KPCA approach, and consider using the same subset-based KPCA approach to estimate the conditional distribution function. The numerical studies including three simulated examples and two data sets illustrate the reliable performance of the proposed method. In particular, the improvement over the global KPCA method is evident. <p>Key words and phrases: Conditional distribution function, eigenanalysis, kernel Gram matrix, KPCA, mean regression function, nonparametric regression.</span>
/statistica/J28N5/J28N529/J28N529.html
OPTIMAL MODEL AVERAGING OF VARYING COEFFICIENT MODELS Cong Li, Qi Li, Jeffrey S. Racine and Daiqiang Zhang 2795-2809<span style='font-size=12pt;'><center>Abstract</center> We consider the problem of model averaging over a set of semiparametric varying coefficient models where the varying coefficients can be functions of continuous and categorical variables. We propose a Mallows model averaging procedure that is capable of delivering model averaging estimators with solid finite-sample performance. Theoretical underpinnings are provided, finite-sample performance is assessed via Monte Carlo simulation, and an illustrative application is presented. The approach is very simple to implement in practice and R code is provided as supplementary material. <p>Key words and phrases: Candidate models, kernel smoothing, semiparametric.</span>
/statistica/J28N5/J28N53/J28N53.html
PETER HALL: MY MENTOR, COLLABORATOR AND FRIEND Peihua Qiu 2249-2259<span style='font-size=12pt;'><center>Abstract</center> Peter Hall left us about two year ago. His passing was an irreplaceable loss to the statistical community. I lost a long-time mentor, collaborator, and friend. In this article, I share with readers certain stages in my career during which Peter provided me much help, things that I learned from him about research and research attitude, our research collaborations, and more. I am only one of many statisticians who benefited from Peter's generosity in helping others, especially young researchers. My example demonstrates the importance and influence of Peter and his generosity on our growth and career development. <p>Key words and phrases: Collaboration, density deconvolution, fond memories, image processing, inverse problems, jump regression analysis, photography, steam trains. </span>
/statistica/J28N5/J28N530/J28N530.html
EMPIRICAL FOURIER METHODS FOR INTERVAL CENSORED DATA Peter G. Hall, W. John Braun and Thierry Duchesne 2811-2822<span style='font-size=12pt;'><center>Abstract</center> Methods for estimating the probability density function are considered under the circumstance that the underlying measurements are interval-censored. Density and distribution function estimators are proposed under parametric and nonparametric assumptions on the censoring mechanism. Conditions for identifiability and consistency of the estimates are established theoretically, and it is shown that under such conditions, the estimates converge to the truth at a polynomial rate in the inverse sample size. An online supplement contains the technical arguments as well as practical guidelines for numerical implementation of the proposed methods. The core of the theory in this paper was originally drafted by Peter Hall in early 2010, following discussions at a workshop on mismeasured data held in Canada in December, 2009 at which Peter was the keynote speaker. The co-authors are grateful for the follow-up conversations held with Peter by long distance over the years prior to his regretful passing. <p>Key words and phrases: Characteristic functions, density estimation, kernel methods.</span>
/statistica/J28N5/J28N531/J28N531.html
ON P-VALUES Laurie Davies 2823-2840<span style='font-size=12pt;'><center>Abstract</center> In statistics P-values are mostly used in the context of hypothesis testing. Software for linear regression assigns a P-value to every covariate which corresponds to testing the hypothesis that the "true" value of the regression coefficient is zero. In this paper several different uses and interpretations of P-values will be discussed ranging from the use of P-values as measures of approximation for parametric models, for location-scale M-functionals to Jeffreys' criticism of P-values and to the choice of covariates in linear regression without an error term. The approach is neither frequentist nor Bayesian. It is not frequentist as the P-values are calculated and interpreted for the data at hand, and simply being a P-value makes it non-Bayesian. <p>Key words and phrases: Approximate models, approximation regions, choice of covariates, functionals, prediction, P-values and approximation.</span>
/statistica/J28N5/J28N532/J28N532.html
ON P-VALUES Laurie Davies 2823-2840<span style='font-size=12pt;'><center>Abstract</center> Covariate balance among different treatment arms is critical in clinical trials, as confounding effects can be effectively eliminated when patients in different arms are alike. To balance the prognostic factors across different arms, we propose a new dynamic scheme for patient allocation. Our approach does not require discretizing continuous covariates to multiple categories, and can handle both continuous and discrete covariates naturally. This is achieved through devising a statistical measure to characterize the similarity between a new patient and all the existing patients in the trial. Under the similarity weighting scheme, we develop a covariate-adaptive biased coin design and establish its theoretical properties, thus improving the original Pocock-Simon design. We conduct extensive simulation studies to examine the design operating characteristics and we illustrate our method with a data example. The new approach is thereby demonstrated to be superior to existing methods in terms of performance. <p>Key words and phrases: Biased coin design, clinical trial, covariate-adaptive randomization, covariate balance, pocock and simon design, similarity measure, stratification.</span>
/statistica/J28N5/J28N533/J28N533.html
TESTS FOR TAR MODELS VS STAR MODELS-A SEPARATE FAMILY OF HYPOTHESES APPROACH Zhaoxing Gao, Shiqing Ling and Howell Tong 2857-2883<span style='font-size=12pt;'><center>Abstract</center> The threshold autoregressive (TAR) model and the smooth threshold autoregressive (STAR) model have been popular parametric nonlinear time series models for the past three decades or so. As yet there is no formal statistical test in the literature for one against the other. The two models are fundamentally different in their autoregressive functions, the TAR model being generally discontinuous while the STAR model is smooth (except in the limit of infinitely fast switching for some cases). Following the approach initiated by Cox (1961, 1962), we treat the test problem as one of separate families of hypotheses. The test statistic under a STAR model is shown to follow asymptotically a chi-squared distribution, and the one under a TAR model can be expressed as a functional of a chi-squared process. We present numerical results with both simulated and real data to assess the performance of our procedure. <p>Key words and phrases: Non-nested test, separate family of hypotheses, STAR model, TAR model.</span>
/statistica/J28N5/J28N534/J28N534.html
LIKELIHOOD RATIO HAAR VARIANCE STABILIZATION AND NORMALIZATION FOR POISSON AND OTHER NON-GAUSSIAN NOISE REMOVAL Piotr Fryzlewicz 2885-2901<span style='font-size=12pt;'><center>Abstract</center> We propose a methodology for denoising, variance-stabilizing, and normalizing signals whose varying mean and variance are linked via a single parameter, such as Poisson or scaled chi-squared. Our key observation is that the signed and square-rooted generalized log-likelihood ratio test for the equality of the local means is approximately distributed as standard normal under the null. We use these test statistics within the Haar wavelet transform at each scale and location, referring to them as the likelihood ratio Haar (LRH) coefficients of the data. In the denoising algorithm, the LRH coefficients are used as thresholding decision statistics, which enables the use of thresholds suitable for i.i.d. Gaussian noise. In the variance-stabilizing and normalizing algorithm, the LRH coefficients replace the standard Haar coefficients in the Haar basis expansion. We prove the consistency of our LRH smoother for Poisson counts with a near-parametric rate, and various numerical experiments demonstrate the good practical performance of our methodology. <p>Key words and phrases: Anscombe transform, Box-Cox transform, Gaussianization, Haar-Fisz, log transform, variance-stabilizing transform.</span>
/statistica/J28N5/J28N535/J28N535.html
RFMS METHOD FOR CREDIT SCORING BASED ON BANK CARD TRANSACTION DATA Danyang Huang, Jing Zhou and Hansheng Wang 2903-2919<span style='font-size=12pt;'><center>Abstract</center> Microcredit refers to small loans to borrowers who typically lack collateral, steady employment, or a verifiable credit history. It is designed not only for start-ups but also for individuals. The microcredit industry is experiencing fast growth in China. In contrast with traditional loans, microcredit typically lacks collateral, which makes credit scoring important. Due to the fast development of on-line microcredit platforms, there are various sources of data that could be used for credit evaluation. Among them, bank card transaction records play an important role. How to conduct credit scoring based on this type of data becomes a problem of importance. The key issue to be solved is feature construction: how to construct meaningful and useful features based on bank card transaction data. To this end, we propose here a so-called RFMS method. Here "R" stands for recency, "F" stands for frequency, and "M" stands for monetary value. Our method can be viewed as a natural extension of the classical RFM model in marketing research. However, we make a further extension by taking "S" (Standard Deviation) into consideration. The performance of the method is empirically tested on a data example from a Chinese microcredit company.<p>Key words and phrases: Credit scoring, frequency, logistic regression, microcredit, monetary value, recency, standard deviation.</span>
/statistica/J28N5/J28N54/J28N54.html
PETER HALL ON EXTREMES: RESEARCH, TEACHING AND SUPERVISION A. H. Welsh 2261-2287<span style='font-size=12pt;'><center>Abstract</center> We examine Peter Hall's early research, undergraduate teaching, and PhD supervision, using the theme of extreme order statistics to highlight interesting aspects of these activities. Focusing on this period allows us to see Peter when, like all academics in the early part of their careers, he was becoming an academic and still establishing himself. That he succeeded so greatly and rapidly began to make the many remarkable contributions that adorn his distinguished career, makes this early, formative stage particularly interesting to explore. <p>Key words and phrases: Endpoint of a distribution, order statistics, rates of convergence, regular variation, threshold selection.</span>
/statistica/J28N5/J28N55/J28N55.html
WAVELET METHODS FOR ERRATIC REGRESSION MEANS IN THE PRESENCE OF MEASUREMENT ERROR Peter Hall, Spiridon Penev and Jason Tran 2289-2307<span style='font-size=12pt;'><center>Abstract</center> In nonparametric regression with errors in the explanatory variable, the regression function is typically assumed to be smooth, and in particular not to have a rapidly changing derivative. Not all data applications have this property. When the property fails, conventional techniques, usually based on kernel methods, have unsatisfactory performance. We suggest an adaptive, wavelet-based approach, founded on the concept of explained sum of squares, and using matrix regularisation to reduce noise. This non-standard technique is used because conventional wavelet methods fail to estimate wavelet coefficients consistently in the presence of measurement error. We assume that the measurement error distribution is known. Our approach enjoys very good performance, especially when the regression function is erratic. Pronounced maxima and minima are recovered more accurately than when using conventional methods that tend to flatten peaks and troughs. We also show that wavelet techniques have advantages when estimating conventional, smooth functions since they require less sophisticated smoothing parameter choice. That problem is particularly challenging in the setting of measurement error. A data example is discussed and a simulation study is presented. <p>Key words and phrases: Chirp, cross-validation, deconvolution, discontinuity, errors in variables, error sum of squares, explained sum of squares, kernel methods.</span>
/statistica/J28N5/J28N56/J28N56.html
SEMI-PARAMETRIC PREDICTION INTERVALS IN SMALL AREAS WHEN AUXILIARY DATA ARE MEASURED WITH ERROR Gauri Datta, Aurore Delaigle, Peter Hall and Li Wang 2309-2335<span style='font-size=12pt;'><center>Abstract</center> In recent years, demand for reliable small area statistics has considerably increased, but the size of samples obtained in small areas is too often small to produce accurate predictors of quantities of interest. To overcome this difficulty, a common approach is to use auxiliary data from other areas or other sources, and produce estimators that combine them with direct data. A popular model for combining direct and indirect data sources is the Fay-Herriot model, which assumes that the auxiliary variables are observed accurately. However, these variables are often subject to measurement errors, and not taking this into account can lead to estimators that are even worse than those based exclusively on the direct data. We consider structural measurement error models and a semi-parametric approach based on the Fay-Herriot model to produce reliable prediction intervals for small area characteristics of interest. Our theoretical study reveals the surprising fact that the properties of the prediction interval are not the same for all values of the noisy covariate. Indeed, the convergence rates are slower when the contaminated covariate takes the value zero than in other cases. Our procedure is illustrated with an application and simulation studies. <p>Key words and phrases: Deconvolution, density stimation, Fay-Herriot model, Fourier transform, Laplace distribution.</span>
/statistica/J28N5/J28N57/J28N57.html
CLUSTERING IN GENERAL MEASUREMENT ERROR MODELS Ya Su, Jill Reedy and Raymond J. Carroll 2337-2351<span style='font-size=12pt;'><center>Abstract</center> This paper is dedicated to the memory of Peter G. Hall. It concerns a deceptively simple question: if one observes variables corrupted with measurement error of possibly very complex form, can one recreate asymptotically the clusters that would have been found had there been no measurement error? We show that the answer is yes, and that the solution is surprisingly simple and general. The method itself is to simulate, by computer, realizations with the same distribution as that of the true variables, and then to apply clustering to these realizations. Technically, we show that if one uses K-means clustering or any other risk minimizing clustering, and a multivariate deconvolution device with certain smoothness and convergence properties, then, in the limit, the cluster means based on our method converge to the same cluster means as if there were no measurement error. Along with the method and its technical justification, we analyze two important nutrition data sets, finding patterns that make sense nutritionally. <p>Key words and phrases: Clustering, deconvolution, k-means, measurement error, mixtures of distributions.</span>
/statistica/J28N5/J28N58/J28N58.html
ESTIMATION OF ERRORS-IN-VARIABLES PARTIALLY LINEAR ADDITIVE MODELS Eun Ryung Lee, Kyunghee Han and Byeong U. Park 2353-2373<span style='font-size=12pt;'><center>Abstract</center> In this paper we consider partially linear additive models where the predictors in the parametric and in the nonparametric parts are contaminated bymeasurement errors. We propose an estimator of the parametric part and show that it achieves -consistency in a certain range of the smoothness of the measurement errors in the nonparametric part. We also derive the convergence rate of the parametric estimator in case the smoothness of the measurement errors is off the range. Furthermore, we suggest an estimator of the additive function in the nonparametric part that achieves the optimal one-dimensional convergence rate in nonparametric deconvolution problems. We conducted a simulation study that confirms our theoretical findings.<p>Key words and phrases: Attenuation, deconvolution, errors in variables, kernel smoothing, rate of convergence, smooth backfitting.</span>
/statistica/J28N5/J28N59/J28N59.html
PETER HALL'S CONTRIBUTION TO EMPIRICAL LIKELIHOOD Jinyuan Chang, Jianjun Guo and Cheng Yong Tang 2375-2387<span style='font-size=12pt;'><center>Abstract</center> We deeply mourn the loss of Peter Hall. Peter was the premier mathematical statistician of his era. His work illuminated many aspects of statistical thought. While his body of work on bootstrap and nonparametric smoothing is widely known and appreciated, less well known is his work in many other areas. In this article, we review Peter's contribution to empirical likelihood (EL). Peter has done fundamental work on studying the coverage accuracy of confidence regions constructed with EL.<p>Key words and phrases: Balanced incomplete block design, Hadamard matrix, nearly balanced incomplete block design, orthogonal array.</span>