Statistica Sinica: Volume 27, Number 4, October 2017This is an example of an RSS feedhttp://www3.stat.sinica.edu.tw/statistica/Fri, 17 March 2017 00:01:00 +0000 Fri, 17 March 2017 00:01:00 +00001800
/statistica/J27N4/J27N41/J27N41-10.html
DISSECTING MULTIPLE IMPUTATION FROM A MULTI-PHASE INFERENCE PERSPECTIVE: WHAT HAPPENS WHEN GOD, IMPUTER AND ANALYST MODELS ARE UNCONGENIAL? Xianchao Xie and Xiao-Li Meng 1485-1594<span style='font-size=12pt;'><center>Abstract</center> Real-life data are almost never really real. By the time the data arrive at an investigator’s desk or disk, the raw data, however defined, have most likely gone through at least one “cleaning” process, such as standardization, re-calibration, imputation, or de-sensitization. Dealing with such a reality scientifically requires a more holistic multi-phase perspective than is permitted by the usual framework of “God’s model versus my model.” This article provides an in-depth look, from this broader perspective, into multiple-imputation (MI) inference (Rubin (1987)) under uncongeniality (Meng (1994)). We present a general estimating-equation decomposition theorem, resulting in an analytic (asymptotic) description of MI inference as an integration of the knowledge of the imputer and the analyst, and establish a characterization of self-efficiency (Meng (1994)) for regulating estimation procedures. These results help to reveal how the quality of and relationship between the imputer’s model and analyst’s procedure affect MI inference, including how a seemingly perfect procedure under the “God-versus-me” paradigm is actually inadmissible when God’s, imputer’s, and analyst’s models are uncongenial to each other. Our theoretical investigation also leads to useful procedures that are as trivially implementable as Rubin’s combining rules, yet with confidence coverage guaranteed to be minimally the nominal level, under any degree of uncongeniality. We reveal that the relationship is very complex between the validity of approaches taken for individual phases and the validity of the final multi-phase inference, and indeed that it is a nontrivial matter to quantify or even qualify the meaning of validity itself in such settings. These results and many open problems are presented to raise the general awareness that the multi-phase inference paradigm is an uncongenial forest populated by thorns, as well as some fruits, many of which are still low-hanging. <p>Key words and phrases: Confidence validity, data cleaning, estimating equation decomposition, incomplete data, multi-phase inference, pre-processing, self-efficiency,strong efficiency, uncongeniality.</span>
/statistica/J27N4/J27N411/J27N411.html
LEARNING SUMMARY STATISTIC FOR APPROXIMATE BAYESIAN COMPUTATION VIA DEEP NEURAL NETWORK Bai Jiang, Tung-Yu Wu, Charles Zheng and Wing H. Wong 1595-1618<span style='font-size=12pt;'><center>Abstract</center> Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resultingsummary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors. <p>Key words and phrases: Approximate Bayesian computation, deep learning, summary statistic.</span>
/statistica/J27N4/J27N412/J27N412.html
DANTZIG-TYPE PENALIZATION FOR MULTIPLE QUANTILE REGRESSION WITH HIGH DIMENSIONAL COVARIATES Seyoung Park, Xuming He and Shuheng Zhou 1619-1638<span style='font-size=12pt;'><center>Abstract</center> We study joint quantile regression at multiple quantile levels with high-dimensional covariates. Variable selection performed at individual quantile levels may lack stability across neighboring quantiles, making it difficult to understand and to interpret the impact of a given covariate on conditional quantile functions. We propose a Dantzig–type penalization method for sparse model selection at each quantile level which, at the same time, aims to shrink differences of the selected models across neighboring quantiles. We show model selection consistency, and investigate the stability of the selected models across quantiles. We also provide asymptotic normality of post–model–selection parameter estimation in the multiple quantile framework. We use numerical examples and data analysis to demonstrate that the proposed Dantzig–type quantile regression model selection method provides stable models for both homogeneous and heterogeneous cases. <p>Key words and phrases: Fused lasso, high dimensional data, model selection, quantile regression, stability.</span>
/statistica/J27N4/J27N413/J27N413.html
GENERIC SAMPLE SPLITTING FOR REFINED COMMUNITY RECOVERY IN DEGREE CORRECTED STOCHASTIC BLOCK MODELS Jing Lei and Lingxue Zhu 1639-1659<span style='font-size=12pt;'><center>Abstract</center> We study the problem of community recovery in stochastic block modelsand degree corrected block models. We show that a simple sample splitting trick can refine almost any approximately correct community recovery method to achieve exactly correct community recovery when the expected node degrees are of order log n or higher. Our results simplify and extend some of the previous work on exact community recovery using sample splitting, and provide better theoretical guarantees for degree corrected stochastic block models.<p>Key words and phrases: Block models, clustering, community detection, network data, sample splitting.</span>
/statistica/J27N4/J27N414/J27N414.html
ESTIMATION OF QUANTILES FROM DATA WITH ADDITIONAL MEASUREMENT ERRORS Matthias Hansmann and Michael Kohler 1661-1673<span style='font-size=12pt;'><center>Abstract</center> In this paper we study the problem of estimating quantiles from data that contain additional measurement errors. The only assumption on these errors is that the average absolute measurement error converges to zero for sample size tending to infinity with probability one. In particular we do not assume that the measurement errors are independent with expectation zero. We show that the empirical measure based on the data with measurement errors leads to an estimator which approaches the quantile set asymptotically. Provided the quantile is uniquely determined, this implies that this quantile estimate is strongly consistent for the true quantile. If this assumption does not hold, we also show that we can construct estimators for the limits of the quantile set if the average absolute measurement error is bounded by a given sequence, that tends to zero for sample size tending to infinity with probability one. But if such a sequence, which upper bounds the measurement errors, is not given, we show that there exists no estimator that is consistent for every distribution of the underlying random variable and all data containing the measurement errors. We derive the rate of convergence of our estimator and show that the derived rate of convergence is optimal. The results are applied in simulations and in the context of experimental fatigue tests.<p>Key words and phrases: Consistency, experimental fatigue tests, quantile estimation, rate of convergence.</span>
/statistica/J27N4/J27N415/J27N415.html
COHERENCE FOR MULTIVARIATE RANDOM FIELDS William Kleiber 1675-1697
/statistica/J27N4/J27N416/J27N416.html
REGRESSION ANALYSIS WITH RESPONSE-SELECTIVE SAMPLING Kani Chen, Yuanyuan Lin, Yuan Yao and Chaoxu Zhou 1699-1714<span style='font-size=12pt;'><center>Abstract</center> Response-selective sampling, in which samples are drawn from a population according to the values of the response variable, is common in biomedical, epidemiological, economic and social studies. This paper proposes to use transformation models, the generalized accelerated failure time models in econometrics, for regression analysis with response-selective sampling. With unknown error distribution, the transformation models are broad enough to cover linear regression models, Cox model, and the proportional odds model as special cases. To the best of our knowledge, except for the case-control logistic regression, there is presently no prospective estimation approach that can work for biased sampling without modification. We prove that the maximum rank correlation estimation is valid for response-selective sampling and establish its consistency and asymptotic normality. Unlike inverse probability methods, the proposed method of estimation does not involve sampling probabilities, which are often difficult to obtain in practice. Without the need of estimating the unknown transformation function or the error distribution, the proposed method is numerically easy to implement with the Nelder-Mead simplex algorithm that does not require convexity or continuity. We propose an inference procedure using random weighting to avoid the complication of density estimation when using the plug-in rule for variance estimation. Numerical studies with supportive evidence are presented. Application is illustrated with the Forbes Global 2000 data.<p>Key words and phrases: General transformation model, maximum rank correlation, random weighting, response-selective sampling.</span>
/statistica/J27N4/J27N417/J27N417.html
CIRCULANT PARTIAL HADAMARD MATRICES:CONSTRUCTION VIA GENERAL DIFFERENCE SETS AND ITS APPLICATION TO fMRI EXPERIMENTS Yuan-Lung Lin, Frederick Kin Hing Phoa and Ming-Hung Kao 1715-1724
/statistica/J27N4/J27N418/J27N418.html
THE ORDERING OF SHANNON ENTROPIES Ming-Tien Tsai 1725-1729<span style='font-size=12pt;'><center>Abstract</center> Via the transformation of the convex ordering of distributions to the Lorenz ordering of new distributions, the information ordering of Shannon entropies is established. The measure of the difference between two Shannon entropies enjoys some merits.<p>Key words and phrases: Convex ordering, Lorenz ordering.</span>
/statistica/J27N4/J27N419/J27N419.html
ADAPTIVE FALSE DISCOVERY RATE CONTROL FOR HETEROGENEOUS DATA Joshua D. Habiger 1731-1756<span style='font-size=12pt;'><center>Abstract</center> Efforts to develop more efficient multiple hypothesis testing procedures for false discovery rate (FDR) control have focused on incorporating an estimate of the proportion of true null hypotheses (such procedures are called adaptive) or exploiting heterogeneity across tests via some optimal weighting scheme. This paper combines these approaches using a weighted adaptive multiple decision function (WAMDF) framework. Optimal weights for a flexible random effects model are derived and a WAMDF that controls the FDR for arbitrary weighting schemes when test statistics are independent under the null hypotheses is given. Asymptotic and numerical assessment reveals that, under weak dependence, the proposed WAMDFs provide more efficient FDR control even if optimal weights are misspecified. The robustness and flexibility of the proposed methodology facilitates the development of more efficient, yet practical, FDR procedures for heterogeneous data. To illustrate, two different weighted adaptive FDR methods for heterogeneous sample sizes are developed and applied to data. <p>Key words and phrases: Decision function, multiple testing, p-value, weighted p-value.</span>
/statistica/J27N4/J27N420/J27N420.html
ADAPTIVE BASIS SELECTION FOR EXPONENTIAL FAMILY SMOOTHING SPLINES WITH APPLICATION IN JOINT MODELING OF MULTIPLE SEQUENCING SAMPLES Ping Ma, Nan Zhang, Jianhua Z. Huang and Wenxuan Zhong 1757-1777<span style='font-size=12pt;'><center>Abstract</center> Second-generation sequencing technologies have replaced array-based technologies and become the default method for genomics and epigenomics analysis. Second-generation sequencing technologies sequence tens of millions of DNA/cDNA fragments in parallel. After the resulting sequences (short reads) are mapped to the genome, one gets a sequence of short read counts along the genome. Effective extraction of signals in these short read counts is the key to the success of sequencing technologies. Nonparametric methods, in particular smoothing splines, have been used extensively for modeling and processing single sequencing samples. However, nonparametric joint modeling of multiple second-generation sequencing samples is still lacking due to computational cost. In this article, we develop an adaptive basis selection method for efficient computation of exponential family smoothing splines for modeling multiple second-generation sequencing samples. Our adaptive basis selection gives a sparse approximation of smoothing splines, yielding a lower-dimensional effective model space for a more scalable computation. The asymptotic analysis shows that the effective model space is rich enough to retain essential features of the data. Moreover, exponential family smoothing spline models computed via adaptive basis selection are shown to have good statistical properties, e.g., convergence at the same rate as that of full basis exponential family smoothing splines. The empirical performance is demonstrated through simulation studies and two second-generation sequencing data examples. <p>Key words and phrases: Bisulfite sequencing, generalized linear model, nonparametric regression, penalized likelihood, RNA-seq, sampling.</span>
/statistica/J27N4/J27N421/J27N421.html
ON PARAMETER ESTIMATION OF TWO-DIMENSIONAL POLYNOMIAL PHASE SIGNAL MODEL Ananya Lahiri and Debasis Kundu 1779-1792<span style='font-size=12pt;'><center>Abstract</center> Two-dimensional (2-D) polynomial phase signals occur in different areas of image processing. When the degree of the polynomial is two they are called chirp signals. In this paper, we consider the least squares estimators of the unknown parameters of the 2-D polynomial phase signal model in the presence of stationary noise, and derive their properties. The proposed least squares estimators are strongly consistent and we obtained their asymptotic distributions. It is observed that asymptotically the least squares estimators are normally distributed. We perform some simulation experiments to observe their behavior. <p>Key words and phrases: Asymptotic distribution, least squares estimators, linear processes, polynomial phase signals, strong consistency.</span>
/statistica/J27N4/J27N422/J27N422.html
ROBUST HYPOTHESIS TESTING VIA Lq-LIKELIHOOD Yichen Qin and Carey E. Priebe 1793-1813<span style='font-size=12pt;'><center>Abstract</center> This article introduces a robust hypothesis testing procedure: the Lq-likelihood-ratio-type test (LqRT). By deriving the asymptotic distribution of the test statistic, we demonstrate its robustness analytically and numerically, and investigate the properties of its influence function and breakdown point. A proposed method to select the tuning parameter q offers a good efficiency/robustness trade-off compared with the traditional likelihood ratio test (LRT) and other robust tests. Simulation and a real data analysis provide further evidence of the advantages of the proposed LqRT method. In particular, for the special case of testing the location parameter in the presence of gross error contamination, the LqRT dominates the Wilcoxon-Mann-Whitney test and the sign test at various levels of contamination.<p>Key words and phrases: Gross error model, relative efficiency, robustness.</span>
/statistica/J27N4/J27N423/J27N423.html
ROBUST HYPOTHESIS TESTING VIA Lq-LIKELIHOOD Melanie Birke, Natalie Neumeyer and Stanislav Volgushev 1815-1839<span style='font-size=12pt;'><center>Abstract</center> In this paper the nonparametric quantile regression model is considered in a location-scale context. The asymptotic properties of the empirical independence process based on covariates and estimated residuals are investigated. In particular an asymptotic expansion and weak convergence to a Gaussian process are proved. The results can be applied to test for validity of the location-scale model, and they allow one to derive various specification tests in conditional quantile location-scale models. A test for monotonicity of the conditional quantile curve is investigated. For the test for validity of the location-scale model, as well as for the monotonicity test, smooth residual bootstrap versions of Kolmogorov-Smirnov and Cramér-von Mises type test statistics are suggested. We give proofs for bootstrap versions of the weak convergence results. The performance of the tests is demonstrated in a simulation study.<p>Key words and phrases: Bootstrap, empirical independence process, Kolmogorov-Smirnov test, model test, monotone rearrangements, nonparametric quantile regression, residual processes, sequential empirical process.</span>
/statistica/J27N4/J27N424/J27N424.html
OPTIMAL DESIGNS FOR REGRESSION MODELS USING THE SECOND-ORDER LEAST SQUARES ESTIMATOR Yue Yin and Julie Zhou 1841-1856<span style='font-size=12pt;'><center>Abstract</center> We investigate properties and numerical algorithms for A- and D-optimal regression designs based on the second-order least squares estimator (SLSE). Several results are derived, including a characterization of the A-optimality criterion. We can formulate the optimal design problems under SLSE as semidefinite programming or convex optimization problems and we show that the resulting algorithms can be faster than more conventional multiplicative algorithms, especially in nonlinear models. Our results also indicate that the optimal designs based on the SLSE are more efficient than those based on the ordinary least squares estimator, provided the error distribution is highly skewed.<p>Key words and phrases: A-optimal design, convex optimization, D-optimal design, multiplicative algorithm, nonlinear model, SeDuMi, transformation invariance.</span>
/statistica/J27N4/J27N425/J27N425.html
PREDICTING DISEASE RISK BY TRANSFORMATION MODELS IN THE PRESENCE OF UNSPECIFIED SUBGROUP MEMBERSHIP Qianqian Wang, Yanyuan Ma and Yuanjia Wang 1857-1878<span style='font-size=12pt;'><center>Abstract</center> Some biomedical studies lead to mixture data. When a subgroup membership is missing for some of the subjects in a study, the distribution of the outcome is a mixture of the subgroup-specific distributions. Taking into account the uncertain distribution of the group membership and the covariates, we model the relation between the disease onset time and the covariates through transformation models in each sub-population, and develop a nonparametric maximum likelihood-based estimation implemented through the EM algorithm along with its inference procedure. We propose methods to identify the covariates that have different effects or common effects in distinct populations, which enables parsimonious modeling and better understanding of the differences across populations. The methods are illustrated through extensive simulation studies and a data example.<p>Key words and phrases: Censored data, EM algorithm, Laplace transformation, mixed populations, semiparametric models, transformation models, uncertain population identifier.</span>
/statistica/J27N4/J27N426/J27N426.html
D-OPTIMAL DESIGNS WITH ORDERED CATEGORICAL DATA Jie Yang, Liping Tong and Abhyuday Mandal 1879-1902<span style='font-size=12pt;'><center>Abstract</center> Cumulative link models have been widely used for ordered categorical responses. Uniform allocation of experimental units is commonly used in practice, but often suffers from a lack of efficiency. We consider D-optimal designs with ordered categorical responses and cumulative link models. For a predetermined set of design points, we derive the necessary and sufficient conditions for an allocation to be locally D-optimal and develop efficient algorithms for obtaining approximate and exact designs. We prove that the number of support points in a minimally supported design only depends on the number of predictors, which can be much less than the number of parameters in the model. We show that a D-optimal minimally supported allocation in this case is usually not uniform on its support points. In addition, we provide EW D-optimal designs as a highly efficient surrogate to Bayesian D-optimal designs. Both of them can be much more robust than uniform designs. <p>Key words and phrases: Approximate design, cumulative link model, exact design, minimally supported design, multinomial response, ordinal data.</span>
/statistica/J27N4/J27N427/J27N427.html
ORACLE INEQUALITIES AND SELECTION CONSISTENCY FOR WEIGHTED LASSO IN HIGH-DIMENSIONAL ADDITIVE HAZARDS MODEL Haixiang Zhang, Liuquan Sun, Yong Zhou and Jian Huang 1903-1920<span style='font-size=12pt;'><center>Abstract</center> The additive hazards model has many applications in high-throughput genomic data analysis and clinical studies. In this article, we study the weighted Lasso estimator for the additive hazards model in sparse, high-dimensional settings where the number of time-dependent covariates is much larger than the sample size. Based on compatibility, cone invertibility factors, and restricted eigenvalues of the Hessian matrix, we establish some non-asymptotic oracle inequalities for the weighted Lasso. Under mild conditions, we show that these quantities are bounded from below by positive constants, thus the compatibility and cone invertibility factors can be treated as positive constants in the oracle inequalities. A multistage adaptive method with weights recursively generated from a concave penalty is presented. We prove a selection consistency theorem and establish an upper bound for dimension of the weighted Lasso estimator.<p>Key words and phrases: High-dimensional covariates, oracle inequalities, sign consistency, survival analysis, variable selection.</span>
/statistica/J27N4/J27N428/J27N428.html
ASYMPTOTIC THEORY FOR ESTIMATING THE SINGULAR VECTORS AND VALUES OF A PARTIALLY- OBSERVED LOW RANK MATRIX WITH NOISE Juhee Cho, Donggyu Kim and Karl Rohe 1921-1948<span style='font-size=12pt;'><center>Abstract</center> Matrix completion algorithms recover a low rank matrix from a small fraction of the entries, each entry contaminated with additive errors. In practice, the singular vectors and singular values of the low rank matrix play a pivotal role for statistical analyses and inferences. This paper proposes estimators of these quantities and studies their asymptotic behavior. Under the setting where the dimensions of the matrix increase to infinity and the probability of observing each entry is identical, Theorem 1 gives the rate of convergence for the estimated singular vectors; Theorem 3 gives a multivariate central limit theorem for the estimated singular values. Even though the estimators use only a partially observed matrix, they achieve the same rates of convergence as the fully observed case. These estimators combine to form a consistent estimator of the full low rank matrix that is computed with a non-iterative algorithm. In the cases studied in this paper, this estimator achieves the minimax lower bound in Koltchinskii, Lounici and Tsybakov (2011). The numerical experiments corroborate our theoretical results.<p>Key words and phrases: Low rank matrices, matrix completion, matrix estimation,singular value decomposition.</span>
/statistica/J27N4/J27N429/J27N429.html
SOME INSIGHTS ABOUT THE SMALL BALL PROBABILITY FACTORIZATION FOR HILBERT RANDOM ELEMENTS Enea G. Bongiorno and Aldo Goia 1949-1965
/statistica/J27N4/J27N430/J27N430.html
CONCORDANCE MEASURE-BASED FEATURE SCREENING AND VARIABLE SELECTION Yunbei Ma, Yi Li, Huazhen Lin and Yi Li 1967-1985<span style='font-size=12pt;'><center>Abstract</center> The C-statistic, measuring the rank concordance between predictors and outcomes, has become a standard metric of predictive accuracy and is therefore a natural criterion for variable screening and selection. However, as the C-statistic is a step function, its optimization requires brute-force search, prohibiting its direct usage in the presence of high-dimensional predictors. We propose a smoothed C-statistic sure screening (C-SS) method for screening ultrahigh-dimensional data, and a penalized C-statistic (PSC) variable selection method for regularized modeling based on the screening results. We show that these procedures form an integrated framework for screening and variable selection: the C-SS possesses the sure screening property, and the PSC possesses the oracle property. Our simulations reveal that, compared to existing procedures, our proposal is more robust and efficient. Our procedure has been applied to analyze a multiple myeloma study, and has identified several novel genes that can predict patients response to treatment.<p>Key words and phrases: C-statistic, false positive rates, sparsity, ultra-high dimensional predictors, variable selection, variable screening.</span>
/statistica/J27N4/J27N431/J27N431.html
COMPUTERIZED ADAPTIVE TESTING THAT ALLOWS FOR RESPONSE REVISION:DESIGN AND ASYMPTOTIC THEORY Shiyu Wang, Georgios Fellouris and Hua-Hua Chang 1987-2010<span style='font-size=12pt;'><center>Abstract</center> In Computerized Adaptive Testing (CAT), items are selected in real time and are adjusted to the test-taker ability. While CAT has become popular for many measurement tasks, such as educational testing and patient reported outcomes, it has been criticized for not allowing examinees to review and revise their answers. In this work, we propose a novel CAT design that preserves the efficiency of a conventional CAT, but allows test-takers to revise their previous answers at any time during the test. The proposed method relies on a polytomous Item Response model that describes the first response to each item, as well as any subsequent responses to it. Each item is selected in order to maximize the Fisher information of the model at the current ability estimate, which is given by the maximizer of a partial likelihood function. We establish the strong consistency and asymptotic normality of the final ability estimator under minimal conditions on the test-taker revision behavior. We present the findings of two simulation studies that illustrate our theoretical results, as well as the behavior of the proposed design in a realistic item pool. <p>Key words and phrases: Asymptotic normality, computerized adaptive testing, consistency, item response theory, martingale limit theory, nominal response model, sequential design. </span>