Robust Score Tests for Censored Outcomes and Incomplete Covariates Leveraging High-Dimensional Auxiliary Variables

Jiahui Feng and Kin Yau Wong

doi:10.5705/ss.202024.0391

Abstract

In many cancer genomic studies, investigators are interested in testing the presence of as

sociation between a time-to-event outcome and covariates of interest. Such analyses are often complicated by missing data. When covariates of interest are missing for some subjects, it is desirable

to leverage information from observed auxiliary variables, which are sometimes high-dimensional,

to improve statistical power. In this paper, we consider a class of semiparametric transformation

models for a potentially right-censored survival outcome and develop an association test between

the outcome and a partially observed covariate. We impute the missing covariate values using

high-dimensional auxiliary variables. To accommodate potential model misspecification, we combine results from multiple plausible models for the survival time to improve power. We establish

the validity of the test under misspecification of the outcome model and an adaptively-selected

model for the incomplete covariate. We demonstrate the validity of the proposed methods and the

superiority over existing methods through extensive simulation studies and applications to major

cancer genomic studies.

Key words and phrases: Imputation; Missing data; Post-selection inference; Survival analysis; Variable selection

Information

Preprint No.	SS-2024-0391
Manuscript ID	SS-2024-0391
Complete Authors	Jiahui Feng, Kin Yau Wong
Corresponding Authors	Kin Yau Wong
Emails	kin-yau.wong@polyu.edu.hk

References

Higgins et al. (2007) Src 1.48E−04 8.20E−04 4.86E−04 8.33E−04 4.62E−04
Xu et al. (2021) TAZ 1.80E−04 1.38E−03 5.51E−04 1.09E−03 4.32E−04
Gao et al. (2014) them using k-nearest neighbor imputation with k = 10. We perform the supremum test with q = 2 and the two transformation functions corresponding to the PH and PO models. The working model of S is selected in two steps: first, select 1000 gene expressions by the correlation-based marginal screening procedure, and then perform lasso on the selected gene expressions; the tuning parameter in lasso is selected by BIC. For comparison, we also perform the complete-case analysis and the covariate-only method described in the simulation studies under the PH and PO models. Under a (family-wise) significance level of 0.05 and the Bonferroni correction, i.e., an individual significance level of 0.05/208 = 0.00024, three proteins are identified to be significantly associated with progression-free survival time under at least one of the five tests. All three protein expressions are more significant under the proposed method than under other methods with either outcome model. Also, the three proteins have been identified to be related to the progression of BLCA in previous studies. The p-values under all methods of the significant protein expressions and some relevant references are given in Table 1. 5.2 METABRIC We also apply the proposed method to analyze data from the Molecular Taxonomy Of Breast Cancer International Consortium (METABRIC) study (Curtis et al., 2012) to investigate the association between gene expressions and the relapse-free survival time of breast cancer patients. The data are available through the cBioPortal for Cancer Genomics (https://www.cbioportal.org/study/summary?id=brca metabric). The study contains data of clinical variables, gene expressions and copy number alterations (CNAs). For the analysis, we select patients with subtypes Luminal A and Luminal B as study subjects. Also, we select 1500 genes with the largest variances as the study variables. After removing subjects with missing clinical data, the sample size is 1119. The median follow-up time was about 119 months, and 35% of the patients were lost to follow-up before tumor progression or death. We artificially introduce 50% of missingness with the MAR mechanism described in the simulation studies for the gene expressions to demonstrate the proposed method. The covariates in X include age at diagnosis, Her2 status, indicator of chemotherapy, indicator of hormone therapy, and indicator of radiotherapy. Her2 status is classified into loss (6.08%), neutral (77.57%) and gain (16.35%) and is represented by a single variable with values 0, 1 and 2, respectively. In a single analysis, we set the covariate of interest S to be a single gene expression. We set the CNAs as auxiliary variables. For each CNA, if there exists another CNA such that they have more than 95% same values, then we delete it from the analysis. After deletion, the dimension of CNA is 385. We perform the supremum test with q = 2 and the two transformation functions corresponding to the PH and PO models. The auxiliary variables of CNA are selected by lasso, and the tuning parameter of lasso is selected using BIC. For comparison, we include the results under the complete-case analysis and the covariate-only method described in the simulation studies with the PH and PO models. Also, we perform score tests using all available gene expressions under the PH and PO models, and we refer to it as the complete-data analysis. The results of the complete-data analysis can be viewed as the gold standard since it uses all values of S. There are seven gene expressions identified to be significantly associated with progressionfree survival time at the (Bonferroni-corrected) significance level of 0.05/1500 = 3.33 × 10−5 under the complete-data analysis with either outcome model. Among these gene expressions, all of them are most significant under the complete-data analysis with the PO model, and 5 are more significant under the proposed method than under the completecase analysis and the covariate-only method with either outcome model. This suggests that the proposed method is more powerful than the other two methods. The p-values under all methods of the significant gene expressions are given in Table 2. As the genes tend to show higher significance under the PO model, the results highlight the advantage of considering multiple outcome models. If we relied solely on the “default choice” of the Cox model, then we might have missed some identified associations. 6. Discussion In this paper, we develop a score test for the presence of association between a potentially right-censored survival outcome and an incomplete covariate, where the missing values of the incomplete covariate can be imputed using high-dimensional auxiliary variables.

Acknowledgments

This research was partially supported by the Guangdong Basic and Applied Basic Research Foundation (Project No. 2021A1515110048) and the Hong Kong Research Grants

Council (Grant No. 15303422).

Supplementary Materials

The online Supplementary Material provides proofs of technical results and additional

simulation results.

Supplementary materials are available for download.

[1] Higgins et al. (2007) Src 1.48E−04 8.20E−04 4.86E−04 8.33E−04 4.62E−04

[2] Xu et al. (2021) TAZ 1.80E−04 1.38E−03 5.51E−04 1.09E−03 4.32E−04