Abstract
When relating genomic data to survival outcomes, there are three main challenges that are the censored survival outcomes, the high-dimensionality of the genomic data, and the non-normality of data. We propose a method to tackle these challenges simultaneously and obtain a robust estimation of detecting significant genes related to survival outcomes based on Accelerated Failure Time (AFT) model. Specifically, we include a general loss function to the AFT model, adopt model regularization and shrinkage technique, cope with parameters tuning and model selection, and develop an algorithm based on unified Expectation–Maximization approach for easy implementation. Simulation results demonstrate the advantages of the proposed method compared with existing methods when the data has heavy-tailed errors and correlated covariates. Two real case studies on patients are provided to illustrate the application of the proposed method.
Similar content being viewed by others
Change history
27 February 2020
The authors have retracted this article [1] because they found a fundamental mistake in the methodology that is not correctable at this time. This mistake is found in the methodology and the derivation of the model with Tukey and Huber���s losses. Because of the error, the findings in the article are not reliable. All authors agree to this retraction.
References
Bell D (2011) Integrated genomic analyses of ovarian carcinom. Nature 474(7353):609–615
Buckley J, James I (1979) Linear regression with censored data. Biometrika 66(3):429–436
Candès E, Tao T (2007) The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35(6):2392–2404
Cox D (1972) Regression models and life tables (with discussion). J R Stat Soc 34:187–220
Craven P, Wahba G (1978) Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 31(4):377–403
Efron B (1967) The two sample problem with censored data. Proc Fifth Berkeley Symp Math Stat Probab 4:831–853
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(1):407–451
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Friedman J, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76(376):817–823
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Soft 33(1):1–22
Gao X, Feng Y (2016) Penalized weighted least absolute deviation regression. Stat Interface 11(1):79–89
Goeman JJ, Meijer RJ, Chaturvedi N (2018) Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model. R package version 0.9-51
Gui J, Li H (2005) Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21(13):3001–3008
Huang J, Ma S, **e H (2006) Regularized estimation in the Accelerated Failure Time model with high-dimensional covariates. Biometrics 62(3):813–820
Kalbeisch J, Prentice R (1980) The statistical analysis of failure time data. Wiley, New York
Klein JP, Moeschberger ML (2003) Survival alanalysis: techniques for censored and truncated data, 2nd edn. Springer, New York
Koenker R (2004) Quantreg: an r package for quantile regression and related methods. http://cranr-project.org
Koenker R (2008) Censored quantile regression redux. J Stat Softw 27(6):1–25
Koenker R, Geling O (2001) Reappraising medfly longevity: a quantile regression survival analysis. J Am Stat Assoc 96(454):458–468
Li H, Luan Y (2003) Kernel cox regression models for linking gene expression profiles to censored survival data. Pac Symp Biocomput 8(12):65–76
Li Y, Dicker L, Zhao SD (2010) A new class of dantzig selectors for censored linear regression models. Harvard University Biostatistics Working paper Series
Li Y, Dicker L, Zhao SD (2014) The Dantzig selector for censored linear regression models. Stat Sin 24(1):251–275
Ning J, Qin J, Shen Y (2015) Buckley-James-Type estimator with right-censored and length-biased data. Biometrics 67(4):1369–1378
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13
Thanoon FH (2015) Robust regression by least absolute deviations method. Int J Stat Appl 5(3):109–112
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395
Tibshirani R (2011) Regression shrinkage and selection via the lasso. J R Stat Soc 73(3):273–282
Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econ Stat 25(3):347–355
Wang S, Nan B, Zhu J, Beer D (2010) Doubly penalized buckley-james method for survival data with high-dimensional covariates. Biometrics 64(1):132–140
Wei LJ (1992) The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Stat Med 11(14–15):1871–1879
Wei LJ, Ying Z, Lin DY (1990) Linear regression analysis of censored survival data based on rank tests. Biometrika 77(4):845–851
Wu TT, Wang S (2013) Doubly regularized cox regression for high-dimensional survival data with group structures. Stat Interface 6(2):175–186
**e S, Wan ATK, Zhou Y (2015) Quantile regression methods with varying-coefficient models for censored data. Comput Stat Data Anal 88(C):154–172
Yang Y, Zou H (2015) A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput 25(6):1129–1141
Ying Z (1993) A large sample study of rank estimation for censored regression data. Ann Stat 21(1):76–99
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Acknowledgements
The first author’s research was supported by “the Fundamental Research Funds for the Central Universities (No. BLX201609)”
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors have retracted this article because they found a fundamental mistake in the methodology that is not correctable at this time. This mistake is found in the methodology and the derivation of the model with Tukey and Huber's losses. Because of the error, the findings in the article are not reliable. All authors agree to this retraction.
Appendix
Appendix
1. The E–M algorithm iteration process:
where \(\epsilon _i=T_i-\alpha ^{(m-1)}-X'_i\beta ^{(m-1)}\), \(m_i=\frac{\delta _i}{n}\prod _{j<i}(\frac{n-j+1}{n-j})^{1-\delta _j}\) is the Kaplan–Meier type estimator of CDF for sorted \(\epsilon\)’s, \(e_i=Y_i-\alpha ^{(m-1)}-X'_i\beta ^{(m-1)}\), and \(w_{ij}=\frac{m_j}{\sum _{j>i}m_j}\) for \(j>i\). Obviously, \(\sum _{j>i}w_{ij}=1\). We update \(\theta ^{(m)}\) to be \(\theta ^{(m)}=arg min_\theta Q^{(m)}(\theta )\). After taking derivative of \(Q^{(m)}(\theta )\) with respect to \(\alpha\) and \(\beta\), we have
where \(Y^*_i=\delta _i T_i+(1-\delta _i)Y^{**}_i\), \(Y^{**}_i=\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j\), for \(i \in C\), and \(\overline{Y}^*=\frac{\sum _{i \in C}Y^{**}_i+\sum _{i \in D}T_i}{n}\).
2. Proof of Proposition 1:
where \(\epsilon _i=T_i-{\hat{\alpha }}-X'_i{\hat{\beta }}\), \(m_i=\frac{\delta _i}{n}\prod _{j<i}(\frac{n-j+1}{n-j})^{1-\delta _j}\) is the Kaplan-Meier type estimator of CDF for sorted \(\epsilon\)’s, \(e_i=Y_i-{\hat{\alpha }}-X'_i{\hat{\beta }}\), and \(w_{ij}=\frac{m_j}{\sum _{j>i}m_j}\) for \(j>i\). Obviously, \(\sum _{j>i}w_{ij}=1\). The estimation of \(\theta\) is \(arg min_\theta Q(\theta )\) with certain penalty. When there is no penalty, we have \(\theta =(\alpha , \beta )'\) s.t.
where \(Y^*_i=\delta _i T_i+(1-\delta _i)Y^{**}_i\), \(Y^{**}_i={\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j\), for \(i \in C\), and \(\overline{Y}^*=\frac{\sum _{i \in C}Y^{**}_i+\sum _{i \in D}T_i}{n}\).
End of proof.
3. Estimated coefficients of Ovarian carcinoma data (see Table 8)
4. Estimated coefficients of Cervical Squamous Cell Carcinoma data (see Table 9).
About this article
Cite this article
Chen, G., Wang, S., Sun, G. et al. RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates. Acta Biotheor 67, 225–251 (2019). https://doi.org/10.1007/s10441-019-09349-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10441-019-09349-9