RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Chen, Guorong; Wang, Sijian; Sun, Guannan; Pan, Huanxue

doi:10.1007/s10441-019-09349-9

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Regular Article
Published: 28 May 2019

Volume 67, pages 225–251, (2019)
Cite this article

Acta Biotheoretica Aims and scope Submit manuscript

414 Accesses
1 Citation
Explore all metrics

This article was retracted on 27 February 2020

This article has been updated

Abstract

When relating genomic data to survival outcomes, there are three main challenges that are the censored survival outcomes, the high-dimensionality of the genomic data, and the non-normality of data. We propose a method to tackle these challenges simultaneously and obtain a robust estimation of detecting significant genes related to survival outcomes based on Accelerated Failure Time (AFT) model. Specifically, we include a general loss function to the AFT model, adopt model regularization and shrinkage technique, cope with parameters tuning and model selection, and develop an algorithm based on unified Expectation–Maximization approach for easy implementation. Simulation results demonstrate the advantages of the proposed method compared with existing methods when the data has heavy-tailed errors and correlated covariates. Two real case studies on patients are provided to illustrate the application of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Article 03 February 2021

Bayesian penalized Buckley-James method for high dimensional bivariate censored regression models

Article 03 March 2022

Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis

Article 13 October 2015

Change history

27 February 2020
The authors have retracted this article [1] because they found a fundamental mistake in the methodology that is not correctable at this time. This mistake is found in the methodology and the derivation of the model with Tukey and Huber��s losses. Because of the error, the findings in the article are not reliable. All authors agree to this retraction.

References

Bell D (2011) Integrated genomic analyses of ovarian carcinom. Nature 474(7353):609–615
Article Google Scholar
Buckley J, James I (1979) Linear regression with censored data. Biometrika 66(3):429–436
Article Google Scholar
Candès E, Tao T (2007) The Dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann Stat 35(6):2392–2404
Article Google Scholar
Cox D (1972) Regression models and life tables (with discussion). J R Stat Soc 34:187–220
Google Scholar
Craven P, Wahba G (1978) Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 31(4):377–403
Article Google Scholar
Efron B (1967) The two sample problem with censored data. Proc Fifth Berkeley Symp Math Stat Probab 4:831–853
Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(1):407–451
Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article Google Scholar
Friedman J, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76(376):817–823
Article Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Soft 33(1):1–22
Article Google Scholar
Gao X, Feng Y (2016) Penalized weighted least absolute deviation regression. Stat Interface 11(1):79–89
Article Google Scholar
Goeman JJ, Meijer RJ, Chaturvedi N (2018) Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model. R package version 0.9-51
Gui J, Li H (2005) Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21(13):3001–3008
Article Google Scholar
Huang J, Ma S, **e H (2006) Regularized estimation in the Accelerated Failure Time model with high-dimensional covariates. Biometrics 62(3):813–820
Article Google Scholar
Kalbeisch J, Prentice R (1980) The statistical analysis of failure time data. Wiley, New York
Google Scholar
Klein JP, Moeschberger ML (2003) Survival alanalysis: techniques for censored and truncated data, 2nd edn. Springer, New York
Book Google Scholar
Koenker R (2004) Quantreg: an r package for quantile regression and related methods. http://cranr-project.org
Koenker R (2008) Censored quantile regression redux. J Stat Softw 27(6):1–25
Article Google Scholar
Koenker R, Geling O (2001) Reappraising medfly longevity: a quantile regression survival analysis. J Am Stat Assoc 96(454):458–468
Article Google Scholar
Li H, Luan Y (2003) Kernel cox regression models for linking gene expression profiles to censored survival data. Pac Symp Biocomput 8(12):65–76
Google Scholar
Li Y, Dicker L, Zhao SD (2010) A new class of dantzig selectors for censored linear regression models. Harvard University Biostatistics Working paper Series
Li Y, Dicker L, Zhao SD (2014) The Dantzig selector for censored linear regression models. Stat Sin 24(1):251–275
Google Scholar
Ning J, Qin J, Shen Y (2015) Buckley-James-Type estimator with right-censored and length-biased data. Biometrics 67(4):1369–1378
Article Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13
Article Google Scholar
Thanoon FH (2015) Robust regression by least absolute deviations method. Int J Stat Appl 5(3):109–112
Google Scholar
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395
Article Google Scholar
Tibshirani R (2011) Regression shrinkage and selection via the lasso. J R Stat Soc 73(3):273–282
Article Google Scholar
Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econ Stat 25(3):347–355
Article Google Scholar
Wang S, Nan B, Zhu J, Beer D (2010) Doubly penalized buckley-james method for survival data with high-dimensional covariates. Biometrics 64(1):132–140
Article Google Scholar
Wei LJ (1992) The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Stat Med 11(14–15):1871–1879
Article Google Scholar
Wei LJ, Ying Z, Lin DY (1990) Linear regression analysis of censored survival data based on rank tests. Biometrika 77(4):845–851
Article Google Scholar
Wu TT, Wang S (2013) Doubly regularized cox regression for high-dimensional survival data with group structures. Stat Interface 6(2):175–186
Article Google Scholar
**e S, Wan ATK, Zhou Y (2015) Quantile regression methods with varying-coefficient models for censored data. Comput Stat Data Anal 88(C):154–172
Article Google Scholar
Yang Y, Zou H (2015) A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput 25(6):1129–1141
Article Google Scholar
Ying Z (1993) A large sample study of rank estimation for censored regression data. Ann Stat 21(1):76–99
Article Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Article Google Scholar

Download references

Acknowledgements

The first author’s research was supported by “the Fundamental Research Funds for the Central Universities (No. BLX201609)”

Author information

Authors and Affiliations

Department of Finance, Bei**g Forestry University, Bei**g, China
Guorong Chen & Huanxue Pan
Department of Statistics and Biostatistics, Rutgers University, New Brunswick, NJ, USA
Sijian Wang
Department of Biostatistics and Programming, Sanofi China, Bei**g, China
Guannan Sun

Authors

Guorong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Sijian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guannan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Huanxue Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huanxue Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors have retracted this article because they found a fundamental mistake in the methodology that is not correctable at this time. This mistake is found in the methodology and the derivation of the model with Tukey and Huber's losses. Because of the error, the findings in the article are not reliable. All authors agree to this retraction.

Appendix

1. The E–M algorithm iteration process:

$$\begin{aligned} \begin{aligned} Q^{(m)}(\theta )&= \sum _{i \in C} E_{T_i}((T_i-\alpha -X'_i\beta )^2|\theta ^{(m-1)}, T_i> Y_i) + \sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} E_{\epsilon _i}(\alpha ^{(m-1)}+ X'_i\beta ^{(m-1)}+\epsilon _i-\alpha -X'_i\beta )^2|\theta ^{(m-1)}, \epsilon _i> e_i) \\&\quad +\sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} \frac{\int _{e_i}^\infty (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\epsilon _i-\alpha -X'_i\beta )^2 f(\epsilon _i)d\epsilon _i}{1-{\hat{F}}(e_i)} \\&\quad +\sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} \frac{\sum _{j>i} m_j (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2 }{\sum _{j>i} m_j }\\&\quad + \sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \\&=\sum _{i \in C} \sum _{j>i} w_{ij} (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2 \\&\quad + \sum _{i \in D} (T_i-\alpha -X'_i\beta )^2 \end{aligned} \end{aligned}$$

(36)

where $\epsilon _i=T_i-\alpha ^{(m-1)}-X'_i\beta ^{(m-1)}$, $m_i=\frac{\delta _i}{n}\prod _{j<i}(\frac{n-j+1}{n-j})^{1-\delta _j}$ is the Kaplan–Meier type estimator of CDF for sorted $\epsilon$’s, $e_i=Y_i-\alpha ^{(m-1)}-X'_i\beta ^{(m-1)}$, and $w_{ij}=\frac{m_j}{\sum _{j>i}m_j}$ for $j>i$. Obviously, $\sum _{j>i}w_{ij}=1$. We update $\theta ^{(m)}$ to be $\theta ^{(m)}=arg min_\theta Q^{(m)}(\theta )$. After taking derivative of $Q^{(m)}(\theta )$ with respect to $\alpha$ and $\beta$, we have

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2}{\partial \alpha }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \alpha }=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )^2}{\partial \beta }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \beta }=0 \end{array}\right. } \end{aligned}$$

(37)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta ) \\ \quad \quad \quad \quad +\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} (\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+e_j-\alpha -X'_i\beta )X_i \\ \quad \quad \quad \quad +\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(38)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}(\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )\\ \quad \quad \quad \quad +\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}(\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )X_i\\ \quad \quad \quad \quad +\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(39)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}(Y^{**}_i-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}(X_i Y^{**}_i-X_i\alpha -X^2_i\beta )+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(40)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i=1}^n(Y^*_i-\alpha -X'_i\beta )=0\\ \sum _{i=1}^n(X_i Y^*_i-X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(41)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \alpha =\overline{Y}^*-\beta \overline{x}\\ \beta =\frac{\sum _{i=1}^n(X_i-\overline{X})(Y^*_i-\overline{Y}^*)}{\sum _{i=1}^n(X_i-\overline{X})^2} \end{array}\right. } \end{aligned}$$

(42)

where $Y^*_i=\delta _i T_i+(1-\delta _i)Y^{**}_i$, $Y^{**}_i=\alpha ^{(m-1)}+X'_i\beta ^{(m-1)}+\sum _{j>i}w_{ij}e_j$, for $i \in C$, and $\overline{Y}^*=\frac{\sum _{i \in C}Y^{**}_i+\sum _{i \in D}T_i}{n}$.

2. Proof of Proposition 1:

$$\begin{aligned} \begin{aligned} Q(\theta )&=\sum _{i \in C} E_{T_i}(L(T_i, \theta )|\theta , T_i> Y_i) + \sum _{i \in D} L(T_i, \theta ) \\&= \sum _{i \in C} E_{T_i}((T_i-\alpha -X'_i\beta )^2|\theta , T_i> Y_i) + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} E_{\epsilon _i}({\hat{\alpha }}+X'_i{\hat{\beta }}+\epsilon _i-\alpha -X'_i\beta )^2|\theta , \epsilon _i> e_i) + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} \frac{\int _{e_i}^\infty ({\hat{\alpha }}+X'_i{\hat{\beta }}+\epsilon _i-\alpha -X'_i\beta )^2 f(\epsilon _i)d\epsilon _i}{1-{\hat{F}}(e_i)} + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} \frac{\sum _{j>i} m_j ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2 }{\sum _{j>i} m_j } + \sum _{i \in D} L(T_i, \theta ) \\&=\sum _{i \in C} \sum _{j>i} w_{ij} ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2 + \sum _{i \in D} L(T_i, \theta ) \end{aligned} \end{aligned}$$

(43)

where $\epsilon _i=T_i-{\hat{\alpha }}-X'_i{\hat{\beta }}$, $m_i=\frac{\delta _i}{n}\prod _{j<i}(\frac{n-j+1}{n-j})^{1-\delta _j}$ is the Kaplan-Meier type estimator of CDF for sorted $\epsilon$’s, $e_i=Y_i-{\hat{\alpha }}-X'_i{\hat{\beta }}$, and $w_{ij}=\frac{m_j}{\sum _{j>i}m_j}$ for $j>i$. Obviously, $\sum _{j>i}w_{ij}=1$. The estimation of $\theta$ is $arg min_\theta Q(\theta )$ with certain penalty. When there is no penalty, we have $\theta =(\alpha , \beta )'$ s.t.

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2}{\partial \alpha }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \alpha }=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} \frac{\partial ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )^2}{\partial \beta }+\sum _{i \in D}\frac{\partial (T_i-\alpha -X'_i\beta )^2}{\partial \beta }=0 \end{array}\right. } \end{aligned}$$

(44)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}\sum _{j>i}w_{ij} ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}\sum _{j>i}w_{ij} ({\hat{\alpha }}+X'_i{\hat{\beta }}+e_j-\alpha -X'_i\beta )X_i+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(45)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}({\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}({\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j-\alpha -X'_i\beta )X_i+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(46)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i \in C}(Y^{**}_i-\alpha -X'_i\beta )+\sum _{i \in D}(T_i-\alpha -X'_i\beta )=0\\ \sum _{i \in C}(X_i Y^{**}_i-X_i\alpha -X^2_i\beta )+\sum _{i \in D}( X_i T_i - X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(47)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \sum _{i=1}^n(Y^*_i-\alpha -X'_i\beta )=0\\ \sum _{i=1}^n(X_i Y^*_i-X_i\alpha -X^2_i\beta )=0 \end{array}\right. } \end{aligned}$$

(48)

$$\begin{aligned}&{\left\{ \begin{array}{ll} \alpha =\overline{Y}^*-\beta \overline{x}\\ \beta =\frac{\sum _{i=1}^n(X_i-\overline{X})(Y^*_i-\overline{Y}^*)}{\sum _{i=1}^n(X_i-\overline{X})^2} \end{array}\right. } \end{aligned}$$

(49)

where $Y^*_i=\delta _i T_i+(1-\delta _i)Y^{**}_i$, $Y^{**}_i={\hat{\alpha }}+X'_i{\hat{\beta }}+\sum _{j>i}w_{ij}e_j$, for $i \in C$, and $\overline{Y}^*=\frac{\sum _{i \in C}Y^{**}_i+\sum _{i \in D}T_i}{n}$.

End of proof.

3. Estimated coefficients of Ovarian carcinoma data (see Table 8)

4. Estimated coefficients of Cervical Squamous Cell Carcinoma data (see Table 9).

Table 8 Estimated coefficients with four types of loss function for screened 100 genes based on 134 training data

Full size table

Table 9 Estimated coefficients with four types of loss function for screened 20 genes based on 59 training data

Full size table

About this article

Cite this article

Chen, G., Wang, S., Sun, G. et al. RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates. Acta Biotheor 67, 225–251 (2019). https://doi.org/10.1007/s10441-019-09349-9

Download citation

Received: 01 August 2018
Accepted: 16 May 2019
Published: 28 May 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10441-019-09349-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Bayesian penalized Buckley-James method for high dimensional bivariate censored regression models

Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis

Change history

27 February 2020

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Bayesian penalized Buckley-James method for high dimensional bivariate censored regression models

Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis

Change history

27 February 2020

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation