Log in

Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Massive survival data are increasingly common in many research fields, and subsampling is a practical strategy for analyzing such data. Although optimal subsampling strategies have been developed for Cox models, little has been done for semiparametric accelerated failure time (AFT) models due to the challenges posed by non-smooth estimating functions for the regression coefficients. We develop optimal subsampling algorithms for fitting semi-parametric AFT models using the least-squares approach. By efficiently estimating the slope matrix of the non-smooth estimating functions using a resampling approach, we construct optimal subsampling probabilities for the observations. For feasible point and interval estimation of the unknown coefficients, we propose a two-step method, drawing multiple subsamples in the second stage to correct for overestimation of the variance in higher censoring scenarios. We validate the performance of our estimators through a simulation study that compares single and multiple subsampling methods and apply the methods to analyze the survival time of lymphoma patients in the Surveillance, Epidemiology, and End Results program.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ai, M., Yu, J., Zhang, H., Wang, H.: Optimal subsampling algorithms for big data generalized linear models. Stat. Sin. 31(2), 749–772 (2021)

    Google Scholar 

  • Buckley, J., James, I.: Linear regression with censored data. Biometrika 66(3), 429–436 (1979)

    Article  Google Scholar 

  • Chiou, S., Kang, S., Yan, J.: Rank-based estimating equations with general weight for accelerated failure time models: an induced smoothing approach. Stat. Med. 34(9), 1495–1510 (2015)

    Article  MathSciNet  Google Scholar 

  • Chiou, S.H., Kang, S., Yan, J.: Fitting accelerated failure time models in routine survival analysis with R package aftgee. J. Stat. Softw. 61(11), 1–23 (2014)

    Article  Google Scholar 

  • Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for \(L_2\) regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. Association of Computing Machinary (2006)

  • Hesterberg, T.: Weighted average importance sampling and defensive mixture distributions. Technometrics 37(2), 185–194 (1995)

    Article  MathSciNet  Google Scholar 

  • **, Z., Lin, D., Wei, L., Ying, Z.: Rank-based inference for the accelerated failure time model. Biometrika 90(2), 341–353 (2003)

    Article  MathSciNet  Google Scholar 

  • **, Z., Lin, D., Ying, Z.: On least-squares regression with censored data. Biometrika 93(1), 147–161 (2006)

    Article  MathSciNet  Google Scholar 

  • Keret, N., Gorfine, M.: Analyzing big EHR data–Optimal Cox regression subsampling procedure with rare events. Journal of the American Statistical Association. 118(544), 2262–2275 (2023)

    Article  MathSciNet  Google Scholar 

  • Li, R., Chang, C., Justesen, J.M., Tanigawa, Y., Qian, J., Hastie, T., Rivas, M.A., Tibshirani, R.: Fast lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK biobank. Biostatistics 23(3), 522–540 (2022)

    Article  MathSciNet  Google Scholar 

  • Ma, P., Chen, Y., Zhang, X., **ng, X., Ma, J., Mahoney, M.W.: Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. J. Mach. Learn. Res. 23(1), 7970–8014 (2022)

    MathSciNet  Google Scholar 

  • Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(27), 861–911 (2015)

    MathSciNet  Google Scholar 

  • Mahoney, M.W., et al.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)

    Google Scholar 

  • Su, W., Yin, G., Zhang, J., Zhao, X.: Divide and conquer for accelerated failure time model with massive time-to-event data. Can. J. Stat. 51(2), 400–419 (2023)

    Article  MathSciNet  Google Scholar 

  • Tsiatis, A.A.: Estimating regression parameters using linear rank tests for censored data. Ann. Stat. 18(1), 354–372 (1990)

    Article  MathSciNet  Google Scholar 

  • Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2021)

    Article  MathSciNet  Google Scholar 

  • Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)

  • Wang, J., Zou, J., Wang, H.: Sampling with replacement vs Poisson sampling: a comparative study in optimal subsampling. IEEE Trans. Inf. Theory 68(10), 6605–6630 (2022)

    Article  MathSciNet  Google Scholar 

  • Wang, W., Lu, S.E., Cheng, J.Q., **e, M., Kostis, J.B.: Multivariate survival analysis in big data: a divide-and-combine approach. Biometrics 78(3), 852–866 (2022)

    Article  MathSciNet  Google Scholar 

  • Wang, Y., Hong, C., Palmer, N., Di, Q., Schwartz, J., Kohane, I., Cai, T.: A fast divide-and-conquer sparse Cox regression. Biostatistics 22(2), 381–401 (2021)

    Article  MathSciNet  Google Scholar 

  • Wu, J., Chen, M.H., Schifano, E.D., Yan, J.: Online updating of survival analysis. J. Comput. Graph. Stat. 30(4), 1209–1223 (2021)

    Article  MathSciNet  Google Scholar 

  • Xue, Y., Wang, H., Yan, J., Schifano, E.D.: An online updating approach for testing the proportional hazards assumption with streams of survival data. Biometrics 76(1), 171–182 (2020)

    Article  MathSciNet  Google Scholar 

  • Yang, Z., Wang, H., Yan, J.: Optimal subsampling for parametric accelerated failure time models with massive survival data. Stat. Med. 41(27), 5421–5431 (2022)

    Article  MathSciNet  Google Scholar 

  • Zeng, D., Lin, D.: Efficient resampling methods for nonsmooth estimating functions. Biostatistics 9(2), 355–363 (2008)

    Article  MathSciNet  Google Scholar 

  • Zhang, H., Zuo, L., Wang, H., Sun, L.: Approximating partial likelihood estimators via optimal subsampling. J. Comput. Graph. Stat. (2023)

  • Zuo, L., Zhang, H., Wang, H., Liu, L.: Sampling-based estimation for massive survival data with additive hazards model. Stat. Med. 40(2), 441–450 (2021)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Wang’s research was supported by NSF grant CCF 2105571 and UConn CLAS Research Funding in Academic Themes.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zehan Yang or HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Wang, H. & Yan, J. Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data. Stat Comput 34, 77 (2024). https://doi.org/10.1007/s11222-024-10391-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-024-10391-y

Keywords

Navigation