Abstract
Massive survival data are increasingly common in many research fields, and subsampling is a practical strategy for analyzing such data. Although optimal subsampling strategies have been developed for Cox models, little has been done for semiparametric accelerated failure time (AFT) models due to the challenges posed by non-smooth estimating functions for the regression coefficients. We develop optimal subsampling algorithms for fitting semi-parametric AFT models using the least-squares approach. By efficiently estimating the slope matrix of the non-smooth estimating functions using a resampling approach, we construct optimal subsampling probabilities for the observations. For feasible point and interval estimation of the unknown coefficients, we propose a two-step method, drawing multiple subsamples in the second stage to correct for overestimation of the variance in higher censoring scenarios. We validate the performance of our estimators through a simulation study that compares single and multiple subsampling methods and apply the methods to analyze the survival time of lymphoma patients in the Surveillance, Epidemiology, and End Results program.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-024-10391-y/MediaObjects/11222_2024_10391_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-024-10391-y/MediaObjects/11222_2024_10391_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-024-10391-y/MediaObjects/11222_2024_10391_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-024-10391-y/MediaObjects/11222_2024_10391_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-024-10391-y/MediaObjects/11222_2024_10391_Fig5_HTML.png)
Similar content being viewed by others
References
Ai, M., Yu, J., Zhang, H., Wang, H.: Optimal subsampling algorithms for big data generalized linear models. Stat. Sin. 31(2), 749–772 (2021)
Buckley, J., James, I.: Linear regression with censored data. Biometrika 66(3), 429–436 (1979)
Chiou, S., Kang, S., Yan, J.: Rank-based estimating equations with general weight for accelerated failure time models: an induced smoothing approach. Stat. Med. 34(9), 1495–1510 (2015)
Chiou, S.H., Kang, S., Yan, J.: Fitting accelerated failure time models in routine survival analysis with R package aftgee. J. Stat. Softw. 61(11), 1–23 (2014)
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for \(L_2\) regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. Association of Computing Machinary (2006)
Hesterberg, T.: Weighted average importance sampling and defensive mixture distributions. Technometrics 37(2), 185–194 (1995)
**, Z., Lin, D., Wei, L., Ying, Z.: Rank-based inference for the accelerated failure time model. Biometrika 90(2), 341–353 (2003)
**, Z., Lin, D., Ying, Z.: On least-squares regression with censored data. Biometrika 93(1), 147–161 (2006)
Keret, N., Gorfine, M.: Analyzing big EHR data–Optimal Cox regression subsampling procedure with rare events. Journal of the American Statistical Association. 118(544), 2262–2275 (2023)
Li, R., Chang, C., Justesen, J.M., Tanigawa, Y., Qian, J., Hastie, T., Rivas, M.A., Tibshirani, R.: Fast lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK biobank. Biostatistics 23(3), 522–540 (2022)
Ma, P., Chen, Y., Zhang, X., **ng, X., Ma, J., Mahoney, M.W.: Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. J. Mach. Learn. Res. 23(1), 7970–8014 (2022)
Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(27), 861–911 (2015)
Mahoney, M.W., et al.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Su, W., Yin, G., Zhang, J., Zhao, X.: Divide and conquer for accelerated failure time model with massive time-to-event data. Can. J. Stat. 51(2), 400–419 (2023)
Tsiatis, A.A.: Estimating regression parameters using linear rank tests for censored data. Ann. Stat. 18(1), 354–372 (1990)
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2021)
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)
Wang, J., Zou, J., Wang, H.: Sampling with replacement vs Poisson sampling: a comparative study in optimal subsampling. IEEE Trans. Inf. Theory 68(10), 6605–6630 (2022)
Wang, W., Lu, S.E., Cheng, J.Q., **e, M., Kostis, J.B.: Multivariate survival analysis in big data: a divide-and-combine approach. Biometrics 78(3), 852–866 (2022)
Wang, Y., Hong, C., Palmer, N., Di, Q., Schwartz, J., Kohane, I., Cai, T.: A fast divide-and-conquer sparse Cox regression. Biostatistics 22(2), 381–401 (2021)
Wu, J., Chen, M.H., Schifano, E.D., Yan, J.: Online updating of survival analysis. J. Comput. Graph. Stat. 30(4), 1209–1223 (2021)
Xue, Y., Wang, H., Yan, J., Schifano, E.D.: An online updating approach for testing the proportional hazards assumption with streams of survival data. Biometrics 76(1), 171–182 (2020)
Yang, Z., Wang, H., Yan, J.: Optimal subsampling for parametric accelerated failure time models with massive survival data. Stat. Med. 41(27), 5421–5431 (2022)
Zeng, D., Lin, D.: Efficient resampling methods for nonsmooth estimating functions. Biostatistics 9(2), 355–363 (2008)
Zhang, H., Zuo, L., Wang, H., Sun, L.: Approximating partial likelihood estimators via optimal subsampling. J. Comput. Graph. Stat. (2023)
Zuo, L., Zhang, H., Wang, H., Liu, L.: Sampling-based estimation for massive survival data with additive hazards model. Stat. Med. 40(2), 441–450 (2021)
Acknowledgements
Wang’s research was supported by NSF grant CCF 2105571 and UConn CLAS Research Funding in Academic Themes.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Z., Wang, H. & Yan, J. Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data. Stat Comput 34, 77 (2024). https://doi.org/10.1007/s11222-024-10391-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10391-y