Log in

Quantile forward regression for high-dimensional survival data

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

Despite the urgent need for an effective prediction model tailored to individual interests, existing models have mainly been developed for the mean outcome, targeting average people. Additionally, the direction and magnitude of covariates’ effects on the mean outcome may not hold across different quantiles of the outcome distribution. To accommodate the heterogeneous characteristics of covariates and provide a flexible risk model, we propose a quantile forward regression model for high-dimensional survival data. Our method selects variables by maximizing the likelihood of the asymmetric Laplace distribution (ALD) and derives the final model based on the extended Bayesian Information Criterion (EBIC). We demonstrate that the proposed method enjoys a sure screening property and selection consistency. We apply it to the national health survey dataset to show the advantages of a quantile-specific prediction model. Finally, we discuss potential extensions of our approach, including the nonlinear model and the globally concerned quantile regression coefficients model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bang H, Tsiatis AA (2002) Median regression with censored cost data. Biometrics 58(3):643–649

    Article  MathSciNet  MATH  Google Scholar 

  • Barut E, Fan J, Verhasselt A (2016) Conditional sure independence screening. J Am Stat Assoc 111(515):1266–1277

    Article  MathSciNet  Google Scholar 

  • Belloni A, Chernozhukov V (2011) \(\ell _1\)-penalized quantile regression in high-dimensional sparse models. Ann Stat 39(1):82–130

    Article  MATH  Google Scholar 

  • Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771

    Article  MathSciNet  MATH  Google Scholar 

  • Cheng MY, Honda T, Zhang JT (2016) Forward variable selection for sparse ultra-high dimensional varying coefficient models. J Am Stat Assoc 111(515):1209–1221

    Article  MathSciNet  Google Scholar 

  • Eli S, Tangvik RJ, Nymo LS, Harthug S, Lassen K, Viste A (2020) Weight loss and bmi criteria in GLIM’s definition of malnutrition is associated with postoperative complications following abdominal resections - results from a national quality registry. Clin Nutrit 39(5):1593–1599

    Article  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). J Royal Stat Soc: Series B (Stat Methodol) 70(5):849–911

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148

    MathSciNet  MATH  Google Scholar 

  • Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604

    Article  MathSciNet  MATH  Google Scholar 

  • Fard NA, Morales GDF, Mejova Y, Schifanella R (2021) On the interplay between educational attainment and nutrition: a spatially-aware perspective. EPJ Data Sci 10(1):18

    Article  Google Scholar 

  • Flegal KM, Kit BK, Orpana H, Graubard BI (2013) Association of all-cause mortality with overweight and obesity using standard body mass index categories: a systematic review and meta-analysis. JAMA 309(1):71–82

    Article  Google Scholar 

  • Gearhardt AN, Corbin WR (2009) Body mass index and alcohol consumption: family history of alcoholism as a moderator. Psychol Addict Behav 23(2):216–225

    Article  Google Scholar 

  • He X, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369

    MathSciNet  MATH  Google Scholar 

  • Honda T, Lin C (2022) Forward variable selection for ultra-high dimensional quantile regression models. Ann Instit Stat Math 1–32

  • Hong HG, Kang J, Li Y (2018) Conditional screening for ultra-high dimensional covariates with survival outcomes. Lifetime Data Anal 24(1):45–71

    Article  MathSciNet  MATH  Google Scholar 

  • Hong HG, Christiani DC, Li Y (2019) Quantile regression for survival data in modern cancer research: expanding statistical tools for precision medicine. Precis Clin Med 2(2):90–99

    Article  Google Scholar 

  • Hong HG, Zheng Q, Li Y (2019) Forward regression for Cox models with high-dimensional covariates. J Multivar Anal 173:268–290

    Article  MathSciNet  MATH  Google Scholar 

  • Hwang WY, Zhang HH, Ghosal S (2009) First: combining forward iterative selection and shrinkage in high dimensional sparse linear regression. Stat Interface 2:341–348

    Article  MathSciNet  MATH  Google Scholar 

  • Karavasiloglou N, Pestoni G, Wanner M, Faeh D, Rohrmann S (2019) Healthy lifestyle is inversely associated with mortality in cancer survivors: results from the third national health and nutrition examination survey (NHANES III). PLOS ONE 14(6):1–11

    Article  Google Scholar 

  • Kleiner KD, Gold MS, Frostpineda K, Lenzbrunsman B, Perri MG, Jacobs WS (2004) Body mass index and alcohol use. J Addict Dis 23(3):105–118

    Article  Google Scholar 

  • Knight K (1998) Limiting distributions for \(l_1\) regression estimators under general conditions. Ann Stat 26(2):755–770

    Article  MATH  Google Scholar 

  • Koenker R, Machado JAF (1999) Goodness of fit and related inference processes for quantile regression. J Am Stat Assoc 94:1296–1310

    Article  MathSciNet  MATH  Google Scholar 

  • Kong Y, Li Y, Zerom D (2019) Screening and selection for quantile regression using an alternative measure of variable importance. J Multiv Anal 173:435–455

    Article  MathSciNet  MATH  Google Scholar 

  • Ledoux M, Talagrand M (1991) Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York

    Book  MATH  Google Scholar 

  • Lee ER, Noh H, Park BU (2014) Model selection via Bayesian information criterion for quantile regression models. J Am Stat Assoc 109:216–229

    Article  MathSciNet  MATH  Google Scholar 

  • Leng C, Tong X (2013) A quantile regression estimator for censored data. Bernoulli 19(1):344–361, http://www.jstor.org/stable/23525643

  • Liu J, Zhong W, Li R (2015) A selective overview of feature screening for ultrahigh-dimensional data. Sci China Math 58:20–33

    Article  MathSciNet  MATH  Google Scholar 

  • Luo S, Chen Z (2014) Sequential lasso cum EBIC for feature selection with ultra-high dimensional feature space. J Am Stat Assoc 109:1229–1240

    Article  MathSciNet  MATH  Google Scholar 

  • Ma S, Li R, Tsai CL (2017) Variable screening via quantile partial correlation. J Am Stat Assoc 112:650–663

    Article  MathSciNet  Google Scholar 

  • Must A, Spadano J, Coakley EH, Field AE, Colditz G, Dietz WH (1999) The disease burden associated with overweight and obesity. JAMA 282(16):1523–1529

    Article  Google Scholar 

  • Park S, He X (2017) Hypothesis testing for regional quantiles. J Stat Plan Inference 191:13–24

    Article  MathSciNet  MATH  Google Scholar 

  • Park S, Lee ER (2021) Hypothesis testing of varying coefficients for regional quantiles. Comput Stat Data Anal 159:107204

    Article  MathSciNet  MATH  Google Scholar 

  • Park S, Lee ER, Zhao H (2022) Low-rank regression models for multiple binary responses and their applications to cancer cell-line encyclopedia data. J Am Stat Assoc. https://doi.org/10.1080/01621459.2022.2105704

    Article  Google Scholar 

  • Peng L (2021) Quantile regression for survival data. Annu Rev Stat Appl 8(1):413–437

    Article  MathSciNet  Google Scholar 

  • Pijyan A, Zheng Q, Hong HG, Li Y (2020) Consistent estimation of generalized linear models with high dimensional predictors via stepwise regression. Entropy 22(9):965

    Article  MathSciNet  Google Scholar 

  • Radchenko P, James GM (2011) Improved variable selection with forward-lasso adaptive shrinkage. Ann Appl Stat 5:427–448

    Article  MathSciNet  MATH  Google Scholar 

  • Sherwood B, Wang L (2016) Partially linear additive quantile regression in ultra-high dimension. Ann Stat 44:288–317

    Article  MathSciNet  MATH  Google Scholar 

  • Sluik D, Brouwer-Brolsma EM, Berendsen AAM, Mikkilä V, Poppitt SD, Silvestre MP, Tremblay A, Pérusse L, Bouchard C, Raben A, Feskens EJM (2019) Protein intake and the incidence of pre-diabetes and diabetes in 4 population-based studies: the preview project. Am J Clin Nutrit 109(5):1310–1318

    Article  Google Scholar 

  • Tibshirani R (1997) The lasso method for variable selection in the cox model. Stat Medi 28:385–395

    Article  Google Scholar 

  • van der Vaart Wellner JA (1996) Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics, Springer, New York

    Book  MATH  Google Scholar 

  • Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 104(488):1512–1524

    Article  MathSciNet  MATH  Google Scholar 

  • Wang L, Wu Y, Li R (2012) Quantile regression for analyzing heterogeneity in ultra-high dimension. J Am Stat Assoc 107:214–222

    Article  MathSciNet  MATH  Google Scholar 

  • Yu K, Moyeed RA (2001) Bayesian quantile regression. Stat Prob Lett 54:437–447

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang CH, Huang J (2008) The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat 36:1567–1594

    Article  MathSciNet  MATH  Google Scholar 

  • Zheng Q, Peng L, He X (2015) Globally adaptive quantile regression with ultra-high dimensional data. Ann Stat 43:2225–2258

    Article  MathSciNet  MATH  Google Scholar 

  • Zheng Q, Hong HG, Li Y (2020) Building generalized linear models with ultrahigh dimensional features: a sequentially conditional approach. Biometrics 76(1):47–60

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyokyoung G. Hong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 146 KB)

Appendix 1 (Theoretical Details)

Appendix 1 (Theoretical Details)

Throughout the proof, for any set \(S \subseteq \{1,\ldots ,p\}\), let

$$\begin{aligned} \ell _S(\beta _S) := \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{\hat{G}_i} \rho _\tau (Y_i^C-{{\varvec{X}}}_{iS}^\top \beta _S), \ell ^o_S(\beta _S) := \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{G^o_i} \rho _\tau (Y_i^C-{{\varvec{X}}}_{iS}^\top \beta _S), \end{aligned}$$

where \(G^o_i := G(Y_i^C)\) is the censoring probability, and \(\beta ^*_S := \mathop {\mathrm {arg\,min}}\limits _{\beta _S} E[\ell _S(\beta _S)]\). Given a random sample \(Z_1,\ldots , Z_n\) and a function \(f(\cdot )\), let \(G_n(f) := n^{-1/2} \sum _{i=1}^n (f(Z_i)-E[f(Z_i)])\).

We first introduce useful lemmas and prove the main theorems.

1.1 Lemmas and proofs

The following Lemma 1 is given in Lemma 8.4 of He et al. (2013).

Lemma 1

Suppose Assumptions 5-6. Then, the Kaplan-Meier estimator \({\hat{G}}(t)\) satisfies

$$\begin{aligned} \sup _{0 \le t \le T} |{\hat{G}}(t)- G(t)|= & {} O\left( n^{-1/2} (\log n)^{1/2}\right) \quad \text{ almost } \text{ surely }.\\ \sup _{0 \le t \le T} |{\hat{G}}(t)^{-1}- G(t)^{-1}|= & {} O\left( n^{-1/2} (\log n)^{1/2}\right) \quad \text{ almost } \text{ surely }. \end{aligned}$$

The following lemmas will be used to prove the main theorems.

Lemma 2

Suppose Assumptions 3,5, and 6. Given an index set S, let \(B_S^o(d_1) := \{\beta _S:\ \Vert \beta _S-\tilde{\beta }_S\Vert \le d_1/\sqrt{s}\}\). Then, we have

$$\begin{aligned} \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } \left| \ell ^o_S(\beta _S) - \ell ^o_S(\tilde{\beta }_S) - \left( \ell _S(\beta _S) - \ell _S(\tilde{\beta }_S)\right) \right| = O\left( d_1 s^{-1/2} n^{-1/2} (\log n)^{1/2}\right) . \end{aligned}$$

Proof

We have

$$\begin{aligned}&\sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } \left| \ell ^o_S(\beta _S) - \ell ^o_S(\tilde{\beta }_S) - \left( \ell _S(\beta _S) - \ell _S(\tilde{\beta }_S)\right) \right| \\&=\sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } \left| \frac{1}{n} \sum _{i=1}^n \left( \frac{\varDelta _i}{G^o_i} - \frac{\varDelta _i}{{\hat{G}}_i} \right) \left\{ \rho _\tau (Y_i^C-{{\varvec{X}}}_{iS}^\top \beta _S) - \rho _\tau (Y_i^C-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S)\right\} \right| \\&\le O\left( n^{-1/2} (\log n)^{1/2}\right) \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } \frac{1}{n} \sum _{i=1}^n \left| {{\varvec{X}}}_{iS}^\top (\beta _S - \tilde{\beta }_S)\right| \\&\le O\left( n^{-1/2} (\log n)^{1/2}\right) \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } \sqrt{ \sum _{i=1}^n \left| {{\varvec{X}}}_{iS}^\top (\beta _S - \tilde{\beta }_S)\right| ^2/n } \\&\le O\left( n^{-1/2} (\log n)^{1/2}\right) \lambda _{M}^{1/2} \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } \Vert \beta _S - \tilde{\beta }_S\Vert \\&= O\left( d_1 s^{-1/2} n^{-1/2} (\log n)^{1/2}\right) , \end{aligned}$$

where the first inequality follows from Lemma 1 and the fact that \(|\rho _\tau (a)-\rho _\tau (b)| \le |a-b|\) for numbers a and b, the second inequality follows from the Cauchy-Schwarz inequality, the third inequality follows from Assumption 3. This completes the proof. \(\square\)

Lemma 3

Suppose conditions of Theorem 1. Given an index set S, let \(B_S^o(d_1) := \{\beta _S:\ \Vert \beta _S-\tilde{\beta }_S\Vert \le d_1/\sqrt{s}\}\). Then, with probability at least \(1-6\exp (-6\rho \log p)\),

$$\begin{aligned} \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } |\ell _S(\beta _S) - \ell _S(\tilde{\beta }_S) - E [\ell _S(\beta _S)] + E[\ell _S(\tilde{\beta }_S)]| \le 2A_3 d_1 \sqrt{\rho \log p/n}, \end{aligned}$$

where \(A_3= 4\sqrt{\lambda _{M}} \tau _0^{-1}\).

Proof

Note that

$$\begin{aligned} \ell ^o_S(\beta _S) - \ell ^o_S(\tilde{\beta }_S) - E [\ell ^o_S(\beta _S)] + E[\ell ^o_S(\tilde{\beta }_S)] = \frac{1}{\sqrt{n}} G_n\left( \frac{\varDelta _i}{G^o_i}\rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) -\frac{\varDelta _i}{G^o_i} \rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S)\right) . \end{aligned}$$

Fix S with \(s=|S| \le \rho\). Let \(t:= d_1/\sqrt{s}\). Let

$$\begin{aligned} A(t) := \sup _{\beta _S \in B^o_S(d_1)} \left| G_n\left( \frac{\varDelta _i}{G^o_i}\rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) -\frac{\varDelta _i}{G^o_i} \rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S)\right) \right| . \end{aligned}$$

For any \(\beta _S \in B^o_S(d_1)\), we have by Assumption 5,

$$\begin{aligned}&\text{ var }\left( G_n\left( \frac{\varDelta _i}{G^o_i}\rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) -\frac{\varDelta _i}{G^o_i} \rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S)\right) \right) \\&\le \tau _0^{-2} E[|{{\varvec{X}}}_{iS}^\top (\beta _S-\tilde{\beta }_S)|^2] \le \lambda _{M} \tau _0^{-2} t^2. \end{aligned}$$

By symmetrization lemma for probabilities, we obtain that for \(M > 2\sqrt{\lambda _{M}} \tau _0^{-1} t\),

$$\begin{aligned} P(A(t) \ge M) \le {2P(A^o(t) \ge M/4)}/ \left\{ 1- 4\lambda _{M} \tau _0^{-2} t^2 / M^2\right\} , \end{aligned}$$

where \(A^o(t)\) is the symmetrized version of A(t), i.e.,

$$\begin{aligned} A^o(t) := \sup _{\beta _S \in B^0_S(t)}\left| G_n\left( \frac{V_i\varDelta _i}{G^o_i}\rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) -\frac{V_i\varDelta _i}{G^o_i} \rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S)\right) \right| , \end{aligned}$$

where \(V_1,\ldots , V_n\) are i.i.d. Rademacher random variables. We can write

$$\begin{aligned} \frac{\varDelta _i}{G^o_i}\rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) -\frac{\varDelta _i}{G^o_i} \rho _\tau (Y_i-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S) = \frac{\varDelta _i}{G^o_i}\tau {{\varvec{X}}}_{iS}^\top (\beta _S-\tilde{\beta }_S) + \frac{\varDelta _i}{G^o_i} W_i(\tau , \delta ), \end{aligned}$$

where \(W_i(\tau , \delta _S) = [Y_i - {{\varvec{X}}}_{iS}^\top (\tilde{\beta }_S + \delta _S)]_- - [Y_i-{{\varvec{X}}}_{iS}^\top \tilde{\beta }_S]_-\) and \(\delta _S := \beta _S- \tilde{\beta }_S\), where \(u_- = 1\{u < 0\} |u|\). Then, we have \(A^o(t) \le B^o(t) + C^o(t)\), where

$$\begin{aligned} B^o(t) = \sup _{\Vert \delta _S\Vert \le t} \,\left| G_n \left[ \frac{V_i\varDelta _i}{G^o_i} {{\varvec{X}}}_{iS}^\top \delta _S\right] \right| ,\quad C^o(t) = \sup _{\Vert \delta _S\Vert \le t}\, \left| G_n \left[ \frac{V_i\varDelta _i}{G^o_i} W_i(\tau , \delta _S)\right] \right| . \end{aligned}$$

We have for \(\lambda >0\),

$$\begin{aligned} E[\exp (\lambda B^o(t))]= & {} E\left[ E\left[ \exp \left( \lambda \sup _{\Vert \delta _S\Vert \le t} \, \left| G_n \left[ \frac{V_i \varDelta _i}{G^o_i} {{\varvec{X}}}_{iS}^\top \delta _S\right] \right| \right) \mid {{\varvec{X}}}, \{\varDelta _i\} \right] \right] \nonumber \\\le & {} E\left[ E\left[ \exp \left( \lambda \sqrt{s} t \max _{j \in S} \left| G_n \left[ \frac{V_i \varDelta _i}{G^o_i} x_{ij}\right] \right| \right) \mid {{\varvec{X}}}, \{\varDelta _i\} \right] \right] \nonumber \\\le & {} 2 E\left[ E\left[ \exp \left( \lambda \sqrt{s} t \max _{j \in S} G_n \left[ \frac{V_i \varDelta _i}{G^o_i} x_{ij}\right] \right) \mid {{\varvec{X}}}, \{\varDelta _i\} \right] \right] \nonumber \\\le & {} 2s E\left[ \exp \left( \lambda ^2 s t^2 \frac{\sum _i x_{ij}^2}{n} \frac{ \varDelta _i^2}{(G^o_i)^2}\right) \mid X, \{\varDelta _i\}\right] \nonumber \\\le & {} 2s \exp \left( \lambda ^2 s \tau _0^{-2} t^2\right) , \end{aligned}$$
(7)

where the first inequality is elementary, the second inequality follows from the fact that \(E[\exp (|Z|)] \le 2E[\exp (Z)]\) for any symmetric random variable Z, the third inequality follows by the intermediate step in the proof of the Hoeffding’s inequality (page 100 in van der Vaart (1996)), and the last inequality follows from Assumption 5 and \(\varDelta _i^2 \le 1\). Therefore, we have

$$\begin{aligned} P(B^o(t)> K_1) \le \min _{\lambda>0} \exp (-\lambda K_1) E[\exp (\lambda B^o(t))] \le \min _{\lambda >0}2s \exp (-\lambda K_1+\lambda ^2 s \tau _0^{-2} t^2). \end{aligned}$$

The minimum is achieved at \(\lambda = K_1 \tau _0^{2} / (2st^2)\). Thus, we have

$$\begin{aligned} P(B^o(t) > K_1) \le 2s\exp (-K_1^2 \tau _0^{2} / (4st^2)). \end{aligned}$$

Next, for \(\lambda >0\), we have

$$\begin{aligned} E[\exp (\lambda C^o(t))]= & {} E\left[ E\left[ \exp \left( \lambda \sup _{\Vert \delta _S\Vert \le t} \, \left| G_n \left[ \frac{V_i \varDelta _i}{G^o_i} W_i(\tau , \delta _S)\right] \right| \right) \mid {{\varvec{X}}}, \{\varDelta _i\} \right] \right] \\\le & {} E\left[ E\left[ \exp \left( 2\lambda \sup _{\Vert \delta _S\Vert \le t} \, \left| G_n \left[ \frac{V_i \varDelta _i}{G^o_i} {{\varvec{X}}}_{is}^\top \delta _S\right] \right| \right) \mid {{\varvec{X}}}, \{\varDelta _i\} \right] \right] \\\le & {} 2s \exp \left( 4\lambda ^2 s \tau _0^{-2} t^2\right) , \end{aligned}$$

where the first inequality follows from the contractive property \(|W_i(\tau , u)-W_i(\tau ,v)| < |{{\varvec{X}}}_{iS}^\top (u-v)|\), and from Theorem 4.12 in Ledoux and Talagrand (1991) with the convex function \(\exp (\cdot )\), and the second inequality follows from the same arguments used in (7).

Therefore, we have

$$\begin{aligned} P(C^o(t)> K_2) \le \min _{\lambda>0} \exp (-\lambda K_2) E[\exp (\lambda C^o(t))] \le \min _{\lambda >0}2s \exp (-\lambda K_2+4\lambda ^2 s \tau _0^{-2} t^2). \end{aligned}$$

The minimum is achieved at \(\lambda = K_2 \tau _0^{2} / (8st^2)\). Thus, we have

$$\begin{aligned} P(C^o(t) > K_2) \le 2s\exp (-K_2^2 \tau _0^{2} / (16st^2)). \end{aligned}$$

Thus, we have

$$\begin{aligned} P(A^o(t) \ge M/4)\le & {} P(B^o(t) \ge M/8) + P(C^o(t) \ge M/8) \\\le & {} 4s\exp (-M^2 \tau _0^2 / (1024st^2)) \end{aligned}$$

Hence, \(P(A(t) \ge M) \le Cs\exp (-M^2 \tau _0^2 / (1024st^2)) = Cs\exp (-M^2 K^2 \tau _0^2/ (1024d_1^2))\) for some \(C>0\), provided that \(M > 2\sqrt{\lambda _{M}} \tau _0^{-1} t = 2\sqrt{\lambda _{M}} \tau _0^{-1} d_1/\sqrt{s}\). Setting \(M=A_3 d_1 \sqrt{\rho \log p}\), where \(A_3:= 4\sqrt{\lambda _{M}} \tau _0^{-1}\), we have \(P(A(t) \ge A_3 d_1 \sqrt{\rho \log p}) \lesssim s \exp (-C \rho \log p)\). Thus, we obtain

$$\begin{aligned} P(\max _{S: |S| \le \rho } A(t) \ge A_3 d_1 \sqrt{\rho \log p})\le & {} \sum _{s=1}^{\rho } {p \atopwithdelims ()s} P(A(t) \ge A_3 d_1 \sqrt{\rho \log p}) \\\lesssim & {} \sum _{s=1}^{\rho } (ep/s)^s s \exp (-C \rho \log p) \\\lesssim & {} \exp (-C \rho \log p). \end{aligned}$$

Combining with Lemma 2 and \(\log n = o(s\rho \log p)\), which holds due to Assumption 2, we obtain

$$\begin{aligned} \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } |\ell _S(\beta _S) - \ell _S(\tilde{\beta }_S) - E [\ell _S(\beta _S)] + E[\ell _S(\tilde{\beta }_S)]| \le 2A_3 d_1 \sqrt{\rho \log p/n}. \end{aligned}$$

This completes the proof. \(\square\)

Lemma 4

Suppose conditions of Theorem 1. Then, given a model S, for any \(\beta _S\) satisfying \(\Vert \beta _S - \beta ^*_S\Vert \le 4q \lambda _{M}^{-1/2}\),

$$\begin{aligned}&\frac{1}{8}\lambda_{m}\tau_0\underline{f} \Vert \beta _S - \beta ^*_S\Vert ^2 - C_3 n^{-1} \log n \le E[\ell _S(\beta _S)] - E[\ell _S(\beta ^*_S)]\\&\quad \le \frac{3\lambda _{M} \bar{f}}{2\tau _0} \Vert \beta _S - \beta ^*_S\Vert ^2 + C_3 n^{-1} \log n. \end{aligned}$$

for some positive constant \(C_3\). For any \(\beta _S\) satisfying \(\Vert \beta _S - \beta ^*_S\Vert > 4q \lambda _{M}^{-1/2}\),

$$\begin{aligned} \frac{1}{4} q \lambda _{M}^{-1/2} \lambda _{m}\tau_0 \underline{f} \Vert \beta _S - \beta ^*_S\Vert \le E[\ell _S(\beta _S)] - E[\ell _S(\beta ^*_S)] \le 2\tau _0^{-1} \lambda _{M}^{1/2}\Vert \beta _S - \beta ^*_S\Vert . \end{aligned}$$

Proof

We first consider the following Case 1.

Case 1: When \(\Vert \beta _S - \beta ^*_S\Vert \le 4q \lambda _{M}^{-1/2}\)

Let \(\delta _S := \beta _S - \beta ^*_S\). By the definition of \(\beta ^*_S\), we have

$$\begin{aligned} E\left[ \frac{1}{n}\sum _{i=1}^n \frac{\varDelta _i}{G^o_i} {{\varvec{X}}}_{iS}\left( \tau - 1\{Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S < 0\}\right) \right] =0. \end{aligned}$$

Hence, we have

$$\begin{aligned}&\left| E\left[ \frac{1}{n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} {{\varvec{X}}}_{iS}^\top \delta _S \left( \tau - 1\{Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S< 0\}\right) \right] \right| \nonumber \\&= \left| E\left[ \frac{1}{n}\sum _{i=1}^n \varDelta _i \left( \frac{1}{{\hat{G}}_i}- \frac{1}{G^o_i} \right) {{\varvec{X}}}_{iS}^\top \delta _S \left( \tau - 1\{Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S < 0\}\right) \right] \right| \nonumber \\&\le \frac{1}{n} \sum _{i=1}^n E\left[ |{{\varvec{X}}}_{iS}^\top \delta _S|\right] O\left( n^{-1/2} (\log n)^{1/2}\right) \nonumber \\&\le E\left[ \left( \frac{1}{n}\sum _{i=1}^n |{{\varvec{X}}}_{iS}^\top \delta _S|^2\right) ^{1/2} \right] O\left( n^{-1/2} (\log n)^{1/2}\right) \nonumber \\&\le E\left[ \frac{1}{n}\sum _{i=1}^n |{{\varvec{X}}}_{iS}^\top \delta _S|^2 \right] ^{1/2} O\left( n^{-1/2} (\log n)^{1/2}\right) \\&\le \lambda _{M}^{1/2} \Vert \delta _S\Vert O\left( n^{-1/2} (\log n)^{1/2}\right) , \nonumber \end{aligned}$$
(8)

where the first inequality follows from Lemma 1, the second inequality follows from the Cauchy-Schwarz inequality, the third inequality follows from the fact that \(EX^2 \ge [E|X|]^2\) for a random variable X, and the last inequality follows from Assumption 3.

Hence, by the identity by Knight (1998), we have for \(\tilde{z} \in [0,z]\),

$$\begin{aligned}&E\left[ \frac{1}{n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) - \frac{\varDelta _i}{{\hat{G}}_i} \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S) \right] \\= & {} E\left[ \frac{1}{n} \sum _{i=1}^n - \frac{\varDelta _i}{{\hat{G}}_i}{{\varvec{X}}}_{iS}^\top \delta _S \left( \tau - 1\{Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S < 0\}\right) + \frac{\varDelta _i}{{\hat{G}}_i} \int _0^{{{\varvec{X}}}_{iS}^\top \delta _S} F({{\varvec{X}}}_{iS}^\top \beta ^*_S.\right. \nonumber \\&\left. + z \mid {{\varvec{X}}}_{i}, \varDelta _i) - F({{\varvec{X}}}_{iS}^\top \beta ^*_S \mid {{\varvec{X}}}_{i}, \varDelta _i) dz \right] \nonumber \\= & {} E\left[ \frac{1}{n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} \int _0^{{{\varvec{X}}}_{iS}^\top \delta _S} F({{\varvec{X}}}_{iS}^\top \beta ^*_S + z \mid {{\varvec{X}}}_{i}, \varDelta _i) - F({{\varvec{X}}}_{iS}^\top \beta ^*_S \mid {{\varvec{X}}}_{i}, \varDelta _i) dz \right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \nonumber \\= & {} E\left[ \frac{1}{n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} \int _0^{{{\varvec{X}}}_{iS}^\top \delta _S} z f({{\varvec{X}}}_{iS}^\top \beta ^*_S \mid {{\varvec{X}}}_{i}, \varDelta _i) + \frac{z^2}{2} f'({{\varvec{X}}}_{iS}^\top \beta ^*_S + \tilde{z} \mid {{\varvec{X}}}_{i}, \varDelta _i) dz\right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \nonumber \\\ge & {} E\left[ \frac{\underline{f}}{2n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} |{{\varvec{X}}}_{iS}^\top \delta _S|^2 - \frac{\bar{f}'}{6n} \sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} |{{\varvec{X}}}_{iS}^\top \delta _S|^3\right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \nonumber \\\ge & {} \underbrace{E\left[ \frac{\underline{f}}{2n}\sum _{i=1}^n \varDelta _i |{{\varvec{X}}}_{iS}^\top \delta _S|^2 - \frac{\bar{f}'}{6n} \sum _{i=1}^n \frac{\varDelta _i}{\tau _0} |{{\varvec{X}}}_{iS}^\top \delta _S|^3\right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert }_{I}, \nonumber \end{aligned}$$
(9)

where the first inequality follows from Assumption 1 and the second inequality follows from Assumption 5. By Assumption 3, \(\Vert \delta _S\Vert \le 4q \lambda _{M}^{-1/2}\), and \(\varDelta _i \in \{0,1\}\), we have

$$\begin{aligned} E[\varDelta _i |{{\varvec{X}}}_{iS}^\top \delta _S|^2]^{1/2} \le \lambda _{M}^{1/2} \Vert \delta _S\Vert \le 4q \le \frac{3\tau_0\underline{f} }{2\bar{f}'} \frac{E[\varDelta _i ({{\varvec{X}}}_{iS}^\top \delta _S)^2]^{3/2}}{E[\varDelta _i |{{\varvec{X}}}_{iS}^\top \delta _S|^3]}, \end{aligned}$$

which gives

$$\begin{aligned} E\left[ \frac{\bar{f}'}{6n} \sum _{i=1}^n \frac{\varDelta _i}{\tau _0} |{{\varvec{X}}}_{iS}^\top \delta _S|^3\right] \le E\left[ \frac{\underline{f}}{4n}\sum _{i=1}^n \varDelta _i |{{\varvec{X}}}_{iS}^\top \delta _S|^2\right] . \end{aligned}$$
(10)

Hence, we have

$$\begin{aligned} I\ge & {} E\left[ \frac{\underline{f}}{4n}\sum _{i=1}^n \varDelta _i |{{\varvec{X}}}_{iS}^\top \delta _S|^2\right] + O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\ge & {} E\left[ \frac{\tau_0\underline{f}}{4n}\sum _{i=1}^n |{{\varvec{X}}}_{iS}^\top \delta _S|^2\right] + O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\ge & {} \frac{\lambda _{m} \tau_0\underline{f}}{4} \Vert \delta _S\Vert ^2 +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\ge & {} \frac{\lambda _{m} \tau_0\underline{f}}{8} \Vert \delta _S\Vert ^2 +O\left( n^{-1} \log n\right) \\\ge & {} \frac{\lambda _{m} \tau_0\underline{f}}{8} \Vert \delta _S\Vert ^2 -C_3 n^{-1} \log n \end{aligned}$$

for some \(C_3>0\), where the second inequality follows from Assumption 5 and the third inequality follows from Assumption 3.

Furthermore, we have

$$\begin{aligned}&E\left[ \frac{1}{n} \frac{\varDelta _i}{{\hat{G}}_i} \sum _{i=1}^n \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) - \frac{\varDelta _i}{{\hat{G}}_i} \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S) \right] \\\le & {} E\left[ \frac{\bar{f}}{2n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} |{{\varvec{X}}}_{iS}^\top \delta _S|^2 \right] + E\left[ \frac{\bar{f}'}{6n} \sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} |{{\varvec{X}}}_{iS}^\top \delta _S|^3 \right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\le & {} E\left[ \frac{\bar{f}}{2n}\sum _{i=1}^n \frac{\varDelta _i}{\tau _0} |{{\varvec{X}}}_{iS}^\top \delta _S|^2 \right] + E\left[ \frac{\bar{f}'}{6n} \sum _{i=1}^n \frac{\varDelta _i}{\tau _0} |{{\varvec{X}}}_{iS}^\top \delta _S|^3 \right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\le & {} E\left[ \frac{3\bar{f}}{4n}\sum _{i=1}^n \frac{\varDelta _i}{\tau _0} |{{\varvec{X}}}_{iS}^\top \delta _S|^2 \right] +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\le & {} \frac{3\lambda _{M} \bar{f}}{4\tau _0} \Vert \delta _S\Vert ^2 +O\left( n^{-1/2} (\log n)^{1/2}\right) \Vert \delta _S\Vert \\\le & {} \frac{3\lambda _{M} \bar{f}}{2\tau _0} \Vert \delta _S\Vert ^2 + C_3 n^{-1} \log n, \end{aligned}$$

for some \(C_3>0\), where the third inequality follows from (10). This completes the proof of the first part of the lemma. Next, we consider the following Case 2.

Case 2: When \(\Vert \beta _S - \beta ^*_S\Vert > 4q \lambda _{M}^{-1/2}\)

Let \(\delta _S = \beta _S - \beta ^*_S\). Then, for \(a := \Vert \delta _S\Vert / (4q \lambda _{M}^{-1/2}) > 1\), it holds that \(\delta _S = a \tilde{\delta }_S\) for some \(\Vert \tilde{\delta }_S\Vert = 4q \lambda _{M}^{-1/2}\). Define

$$\begin{aligned} Q(\delta _S) := E[\ell _S(\beta ^*_S+\delta _S)] - E[\ell _S(\beta ^*_S)] = E[\ell _S(\beta _S)] - E[\ell _S(\beta ^*_S)]. \end{aligned}$$

Because \(Q(\cdot )\) is a convex function, we have

$$\begin{aligned} \frac{1}{a} Q(a \tilde{\delta }_S) + \left( 1-\frac{1}{a}\right) Q(0) \ge Q(\tilde{\delta }_S). \end{aligned}$$

Because \(Q(0)=0\) and \(\Vert \tilde{\delta }_S\Vert = 4q \lambda _{M}^{-1/2}\), we have

$$\begin{aligned} Q(\delta _S)= & {} Q(a \tilde{\delta }_S) \ge a Q(\tilde{\delta }_S) \ge \frac{a\lambda _{m} \tau_0\underline{f}}{8} \Vert \tilde{\delta }_S\Vert ^2 - a C_3 n^{-1} \log n \\= & {} \Vert \delta _S\Vert \left\{ 0.5 q \lambda _{M}^{-1/2} \lambda _{m} \tau_0\underline{f} + O\left( n^{-1}\log n \right) \right\} \\\ge & {} 0.25 q \lambda _{M}^{-1/2} \lambda _{m} \tau_0\underline{f} \Vert \delta _S\Vert , \end{aligned}$$

where the second inequality follows from Case 1.

Note that we have

$$\begin{aligned} E\left[ \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) - \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S) \mid X_i, \varDelta _i\right]\le & {} E[|{{\varvec{X}}}_{iS}^\top (\beta _S -\beta ^*_S)|\mid X_i, \varDelta _i] \\\le & {} E[|{{\varvec{X}}}_{iS}^\top (\beta _S -\beta ^*_S)|^2 \mid X_i, \varDelta _i]^{1/2} \\\le & {} \lambda _{M}^{1/2} \Vert \beta _S -\beta ^*_S\Vert , \end{aligned}$$

where the last inequality follows from Assumption 3. By Lemma 1 and Assumption 5, we also have \({\hat{G}}_i \gtrsim G^o_i -n^{-1/2} (\log n)^{1/2} \ge \tau _0 -n^{-1/2} (\log n)^{1/2} \gtrsim 0.5 \tau _0\). Hence,

$$\begin{aligned} E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i}\rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta _S) - \frac{\varDelta _i}{{\hat{G}}_i} \rho _{\tau }(Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S) \right] \le 2\tau _0^{-1} \lambda _{M}^{1/2} \Vert \delta _S\Vert . \end{aligned}$$

This completes the proof of the second part of the lemma. \(\square\)

Lemma 5

Under conditions of Theorem 1, there exist some universal constants \(A_4, A_5 >0\) such that

$$\begin{aligned} \Vert \hat{\beta }_S - \beta ^*_S\Vert \le A_4 \sqrt{\rho ^2 \log p / n},\quad |\ell _S(\hat{\beta }_S) - \ell _S(\beta ^*_S)| \le A_5 \rho ^2 \log p/n \end{aligned}$$

hold uniformly over \(S: |S| \le \rho\), with probability at least \(1-6\exp (-6\rho \log p)\).

Proof

We first define an event

$$\begin{aligned} E = \left\{ \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } |\ell _S(\beta _S) - \ell _S(\beta ^*_S) - E [\ell _S(\beta _S)] + E[\ell _S(\beta ^*_S)]| \le 2A_3 d_1 \sqrt{\rho \log p/n} \right\} . \end{aligned}$$

By Lemma 3, the event E holds with probability at least \(1-6\exp (-6\rho \log p)\).

If \(\Vert \beta _S-\beta ^*_S\Vert =A_4 \sqrt{\rho ^2 \log p / n}\), then \(\Vert \beta _S-\beta ^*_S\Vert \le A_4 \sqrt{\rho ^3 \log p /n} / \sqrt{s}\) and \(\Vert \beta _S - \beta ^*_S\Vert \le 4q \lambda _{M}^{-1/2}\) because \(s \le \rho\) and \(\rho ^2 \log p = o(n)\). Thus, we have under the event E,

$$\begin{aligned} \ell _S(\beta ^*_S) - \ell _S(\beta _S)= & {} \ell _S(\beta ^*_S) - E[\ell _S(\beta ^*_S)] - \ell _S(\beta _S) + E[\ell _S(\beta _S)] + E[\ell _S(\beta ^*_S)] - E[\ell _S(\beta _S)] \\\le & {} 2A_3 A_4 \sqrt{\rho ^3 \log p /n} \,\sqrt{\rho \log p/n} - \frac{1}{8}\lambda _{m}\tau_0\underline{f} \Vert \beta _S - \beta ^*_S\Vert ^2 +O\left( n^{-1} \log n\right) \\\le & {} 2A_3 A_4 \rho ^2 \log p /n - \frac{1}{8} \lambda _{m} \tau_0\underline{f} A_4^2 {\rho ^2 \log p / n} + O\left( n^{-1} \log n\right) \\< & {} 0, \end{aligned}$$

where the first inequality follows from Lemma 4 and the last inequality holds for sufficiently large constant \(A_4>0\) due to the growth rate condition of \(\rho\) in Assumption 1. Thus

$$\begin{aligned} \min _{|S| \le \rho , \Vert \beta _S-\beta ^*_S\Vert =A_4 \sqrt{\rho ^2 \log p / n}} \ell _S(\beta _S) -\ell _S(\beta ^*_S) > 0. \end{aligned}$$

By the convexity of \(\ell _S(\cdot )\), \(\Vert \hat{\beta }_S-\beta ^*_S\Vert \le A_4 \sqrt{\rho ^2 \log p / n}\). This proves the first part.

For any \(\beta _S\) satisfying \(\Vert \beta _S - \beta ^*_S\Vert \le A_4 \sqrt{\rho ^2 \log p / n}\), we have under the event E,

$$\begin{aligned} |\ell _S(\beta ^*_S) - \ell _S(\beta _S)|\le & {} |\ell _S(\beta ^*_S) - E[\ell _S(\beta ^*_S)] - \ell _S(\beta _S) + E[\ell _S(\beta _S)]| + |E[\ell _S(\beta ^*_S)] - E[\ell _S(\beta _S)]| \\\le & {} 2A_3 A_4 \rho ^2 \log p /n + 1.5 \lambda _{M} \bar{f} \tau _0^{-1} \Vert \beta _S - \beta ^*_S\Vert ^2 + O\left( n^{-1} \log n\right) \\\le & {} 2A_3 A_4 \rho ^2 \log p /n + 1.5 \lambda _{M} \bar{f} \tau _0^{-1} A_4^2 {\rho ^2 \log p / n} + O\left( n^{-1} \log n\right) \\\le & {} A_5 \rho ^2 \log p/n \end{aligned}$$

for some constant \(A_5>0\), where the second inequality follows from Lemma 4, and the last inequality follows from the growth rate condition of \(\rho\) in Assumption 1. This completes the proof. \(\square\)

Lemma 6

Given a model S such that \(|S| < \rho\), \(M_\tau \not \subseteq S\), under Assumption 4, there exists \(j \in M_\tau \setminus S\) such that \([\beta ^*_{S \bigcup \{j\}}]_j \ne 0\). Further, if Assumption 1 holds, then there exists \(j \in M_\tau \setminus S\) such that \(|[\beta ^*_{S \bigcup \{j\}}]_j| \ge C \bar{f}^{-1} \tau _0 n^{-\alpha }\), where \(C>0\) is defined in Assumption 4.

Proof

Suppose that \([\beta ^*_{S \bigcup \{j\}}]_j = 0\) for all \(j \in M_\tau \setminus S\). Fix \(j \in M_\tau \setminus S\). As \(\beta ^*_{S \bigcup \{j\}}\) is the minimizer of \(E[\ell ^o_{S \bigcup \{j\}}(\beta )]\), by the convexity of \(E[\ell ^o_{S \bigcup \{j\}}(\beta )]\), it holds that

$$\begin{aligned} 0= & {} E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{G^o_i} x_{ij} \psi _\tau \left( Y_i-{{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}\right) \right] \\= & {} E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{G^o_i} x_{ij} \psi _\tau \left( Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S\right) \right] , \end{aligned}$$

where the second equality follows from the fact that \([\beta ^*_{S \bigcup \{j\}}]_j = 0\).

which contradicts to Assumption 4. Thus, there exists \(j \in M_\tau \setminus S\) such that \([\beta ^*_{S \bigcup \{j\}}]_j \ne 0\).

Further, by Assumption 4, there exists \(j \in M_\tau \setminus S\) such that

$$\begin{aligned} Cn^{-\alpha }\le & {} \left| E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{G^o_i} x_{ij} \psi _\tau \left( Y_i-{{\varvec{X}}}_{iS}^\top \left[ \beta ^*_{S \bigcup \{j\}}\right] _S\right) \right] \right| \\= & {} \left| E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{G^o_i} x_{ij} \psi _\tau \left( Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S\right) \right] \right. \\&\left. - E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{G^o_i} x_{ij} \psi _\tau (Y_i- {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}})\right] \right| \\\le & {} \left| E\left[ \frac{1}{n} \sum _{i=1}^n \frac{|x_{ij}|}{G^o_i} P\left\{ Y_i \in \left( {{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S, {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}\right) \mid {{\varvec{X}}}_i \right\} \right] \right| \\\le & {} \left| E\left[ \frac{1}{n} \sum _{i=1}^n {\bar{f}} \frac{x_{ij}^2}{G^o_i} |[\beta ^*_{S \bigcup \{j\}}]_j| \right] \right| \\= & {} \frac{{\bar{f}}}{\tau _0} \left| [\beta ^*_{S \bigcup \{j\}}]_j\right| , \end{aligned}$$

where the first equation holds due to the definition of \(\beta ^*_{S \bigcup \{j\}}\), the third inequality holds due to Assumption 1 and the fact that \(\left| {{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S-{{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}\right| = \left| x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j\right|\), and the last equality follows from the fact that \(E[x_{ij}^2]=1\) and Assumption 6. This completes the second part of Lemma 6. \(\square\)

1.2 Proof of Theorem 1

Proof

We first define an event

$$\begin{aligned} E= & {} \left\{ \sup _{\beta _S \in B^o_S(d_1),\ |S| \le \rho } |\ell _S(\beta _S) - \ell _S(\beta ^*_S) - E [\ell _S(\beta _S)] + E[\ell _S(\beta ^*_S)]| \le 2A_3 d_1 \sqrt{\rho \log p/n} \right\} \\&\bigcup \left\{ \sup _{|S| \le \rho } \Vert \hat{\beta }_S - \beta ^*_S\Vert \le A_4 \sqrt{\rho ^2 \log p / n},\quad \sup _{|S| \le \rho } |\ell _S(\hat{\beta }_S) - \ell _S(\beta ^*_S)| \le A_5 \rho ^2 \log p/n \right\} \end{aligned}$$

By Lemmas 3 and 5, the event E holds with probability at least \(1-6\exp (-6\rho \log p)\). Under the event E, we have \(\hat{\beta }_S \in B_S^o(d_1) = \{\beta _S:\ \Vert \beta _S-\beta ^*_S\Vert \le d_1/\sqrt{s}\}\), where \(d_1 := A_4 \sqrt{s\rho ^2 \log p / n}\). Note that

$$\begin{aligned}&\ell _S(\hat{\beta }_S)-\ell _{S \bigcup \{j\}}(\beta ^*_{S \bigcup \{j\}}) \\&= \ell _S(\hat{\beta }_S) - \ell _S(\beta ^*_S) + \ell _S(\beta ^*_S) - \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}}) \\&= \underbrace{\ell _S(\hat{\beta }_S) - \ell _S(\beta ^*_S) - E \left[ \ell _S(\hat{\beta }_S)\right] + E[\ell _S(\beta ^*_S)]}_{I_1} + \underbrace{E[\ell _S(\hat{\beta }_S)] - E[\ell _S(\beta ^*_S)]}_{I_2} + \underbrace{\ell _S(\beta ^*_S) - \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}})}_{I_3}. \end{aligned}$$

Under the event E, \(I_1 \ge -2A_3 d_1 \sqrt{\rho \log p/n} \ge -2A_3 A_4 \rho ^2 \log p /n\). Also by Lemma 4, we obtain \(I_2 \ge \frac{1}{8}\lambda _{m}\underline{f} \tau _0 \Vert \hat{\beta }_S - \beta ^*_S\Vert ^2 - C_3 n^{-1} \log n \ge - C_3 n^{-1} \log n\). Next, we have

$$\begin{aligned} I_3= & {} \underbrace{\ell _S(\beta ^*_S) -E \ell _S(\beta ^*_S) + E \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}}) - \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}})}_{I_{31}} \\&+ \underbrace{E\ell _S(\beta ^*_S) - E \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}})}_{I_{32}}. \end{aligned}$$

Using the similar arguments in the proof of Lemma 3, with probability at least \(1-6\exp (-6\rho \log p)\), we have \(I_{31} \ge -2 \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert A_3 \rho \sqrt{\log p/n}\).

Next, we consider \(I_{32}\). Note that by Lemma 6, we can consider \(j \in S^c\) such that \(|[\beta ^*_{S \bigcup \{j\}}]_j| \ge C \bar{f}^{-1} \tau _0 n^{-\alpha }\), i.e., \(\Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \ge C \bar{f}^{-1}\tau _0 n^{-\alpha }\). By Lemma 4, we have

$$\begin{aligned} I_{32} \gtrsim \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert ^2 \wedge \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert - C_3 n^{-1} \log n. \end{aligned}$$

Hence, we have with probability at least \(1-6\exp (-6\rho \log p)\),

$$\begin{aligned} I_3 \gtrsim \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert ^2 \wedge \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert - n^{-1} \log n - \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \rho \sqrt{\log p/n}. \end{aligned}$$

Hence, if \(\Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \gtrsim 1\), then

$$\begin{aligned} I_3 \gtrsim \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert - n^{-1} \log n \gtrsim C \bar{f}^{-1} \tau _0 n^{-\alpha } - n^{-1} \log n, \end{aligned}$$

where the inequality follows from \(\Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \ge |[\beta ^*_{S \bigcup \{j\}}]_j| \ge C \bar{f}^{-1} \tau _0 n^{-\alpha }\). Else if \(\Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \lesssim 1\), using the similar argument,

$$\begin{aligned} I_3\gtrsim & {} \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert ^2 - n^{-1} \log n - \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \rho \sqrt{\log p/n} \\\gtrsim & {} \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert ^2 - n^{-1} \log n - \rho \sqrt{\log p/n} \\\gtrsim & {} C^2 \bar{f}^{-2} \tau _0^2 n^{-2\alpha } - n^{-1} \log n - \rho \sqrt{\log p/n}. \end{aligned}$$

Hence, we have with probability at least \(1-6\exp (-6\rho \log p)\),

$$\begin{aligned} I_3 \gtrsim C^2 \bar{f}^{-2} \tau _0^2 n^{-2\alpha } - n^{-1} \log n - \rho \sqrt{\log p/n}. \end{aligned}$$

Combining these inequalities, we have

$$\begin{aligned} \max _{j \in S^c} \ell _S(\hat{\beta }_S)-\ell _{S \bigcup \{j\}} (\hat{\beta }_{S \bigcup \{j\}})\ge & {} C^2 \bar{f}^{-2} \tau _0^2 n^{-2\alpha } -\rho ^2 \log p /n - \rho \sqrt{\log p/n}- n^{-1} \log n\\\gtrsim & {} 0.5 C^2 \bar{f}^{-2} \tau _0^2 n^{-2\alpha } \end{aligned}$$

uniformly in S satisfying \(|S|<\rho ,\ M_\tau \nsubseteq S\), where the last inequality follows from the growth rate condition of \(\rho\) in Assumption 4. By letting \(C_1 := 0.5 C^2 \bar{f}^{-2} \tau _0^2\), this completes the proof. \(\square\)

1.3 Proof of Theorem 2

Proof

The proof consists of the three steps.

Claim 1: It holds that \(|[\beta ^*_{S \bigcup \{j\}}]_j| =o(n^{-\alpha })\) for any \(j \in (M_\tau \bigcup S)^c\).

Proof of Claim 1: We have by Assumption 1, if \({{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}} -{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S \ge 0\), i.e., \(x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \ge 0\), then

$$\begin{aligned}&E\left[ \psi _\tau (Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S) - \psi _\tau (Y_i- {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}) \mid \varDelta _i, {{\varvec{X}}}_i\right] \\&= P\left( Y_i \le {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}} \mid \varDelta _i, {{\varvec{X}}}_i\right) - P\left( Y_i \le {{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S \mid \varDelta _i, {{\varvec{X}}}_i\right) \\&\ge \underline{f} \left( {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}} -{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S \right) \\&= \underline{f} x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j. \end{aligned}$$

Else, if \({{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}} -{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S < 0\), i.e., \(x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j < 0\), then

$$\begin{aligned} E\left[ \psi _\tau (Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S) - \psi _\tau (Y_i- {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}) \mid \varDelta _i, {{\varvec{X}}}_i\right] \le \underline{f} x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j. \end{aligned}$$

Thus, we have

$$\begin{aligned}&E\left[ x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j\psi _\tau (Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S) -x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \psi _\tau (Y_i- {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}) \mid \varDelta _i, {{\varvec{X}}}_i\right] \nonumber \\&\ge \underline{f} x^2_{ij} [\beta ^*_{S \bigcup \{j\}}]^2_j. \end{aligned}$$
(11)

By the conditions of Theorem 2 and the fact that \(E[\varDelta _i \mid {{\varvec{X}}}_i] = G^o_i\), we have for any \(j \in (M_\tau \bigcup S)^c\),

$$\begin{aligned}&o(n^{-\alpha }|[\beta ^*_{S \bigcup \{j\}}]_j|) \\= & {} \left| E\left[ \frac{\varDelta _i}{G^o_i} x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \psi _\tau (Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S)\right] \right| \\= & {} \left| E\left[ \frac{\varDelta _i}{G^o_i} x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \psi _\tau (Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S)\right] - E\left[ \frac{\varDelta _i}{G^o_i} x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \psi _\tau (Y_i- {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}})\right] \right| \\= & {} \left| E\left[ \frac{\varDelta _i}{G^o_i} E\left[ x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \psi _\tau (Y_i-{{\varvec{X}}}_{iS}^\top [\beta ^*_{S \bigcup \{j\}}]_S) - x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \psi _\tau (Y_i- {{\varvec{X}}}_{i,S \bigcup \{j\}}^\top \beta ^*_{S \bigcup \{j\}}) \mid \varDelta _i, {{\varvec{X}}}_i \right] \right] \right| \\\ge & {} E\left[ \underline{f} x^2_{ij} [\beta ^*_{S \bigcup \{j\}}]^2_j \frac{\varDelta _i}{G^o_i}\right] \\= & {} \underline{f}[\beta ^*_{S \bigcup \{j\}}]^2_j, \end{aligned}$$

where the second equation holds due to the definition of \(\beta ^*_{S \bigcup \{j\}}\), the first inequality follows from (11), and the last equality follows from the fact that \(E[\varDelta _i] = G^o_i\) and \(E[x_{ij}^2]=1\). Thus, \(|[\beta ^*_{S \bigcup \{j\}}]_j| =o(n^{-\alpha })\) for any \(j \in (M_\tau \bigcup S)^c\).

Claim 2: It holds that \(\max _{|S| < \rho , j \in (M_\tau \bigcup S)^c} \ -\ell _{S \bigcup \{j\}}(\hat{\beta }_{S \bigcup \{j\}}) + \ell _S(\hat{\beta }_S) \le 0.5 C_1 n^{-2\alpha }\) with probability at least \(1-36 \exp (-6\rho \log p)\).

Proof of Claim 2: We can write

$$\begin{aligned}&\max _{|S|< \rho , j \in (M_\tau \bigcup S)^c} \ -\ell _{S \bigcup \{j\}}(\hat{\beta }_{S \bigcup \{j\}}) + \ell _S(\hat{\beta }_S) \\\le & {} \underbrace{\max _{|S|< \rho , j \in (M_\tau \bigcup S)^c} -\ell _{S \bigcup \{j\}}(\hat{\beta }_{S \bigcup \{j\}}) + \ell _{S \bigcup \{j\}}(\beta ^*_{S \bigcup \{j\}})}_{I_1} +\underbrace{\max _{|S| < \rho , j \in (M_\tau \bigcup S)^c} -\ell _{S \bigcup \{j\}}(\beta ^*_{S \bigcup \{j\}}) + \ell _S(\hat{\beta }_S)}_{I_2}. \end{aligned}$$

Using the similar arguments in Lemma 5, we have with high probability at least \(1-6\exp (-6\rho \log p)\),

$$\begin{aligned} |I_1| \le A_5 \rho ^2 \log p/n. \end{aligned}$$
(12)

Next, we can decompose \(I_2\) as

$$\begin{aligned} I_2 = \underbrace{\ell _S(\hat{\beta }_S) - \ell _S(\beta ^*_S) - E \left[ \ell _S(\hat{\beta }_S)\right] + E[\ell _S(\beta ^*_S)]}_{I_{21}} + \underbrace{E[\ell _S(\hat{\beta }_S)] - E[\ell _S(\beta ^*_S)]}_{I_{22}} +\underbrace{\ell _S(\beta ^*_S)-\ell _{S \bigcup \{j\}}(\beta ^*_{S \bigcup \{j\}})}_{I_{23}}. \end{aligned}$$

Because \(\Vert \hat{\beta }_S - \beta ^*_S\Vert \le A_4 \sqrt{\rho ^2 \log p / n}\) by Lemma 5, we have by Lemma 3,

$$\begin{aligned} I_{21} \le 2A_3 A_4 \rho ^2 \log p/n. \end{aligned}$$
(13)

By Lemma 4, we have

$$\begin{aligned} I_{22} \le 1.5 \lambda _{M} \tau _0^{-1} \bar{f} \Vert \hat{\beta }_S - \beta ^*_S\Vert ^2 + C_3 n^{-1} \log n \le 1.5 \lambda _{M} \tau _0^{-1} \bar{f} A_4^2 \rho ^2 \log p / n + C_3 \log n/n. \end{aligned}$$
(14)

Next, we consider \(I_{23}\). We obtain

$$\begin{aligned} I_{23}= & {} \underbrace{\ell _S(\beta ^*_S) - \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}}) -E \ell _S(\beta ^*_S) + E \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}})}_{I_{231}} \\&+ \underbrace{E\ell _S(\beta ^*_S) - E \ell _{S \bigcup \{j\}}(\beta ^*_{S\bigcup \{j\}})}_{I_{232}}. \end{aligned}$$

By Lemma 2 and Assumption 3, we have with probability at least \(1-6\exp (-6\rho \log p)\),

$$\begin{aligned} I_{231}\le & {} 2A_3 \rho \sqrt{\frac{\log p}{n}} \Vert \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\Vert \\\le & {} 2A_3 \rho \sqrt{\frac{\log p}{n}} \lambda _{m}^{-1/2} {\bar{f}}^{-1/2} E\left[ \frac{{\bar{f}}}{n} \sum _{i=1}^n \left| {{\varvec{X}}}_i^\top \left( \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\right) \right| ^2 \right] ^{1/2} \\\le & {} 4A_3^2 \rho ^2 \lambda _{m}^{-1} {\bar{f}}^{-1} \frac{\log p}{n} + E\left[ \frac{{\bar{f}}}{n} \sum _{i=1}^n \left| {{\varvec{X}}}_i^\top \left( \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\right) \right| ^2 \right] , \end{aligned}$$

where the last inequality follows from \(xy \le x^2 + y^2\) for any non-negative numbers x and y.

Furthermore, by (8) and (9) in Lemma 4,

$$\begin{aligned} I_{232}\le & {} E\left[ \frac{1}{n} \sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i}\int _0^{{{\varvec{X}}}_i^\top ( \beta ^*_{S \bigcup \{j\}} - \beta ^*_S)} F\left( {{\varvec{X}}}_{i, S \bigcup \{j\}} \beta ^*_{S \bigcup \{j\}} +z\right) - F\left( {{\varvec{X}}}_{i, S \bigcup \{j\}} \beta ^*_{S \bigcup \{j\}} \right) dz\right] \\&+ O\left( n^{-1/2} (\log n)^{1/2}\right) E\left[ \frac{\bar{f}}{2n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} \left| {{\varvec{X}}}_{i}^\top \left( \beta ^*_{S \bigcup \{j\}} - \beta ^*_S\right) \right| ^2 \right] ^{1/2} \\\le & {} E\left[ \frac{\bar{f}}{2n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} |{{\varvec{X}}}_{i}^\top (\beta ^*_{S \bigcup \{j\}} - \beta ^*_S)|^2 \right] + O\left( n^{-1/2} (\log n)^{1/2}\right) E\left[ \frac{\bar{f}}{2n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} \left| {{\varvec{X}}}_{i}^\top \left( \beta ^*_{S \bigcup \{j\}} - \beta ^*_S\right) \right| ^2 \right] ^{1/2} \\\le & {} E\left[ \frac{\bar{f}}{n}\sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} |{{\varvec{X}}}_{i}^\top (\beta ^*_{S \bigcup \{j\}} - \beta ^*_S)|^2 \right] + O\left( n^{-1} \log n \right) \\\le & {} E\left[ \frac{\bar{f}}{n}\sum _{i=1}^n \frac{2\varDelta _i}{G^o_i} |{{\varvec{X}}}_{i}^\top (\beta ^*_{S \bigcup \{j\}} - \beta ^*_S)|^2 \right] + O\left( n^{-1} \log n \right) \\= & {} E\left[ \frac{2\bar{f}}{n}\sum _{i=1}^n |{{\varvec{X}}}_{i}^\top (\beta ^*_{S \bigcup \{j\}} - \beta ^*_S)|^2 \right] + O\left( n^{-1} \log n \right) \\\le & {} E\left[ \frac{2\bar{f}}{n}\sum _{i=1}^n |{{\varvec{X}}}_{i}^\top \delta _S|^2 \right] + E\left[ \frac{2\bar{f}}{n}\sum _{i=1}^n x_{ij}^2 [\beta ^*_{S \bigcup \{j\}}]^2_j \right] + O\left( n^{-1} \log n \right) , \end{aligned}$$

where the second inequality follows from Assumption 1, the third inequality follows from \(xy \le x^2 + y^2\) for any non-negative numbers x and y, the fourth inequality follows from Assumption 5 and Lemma 1, and the last equality follows from the fact that \(E\left[ \varDelta _i \mid {{\varvec{X}}}_i \right] = G^o_i\).

Hence, we have

$$\begin{aligned} I_{23}\le & {} I_{231} + I_{232} \nonumber \\\le & {} 4A_3^2 \rho ^2 \lambda _{m}^{-1} {\bar{f}}^{-1} \frac{\log p}{n}+ E\left[ \frac{3{\bar{f}}}{n} \sum _{i=1}^n \left| {{\varvec{X}}}_i^\top \left( \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\right) \right| ^2 \right] + O\left( \frac{\log n}{n} \right) . \end{aligned}$$
(15)

Let \(\delta _S:= [\beta ^*_{S \bigcup \{j\}}]_S - \beta ^*_S\). By the definition of \(\beta ^*_S\) and \(\beta ^*_{S \bigcup \{j\}}\), we have

$$\begin{aligned} 0\ge & {} E[\ell ^o_{S \bigcup \{j\}}(\beta ^*_{S \bigcup \{j\}})] - E[\ell ^o_S(\beta ^*_S)] \nonumber \\= & {} E\left[ \frac{1}{n} \sum _{i=1}^n -\frac{\varDelta _i}{G^o_i} ({{\varvec{X}}}_{iS}^\top \delta _S + x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j) \left( \tau - 1\{Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S< 0\}\right) \right. \nonumber \\&\quad \quad + \left. \frac{\varDelta _i}{G^o_i}\int _0^{{{\varvec{X}}}_{iS}^\top \delta _S + x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j} F({{\varvec{X}}}_{iS}^\top \beta ^*_S + z) - F({{\varvec{X}}}_{iS}^\top \beta ^*_S) dz \right] \nonumber \\= & {} E\left[ \frac{1}{n} \sum _{i=1}^n -\frac{\varDelta _i}{G^o_i} (x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j) \left( \tau - 1\{Y_i-{{\varvec{X}}}_{iS}^\top \beta ^*_S < 0\}\right) \right. \nonumber \\&\quad \quad \left. + \frac{\varDelta _i}{G^o_i} \int _0^{{{\varvec{X}}}_{iS}^\top \delta _S + x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j} F({{\varvec{X}}}_{iS}^\top \beta ^*_S + z) - F({{\varvec{X}}}_{iS}^\top \beta ^*_S) dz \right] \nonumber \\\ge & {} -o(n^{-\alpha })|[\beta ^*_{S \bigcup \{j\}}]_j| + E\left[ \frac{\underline{f}}{2n}\sum _{i=1}^n \left| {{\varvec{X}}}_{i}^\top \delta _S + x_{ij} [\beta ^*_{S \bigcup \{j\}}]_j \right| ^2 \right] \nonumber \\= & {} -o(n^{-\alpha }) |[\beta ^*_{S \bigcup \{j\}}]_j| + E\left[ \frac{\underline{f}}{2n}\sum _{i=1}^n \left| {{\varvec{X}}}_{i}^\top \left( \beta ^*_S - \beta ^*_{S\bigcup \{j\}}\right) \right| ^2 \right] , \end{aligned}$$
(16)

where the first equality holds due to the identity of Knight (1998), and the second inequality follows from the condition of Theorem 2, Assumption 1, and the fact that \(E[\varDelta _i \mid {{\varvec{X}}}_i]=G^o_i\). Combining (15) and (16), we obtain

$$\begin{aligned} I_{23}\le & {} 4A_3^2 \rho ^2 \lambda _{m}^{-1} {\bar{f}}^{-1} \frac{\log p}{n}+ o(n^{-\alpha }) |[\beta ^*_{S \bigcup \{j\}}]_j| + O\left( \frac{\log n}{n} \right) \nonumber \\\le & {} 4A_3^2 \rho ^2 \lambda _{m}^{-1} {\bar{f}}^{-1} \frac{\log p}{n}+ o(n^{-2\alpha }) + O\left( \frac{\log n}{n} \right) . \end{aligned}$$
(17)

Combining (13), (14), and (17), we obtain

$$\begin{aligned} I_2 = I_{21}+I_{22}+I_{23} \le C_4 \rho ^2 \log p/n + C_3 \log n/n+ o(n^{-2\alpha }) + O\left( \frac{\log n}{n} \right) \end{aligned}$$
(18)

for some constant \(C_4>0\). Hence, we have by (12) and (18), with probability at least \(1-36 \exp (-6\rho \log p)\),

$$\begin{aligned}&\max _{|S| < \rho , j \in (M_\tau \bigcup S)^c} \ -\ell _{S \bigcup \{j\}}(\hat{\beta }_{S \bigcup \{j\}}) + \ell _S(\hat{\beta }_S) \\\le & {} A_5 \rho ^2 \log p/n +C_4 \rho ^2 \log p/n + C_3 \log n/n+ o(n^{-2\alpha }) + O\left( \frac{\log n}{n} \right) \\\le & {} 0.5 C_1 n^{-2\alpha }, \end{aligned}$$

where the last inequality follows from the growth rate condition of \(\rho\) in Assumption 4.

Claim 3: It holds that \(M_\tau \subset F_{C_2 |M_\tau |}\) with probability at least \(1-36 \exp (-3\rho \log p)\).

Proof of Claim 3: By Claim 2 and the fact that \(\log n \log p / n = o(n^{-2\alpha })\), if \(M_\tau \not \subseteq S\), CQ-FR would select a noise variable with probability less than \(36 \exp (-6\rho \log p)\). For \(k > |M_\tau |\), \(M_\tau \not \subset F_k\) implies that at least \(k-|M_\tau |\) noise variables are selected within the k steps. Then for \(k=C_2|M_\tau |\) with \(C_2 > 1\),

$$\begin{aligned} P(M_\tau \not \subset F_k)\le & {} \sum _{j=k-|M_\tau |}^k {k \atopwithdelims ()j} \{36 \exp (-6\rho \log p)\}^j \\\le & {} |M_\tau | k^{|M_\tau |} \{36 \exp (-6\rho \log p)\}^{k-|M_\tau |} \\\le & {} 36 \exp (-3\rho \log p). \end{aligned}$$

Therefore, \(M_\tau \subset F_{C_2 |M_\tau |}\) with probability at least \(1-36 \exp (-3\rho \log p)\). This completes the proof. \(\square\)

1.4 Proof of Theorem 3

Proof

By Theorem 2, CQ-FR does not stop at the kth step when \(M_\tau \not \subseteq F_k\), and CQ-FR satisfies \(M_\tau \subseteq F_k\) for some \(k< C_2 |M_{\tau }|\) with probability going to one. Hence, with probability going to one, CQ-FR stops at the kth step if \(\text{ EBIC}^c(F_{k+1}) > \text{ EBIC}^c(F_k)\).

Suppose that \(M_\tau \nsubseteq F_{k-1}\) and \(M_\tau \subseteq F_k\). We can easily see that \(\text{ EBIC}^c(F_{k+1}) > \text{ EBIC}^c(F_k)\) if and only if \(\log \ell _{F_{k}}(\hat{\beta }_{F_{k}}) - \log \ell _{F_{k+1}}(\hat{\beta }_{F_{k+1}}) \le \frac{\log n}{2n} \log p\).

Recall that \(\beta (\tau )\) is the underlying quantile coefficient vector. Let

$$\begin{aligned} M_n(\tau ) := n^{-1/2} \sum _{i=1}^n \frac{\varDelta _i}{{\hat{G}}_i} {{\varvec{X}}}_i \psi _{\tau }(Y_i-{{\varvec{X}}}_i^\top \beta (\tau )) \end{aligned}$$

and

$$\begin{aligned} H_{S,\tau } := \begin{bmatrix} E\left[ \frac{\varDelta _i}{{\hat{G}}_i}{{\varvec{X}}}_{iS} {{\varvec{X}}}^\top _{iS} f\left( {{\varvec{X}}}_{iS}^\top \beta (\tau ) \mid {{\varvec{X}}}_{iS}\right) \right] , &{} 0\\ 0, &{} 0 \end{bmatrix}, \quad H_{S,\tau }^{-1} := \begin{bmatrix} E\left[ \frac{\varDelta _i}{{\hat{G}}_i}{{\varvec{X}}}_{iS} {{\varvec{X}}}^\top _{iS} f\left( {{\varvec{X}}}_{iS}^\top \beta (\tau ) \mid {{\varvec{X}}}_{iS}\right) \right] ^{-1}, &{} 0\\ 0, &{} 0 \end{bmatrix}. \end{aligned}$$

Using the similar argument in the proof of Lemma 7.10 of Zheng et al. (2015) with the fact that \(M_\tau \subseteq F_k \subset F_{k+1}\) we can show that with probability at least \(1-16\exp (-\log p)-12/p\),

$$\begin{aligned}&\ell _{F_{k+1}}(\hat{\beta }_{F_{k+1}}) - \ell _{F_{k}}(\hat{\beta }_{F_{k}}) \\= & {} \ell _{F_{k+1}}(\hat{\beta }_{F_{k+1}}) - \ell _{F_{k+1}}(\beta (\tau )) - \left[ \ell _{F_{k}}(\hat{\beta }_{F_{k}}) - \ell _{F_{k+1}}(\beta (\tau )) \right] \\= & {} -\frac{M_n^\top (\tau ) [H_{F_{k+1},\tau }^{-1} - H_{F_k,\tau }^{-1}] M_n(\tau )}{4n} + o_p(\sqrt{k \log p/n^3}) \\\ge & {} -600 ( {f} \lambda _{m})^{-1} \log p/n + o_p(\sqrt{k \log p/n^3}), \end{aligned}$$

which yields with probability tending to 1,

$$\begin{aligned} \ell _{F_{k}}(\hat{\beta }_{F_{k}}) - \ell _{F_{k+1}}(\hat{\beta }_{F_{k+1}}) = o(\log p \log n/n). \end{aligned}$$

Using the similar argument in Lee et al. (2014), we have \(\ell _{F_{k}}(\hat{\beta }_{F_{k}}), \ell _{F_{k+1}}(\hat{\beta }_{F_{k+1}}) \ge c_5\) for some constant \(c_5>0\) with probability tending to one. Hence,

$$\begin{aligned} \log \ell _{F_{k}}(\hat{\beta }_{F_{k}}) - \log \ell _{F_{k+1}}(\hat{\beta }_{F_{k+1}}) \le \frac{\log n}{2n} \log p. \end{aligned}$$

This completes the proof. \(\square\)

1.5 Proofs of Theorems 4 and 5

Proof

Suppose that \(\ell ^* \in M_\tau ^c\). Let \(k:= \max _{j \in M_\tau \setminus F} |(\beta ^*_{F \bigcup \{j\}})_j|\). By Lemma 6, \(|[\beta ^*_{F \bigcup \{k\}}]_k| \ge C \bar{f}^{-1} n^{-\alpha }\). By the definition of \(\ell ^*\), we also have

$$\begin{aligned} \ell _{F \bigcup \{\ell ^*\}}(\hat{\beta }_{F \bigcup \{\ell ^*\}}) \le \ell _{F \bigcup \{k\}}(\hat{\beta }_{F \bigcup \{k\}}), \end{aligned}$$

which implies

$$\begin{aligned} \ell _{F \bigcup \{\ell ^*\}}(\beta ^*_{F \bigcup \{\ell ^*\}})\le & {} \ell _{F \bigcup \{k\}}(\beta ^*_{F \bigcup \{k\}}) \\&+ \lambda _{M}^{1/2} \Vert \hat{\beta }_{F \bigcup \{\ell ^*\}} - \beta ^*_{F \bigcup \{\ell ^*\}}\Vert + \lambda _{M}^{1/2} \Vert \hat{\beta }_{F \bigcup \{k\}} - \beta ^*_{F \bigcup \{k\}}\Vert \\\le & {} \ell _{F \bigcup \{k\}}(\beta ^*_{F \bigcup \{k\}}) + 2A_4\lambda _{M}^{1/2} \sqrt{\rho ^2 \log p / n}, \end{aligned}$$

where the first inequality follows from the Cauchy-Schwarz inequality, Assumption 3, and the fact that \(\rho _{\tau }(\cdot )\) satisfies \(|\rho _{\tau }(x) -\rho _{\tau }(y)| \le |x-y|\), and the last inequality follows from Lemma 5.

By Lemmas 2 and 3, this also implies

$$\begin{aligned}&E\left[ \ell ^o_{F \bigcup \{\ell ^*\}}(\beta ^*_{F \bigcup \{\ell ^*\}})\right] \lesssim E\left[ \ell ^o_{F \bigcup \{k\}}(\beta ^*_{F \bigcup \{k\}})\right] \nonumber \\&\quad + \sqrt{\rho ^2 \log p / n} + \sqrt{\frac{\log n}{n}} \Vert \beta ^*_{F \bigcup \{\ell ^*\}} - \beta ^*_{F \bigcup \{k\}}\Vert . \end{aligned}$$
(19)

Using the similar arguments in the proof of Lemma 4, we obtain

$$\begin{aligned}&E\left[ \ell ^o_{F \bigcup \{\ell ^*\}}(\beta ^*_{F \bigcup \{\ell ^*\}})\right] - E\left[ \ell ^o_{F \bigcup \{k\}}(\beta ^*_{F \bigcup \{k\}})\right] \nonumber \\= & {} E\left[ \rho _{\tau }(Y_i-{{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{\ell ^*\}}) - \rho _{\tau }(Y_i-{{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}}) \right] \nonumber \\= & {} E\left[ - {{\varvec{X}}}_{i}^\top \left( \beta ^*_{F \bigcup \{\ell ^*\}}-\beta ^*_{F \bigcup \{k\}}\right) \left( \tau - 1\{Y_i-{{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}}< 0\}\right) \right. \nonumber \\&+ \left. \int _0^{{{\varvec{X}}}_{i}^\top \left( \beta ^*_{F \bigcup \{\ell ^*\}}-\beta ^*_{F \bigcup \{k\}}\right) } F({{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}} + z \mid {{\varvec{X}}}_{i}, \varDelta _i) - F({{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}} \mid {{\varvec{X}}}_{i}, \varDelta _i) dz \right] \nonumber \\= & {} E\left[ - X_{i,\ell ^*} [\beta ^*_{F \bigcup \{\ell ^*\}}]_{\ell ^*} \left( \tau - 1\{Y_i-{{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}} < 0\}\right) \right. \nonumber \\&+ \left. \int _0^{{{\varvec{X}}}_{i}^\top \left( \beta ^*_{F \bigcup \{\ell ^*\}}-\beta ^*_{F \bigcup \{k\}}\right) } F({{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}} + z \mid {{\varvec{X}}}_{i}, \varDelta _i) - F({{\varvec{X}}}_{i}^\top \beta ^*_{F \bigcup \{k\}} \mid {{\varvec{X}}}_{i}, \varDelta _i) dz \right] \nonumber \\\ge & {} - o(n^{-\alpha }) [\beta ^*_{F \bigcup \{\ell ^*\}}]_{\ell ^*} +\frac{\underline{f}}{4} E\left[ \left\{ {{\varvec{X}}}_{i}^\top \left( \beta ^*_{F \bigcup \{\ell ^*\}}-\beta ^*_{F \bigcup \{k\}}\right) \right\} ^2\right] , \end{aligned}$$
(20)

where the third equality follows from the definition of \(\beta ^*_{F \bigcup \{k\}}\), and the inequality follows from the same arguments in the proof of Lemma 4 using the condition in Theorem 4. Let \(\delta _{F \bigcup \{\ell ^*\}} := \beta ^*_{F \bigcup \{\ell ^*\}}-[\beta ^*_{F \bigcup \{k\}}]_{F \bigcup \{\ell ^*\}}\). Then, have

$$\begin{aligned}&E\left[ \left\{ {{\varvec{X}}}_{i}^\top \left( \beta ^*_{F \bigcup \{\ell ^*\}}-\beta ^*_{F \bigcup \{k\}}\right) \right\} ^2\right] \nonumber \\&= E\left[ \left\{ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}} - X_{i,k} [\beta ^*_{F \bigcup \{k\}}]_k \right\} ^2\right] \nonumber \\&= E\left[ \left\{ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}} \right\} ^2\right] + [\beta ^*_{F \bigcup \{k\}}]_k^2 - 2 E\left[ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}} X_{i,k} [\beta ^*_{F \bigcup \{k\}}]_k \right] \nonumber \\&\ge E\left[ \left\{ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}} \right\} ^2\right] + [\beta ^*_{F \bigcup \{k\}}]_k^2\nonumber \\&- 2 \sqrt{E\left[ \left\{ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}}\right\} ^2 X^2_{i,k}\right] [\beta ^*_{F \bigcup \{k\}}]_k^2 }\nonumber \\&\ge E\left[ \left\{ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}} \right\} ^2\right] + [\beta ^*_{F \bigcup \{k\}}]_k^2\nonumber \\&- 2 \sqrt{ E\left[ \left\{ {{\varvec{X}}}_{i, F \bigcup \{\ell ^*\}}^\top \delta _{F \bigcup \{\ell ^*\}}\right\} ^2\right] (1-\epsilon ) [\beta ^*_{F \bigcup \{k\}}]_k^2 } \nonumber \\&\ge \epsilon [\beta ^*_{F \bigcup \{k\}}]_k^2 \end{aligned}$$
(21)

where the first inequality follows from \(E[X]^2 \le E[X^2]\) for any random variable X, the second inequality follows from Assumption 8 and the fact that \(x^2 + y^2 \ge 2xy\) for any numbers xy. Combining (19)-(20) and Assumption 8, we have

$$\begin{aligned} \sqrt{\rho ^2 \log p / n} + \sqrt{\frac{\log n}{n}}\ge & {} - o(n^{-\alpha }) [\beta ^*_{F \bigcup \{\ell ^*\}}]_{\ell ^*} + \epsilon [\beta ^*_{F \bigcup \{k\}}]_k^2 \\\gtrsim & {} - o(n^{-2\alpha }) - [\beta ^*_{F \bigcup \{\ell ^*\}}]_{\ell ^*}^2 + \epsilon [\beta ^*_{F \bigcup \{k\}}]_k^2. \end{aligned}$$

Because \([\beta ^*_{F \bigcup \{k\}}]_k^2 \gtrsim n^{-2\alpha } \gtrsim n^{-2\alpha } + \sqrt{\rho ^2 \log p / n} + \sqrt{\frac{\log n}{n}}\) due to the rate condition of \(\alpha\) in Assumption 4, this implies

$$\begin{aligned} |[\beta ^*_{F \bigcup \{\ell ^*\}}]_{\ell ^*}| \ge \frac{1}{C_B} |[\beta ^*_{F \bigcup \{k\}}]_k|. \end{aligned}$$

Hence, there exists some constant \(C_L>\frac{1}{C_B}\) satisfying

$$\begin{aligned} \frac{|[\beta ^*_{F \bigcup \{\ell ^*\}}]_{\ell ^*}|}{\max _{k \in M_\tau \setminus F} |[\beta ^*_{F \bigcup \{k\}}]_k|} > C_L \end{aligned}$$

uniformly in \(F \not \subseteq M_\tau\). Thus, we have \(\ell^{\ast} \in M_\tau \setminus F\) for any \(F \not \subseteq M_\tau\). This completes the proof. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, E.R., Park, S., Lee, S.K. et al. Quantile forward regression for high-dimensional survival data. Lifetime Data Anal 29, 769–806 (2023). https://doi.org/10.1007/s10985-023-09603-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-023-09603-w

Keywords

Navigation