High-dimensional sign-constrained feature selection and grou**

Qin, Shanshan; Ding, Hao; Wu, Yuehua; Liu, Feng

doi:10.1007/s10463-020-00766-z

High-dimensional sign-constrained feature selection and grou**

Published: 12 October 2020

Volume 73, pages 787–819, (2021)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Shanshan Qin¹,
Hao Ding¹,
Yuehua Wu¹ &
…
Feng Liu²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In this paper, we propose a non-negative feature selection/feature grou** (nnFSG) method for general sign-constrained high-dimensional regression problems that allows regression coefficients to be disjointly homogeneous, with sparsity as a special case. To solve the resulting non-convex optimization problem, we provide an algorithm that incorporates the difference of convex programming, augmented Lagrange and coordinate descent methods. Furthermore, we show that the aforementioned nnFSG method recovers the oracle estimate consistently, and that the mean-squared errors are bounded. Additionally, we examine the performance of our method using finite sample simulations and applying it to a real protein mass spectrum dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Sparse Proteomics Analysis – a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

Article Open access 09 March 2017

The Horseshoe-Like Regularization for Feature Subset Selection

Article 17 December 2019

Feature Selection and Machine Learning with Mass Spectrometry Data

References

Arnold, T. B., Tibshirani, R. J. (2016). Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1), 1–27.
Article MathSciNet Google Scholar
Esser, E., Lou, Y. F., **n, J. (2013). A method for finding structured sparse solutions to nonnegative least squares problems with applications. SIAM Journal on Imaging Sciences, 6(4), 2010–2046.
Article MathSciNet Google Scholar
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Article MathSciNet Google Scholar
Frank, L. E., Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109–135.
Article Google Scholar
Friedman, J., Hastie, T., Simon, N., Tibshirani, R. (2016). Lasso and elastic-net regularized generalized linear models. R-Package Version, 2(0–5), 2016.
Google Scholar
Fu, A., Narasimhan, B., Boyd, S. (2017). CVXR: An R package for disciplined convex optimization. ar**v:1711.07582.
Goeman, J. J. (2010). $L_1$ penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52(1), 70–84.
MathSciNet MATH Google Scholar
Hu, Z., Follmann, D. A., Miura, K. (2015). Vaccine design via nonnegative lasso-based variable selection. Statistics in Medicine, 34(10), 1791–1798.
Article MathSciNet Google Scholar
Huang, J., Ma, S., **e, H., Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.
Article MathSciNet Google Scholar
Itoh, Y., Duarte, M. F., Parente, M. (2016). Perfect recovery conditions for non-negative sparse modeling. IEEE Transactions on Signal Processing, 65(1), 69–80.
Article MathSciNet Google Scholar
Jang, W., Lim, J., Lazar, N., Loh, J. M., McDowell, J., Yu, D. (2011). Regression shrinkage and equality selection for highly correlated predictors with HORSES. Biometrics, 64, 1–23.
Google Scholar
Koike, Y., Tanoue, Y. (2019). Oracle inequalities for sign constrained generalized linear models. Econometrics and Statistics, 11, 145–157.
Article MathSciNet Google Scholar
Luenberger, D. G., Ye, Y. (2015). Linear and nonlinear programming, Vol. 228. New York: Springer.
MATH Google Scholar
Mandal, B. N., Ma, J. (2016). $l_1$ regularized multiplicative iterative path algorithm for non-negative generalized linear models. Computational Statistics and Data Analysis, 101, 289–299.
Article MathSciNet Google Scholar
Meinshausen, N. (2013). Sign-constrained least squares estimation for high-dimensional regression. Electronic Journal of Statistics, 7, 1607–1631.
Article MathSciNet Google Scholar
Mullen, K. M., van Stokkum, I. H. (2012). The Lawson–Hanson algorithm for nonnegative least squares (NNLS). CRAN: R package. https://cran.r-project.org/web/packages/nnls/nnls.pdf.
Rekabdarkolaee, H. M., Boone, E., Wang, Q. (2017). Robust estimation and variable selection in sufficient dimension reduction. Computational Statistics and Data Analysis, 108, 146–157.
Article MathSciNet Google Scholar
Renard, B. Y., Kirchner, M., Steen, H., Steen, J. A., Hamprecht, F. A. (2008). NITPICK: Peak identification for mass spectrometry data. BMC Bioinformatics, 9(1), 355.
Article Google Scholar
Shadmi, Y., Jung, P., Caire, G. (2019). Sparse non-negative recovery from biased sub-Gaussian measurements using NNLS. ar**v:1901.05727.
She, Y. (2010). Sparse regression with exact clustering. Electronic Journal of Statistics, 4, 1055–1096.
Article MathSciNet Google Scholar
Shen, X., Huang, H. C., Pan, W. (2012a). Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99(4), 899–914.
Article MathSciNet Google Scholar
Shen, X., Pan, W., Zhu, Y. (2012b). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107(497), 223–232.
Article MathSciNet Google Scholar
Shen, X., Pan, W., Zhu, Y., Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807–832.
Article MathSciNet Google Scholar
Slawski, M., Hein, M. (2010). Sparse recovery for protein massspectrometry data. In NIPS workshop on practical applications of sparse modelling.
Slawski, M., Hein, M. (2013). Non-negative least squares for high-dimensional linear models: Consistency and sparse recovery without regularization. Electronic Journal of Statistics, 7, 3004–3056.
Article MathSciNet Google Scholar
Slawski, M., Hussong, R., Tholey, A., Jakoby, T., Gregorius, B., Hildebrandt, A., Hein, M. (2012). Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching. BMC Bioinformatics, 13(1), 291.
Article Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
MathSciNet MATH Google Scholar
Tibshirani, R., Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, 9(1), 18–29.
Article Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(1), 91–108.
Article MathSciNet Google Scholar
Tibshirani, R. J., Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 1335–1371.
Article MathSciNet Google Scholar
Wen, Y. W., Wang, M., Cao, Z., Cheng, X., Ching, W. K., Vassiliadis, V. S. (2015). Sparse solution of nonnegative least squares problems with applications in the construction of probabilistic Boolean networks. Numerical Linear Algebra with Applications, 22(5), 883–899.
Article MathSciNet Google Scholar
Wu, L., Yang, Y. (2014). Nonnegative elastic net and application in index tracking. Applied Mathematics and Computation, 227, 541–552.
Article MathSciNet Google Scholar
Wu, L., Yang, Y., Liu, H. (2014). Nonnegative-lasso and application in index tracking. Computational Statistics and Data Analysis, 70, 116–126.
Article MathSciNet Google Scholar
**ang, S., Shen, X., Ye, J. (2015). Efficient nonconvex sparse group feature selection via continuous and discrete optimization. Artificial Intelligence, 224, 28–50.
Article MathSciNet Google Scholar
Yang, S., Yuan, L., Lai, Y. C., Shen, X., Wonka, P., Ye, J. (2012). Feature grou** and selection over an undirected graph. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 922–930). ACM. New York.
Yang, Y., Wu, L. (2016). Nonnegative adaptive lasso for ultra-high dimensional regression models and a two-stage method applied in financial modeling. Journal of Statistical Planning and Inference, 174, 52–67.
Article MathSciNet Google Scholar
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Article MathSciNet Google Scholar
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942.
Article MathSciNet Google Scholar
Zhu, Y., Shen, X., Pan, W. (2013). Simultaneous grou** pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108(502), 713–725.
Article MathSciNet Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Article MathSciNet Google Scholar
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society): Series B (Statistical Methodology, 67(2), 301–320.
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by Natural Sciences and Engineering Research Council of Canada (RGPIN-2017-05720). Qin also gratefully acknowledges the financial support from the China Scholarship Council (Grant No.201506180073).

Author information

Authors and Affiliations

Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, ON, M3J 1P3, Canada
Shanshan Qin, Hao Ding & Yuehua Wu
Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, NSW, 2007, Australia
Feng Liu

Authors

Shanshan Qin
View author publications
You can also search for this author in PubMed Google Scholar
Hao Ding
View author publications
You can also search for this author in PubMed Google Scholar
Yuehua Wu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Ding.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Lemma 1

Since $\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}=({\hat{\alpha }}_1^{{\mathrm{ols}}},\ldots ,{\hat{\alpha }}_{K^0}^{{\mathrm{ols}}})^{\top }= (Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}Z_{{{\mathcal {G}}_0^0}^c}^{\top }{{\varvec{{y}}}} ={{\varvec{{\alpha }}}}^{0}+(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}Z_{{{\mathcal {G}}_0^0}^c}^{\top }{{\varvec{{\epsilon }}}}$, $\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}\sim N\left( {{\varvec{{\alpha }}}}^{0}, \sigma ^2(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}\right) ,$ namely,

$$\begin{aligned} {\hat{\alpha }}_{k}^{{\mathrm{ols}}}-\alpha _k^0\sim N\left( 0,\sigma ^2(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})_{kk}^{-1}\right) , k=1,\ldots ,K^0, \end{aligned}$$

where $(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})_{kk}^{-1}$ denotes the k-th diagonal element of matrix $(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}$.

By the assumption (A2), it yields that the variance of ${\hat{\alpha }}_{k}^{{\mathrm{ols}}}$ is bounded from above by $\sigma ^2/(nc_0)$ for all $k=1,\ldots ,K^0$. In view of the assumption (A3), $\min _{1\le k\le K^0}{\alpha }_{k}^{0}=\min _{j\in {{\mathcal {G}}_0^0}^c}{\beta }_{j}^{0}> c_n$, where $c_n=[2\sigma ^2\log \{2nK^0/(2\pi )^{1/2}\}/(nc_0)]^{1/2}$. Similar to Meinshausen (2013), by Bonferroni’s inequality, we thus have

$$\begin{aligned} \Vert \hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}-{{\varvec{{\alpha }}}}^{0}\Vert _{\infty }\le c_n, \end{aligned}$$

with probability at least

$$\begin{aligned} 1-2K^0\left\{ 1-\varPhi \left( c_n{(nc_0)^{1/2}}/{\sigma }\right) \right\} =1-2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} . \end{aligned}$$

It implies that with probability at least $1-2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} $, $\min _{1\le k\le K^0}{\hat{\alpha }}_{k}^{{\mathrm{ols}}}> 0$, and thus $\hat{{{\varvec{{\alpha }}}}}^{ora}=\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}$, $\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$. That is,

$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \le 2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} . \end{aligned}$$

Since $1-\varPhi (x)\le (2\pi )^{-1/2}x^{-1}\exp (-x^2/2)$ for any $x>0$, it follows that

$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \le \frac{1}{n}\frac{1}{[2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2}}=O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

□

Proof of Theorem 1

Let ${\mathcal {G}} =({\mathcal {G}}_0,{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_K)$ be a grou** of the constrained problem in Sect. 2, satisfying that $0\le {\hat{\beta }}_j^{{\mathrm{cons}}}\le \tau $ if $j\in {\mathcal {G}}_0$, $|{\hat{\beta }}_j^{{\mathrm{cons}}}-{\hat{\beta }}_{j'}^{{\mathrm{cons}}}|> \tau $ if $j\in {{\mathcal {G}}}_k$, $j'\in {{\mathcal {G}}}_{k'}$, $j = 1, \ldots , p; 1\le k\ne k'\le K$.

If ${{\mathcal {G}}}={\mathcal {G}}^0$, then $|{{\mathcal {G}}}_0^c|=s_1^0$. By the first constraint $\sum _{j=1}^p \min \left\{ \frac{|\beta _j|}{\tau }, 1\right\} \le s_1$, $\sum _{j\in {{\mathcal {G}}}_0}{\hat{\beta }}_j^{{\mathrm{cons}}}/\tau +s_1^0\le s_1^0$, which implies that ${\hat{\beta }}_j^{{\mathrm{cons}}}=0$, $j\in {{\mathcal {G}}}_0$. By the second constraint $\sum _{(j, j') \in \varepsilon } \min \left\{ \frac{|\beta _j - \beta _{j'}|}{\tau }, 1\right\} \le s_2$, similarly, we obtain that ${\hat{\beta }}_j^{{\mathrm{cons}}}={\hat{\beta }}_{j'}^{{\mathrm{cons}}}$, $j, j'\in {{\mathcal {G}}}_k={\mathcal {G}}_k^0$, $(j, j')\in \varepsilon $, $k=1,\ldots ,K$. Thus, $\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}$ if ${{\mathcal {G}}}={\mathcal {G}}^0$, which, together with the fact that ${{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})={{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0) + {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}={\mathcal {G}}^0)$, yields that

$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) = {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0\right) . \end{aligned}$$

(20)

Denote ${{\bar{S}}}({{\varvec{{\beta }}}}) = 2^{-1}\Vert Y-X {{\varvec{{\beta }}}}\Vert ^2 $. In view that ${{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0) ={{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0)+{{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0)$, (20) thus becomes

$$\begin{aligned}&{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) \nonumber \\&\le {\mathrm{pr}}\left( {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) -{{\bar{S}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})\le 0, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0\right) + {\mathrm{pr}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) . \end{aligned}$$

(21)

The second term in (21) has already provided in Lemma 2.1. Next, we work on the first term in (21), and denote it by $\varGamma $.

Consider the case where $\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$ and ${{\mathcal {G}}}\ne {\mathcal {G}}^0$. Define $\bar{{{\varvec{{\beta }}}}}= ({\bar{\beta }}_1,\ldots ,{\bar{\beta }}_{p})^{\top }$, satisfying

$$\begin{aligned} {\bar{\beta }}_j=\left\{ \begin{array}{l@{\quad }l} \frac{\sum _{j'\in {\mathcal {G}}_k}{\hat{\beta }}_{j'}^{{\mathrm{cons}}}}{|{\mathcal {G}}_k|}, &{}{{\mathrm{if}}} ~j\in {\mathcal {G}}_k, k=1,\ldots ,K,\\ 0, &{}{{\mathrm{if}}}~ j\in {\mathcal {G}}_0. \end{array} \right. \end{aligned}$$

It follows that $|{\bar{\beta }}_j-{\hat{\beta }}_{j}^{{\mathrm{cons}}}|\le \tau $, $\Vert \bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\Vert ^2\le \tau ^2p$, and thus

$$\begin{aligned} \Vert X(\bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}})\Vert ^2\le \lambda _{{\mathrm {max}}}(X^{\top }X)\tau ^2p. \end{aligned}$$

(22)

Note that

$$\begin{aligned} \Vert Y-X\bar{{{\varvec{{\beta }}}}}\Vert ^2\ge \Vert Y-P_{Z_{{\mathcal {G}}_0^c}}Y\Vert ^2=\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}+(I-P_{Z_{{\mathcal {G}}_0^c}}){{\varvec{{\epsilon }}}}\Vert ^2. \end{aligned}$$

(23)

For any vector ${{\varvec{{u}}}}, {{\varvec{{v}}}}\in {\mathbb {R}}^p$ and $a>0$, it holds that $\Vert {{\varvec{{u}}}}+{{\varvec{{v}}}}\Vert ^2\ge a^{-1}(a-1)\Vert {{\varvec{{u}}}}\Vert ^2-(a-1)\Vert {{\varvec{{v}}}}\Vert ^2$ (Shen et al. 2012a). We thus have

$$\begin{aligned} {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) = \frac{1}{2}\left\| Y-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right\| ^2\ge \frac{a-1}{2a}\left\| Y-X\bar{{{\varvec{{\beta }}}}}\right\| ^2-\frac{a-1}{2}\left\| X(\bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}})\right\| ^2. \end{aligned}$$

(24)

By substituting (22)–(23) into (24) and together with ${{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =2^{-1}\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\Vert ^2\le 2^{-1}{{\varvec{{\epsilon }}}}^{\top }{{\varvec{{\epsilon }}}},$ we obtain that, for any $a>1$,

$$\begin{aligned} 2a\left\{ {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) -{{\bar{S}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})\right\} = 2a\left\{ {{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\right) -{{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \right\} \ge -L_1-L_2+L_3, \end{aligned}$$

where $L_1=\{{{\varvec{{\epsilon }}}}-(a-1)(I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\}^{\top }(I-P_{Z_{{\mathcal {G}}_0^c}})\{{{\varvec{{\epsilon }}}}-(a-1)(I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\}$, and $L_1\sigma ^{-2}$ follows noncentral Chi-squared distribution $\chi _{k,\varLambda }^2$ with degrees of freedom $k=\max \{n-K({\mathcal {G}}_0^c),0\}$, and noncentral parameter $\varLambda =(a-1)^2\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2/\sigma ^2$; $L_2=a{{\varvec{{\epsilon }}}}^{\top }P_{Z_{{\mathcal {G}}_0^c}}{{\varvec{{\epsilon }}}}$ is independent of $L_1$, and $a^{-1}\sigma ^{-2}L_2$ follows Chi-squared distribution $\chi _{\kappa }^2$ with degrees of freedom $\kappa =K({\mathcal {G}}_0^c)$; $L_3=a(a-1)\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2-a(a-1) \lambda _{{\mathrm {max}}}(X^{\top }X)\tau ^2p$. Note that, by the definition of $C_{{\mathrm {min}}}$, $\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge nC_{{\mathrm {min}}}$.

For $\varGamma $, by Markov inequality and moment-generating function of Chi-squared distribution, it holds that, for any $0<t<1/(2a)$ and $1-2at<1-2t<1 ~ (a>1)$, by Shen et al. (2012a),

$$\begin{aligned} \varGamma&\le \sum _{i=1}^{s_1^0}\sum _{j=0}^i\genfrac(){0.0pt}0{p-s_1^0}{j}\genfrac(){0.0pt}0{s_1^0}{s_1^0-i}T_il_1^*{l_2^*}^{K_i^*/2}\frac{1}{(1-2t)^{n/2}}, \end{aligned}$$

where $l_1^*=\exp \left\{ \frac{(a-1)\log p}{4n}-n\frac{t(a-1)iC_{{\mathrm {min}}}}{\sigma ^2}\frac{1-2at}{1-2t}\right\} $, $l_2^*=(1-2t)/(1-2at)$, $K_i^*=\max _{\{{\mathcal {G}}\in {\mathcal {T}}, |{\mathcal {G}}_0\backslash {\mathcal {G}}_0^0|=i\}}K({\mathcal {G}}_0^c)$. Note that the last inequality holds true because

$$\begin{aligned} \frac{t}{\sigma ^2}a(a-1)\lambda _{{\mathrm {max}}}(X^{\top }X)p\tau ^2\le \frac{2ta(a-1)\log p}{4n}\le \frac{(a-1)\log p}{4n} \end{aligned}$$

for any $\tau \le \sigma [\log p/\{2np\lambda _{{\mathrm {max}}}(X^{\top }X)\}]^{1/2}$. We choose $a=4+n/4$, $t={4^{-1}(a-1)}^{-1}$, and define $b={(1-2t)}/{(1-2at)}$. Then $b={(2a-3)}/{(a-2)}<5/2$, and $(a-1)/(4n)\le 1$. Since $-\log (1-x)\le x(1-x)^{-1}$ for $0<x<1$, and $0<2t=2^{-1}(a-1)^{-1}<1$, it follows that

$$\begin{aligned} -\frac{n}{2}\log (1-2t)\le \frac{n}{2}\frac{1/{\{2(a-1)\}}}{1-1/{\{2(a-1)\}}}\le \frac{n}{2}\frac{1}{2(4+n/4)-3}\le 1, \end{aligned}$$

which jointly with the facts

$$\begin{aligned} \genfrac(){0.0pt}0{s_1^0}{s_1^0-i}\le (s_1^0)^i, \sum _{j=0}^i\genfrac(){0.0pt}0{p-s_1^0}{j}\le (p-s_i^0)^i \quad {\text {and}} \quad (p-s_1^0)s_1^0\le p^2/4 \end{aligned}$$

yields that

$$\begin{aligned} \varGamma&\le \sum _{i=1}^{s_1^0}\left( \frac{p^2}{4}\right) ^iT_i\exp \left\{ \frac{(a-1)\log p}{4n}-n\frac{iC_{{\mathrm {min}}}}{4b\sigma ^2}\right\} b^{K_i^*/2}\frac{1}{(1-2t)^{n/2}}\nonumber \\&\le \exp (1)\sum _{i=1}^{s_1^0}\exp \left\{ -i\frac{n}{10\sigma ^2}\left( C_{{\mathrm {min}}} -\frac{10\sigma ^2}{n}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\right) \right\} . \end{aligned}$$

(25)

Since $(1-z)^{-1}=\sum _{i=0}^{\infty }z^i$ for $|z|<1$, we thus obtain that, for $x<0$,

$$\begin{aligned} \sum _{i=1}^{s_1^0}\exp (ix)\le -1+\frac{1}{1-\exp (x)}=\frac{\exp (x)}{1-\exp (x)}. \end{aligned}$$

We take $x =- {10^{-1}\sigma ^{-2}}n\{C_{{\mathrm {min}}}-{10\sigma ^2}{n}^{-1}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\}$ if $C_{{\mathrm {min}}}>{10\sigma ^2}{n}^{-1}(3\log p +{\bar{T}}+{{\bar{K}}}/{2})$. Together with $\varGamma \le 1$, (25) becomes

$$\begin{aligned} \varGamma \le \{\exp (1)+1\}\exp \left[ -\frac{n}{10\sigma ^2}\left\{ C_{{\mathrm {min}}}-\frac{10\sigma ^2}{n}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\right\} \right] . \end{aligned}$$

(26)

Similarly, we can show that (26) still holds for $C_{{\mathrm {min}}}\le {10\sigma ^2}{n}^{-1}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})$. By Lemma 2.1 and (26), (21) becomes

$$\begin{aligned}&{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) \nonumber \\&\le \{\exp (1)+1\}\exp \left[ -\frac{n}{10\sigma ^2}\left\{ C_{{\mathrm {min}}}-\frac{10\sigma ^2}{n}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\right\} \right] +\frac{c}{n(\log n)^{1/2}}. \end{aligned}$$

(27)

1.
If $C_{{\mathrm {min}}}\ge {10\sigma ^2}{n}^{-1}\left( \log n+2^{-1}\log \log n +3\log p+{\bar{T}}+{{\bar{K}}}/{2}\right) $, by (27),
$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) =O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$
2.
We denote $T_1=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G\}})$, and $T_2=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G^c\}})$, where $G=\{n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2\}$. It is easy to see that
$$\begin{aligned} \frac{1}{n}E\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2=T_1+T_2. \end{aligned}$$
Now, we work on $T_1$. By the definition, $T_1 = \int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x){\mathrm{d}}x + 25\sigma ^2{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2)$. For the first term of $T_1$,
$$\begin{aligned}&\int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( 4n^{-1}\Vert {{\varvec{{\epsilon }}}}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{25\sigma ^2}^{\infty }E\left\{ \exp \left( \frac{\Vert {{\varvec{{\epsilon }}}}\Vert ^2}{3\sigma ^2}\right) \right\} \exp \left( -\frac{nx}{12\sigma ^2}\right) {\mathrm{d}}x \nonumber \\&\quad = \int \nolimits _{25\sigma ^2}^{\infty }\exp \left[ -\frac{n}{12\sigma ^2}\{x-6(\log 3) \sigma ^2\}\right] {\mathrm{d}}x \nonumber \\&\quad < \int \nolimits _{25\sigma ^2}^{\infty }\exp \left\{ -\frac{n}{12\sigma ^2}(x-24 \sigma ^2)\right\} {\mathrm{d}}x \nonumber \\&\quad =\frac{12\sigma ^2}{n}\exp \left( -\frac{n}{12}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$
(28)
Since $\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\le 2(\Vert Y-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\Vert ^2+\Vert Y-X{{\varvec{{\beta }}}}^{0}\Vert ^2)\le 4\Vert Y-X{{\varvec{{\beta }}}}^{0}\Vert ^2=4\Vert {{\varvec{{\epsilon }}}}\Vert ^2$, the first ‘$\le $’ follows. The second ‘$\le $’ is obtained by the Markov inequality. In view of the moment generating function for Chi-squared distribution, the first ‘$=$’ holds. For the second term of $T_1$,
$$\begin{aligned} 25\sigma ^2{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2)\le 25\sigma ^2\exp (-{n}/{12})=o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$
(29)
By (28) and (29), we thus have $T_1=o({K^0\sigma ^2}/{n})$.

On the other hand,

$$\begin{aligned} T_2&= E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \nonumber \\&\quad +E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\left( 1-I_{\{G\}}\right) I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}= \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) . \end{aligned}$$

(30)

For the first term in (30), it follows that

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \nonumber \\&\quad \le 25\sigma ^2{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \nonumber \\&\quad \le 25\sigma ^2 {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) + 25\sigma ^2{\mathrm{pr}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) \nonumber \\&\quad \le \frac{100\sigma ^2}{n(\log n)^{1/2}}+\frac{50\sigma ^2c}{n(\log n)^{1/2}}=o\left( \frac{K^0 \sigma ^2}{n}\right) . \end{aligned}$$

(31)

For the second term in (30),

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}} \right) \nonumber \\&\quad \le E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) , \end{aligned}$$

(32)

and

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2 I_{\{\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}= \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \nonumber \\&\quad =\frac{1}{n}E\left( \left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\right) =\frac{1}{n}E\left( \left\| P_{Z_{{\mathcal {G}}_0^{0c}}}{{\varvec{{\epsilon }}}}\right\| ^2\right) =\frac{K^0\sigma ^2}{n}. \end{aligned}$$

(33)

By (30)–(33), $T_2 = n^{-1}{K^0\sigma ^2}(1+o(1))$. Therefore,

$$\begin{aligned} \frac{1}{n}E\left( \left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\right) =T_1+T_2=\frac{K^0\sigma ^2}{n}(1+o(1)). \end{aligned}$$

□

Proof of Theorem 3

This proof mimics the proof of Theorem 1 in (Shen et al. 2012a). We thus omit the details. □

Proof of Theorem 4

By Sect. 3, there exists a finite $m^*$ such that $\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{(m^*)}$. Denote the grou** of $\hat{{{\varvec{{\beta }}}}}$ by ${\mathcal {G}}=({\mathcal {G}}_0,{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_K)$ with $K<K^*$. Then $\hat{{{\varvec{{\beta }}}}}$ satisfies that, for grou** ${\mathcal {G}}$,

$$\begin{aligned} \left\{ \begin{array}{l@{\quad }l} -(X_{{\mathcal {G}}_k}{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X{{\varvec{{\beta }}}})+ n\sum \limits _{j\in {\mathcal {G}}_k}\varDelta _j({{\varvec{{\beta }}}})=0&{} ~k=1,\ldots ,K \\ |(X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X{{\varvec{{\beta }}}})-n\sum \limits _{j\in A}\varDelta _j({{\varvec{{\beta }}}}) |\le n\frac{\lambda _2}{\tau }|\varepsilon \cap \{A\times ({\mathcal {G}}_k{\setminus } A)\}| &{}~A\subset {\mathcal {G}}_k, |{\mathcal {G}}_k|>1, \\ |{{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X{{\varvec{{\beta }}}})-n\varDelta _j({{\varvec{{\beta }}}})|\le n\frac{\lambda _1}{\tau } &{} ~ j\in {\mathcal {G}}_0,\\ \end{array} \right. \end{aligned}$$

(34)

where

$$\begin{aligned} \varDelta _j({{\varvec{{\beta }}}})={\lambda _1}{\tau ^{-1}}{\text {sign}}(\beta _j)I_{\{|\beta _j|\le \tau \}} +{\lambda _2}{\tau ^{-1}}\sum \limits _{j': (j',j)\in \varepsilon }{\text {sign}}(\beta _j-\beta _{j'})I_{\{|\beta _j-\beta _{j'}|\le \tau \}} +2\lambda _3\beta _jI_{\{\beta _j<0\}}. \end{aligned}$$

Denote ${\mathcal {J}}={\mathcal {J}}_{11}\cap {\mathcal {J}}_{12}\cap {\mathcal {J}}_{21}\cap {\mathcal {J}}_{22}$, where ${\mathcal {J}}_{11}=\{\min \nolimits _{j\notin {\mathcal {G}}_{0}^0}{\hat{\beta }}_j^{{\mathrm{ols}}}>2\tau \}$, ${\mathcal {J}}_{12}=\{\max \nolimits _{j\in {\mathcal {G}}_{0}^0}|{{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})|\le n{\lambda _1}{\tau ^{-1}}\}$, ${\mathcal {J}}_{21}=\{\min \nolimits _{1\le k<l\le K^0}|{\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}}|>2\tau \}$, ${\mathcal {J}}_{22}=\cap _{k=1,\ldots ,K^0: |{\mathcal {G}}_k^0|>1}\{\max _{A\subset {\mathcal {G}}_k^0}|(X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})|\le n{\lambda _2}{\tau }^{-1}|\varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}|\}$. First, we show that $\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$ is a solution to (34) on ${\mathcal {J}}$. Note that, $\sum _{j\in {\mathcal {G}}_k^0}\varDelta _j\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =0$ on the set ${\mathcal {J}}_{11}\cap {\mathcal {J}}_{21}$. By the definition of $\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$, $(X_{{\mathcal {G}}_k^0}{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-{{\varvec{{X}}}}\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=0$. Thus, the first equation in (34) holds for ${{\varvec{{\beta }}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$. Since $\sum _{j\in {\mathcal {G}}_k^0}\varDelta _j\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =0$ on ${\mathcal {J}}$, one can easily see that the second and third inequalities also hold for ${{\varvec{{\beta }}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$.

Next, we show that (34) has a unique solution on ${\mathcal {J}}$, and thus $\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$. We provide the proof by contradiction. Assume that $\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$. Let ${\mathcal {H}}=({\mathcal {H}}_1,\ldots ,{\mathcal {H}}_L)={\mathcal {G}}_0^c\vee {\mathcal {G}}_0^{0c}$. Herein, we give an example to explain the sign ’$\vee $’. Define two sets $A_1=\{\{1,2,3,4\}, \{5,6\}\}$, and $A_2=\{\{1,2\}, \{3,4,5,6\},\{7\}\}$. Then $A_1\vee A_2=\{\{1,2\},\{3,4\},\{5,6\},\{7\}\}$. Denote $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}} =({\hat{\alpha }}_{{\mathcal {H}}_1}^{{\mathrm{ols}}},\ldots ,{\hat{\alpha }}_{{\mathcal {H}}_L}^{{\mathrm{ols}}})^\top $, $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}} =({\hat{\alpha }}_{{\mathcal {H}}_1},\ldots ,{\hat{\alpha }}_{{\mathcal {H}}_L})^\top $ the coefficients estimated by OLS and the algorithm 1, respectively. Then $S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})=(2n)^{-1}\Vert {{\varvec{{y}}}}- Z_{{\mathcal {H}}}{{\varvec{{\alpha }}}}_{{\mathcal {H}}}\Vert ^2 +J({{\varvec{{\alpha }}}}_{{\mathcal {H}}})$, where

$$\begin{aligned} J({{\varvec{{\alpha }}}}_{{\mathcal {H}}})&=\lambda _1\sum \limits _{k=1}^L|{\mathcal {H}}_k| \min \left\{ \frac{|\alpha _{{\mathcal {H}}_k}|}{\tau },1 \right\} + \lambda _2\sum \limits _{1\le k<l\le L}|\varepsilon _{kl}|\min \left\{ \frac{|\alpha _{{\mathcal {H}}_k}-\alpha _{{\mathcal {H}}_l}|}{\tau },1\right\} \\&\quad +\lambda _3\sum \limits _{k=1}^L|{\mathcal {H}}_k| (\min \{\alpha _{{\mathcal {H}}_k},0\})^2 \end{aligned}$$

for ${{\varvec{{\alpha }}}}_{{\mathcal {H}}} =({\alpha }_{{\mathcal {H}}_1},\ldots ,{\alpha }_{{\mathcal {H}}_L})^\top $, where $\varepsilon _{kl}$ is the set of undirected edge between ${\mathcal {H}}_k$ and ${\mathcal {H}}_l$. We thus have

$$\begin{aligned} \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}- \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}=\frac{1}{n}Z_{{\mathcal {H}}}^{\top }Z_{{\mathcal {H}}}(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})+{{\varvec{{\varphi }}}}, \end{aligned}$$

where ${{\varvec{{\varphi }}}}=(\varphi _1,\ldots ,\varphi _L)^{\top }={{\varvec{{\varphi }}}}_{1}+{{\varvec{{\varphi }}}}_{2}$, ${{\varvec{{\varphi }}}}_{1}=(\varphi _{11},\ldots ,\varphi _{L1})^{\top }$, ${{\varvec{{\varphi }}}}_{2}=(\varphi _{12},\ldots ,\varphi _{L2})^{\top }$, $ \varphi _{k1}={\lambda _1}{\tau ^{-1}}|{\mathcal {H}}_k|( a_kI_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}|\le \tau \}}- a_k^{{\mathrm{ols}}}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|\le \tau \}}) + {\lambda _2}{\tau ^{-1}}\sum \nolimits _{l\ne k}|\varepsilon _{kl}|(b_{kl}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\le \tau \}}-b_{kl}^{{\mathrm{ols}}}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\le \tau \}}), \varphi _{k2}=2\lambda _3(|{\mathcal {H}}_k|{\hat{\alpha }}_{{\mathcal {H}}_k}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}<0\}} -|{\mathcal {H}}_k|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}<0\}}),$ where $k=1,\ldots ,L$, $a_k={\text {sign}}({\hat{\alpha }}_{{\mathcal {H}}_k})$, if ${\hat{\alpha }}_{{\mathcal {H}}_k}\ne 0$, $a_k\in [-1,1]$ otherwise; $b_{kl}={\text {sign}}({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l})$ if ${\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}\ne 0$, $b_{kl}\in [-1,1]$ otherwise. Similarly, we have $a_k^{{\mathrm{ols}}}$ and $b_{kl}^{{\mathrm{ols}}}$. Note that $\Vert {{\varvec{{\varphi }}}}_{1}\Vert ^2 \le 4\tau ^{-2}(\lambda _1s^*+\lambda _2|{\mathcal {N}}|)^2$.

Now, we consider two cases: (1) $\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert <\tau /2$ and (2) $\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert \ge \tau /2$. For each case, we show that both $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}$ and $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}$ are the local minimizers of $S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})$ and $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}=\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}$ on ${\mathcal {J}}$.

1.
$\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert <\tau /2$. On the set ${\mathcal {J}}$, ${\hat{\alpha }}_{{\mathcal {H}}_k}\ge {\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|\ge 2\tau -\tau /2>\tau $ if ${\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}>2\tau $; $|{\hat{\alpha }}_{{\mathcal {H}}_k}|<|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|+|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|<\tau /2$ if $|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|=0$; $|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\ge -|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|-|{\hat{\alpha }}_{{\mathcal {H}}_l}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}| +|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\ge \tau $ if $|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\ge 2\tau $; $|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\le |{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|+|{\hat{\alpha }}_{{\mathcal {H}}_l}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}| +|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|<\tau $ if $|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|=0$. It implies that both $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}$ and $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}$ are the local minimizers of $S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})$ and $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}=\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}$ on ${\mathcal {J}}$.
2.
$\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert \ge \tau /2$. By Cauchy–Schwarz inequality,
$$\begin{aligned} \left| {{\varvec{{\varphi }}}}_{1}^{\top }(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})\right| \le \frac{2}{\tau }\left( \lambda _1s^{*}+\lambda _2 |{\mathcal {N}}|\right) \Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert . \end{aligned}$$
It is easy to verify that $ ({\hat{\alpha }}_{{\mathcal {H}}_k}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}<0\}} -{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}<0\}})({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}})\ge 0$, followed by
$$\begin{aligned} {{\varvec{{\varphi }}}}_{2}^{\top }(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})\ge 0. \end{aligned}$$
By the assumption (A4),
$$\begin{aligned}&\left( \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}- \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}\right) ^{\top } \frac{\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}}{\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert }\nonumber \\&\ge \min _{K({\mathcal {H}})\le K^*}\frac{\tau }{2}\lambda _{{\mathrm {min}}}\left( \frac{1}{n} Z_{{\mathcal {H}}}^{\top }Z_{{\mathcal {H}}}\right) -\frac{2}{\tau }\left( \lambda _1s^{*}+\lambda _2 |{\mathcal {N}}|\right) >0. \end{aligned}$$
(35)

On the other hand, $\frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}} = 0$ and $\frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}=0$ on ${\mathcal {J}}$ if $\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\ne \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}$, which contracts to (35). Therefore, the problem (34) has a unique solution on ${\mathcal {J}}$. That is $\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}$ on ${\mathcal {J}}$, which yields that

$$\begin{aligned} {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})\le {{\mathrm{pr}}}(J^c) \le {{\mathrm{pr}}}({\mathcal {J}}_{11}^c)+{{\mathrm{pr}}}({\mathcal {J}}_{12}^c)+{{\mathrm{pr}}}({\mathcal {J}}_{21}^c)+{{\mathrm{pr}}}({\mathcal {J}}_{12}^c). \end{aligned}$$

(36)

Next, we show the bounds of ${{\mathrm{pr}}}({\mathcal {J}}_{11}^c), {{\mathrm{pr}}}({\mathcal {J}}_{12}^c), {{\mathrm{pr}}}({\mathcal {J}}_{21}^c), {{\mathrm{pr}}}({\mathcal {J}}_{12}^c)$.

Before proceeding, we provide the following inequality, for $x>0$, $\varPhi (-x)\le (2\pi )^{-1/2}x^{-1}\exp (-x^2/2)$. If $x^2\ge 2\log \{{2na}/{(2\pi )^{1/2}}\}$, $a \ge 1$, $x>0$, then $2a\varPhi (-x)\le cn^{-1}(\log n)^{-1/2}$.

For ${\mathcal {J}}_{11}^c$, by the assumptions (A1)–(A2), ${\hat{\beta }}_j^{{\mathrm{ols}}}\sim N(\beta _j^0,var({\hat{\beta }}_j^{{\mathrm{ols}}}))$, where $var({\hat{\beta }}_j^{{\mathrm{ols}}})\le n^{-1}\sigma ^{2}\lambda _{{\mathrm {min}}}^{-1}(n^{-1}Z_{{\mathcal {G}}_0^{0c}}^{\top }Z_{{\mathcal {G}}_0^{0c}})$. If $\gamma _{{\mathrm {min}}}>2\tau $, and $\{(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\sigma ^{-1}\}^2\ge 2\log \{{2n(p-|{\mathcal {G}}_0^0|)}/{(2\pi )^{1/2}}\}$, then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{11}^c)&\le \sum _{j\in {\mathcal {G}}_0^{0c}}{{\mathrm{pr}}}\left( {\hat{\beta }}_j^{{\mathrm{ols}}}\le 2\tau \right) \le \sum _{j\in {\mathcal {G}}_0^{0c}}{{\mathrm{pr}}}({\beta }_j^{0}-|{\hat{\beta }}_j^{{\mathrm{ols}}}-{\beta }_j^{0}|\le 2\tau )\nonumber \\&\le 2\left( p-|{\mathcal {G}}_0^0|\right) \varPhi \left( -(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\sigma ^{-1}\right) \nonumber \\&=O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

(37)

For ${\mathcal {J}}_{12}^c$, by (A1)–(A2), ${{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X^{\top }\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})={{\varvec{{x}}}}_{(j)}^{\top }(I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\sim N(0,\sigma ^2\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{x}}}}_{(j)}\Vert ^2),$ and $\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{x}}}}_{(j)}\Vert ^2\le \Vert {{\varvec{{x}}}}_{(j)}\Vert ^2$. If $({n{\lambda _1\tau ^{-1}}\sigma ^{-1}}/{\max \nolimits _{1\le j\le p}\Vert {{\varvec{{x}}}}_{(j)}\Vert })^2\ge 2\log \{{2n|{\mathcal {G}}_0^0|}/{(2\pi )^{1/2}}\}$, then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{12}^c)&\le \sum _{j\in {\mathcal {G}}_0^{0}}{{\mathrm{pr}}}\left( \left| {{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})\right| > n\frac{\lambda _1}{\tau }\right) \nonumber \\&\le 2|{\mathcal {G}}_0^0|\varPhi \left( -\frac{n{\lambda _1}/{\tau }}{\sigma \max \limits _{1\le j\le p}\Vert {{\varvec{{x}}}}_{(j)}\Vert }\right) = O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

(38)

For ${\mathcal {J}}_{21}^c$, by (A1)–(A2), ${\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}}\sim N(\alpha _k^0-\alpha _l^0, var({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}})),$ where $var({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}})\le 4n^{-1}\sigma ^{2}\lambda _{{\mathrm {min}}}^{-1}(n^{-1}Z_{{\mathcal {G}}_0^{0c}}^{\top }Z_{{\mathcal {G}}_0^{0c}})$. If $\gamma _{{\mathrm {min}}}>2\tau $, and $\{2^{-1}\sigma ^{-1}(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\}^2\ge 2\log \{{nK^0(K^0-1)}/{(2\pi )^{1/2}}\}$, then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{21}^c)&\le \sum _{1\le k<l\le K^0}{{\mathrm{pr}}}(|{\hat{\alpha }}_k-{\hat{\alpha }}_l|\le 2\tau )\nonumber \\&\le \sum _{1\le k<l\le K^0}{{\mathrm{pr}}}(|{\alpha }_k^0-{\alpha }_l^0|-|({\hat{\alpha }}_k-{\hat{\alpha }}_l)-({\alpha }_k^0-{\alpha }_l^0)|\le 2\tau )\nonumber \\&\le K^0(K^0-1)\varPhi \left( -2^{-1}\sigma ^{-1}(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\right) \nonumber \\&= O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

(39)

For ${\mathcal {J}}_{22}^c$, by (A1)–(A2), $(X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X^{\top }\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=(X_A{{\varvec{{1}}}})^{\top }(I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\sim N(0,\sigma ^2\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}})X_A{{\varvec{{1}}}}\Vert ^2),$ and $\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}})X_A{{\varvec{{1}}}}\Vert ^2\le \Vert X_A{{\varvec{{1}}}}\Vert ^2$. Denote ${\mathcal {D}} = \max \nolimits _{k,A\subset {\mathcal {G}}_{k}^0} {\Vert X_A{{\varvec{{1}}}}\Vert }/{|\varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}|}$. If $({2^{-1}n{\lambda _2}{\tau }^{-1}\sigma ^{-1}}/{\mathcal {D}})^2\ge 2\log \{{2n|{\mathcal {N}}|}/{(2\pi )^{1/2}}\}$, then

$$\begin{aligned} {{\mathrm{pr}}}({\mathcal {J}}_{22}^c)&\le \sum \limits _{k=1,\ldots ,K^0;A\subset {\mathcal {G}}_k^0}{{\mathrm{pr}}}\left( \left| (X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})\right| > n\frac{\lambda _2}{\tau }\left| \varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}\right| \right) \nonumber \\&\le 2|{\mathcal {N}}|\varPhi \left( -\frac{n{\lambda _2}/{\tau }}{2\sigma {\mathcal {D}} }\right) = O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

(40)

By (36)–(40), we thus have ${{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}) = O\left( \frac{1}{n(\log n)^{1/2}}\right) ,$ which, together with Lemma 2.1, yields that

$$\begin{aligned} {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})\le {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})+{{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$

(2) Note that, $\hat{{{\varvec{{\alpha }}}}}$ satisfies that $-Z_{{\mathcal {G}}_0^c}^{\top }\left( {{\varvec{{y}}}}-Z_{{\mathcal {G}}_0^c}\hat{{{\varvec{{\alpha }}}}}\right) +2n\lambda _3M_0\hat{{{\varvec{{\alpha }}}}}+n\hat{{{\varvec{{\delta }}}}}=0,$ where $M_0$ is a $K\times K$ diagonal matrix with diagonal elements $|{\mathcal {G}}_k|I_{\{\hat{{{\varvec{{\alpha }}}}}_{k}<0\}}$ for $k = 1,\ldots , K$; $\hat{{{\varvec{{\delta }}}}}=({\hat{\delta }}_1,\ldots ,{\hat{\delta }}_K)^{\top }$, ${\hat{\delta }}_k=\sum _{j\in {\mathcal {G}}_k}\varUpsilon _j(\hat{{{\varvec{{\beta }}}}})$, and $\varUpsilon _j({{\varvec{{\beta }}}})={\lambda _1}{\tau ^{-1}}{\text {sign}}(\beta _j)I_{\{|\beta _j|\le \tau \}} + {\lambda _2}{\tau ^{-1}}\sum \nolimits _{j': (j',j)\in \varepsilon }{\text {sign}}(\beta _j-\beta _{j'})I_{\{|\beta _j-\beta _{j'}|\le \tau \}}$. Note that $\Vert \hat{{{\varvec{{\delta }}}}}\Vert ^2\le \tau ^{-2}(\lambda _1s^*+\lambda _2|{\mathcal {N}}|)^2$. We obtain that $\hat{{{\varvec{{\alpha }}}}}=(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}(Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{y}}}}-n\hat{{{\varvec{{\delta }}}}}),$ followed by

$$\begin{aligned}&\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\nonumber \\&=\Vert Z_{{\mathcal {G}}_0^c}(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}(Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{y}}}}-n\hat{{{\varvec{{\delta }}}}})-Z_{{\mathcal {G}}_0^{0c}}{{\varvec{{\alpha }}}}^{0}\Vert ^2\nonumber \\&=\Vert \{I-Z_{{\mathcal {G}}_0^c}(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}Z_{{\mathcal {G}}_0^c}^{\top }\}Z_{{\mathcal {G}}_0^{0c}}{{\varvec{{\alpha }}}}^{0} -Z_{{\mathcal {G}}_0^c}(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{\epsilon }}}}\nonumber \\&\quad +nZ_{{\mathcal {G}}_0^c} (Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}\hat{{{\varvec{{\delta }}}}})\Vert ^2\nonumber \\&\le 3\Vert X{{\varvec{{\beta }}}}^{0}\Vert ^2+3\Vert {{\varvec{{\epsilon }}}}\Vert ^2+\frac{3\tau ^2 n}{16}\min _{K({\mathcal {G}}_0^c)\le K^*}\lambda _{{\mathrm {min}}}\left( \frac{1}{n}Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}\right) . \end{aligned}$$

(41)

Denote $T_1=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G\}})$ and $T_2=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G^c\}})$, where $G=\{n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge D\}$. By the definition, we have $n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2)=T_1+T_2.$ Next, we work on $T_1,T_2.$ Let

$$\begin{aligned} D=\frac{3}{n}\Vert X{{\varvec{{\beta }}}}_{0}\Vert ^2+10\sigma ^2+\frac{3\tau ^2 }{16}\min _{K({\mathcal {G}}_0^c)\le K^*}\lambda _{{\mathrm {min}}}\left( \frac{1}{n}Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}\right) . \end{aligned}$$

(42)

For $T_1$, it follows that

$$\begin{aligned}&\int \nolimits _{D}^{\infty }{{\mathrm{pr}}}\left( n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{10\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( 3n^{-1}\Vert {{\varvec{{\epsilon }}}}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{10\sigma ^2}^{\infty }E\left\{ \exp \left( \frac{t\Vert {{\varvec{{\epsilon }}}}\Vert ^2}{\sigma ^2}\right) \exp \left( -\frac{ntx}{3\sigma ^2} \right) \right\} {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{10\sigma ^2}^{\infty }\exp \left\{ -\frac{n}{9\sigma ^2}(x-9\sigma ^2)\right\} {\mathrm{d}}x\nonumber \\&\quad \le \frac{9\sigma ^2}{n}\exp \left( -\frac{n}{9}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$

(43)

By (41) and (42), thus the first ‘$\le $’ follows. In view of the moment generating function for Chi-squared distribution, taking $t = 1/3$, the third ‘$\le $’ holds. For $T_2$,

$$\begin{aligned} T_2= E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) +E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\left( 1-I_{\{G\}}\right) I_{\{\hat{{{\varvec{{\beta }}}}}= \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) . \end{aligned}$$

(44)

For the first term in (44), if $D= o\{K^0(\log n)^{1/2}\},$ then

$$\begin{aligned}&E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G^c\}}I_{\{\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \le D{{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) = o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$

(45)

For the second term in (44),

$$\begin{aligned} E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}I_{\{\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \le E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{G\}}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) , \end{aligned}$$

(46)

$$\begin{aligned} E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2I_{\{\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\}}\right) \le E\left( \frac{1}{n}\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2\right) =\frac{K^0\sigma ^2}{n}. \end{aligned}$$

(47)

By (43), (44)–(47), $n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2)=T_1+T_2=n^{-1}K^0\sigma ^2(1+o(1)).$ □

About this article

Cite this article

Qin, S., Ding, H., Wu, Y. et al. High-dimensional sign-constrained feature selection and grou**. Ann Inst Stat Math 73, 787–819 (2021). https://doi.org/10.1007/s10463-020-00766-z

Download citation

Received: 10 January 2020
Revised: 04 September 2020
Accepted: 08 September 2020
Published: 12 October 2020
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10463-020-00766-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

High-dimensional sign-constrained feature selection and grou**

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse Proteomics Analysis – a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

The Horseshoe-Like Regularization for Feature Subset Selection

Feature Selection and Machine Learning with Mass Spectrometry Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proof of Lemma 1

Proof of Theorem 1

Proof of Theorem 3

Proof of Theorem 4

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

High-dimensional sign-constrained feature selection and grou**

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse Proteomics Analysis – a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

The Horseshoe-Like Regularization for Feature Subset Selection

Feature Selection and Machine Learning with Mass Spectrometry Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof of Lemma 1

Proof of Theorem 1

Proof of Theorem 3

Proof of Theorem 4

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation