Abstract
In this paper, we propose a non-negative feature selection/feature grou** (nnFSG) method for general sign-constrained high-dimensional regression problems that allows regression coefficients to be disjointly homogeneous, with sparsity as a special case. To solve the resulting non-convex optimization problem, we provide an algorithm that incorporates the difference of convex programming, augmented Lagrange and coordinate descent methods. Furthermore, we show that the aforementioned nnFSG method recovers the oracle estimate consistently, and that the mean-squared errors are bounded. Additionally, we examine the performance of our method using finite sample simulations and applying it to a real protein mass spectrum dataset.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10463-020-00766-z/MediaObjects/10463_2020_766_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10463-020-00766-z/MediaObjects/10463_2020_766_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10463-020-00766-z/MediaObjects/10463_2020_766_Fig3_HTML.png)
Similar content being viewed by others
References
Arnold, T. B., Tibshirani, R. J. (2016). Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1), 1–27.
Esser, E., Lou, Y. F., **n, J. (2013). A method for finding structured sparse solutions to nonnegative least squares problems with applications. SIAM Journal on Imaging Sciences, 6(4), 2010–2046.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Frank, L. E., Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109–135.
Friedman, J., Hastie, T., Simon, N., Tibshirani, R. (2016). Lasso and elastic-net regularized generalized linear models. R-Package Version, 2(0–5), 2016.
Fu, A., Narasimhan, B., Boyd, S. (2017). CVXR: An R package for disciplined convex optimization. ar**v:1711.07582.
Goeman, J. J. (2010). \(L_1\) penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52(1), 70–84.
Hu, Z., Follmann, D. A., Miura, K. (2015). Vaccine design via nonnegative lasso-based variable selection. Statistics in Medicine, 34(10), 1791–1798.
Huang, J., Ma, S., **e, H., Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.
Itoh, Y., Duarte, M. F., Parente, M. (2016). Perfect recovery conditions for non-negative sparse modeling. IEEE Transactions on Signal Processing, 65(1), 69–80.
Jang, W., Lim, J., Lazar, N., Loh, J. M., McDowell, J., Yu, D. (2011). Regression shrinkage and equality selection for highly correlated predictors with HORSES. Biometrics, 64, 1–23.
Koike, Y., Tanoue, Y. (2019). Oracle inequalities for sign constrained generalized linear models. Econometrics and Statistics, 11, 145–157.
Luenberger, D. G., Ye, Y. (2015). Linear and nonlinear programming, Vol. 228. New York: Springer.
Mandal, B. N., Ma, J. (2016). \(l_1\) regularized multiplicative iterative path algorithm for non-negative generalized linear models. Computational Statistics and Data Analysis, 101, 289–299.
Meinshausen, N. (2013). Sign-constrained least squares estimation for high-dimensional regression. Electronic Journal of Statistics, 7, 1607–1631.
Mullen, K. M., van Stokkum, I. H. (2012). The Lawson–Hanson algorithm for nonnegative least squares (NNLS). CRAN: R package. https://cran.r-project.org/web/packages/nnls/nnls.pdf.
Rekabdarkolaee, H. M., Boone, E., Wang, Q. (2017). Robust estimation and variable selection in sufficient dimension reduction. Computational Statistics and Data Analysis, 108, 146–157.
Renard, B. Y., Kirchner, M., Steen, H., Steen, J. A., Hamprecht, F. A. (2008). NITPICK: Peak identification for mass spectrometry data. BMC Bioinformatics, 9(1), 355.
Shadmi, Y., Jung, P., Caire, G. (2019). Sparse non-negative recovery from biased sub-Gaussian measurements using NNLS. ar**v:1901.05727.
She, Y. (2010). Sparse regression with exact clustering. Electronic Journal of Statistics, 4, 1055–1096.
Shen, X., Huang, H. C., Pan, W. (2012a). Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99(4), 899–914.
Shen, X., Pan, W., Zhu, Y. (2012b). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107(497), 223–232.
Shen, X., Pan, W., Zhu, Y., Zhou, H. (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807–832.
Slawski, M., Hein, M. (2010). Sparse recovery for protein massspectrometry data. In NIPS workshop on practical applications of sparse modelling.
Slawski, M., Hein, M. (2013). Non-negative least squares for high-dimensional linear models: Consistency and sparse recovery without regularization. Electronic Journal of Statistics, 7, 3004–3056.
Slawski, M., Hussong, R., Tholey, A., Jakoby, T., Gregorius, B., Hildebrandt, A., Hein, M. (2012). Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching. BMC Bioinformatics, 13(1), 291.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
Tibshirani, R., Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, 9(1), 18–29.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(1), 91–108.
Tibshirani, R. J., Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 1335–1371.
Wen, Y. W., Wang, M., Cao, Z., Cheng, X., Ching, W. K., Vassiliadis, V. S. (2015). Sparse solution of nonnegative least squares problems with applications in the construction of probabilistic Boolean networks. Numerical Linear Algebra with Applications, 22(5), 883–899.
Wu, L., Yang, Y. (2014). Nonnegative elastic net and application in index tracking. Applied Mathematics and Computation, 227, 541–552.
Wu, L., Yang, Y., Liu, H. (2014). Nonnegative-lasso and application in index tracking. Computational Statistics and Data Analysis, 70, 116–126.
**ang, S., Shen, X., Ye, J. (2015). Efficient nonconvex sparse group feature selection via continuous and discrete optimization. Artificial Intelligence, 224, 28–50.
Yang, S., Yuan, L., Lai, Y. C., Shen, X., Wonka, P., Ye, J. (2012). Feature grou** and selection over an undirected graph. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 922–930). ACM. New York.
Yang, Y., Wu, L. (2016). Nonnegative adaptive lasso for ultra-high dimensional regression models and a two-stage method applied in financial modeling. Journal of Statistical Planning and Inference, 174, 52–67.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942.
Zhu, Y., Shen, X., Pan, W. (2013). Simultaneous grou** pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108(502), 713–725.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society): Series B (Statistical Methodology, 67(2), 301–320.
Acknowledgements
This work is supported by Natural Sciences and Engineering Research Council of Canada (RGPIN-2017-05720). Qin also gratefully acknowledges the financial support from the China Scholarship Council (Grant No.201506180073).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Lemma 1
Since \(\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}=({\hat{\alpha }}_1^{{\mathrm{ols}}},\ldots ,{\hat{\alpha }}_{K^0}^{{\mathrm{ols}}})^{\top }= (Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}Z_{{{\mathcal {G}}_0^0}^c}^{\top }{{\varvec{{y}}}} ={{\varvec{{\alpha }}}}^{0}+(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}Z_{{{\mathcal {G}}_0^0}^c}^{\top }{{\varvec{{\epsilon }}}}\), \(\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}\sim N\left( {{\varvec{{\alpha }}}}^{0}, \sigma ^2(Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}\right) ,\) namely,
where \((Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})_{kk}^{-1}\) denotes the k-th diagonal element of matrix \((Z_{{{\mathcal {G}}_0^0}^c}^{\top }Z_{{{\mathcal {G}}_0^0}^c})^{-1}\).
By the assumption (A2), it yields that the variance of \({\hat{\alpha }}_{k}^{{\mathrm{ols}}}\) is bounded from above by \(\sigma ^2/(nc_0)\) for all \(k=1,\ldots ,K^0\). In view of the assumption (A3), \(\min _{1\le k\le K^0}{\alpha }_{k}^{0}=\min _{j\in {{\mathcal {G}}_0^0}^c}{\beta }_{j}^{0}> c_n\), where \(c_n=[2\sigma ^2\log \{2nK^0/(2\pi )^{1/2}\}/(nc_0)]^{1/2}\). Similar to Meinshausen (2013), by Bonferroni’s inequality, we thus have
with probability at least
It implies that with probability at least \(1-2K^0\left\{ 1-\varPhi \left( [2\log \{{2nK^0}/(2\pi )^{1/2}\}]^{1/2} \right) \right\} \), \(\min _{1\le k\le K^0}{\hat{\alpha }}_{k}^{{\mathrm{ols}}}> 0\), and thus \(\hat{{{\varvec{{\alpha }}}}}^{ora}=\hat{{{\varvec{{\alpha }}}}}^{{\mathrm{ols}}}\), \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). That is,
Since \(1-\varPhi (x)\le (2\pi )^{-1/2}x^{-1}\exp (-x^2/2)\) for any \(x>0\), it follows that
□
Proof of Theorem 1
Let \({\mathcal {G}} =({\mathcal {G}}_0,{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_K)\) be a grou** of the constrained problem in Sect. 2, satisfying that \(0\le {\hat{\beta }}_j^{{\mathrm{cons}}}\le \tau \) if \(j\in {\mathcal {G}}_0\), \(|{\hat{\beta }}_j^{{\mathrm{cons}}}-{\hat{\beta }}_{j'}^{{\mathrm{cons}}}|> \tau \) if \(j\in {{\mathcal {G}}}_k\), \(j'\in {{\mathcal {G}}}_{k'}\), \(j = 1, \ldots , p; 1\le k\ne k'\le K\).
If \({{\mathcal {G}}}={\mathcal {G}}^0\), then \(|{{\mathcal {G}}}_0^c|=s_1^0\). By the first constraint \(\sum _{j=1}^p \min \left\{ \frac{|\beta _j|}{\tau }, 1\right\} \le s_1\), \(\sum _{j\in {{\mathcal {G}}}_0}{\hat{\beta }}_j^{{\mathrm{cons}}}/\tau +s_1^0\le s_1^0\), which implies that \({\hat{\beta }}_j^{{\mathrm{cons}}}=0\), \(j\in {{\mathcal {G}}}_0\). By the second constraint \(\sum _{(j, j') \in \varepsilon } \min \left\{ \frac{|\beta _j - \beta _{j'}|}{\tau }, 1\right\} \le s_2\), similarly, we obtain that \({\hat{\beta }}_j^{{\mathrm{cons}}}={\hat{\beta }}_{j'}^{{\mathrm{cons}}}\), \(j, j'\in {{\mathcal {G}}}_k={\mathcal {G}}_k^0\), \((j, j')\in \varepsilon \), \(k=1,\ldots ,K\). Thus, \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\) if \({{\mathcal {G}}}={\mathcal {G}}^0\), which, together with the fact that \({{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}})={{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0) + {{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}={\mathcal {G}}^0)\), yields that
Denote \({{\bar{S}}}({{\varvec{{\beta }}}}) = 2^{-1}\Vert Y-X {{\varvec{{\beta }}}}\Vert ^2 \). In view that \({{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0) ={{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0)+{{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}, \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}, {{\mathcal {G}}}\ne {\mathcal {G}}^0)\), (20) thus becomes
The second term in (21) has already provided in Lemma 2.1. Next, we work on the first term in (21), and denote it by \(\varGamma \).
Consider the case where \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\) and \({{\mathcal {G}}}\ne {\mathcal {G}}^0\). Define \(\bar{{{\varvec{{\beta }}}}}= ({\bar{\beta }}_1,\ldots ,{\bar{\beta }}_{p})^{\top }\), satisfying
It follows that \(|{\bar{\beta }}_j-{\hat{\beta }}_{j}^{{\mathrm{cons}}}|\le \tau \), \(\Vert \bar{{{\varvec{{\beta }}}}}-\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\Vert ^2\le \tau ^2p\), and thus
Note that
For any vector \({{\varvec{{u}}}}, {{\varvec{{v}}}}\in {\mathbb {R}}^p\) and \(a>0\), it holds that \(\Vert {{\varvec{{u}}}}+{{\varvec{{v}}}}\Vert ^2\ge a^{-1}(a-1)\Vert {{\varvec{{u}}}}\Vert ^2-(a-1)\Vert {{\varvec{{v}}}}\Vert ^2\) (Shen et al. 2012a). We thus have
By substituting (22)–(23) into (24) and together with \({{\bar{S}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =2^{-1}\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\Vert ^2\le 2^{-1}{{\varvec{{\epsilon }}}}^{\top }{{\varvec{{\epsilon }}}},\) we obtain that, for any \(a>1\),
where \(L_1=\{{{\varvec{{\epsilon }}}}-(a-1)(I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\}^{\top }(I-P_{Z_{{\mathcal {G}}_0^c}})\{{{\varvec{{\epsilon }}}}-(a-1)(I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\}\), and \(L_1\sigma ^{-2}\) follows noncentral Chi-squared distribution \(\chi _{k,\varLambda }^2\) with degrees of freedom \(k=\max \{n-K({\mathcal {G}}_0^c),0\}\), and noncentral parameter \(\varLambda =(a-1)^2\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2/\sigma ^2\); \(L_2=a{{\varvec{{\epsilon }}}}^{\top }P_{Z_{{\mathcal {G}}_0^c}}{{\varvec{{\epsilon }}}}\) is independent of \(L_1\), and \(a^{-1}\sigma ^{-2}L_2\) follows Chi-squared distribution \(\chi _{\kappa }^2\) with degrees of freedom \(\kappa =K({\mathcal {G}}_0^c)\); \(L_3=a(a-1)\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2-a(a-1) \lambda _{{\mathrm {max}}}(X^{\top }X)\tau ^2p\). Note that, by the definition of \(C_{{\mathrm {min}}}\), \(\Vert (I-P_{Z_{{\mathcal {G}}_0^c}})X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge nC_{{\mathrm {min}}}\).
For \(\varGamma \), by Markov inequality and moment-generating function of Chi-squared distribution, it holds that, for any \(0<t<1/(2a)\) and \(1-2at<1-2t<1 ~ (a>1)\), by Shen et al. (2012a),
where \(l_1^*=\exp \left\{ \frac{(a-1)\log p}{4n}-n\frac{t(a-1)iC_{{\mathrm {min}}}}{\sigma ^2}\frac{1-2at}{1-2t}\right\} \), \(l_2^*=(1-2t)/(1-2at)\), \(K_i^*=\max _{\{{\mathcal {G}}\in {\mathcal {T}}, |{\mathcal {G}}_0\backslash {\mathcal {G}}_0^0|=i\}}K({\mathcal {G}}_0^c)\). Note that the last inequality holds true because
for any \(\tau \le \sigma [\log p/\{2np\lambda _{{\mathrm {max}}}(X^{\top }X)\}]^{1/2}\). We choose \(a=4+n/4\), \(t={4^{-1}(a-1)}^{-1}\), and define \(b={(1-2t)}/{(1-2at)}\). Then \(b={(2a-3)}/{(a-2)}<5/2\), and \((a-1)/(4n)\le 1\). Since \(-\log (1-x)\le x(1-x)^{-1}\) for \(0<x<1\), and \(0<2t=2^{-1}(a-1)^{-1}<1\), it follows that
which jointly with the facts
yields that
Since \((1-z)^{-1}=\sum _{i=0}^{\infty }z^i\) for \(|z|<1\), we thus obtain that, for \(x<0\),
We take \(x =- {10^{-1}\sigma ^{-2}}n\{C_{{\mathrm {min}}}-{10\sigma ^2}{n}^{-1}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\}\) if \(C_{{\mathrm {min}}}>{10\sigma ^2}{n}^{-1}(3\log p +{\bar{T}}+{{\bar{K}}}/{2})\). Together with \(\varGamma \le 1\), (25) becomes
Similarly, we can show that (26) still holds for \(C_{{\mathrm {min}}}\le {10\sigma ^2}{n}^{-1}(3\log p+{\bar{T}}+{{\bar{K}}}/{2})\). By Lemma 2.1 and (26), (21) becomes
-
1.
If \(C_{{\mathrm {min}}}\ge {10\sigma ^2}{n}^{-1}\left( \log n+2^{-1}\log \log n +3\log p+{\bar{T}}+{{\bar{K}}}/{2}\right) \), by (27),
$$\begin{aligned} {{\mathrm{pr}}}\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ora}}}\right) =O\left( \frac{1}{n(\log n)^{1/2}}\right) . \end{aligned}$$ -
2.
We denote \(T_1=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G\}})\), and \(T_2=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G^c\}})\), where \(G=\{n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2\}\). It is easy to see that
$$\begin{aligned} \frac{1}{n}E\left\| X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\right\| ^2=T_1+T_2. \end{aligned}$$Now, we work on \(T_1\). By the definition, \(T_1 = \int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x){\mathrm{d}}x + 25\sigma ^2{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2)\). For the first term of \(T_1\),
$$\begin{aligned}&\int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{25\sigma ^2}^{\infty }{{\mathrm{pr}}}\left( 4n^{-1}\Vert {{\varvec{{\epsilon }}}}\Vert ^2\ge x\right) {\mathrm{d}}x \nonumber \\&\quad \le \int \nolimits _{25\sigma ^2}^{\infty }E\left\{ \exp \left( \frac{\Vert {{\varvec{{\epsilon }}}}\Vert ^2}{3\sigma ^2}\right) \right\} \exp \left( -\frac{nx}{12\sigma ^2}\right) {\mathrm{d}}x \nonumber \\&\quad = \int \nolimits _{25\sigma ^2}^{\infty }\exp \left[ -\frac{n}{12\sigma ^2}\{x-6(\log 3) \sigma ^2\}\right] {\mathrm{d}}x \nonumber \\&\quad < \int \nolimits _{25\sigma ^2}^{\infty }\exp \left\{ -\frac{n}{12\sigma ^2}(x-24 \sigma ^2)\right\} {\mathrm{d}}x \nonumber \\&\quad =\frac{12\sigma ^2}{n}\exp \left( -\frac{n}{12}\right) =o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$(28)Since \(\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\le 2(\Vert Y-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}\Vert ^2+\Vert Y-X{{\varvec{{\beta }}}}^{0}\Vert ^2)\le 4\Vert Y-X{{\varvec{{\beta }}}}^{0}\Vert ^2=4\Vert {{\varvec{{\epsilon }}}}\Vert ^2\), the first ‘\(\le \)’ follows. The second ‘\(\le \)’ is obtained by the Markov inequality. In view of the moment generating function for Chi-squared distribution, the first ‘\(=\)’ holds. For the second term of \(T_1\),
$$\begin{aligned} 25\sigma ^2{{\mathrm{pr}}}(n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{cons}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge 25\sigma ^2)\le 25\sigma ^2\exp (-{n}/{12})=o\left( \frac{K^0\sigma ^2}{n}\right) . \end{aligned}$$(29)By (28) and (29), we thus have \(T_1=o({K^0\sigma ^2}/{n})\).
On the other hand,
For the first term in (30), it follows that
For the second term in (30),
and
By (30)–(33), \(T_2 = n^{-1}{K^0\sigma ^2}(1+o(1))\). Therefore,
□
Proof of Theorem 3
This proof mimics the proof of Theorem 1 in (Shen et al. 2012a). We thus omit the details. □
Proof of Theorem 4
By Sect. 3, there exists a finite \(m^*\) such that \(\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{(m^*)}\). Denote the grou** of \(\hat{{{\varvec{{\beta }}}}}\) by \({\mathcal {G}}=({\mathcal {G}}_0,{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_K)\) with \(K<K^*\). Then \(\hat{{{\varvec{{\beta }}}}}\) satisfies that, for grou** \({\mathcal {G}}\),
where
Denote \({\mathcal {J}}={\mathcal {J}}_{11}\cap {\mathcal {J}}_{12}\cap {\mathcal {J}}_{21}\cap {\mathcal {J}}_{22}\), where \({\mathcal {J}}_{11}=\{\min \nolimits _{j\notin {\mathcal {G}}_{0}^0}{\hat{\beta }}_j^{{\mathrm{ols}}}>2\tau \}\), \({\mathcal {J}}_{12}=\{\max \nolimits _{j\in {\mathcal {G}}_{0}^0}|{{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})|\le n{\lambda _1}{\tau ^{-1}}\}\), \({\mathcal {J}}_{21}=\{\min \nolimits _{1\le k<l\le K^0}|{\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}}|>2\tau \}\), \({\mathcal {J}}_{22}=\cap _{k=1,\ldots ,K^0: |{\mathcal {G}}_k^0|>1}\{\max _{A\subset {\mathcal {G}}_k^0}|(X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})|\le n{\lambda _2}{\tau }^{-1}|\varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}|\}\). First, we show that \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\) is a solution to (34) on \({\mathcal {J}}\). Note that, \(\sum _{j\in {\mathcal {G}}_k^0}\varDelta _j\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =0\) on the set \({\mathcal {J}}_{11}\cap {\mathcal {J}}_{21}\). By the definition of \(\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\), \((X_{{\mathcal {G}}_k^0}{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-{{\varvec{{X}}}}\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=0\). Thus, the first equation in (34) holds for \({{\varvec{{\beta }}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). Since \(\sum _{j\in {\mathcal {G}}_k^0}\varDelta _j\left( \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\right) =0\) on \({\mathcal {J}}\), one can easily see that the second and third inequalities also hold for \({{\varvec{{\beta }}}}=\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\).
Next, we show that (34) has a unique solution on \({\mathcal {J}}\), and thus \(\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). We provide the proof by contradiction. Assume that \(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\). Let \({\mathcal {H}}=({\mathcal {H}}_1,\ldots ,{\mathcal {H}}_L)={\mathcal {G}}_0^c\vee {\mathcal {G}}_0^{0c}\). Herein, we give an example to explain the sign ’\(\vee \)’. Define two sets \(A_1=\{\{1,2,3,4\}, \{5,6\}\}\), and \(A_2=\{\{1,2\}, \{3,4,5,6\},\{7\}\}\). Then \(A_1\vee A_2=\{\{1,2\},\{3,4\},\{5,6\},\{7\}\}\). Denote \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}} =({\hat{\alpha }}_{{\mathcal {H}}_1}^{{\mathrm{ols}}},\ldots ,{\hat{\alpha }}_{{\mathcal {H}}_L}^{{\mathrm{ols}}})^\top \), \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}} =({\hat{\alpha }}_{{\mathcal {H}}_1},\ldots ,{\hat{\alpha }}_{{\mathcal {H}}_L})^\top \) the coefficients estimated by OLS and the algorithm 1, respectively. Then \(S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})=(2n)^{-1}\Vert {{\varvec{{y}}}}- Z_{{\mathcal {H}}}{{\varvec{{\alpha }}}}_{{\mathcal {H}}}\Vert ^2 +J({{\varvec{{\alpha }}}}_{{\mathcal {H}}})\), where
for \({{\varvec{{\alpha }}}}_{{\mathcal {H}}} =({\alpha }_{{\mathcal {H}}_1},\ldots ,{\alpha }_{{\mathcal {H}}_L})^\top \), where \(\varepsilon _{kl}\) is the set of undirected edge between \({\mathcal {H}}_k\) and \({\mathcal {H}}_l\). We thus have
where \({{\varvec{{\varphi }}}}=(\varphi _1,\ldots ,\varphi _L)^{\top }={{\varvec{{\varphi }}}}_{1}+{{\varvec{{\varphi }}}}_{2}\), \({{\varvec{{\varphi }}}}_{1}=(\varphi _{11},\ldots ,\varphi _{L1})^{\top }\), \({{\varvec{{\varphi }}}}_{2}=(\varphi _{12},\ldots ,\varphi _{L2})^{\top }\), \( \varphi _{k1}={\lambda _1}{\tau ^{-1}}|{\mathcal {H}}_k|( a_kI_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}|\le \tau \}}- a_k^{{\mathrm{ols}}}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|\le \tau \}}) + {\lambda _2}{\tau ^{-1}}\sum \nolimits _{l\ne k}|\varepsilon _{kl}|(b_{kl}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\le \tau \}}-b_{kl}^{{\mathrm{ols}}}I_{\{|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\le \tau \}}), \varphi _{k2}=2\lambda _3(|{\mathcal {H}}_k|{\hat{\alpha }}_{{\mathcal {H}}_k}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}<0\}} -|{\mathcal {H}}_k|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}<0\}}),\) where \(k=1,\ldots ,L\), \(a_k={\text {sign}}({\hat{\alpha }}_{{\mathcal {H}}_k})\), if \({\hat{\alpha }}_{{\mathcal {H}}_k}\ne 0\), \(a_k\in [-1,1]\) otherwise; \(b_{kl}={\text {sign}}({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l})\) if \({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}\ne 0\), \(b_{kl}\in [-1,1]\) otherwise. Similarly, we have \(a_k^{{\mathrm{ols}}}\) and \(b_{kl}^{{\mathrm{ols}}}\). Note that \(\Vert {{\varvec{{\varphi }}}}_{1}\Vert ^2 \le 4\tau ^{-2}(\lambda _1s^*+\lambda _2|{\mathcal {N}}|)^2\).
Now, we consider two cases: (1) \(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert <\tau /2\) and (2) \(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert \ge \tau /2\). For each case, we show that both \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) are the local minimizers of \(S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}=\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) on \({\mathcal {J}}\).
-
1.
\(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert <\tau /2\). On the set \({\mathcal {J}}\), \({\hat{\alpha }}_{{\mathcal {H}}_k}\ge {\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|\ge 2\tau -\tau /2>\tau \) if \({\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}>2\tau \); \(|{\hat{\alpha }}_{{\mathcal {H}}_k}|<|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|+|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|<\tau /2\) if \(|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|=0\); \(|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\ge -|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|-|{\hat{\alpha }}_{{\mathcal {H}}_l}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}| +|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\ge \tau \) if \(|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|\ge 2\tau \); \(|{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_l}|\le |{\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}|+|{\hat{\alpha }}_{{\mathcal {H}}_l}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}| +|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|<\tau \) if \(|{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}-{\hat{\alpha }}_{{\mathcal {H}}_l}^{{\mathrm{ols}}}|=0\). It implies that both \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) are the local minimizers of \(S({{\varvec{{\alpha }}}}_{{\mathcal {H}}})\) and \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}=\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\) on \({\mathcal {J}}\).
-
2.
\(\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert \ge \tau /2\). By Cauchy–Schwarz inequality,
$$\begin{aligned} \left| {{\varvec{{\varphi }}}}_{1}^{\top }(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})\right| \le \frac{2}{\tau }\left( \lambda _1s^{*}+\lambda _2 |{\mathcal {N}}|\right) \Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert . \end{aligned}$$It is easy to verify that \( ({\hat{\alpha }}_{{\mathcal {H}}_k}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}<0\}} -{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}I_{\{{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}}<0\}})({\hat{\alpha }}_{{\mathcal {H}}_k}-{\hat{\alpha }}_{{\mathcal {H}}_k}^{{\mathrm{ols}}})\ge 0\), followed by
$$\begin{aligned} {{\varvec{{\varphi }}}}_{2}^{\top }(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})\ge 0. \end{aligned}$$By the assumption (A4),
$$\begin{aligned}&\left( \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}- \frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}\right) ^{\top } \frac{\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}}{\Vert \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}-\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\Vert }\nonumber \\&\ge \min _{K({\mathcal {H}})\le K^*}\frac{\tau }{2}\lambda _{{\mathrm {min}}}\left( \frac{1}{n} Z_{{\mathcal {H}}}^{\top }Z_{{\mathcal {H}}}\right) -\frac{2}{\tau }\left( \lambda _1s^{*}+\lambda _2 |{\mathcal {N}}|\right) >0. \end{aligned}$$(35)
On the other hand, \(\frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}} = 0\) and \(\frac{\partial S(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}})}{\partial {{\varvec{{\alpha }}}}_{{\mathcal {H}}}}=0\) on \({\mathcal {J}}\) if \(\hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}\ne \hat{{{\varvec{{\alpha }}}}}_{{\mathcal {H}}}^{{\mathrm{ols}}}\), which contracts to (35). Therefore, the problem (34) has a unique solution on \({\mathcal {J}}\). That is \(\hat{{{\varvec{{\beta }}}}} = \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}\) on \({\mathcal {J}}\), which yields that
Next, we show the bounds of \({{\mathrm{pr}}}({\mathcal {J}}_{11}^c), {{\mathrm{pr}}}({\mathcal {J}}_{12}^c), {{\mathrm{pr}}}({\mathcal {J}}_{21}^c), {{\mathrm{pr}}}({\mathcal {J}}_{12}^c)\).
Before proceeding, we provide the following inequality, for \(x>0\), \(\varPhi (-x)\le (2\pi )^{-1/2}x^{-1}\exp (-x^2/2)\). If \(x^2\ge 2\log \{{2na}/{(2\pi )^{1/2}}\}\), \(a \ge 1\), \(x>0\), then \(2a\varPhi (-x)\le cn^{-1}(\log n)^{-1/2}\).
For \({\mathcal {J}}_{11}^c\), by the assumptions (A1)–(A2), \({\hat{\beta }}_j^{{\mathrm{ols}}}\sim N(\beta _j^0,var({\hat{\beta }}_j^{{\mathrm{ols}}}))\), where \(var({\hat{\beta }}_j^{{\mathrm{ols}}})\le n^{-1}\sigma ^{2}\lambda _{{\mathrm {min}}}^{-1}(n^{-1}Z_{{\mathcal {G}}_0^{0c}}^{\top }Z_{{\mathcal {G}}_0^{0c}})\). If \(\gamma _{{\mathrm {min}}}>2\tau \), and \(\{(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\sigma ^{-1}\}^2\ge 2\log \{{2n(p-|{\mathcal {G}}_0^0|)}/{(2\pi )^{1/2}}\}\), then
For \({\mathcal {J}}_{12}^c\), by (A1)–(A2), \({{\varvec{{x}}}}_{(j)}^{\top }({{\varvec{{y}}}}-X^{\top }\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})={{\varvec{{x}}}}_{(j)}^{\top }(I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\sim N(0,\sigma ^2\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{x}}}}_{(j)}\Vert ^2),\) and \(\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{x}}}}_{(j)}\Vert ^2\le \Vert {{\varvec{{x}}}}_{(j)}\Vert ^2\). If \(({n{\lambda _1\tau ^{-1}}\sigma ^{-1}}/{\max \nolimits _{1\le j\le p}\Vert {{\varvec{{x}}}}_{(j)}\Vert })^2\ge 2\log \{{2n|{\mathcal {G}}_0^0|}/{(2\pi )^{1/2}}\}\), then
For \({\mathcal {J}}_{21}^c\), by (A1)–(A2), \({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}}\sim N(\alpha _k^0-\alpha _l^0, var({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}})),\) where \(var({\hat{\alpha }}_k^{{\mathrm{ols}}}-{\hat{\alpha }}_l^{{\mathrm{ols}}})\le 4n^{-1}\sigma ^{2}\lambda _{{\mathrm {min}}}^{-1}(n^{-1}Z_{{\mathcal {G}}_0^{0c}}^{\top }Z_{{\mathcal {G}}_0^{0c}})\). If \(\gamma _{{\mathrm {min}}}>2\tau \), and \(\{2^{-1}\sigma ^{-1}(\gamma _{{\mathrm {min}}}-2\tau )n^{1/2}\lambda _{{\mathrm {min}}}^{1/2}(n^{-1}Z_{{\mathcal {G}}_{0}^{0c}}^{\top }Z_{{\mathcal {G}}_{0}^{0c}})\}^2\ge 2\log \{{nK^0(K^0-1)}/{(2\pi )^{1/2}}\}\), then
For \({\mathcal {J}}_{22}^c\), by (A1)–(A2), \((X_A{{\varvec{{1}}}})^{\top }({{\varvec{{y}}}}-X^{\top }\hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}})=(X_A{{\varvec{{1}}}})^{\top }(I-P_{Z_{{\mathcal {G}}_0^{0c}}}){{\varvec{{\epsilon }}}}\sim N(0,\sigma ^2\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}})X_A{{\varvec{{1}}}}\Vert ^2),\) and \(\Vert (I-P_{Z_{{\mathcal {G}}_0^{0c}}})X_A{{\varvec{{1}}}}\Vert ^2\le \Vert X_A{{\varvec{{1}}}}\Vert ^2\). Denote \({\mathcal {D}} = \max \nolimits _{k,A\subset {\mathcal {G}}_{k}^0} {\Vert X_A{{\varvec{{1}}}}\Vert }/{|\varepsilon \cap \{A\times ({\mathcal {G}}_k^0{\setminus } A)\}|}\). If \(({2^{-1}n{\lambda _2}{\tau }^{-1}\sigma ^{-1}}/{\mathcal {D}})^2\ge 2\log \{{2n|{\mathcal {N}}|}/{(2\pi )^{1/2}}\}\), then
By (36)–(40), we thus have \({{\mathrm{pr}}}(\hat{{{\varvec{{\beta }}}}}\ne \hat{{{\varvec{{\beta }}}}}^{{\mathrm{ols}}}) = O\left( \frac{1}{n(\log n)^{1/2}}\right) ,\) which, together with Lemma 2.1, yields that
(2) Note that, \(\hat{{{\varvec{{\alpha }}}}}\) satisfies that \(-Z_{{\mathcal {G}}_0^c}^{\top }\left( {{\varvec{{y}}}}-Z_{{\mathcal {G}}_0^c}\hat{{{\varvec{{\alpha }}}}}\right) +2n\lambda _3M_0\hat{{{\varvec{{\alpha }}}}}+n\hat{{{\varvec{{\delta }}}}}=0,\) where \(M_0\) is a \(K\times K\) diagonal matrix with diagonal elements \(|{\mathcal {G}}_k|I_{\{\hat{{{\varvec{{\alpha }}}}}_{k}<0\}}\) for \(k = 1,\ldots , K\); \(\hat{{{\varvec{{\delta }}}}}=({\hat{\delta }}_1,\ldots ,{\hat{\delta }}_K)^{\top }\), \({\hat{\delta }}_k=\sum _{j\in {\mathcal {G}}_k}\varUpsilon _j(\hat{{{\varvec{{\beta }}}}})\), and \(\varUpsilon _j({{\varvec{{\beta }}}})={\lambda _1}{\tau ^{-1}}{\text {sign}}(\beta _j)I_{\{|\beta _j|\le \tau \}} + {\lambda _2}{\tau ^{-1}}\sum \nolimits _{j': (j',j)\in \varepsilon }{\text {sign}}(\beta _j-\beta _{j'})I_{\{|\beta _j-\beta _{j'}|\le \tau \}}\). Note that \(\Vert \hat{{{\varvec{{\delta }}}}}\Vert ^2\le \tau ^{-2}(\lambda _1s^*+\lambda _2|{\mathcal {N}}|)^2\). We obtain that \(\hat{{{\varvec{{\alpha }}}}}=(Z_{{\mathcal {G}}_0^c}^{\top }Z_{{\mathcal {G}}_0^c}+2n\lambda _3M_0)^{-1}(Z_{{\mathcal {G}}_0^c}^{\top }{{\varvec{{y}}}}-n\hat{{{\varvec{{\delta }}}}}),\) followed by
Denote \(T_1=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G\}})\) and \(T_2=n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2I_{\{G^c\}})\), where \(G=\{n^{-1}\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2\ge D\}\). By the definition, we have \(n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2)=T_1+T_2.\) Next, we work on \(T_1,T_2.\) Let
For \(T_1\), it follows that
By (41) and (42), thus the first ‘\(\le \)’ follows. In view of the moment generating function for Chi-squared distribution, taking \(t = 1/3\), the third ‘\(\le \)’ holds. For \(T_2\),
For the first term in (44), if \(D= o\{K^0(\log n)^{1/2}\},\) then
For the second term in (44),
By (43), (44)–(47), \(n^{-1}E(\Vert X\hat{{{\varvec{{\beta }}}}}-X{{\varvec{{\beta }}}}^{0}\Vert ^2)=T_1+T_2=n^{-1}K^0\sigma ^2(1+o(1)).\) □
About this article
Cite this article
Qin, S., Ding, H., Wu, Y. et al. High-dimensional sign-constrained feature selection and grou**. Ann Inst Stat Math 73, 787–819 (2021). https://doi.org/10.1007/s10463-020-00766-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-020-00766-z