Abstract
We propose an adaptive smoothing algorithm based on Nesterov’s smoothing technique in Nesterov (Math Prog 103(1):127–152, 2005) for solving “fully” nonsmooth composite convex optimization problems. Our method combines both Nesterov’s accelerated proximal gradient scheme and a new homotopy strategy for smoothness parameter. By an appropriate choice of smoothing functions, we develop a new algorithm that has the \(\mathcal {O}\left( \frac{1}{\varepsilon }\right) \)-worst-case iteration-complexity while preserves the same complexity-per-iteration as in Nesterov’s method and allows one to automatically update the smoothness parameter at each iteration. Then, we customize our algorithm to solve four special cases that cover various applications. We also specify our algorithm to solve constrained convex optimization problems and show its convergence guarantee on a primal sequence of iterates. We demonstrate our algorithm through three numerical examples and compare it with other related algorithms.
Similar content being viewed by others
References
Andreas, A., Marco, S., Suykens, J.: Hybrid conditional gradient-smoothing algorithms with applications to sparse and low rank regularization. In: Andreas, A., Marco, S., Suykens, J. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, pp. 53–82. CRC Press, Boca Raton (2014)
Baes, M., and Bürgisser, M.: Smoothing techniques for solving semi-definite programs with many constraints. Optimization Online (2009)
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces. Springer, New York (2011)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Becker, S., Bobin, J., Candès, E.J.: NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Prog. Comput. 3(3), 165–218 (2011)
Belloni, A., Chernozhukov, V., Wang, L.: Square-root LASSO: pivotal recovery of sparse signals via conic programming. Biometrika 94(4), 791–806 (2011)
Ben-Tal, A., and Nemirovski, A.: Lectures on modern convex optimization: Analysis, algorithms, and engineering applications, volume 3 of MPS/SIAM Series on Optimization. SIAM (2001)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical methods. Prentice Hall, Englewood Cliffs (1989)
Boţ, R., Hendrich, C.: A variable smoothing algorithm for solving convex optimization problems. TOP 23(1), 124–150 (2012)
Bot, R.I., Hendrich, C.: A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems. Comput. Optim. Appl. 54(2), 239–262 (2013)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Chen, J., Burer, S.: A first-order smoothing technique for a class of large-scale linear programs. SIAM J. Optim. 24(2), 598–620 (2014)
Combettes, P., Pesquet, J.-C.: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Chapter Proximal Splitting Methods in Signal Processing, pp. 185–212. Springer, New York (2011)
Devolder, O., Glineur, F., Nesterov, Y.: Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J. Optim. 22(2), 702–727 (2012)
Goldfarb, D., Ma, S.: Fast alternating linearization methods of minimization of the sum of two convex functions. Math. Prog. A 141(1), 349–382 (2012)
Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and Its Applications, pp. 155–210. Springer, New York (2006)
Necoara, I., Suykens, J.A.K.: Applications of a smoothing technique to decomposition in convex optimization. IEEE Trans. Autom. Control 53(11), 2674–2679 (2008)
Nedelcu, V., Necoara, I., Tran-Dinh, Q.: Computational complexity of inexact gradient augmented lagrangian methods: application to constrained MPC. SIAM J. Optim. Control 52(5), 3109–3134 (2014)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(o(1/k^2)\). Doklady AN SSSR 269 (Soviet Math. Dokl.), 543–547 (1983)
Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Applied Optimization, vol. 87. Kluwer Academic Publishers, Dordrecht (2004)
Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16(1), 235–249 (2005)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Prog. 103(1), 127–152 (2005)
Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Math. Prog. 110(2), 245–259 (2007)
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Prog. 140(1), 125–161 (2013)
Orabona, F., Argyriou, A., and Srebro, N.: PRISMA: proximal iterative smoothing algorithm. Tech. Report., pp. 1–21 (2012). http://arxiv.org/abs/1206.2372
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Rockafellar, R.T.: Monotropic programming: a generalization of linear programming and network programming. In: Rockafellar, R.T. (ed.) Convexity and Duality in Optimization, pp. 10–36. Springer, New York (1985)
Tran-Dinh, Q.: Sequential Convex Programming and Decomposition Approaches for Nonlinear Optimization. PhD Thesis, Arenberg Doctoral School, KU Leuven, Department of Electrical Engineering (ESAT/SCD) and Optimization in Engineering Center, Kasteelpark Arenberg 10, 3001-Heverlee, Belgium (2012)
Tran-Dinh, Q., Cevher, V.: A primal-dual algorithmic framework for constrained convex minimization, pp. 1–54. Tech. Report, LIONS (2014)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
Tran-Dinh, Q., Savorgnan, C., Diehl, M.: Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimization problems. Compt. Optim. Appl. 55(1), 75–111 (2013)
Yang, J., Zhang, Y.: Alternating direction algorithms for \(\ell _1\) -problems in compressive sensing. SIAM J. Sci. Comput. 33(1–2), 250–278 (2011)
Acknowledgments
This research was partially supported by NSF, Grant No. 161-9884, and NAFOSTED, Grant No. 101.01-2014-24.
Author information
Authors and Affiliations
Corresponding author
Appendix: the proof of technical results
Appendix: the proof of technical results
This appendix provides the full proof of the technical results presented in the main text.
1.1 The proof of Lemma 2: descent property of the proximal gradient step
By using (9) with \(f_{\gamma }({x}) := \varphi ^{*}_{\gamma }({A}^{\top }{x})\), \(\nabla {f_{\gamma }}(\bar{{x}}) = {A}\nabla {\varphi _{\gamma }^{*}}({A}^{\top }\bar{{x}})\), \({z}:= {A}^{\top }{x}\), \(\bar{{z}} := {A}^{\top }\bar{{x}}\), and \(\Vert {A}^{\top }({x}- \bar{{x}})\Vert \le \Vert {A}\Vert \Vert {x}- \bar{{x}}\Vert \) we can show that
Using this estimate, we can show that the proof of (14) can be done similarly as in [31]. \(\square \)
1.2 The proof of Lemma 3: key estimate
We first substitute \(\beta = \frac{\gamma _{k+1}}{\Vert {A}\Vert ^2}\) into (14) and using (15) to obtain
Multiplying this inequality by \((1-\tau _k)\) and (14) by \(\tau _k\), and summing up the results we obtain
where \(\hat{\ell }_{\gamma }^k({x}) := f_{\gamma }(\hat{{x}}^k) + \langle \nabla {f_{\gamma }}(\hat{{x}}^k), {x}- \hat{{x}}^k\rangle + g({x})\).
From (16), we have \(\tau _k\tilde{{x}}^k = \hat{{x}}^k - (1-\tau _k){x}^k\), we can write this inequality as
Using (10) with \(\gamma := \gamma _{k+1}\), \(\hat{\gamma } := \gamma _k\) and \({z}:= {A}^{\top }{x}^k\), we get
which leads to (cf: \(F_{\gamma } = f_{\gamma } + g\)):
Next, we estimate \(\hat{\ell }_{\gamma _{k+1}}^k\). Using the definition of \(f_{\gamma }\) and \(\nabla {f}_{\gamma }\), we can deduce
Substituting \(\tilde{{x}}^{k+1} := \tilde{{x}}^k - \frac{1}{\tau _k}(\hat{{x}}^k - {x}^{k+1})\) from the third line of (16) together with (42), and (43) into (41), we can derive
which is indeed (18), where \(R_k\) is given by (19).
Finally, we prove (20). Indeed, using the strong convexity and the \(L_b\)-smoothness of \(b_{\mathcal {U}}\), we can lower bound
Letting \(\hat{{v}}_k := {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k) - \bar{{u}}^c\) and \( {v}_k := {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - \bar{{u}}^c\), we write \(R_k\) as
which obviously implies (20). \(\square \)
1.3 The proof of Lemma 4: the choice of parameters
First, using the update rules of \(\tau _k\) and \(\gamma _k\) in (21), we can express the quantity \(m_k\) as
Moreover, it follows from the properties of \(b_{\mathcal {U}}\) that
Multiplying the lower bound (20) by \(\frac{\gamma _{k+1}}{\tau _k^2}\) and combining the result with the last inequality and the estimate of \(m_k\), we obtain the first lower bound in (22).
Next, using the update rules (21) of \(\tau _k\) and \(\gamma _k\), we have \(\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2} = \frac{(k+\bar{c}-1)(k+\bar{c})^2\gamma _1\bar{c}}{(k+\bar{c})(k+\bar{c})} = \frac{\gamma _1\bar{c}(k+\bar{c}-1)^2}{(k+\bar{c}-1)} = \frac{\gamma _k}{\tau _{k-1}^2}\), which is the second equality in (22).
Using (18) with the lower bound of \(R_k\) from (22), we have
where \(\Delta {F}_k := F_{\gamma _k}({x}^k) - F^{\star }\) and \(s_k := \frac{\gamma _1^2\bar{c}^2\left[ (L_b-1)(k + \bar{c}) + 1\right] }{(k + \bar{c})^2}\). Using this inequality and the relation \(\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2} = \frac{\gamma _k}{\tau _{k-1}^2}\) in (22), we can easily show that
By induction, we obtain from the last inequality that
which implies (23), where \(S_k := \sum _{i=0}^ks_k = \gamma _1^2\bar{c}^2\sum _{i=0}^k\frac{\left[ (L_b-1)(i + \bar{c}) + 1\right] }{(i + \bar{c})^2}\).
Finally, to prove (24), we use two elementary inequalities \(\sum _{i=1}^{k+\bar{c}}\frac{1}{i} < 1 + \ln (k+\bar{c})\) and \(\sum _{i=0}^k\frac{1}{(i+\bar{c})^2} \le \frac{1}{\bar{c}^2} + \sum _{i=1}^k\frac{1}{(i+\bar{c}-1)(i+\bar{c})} < \frac{1}{\bar{c}^2} + \frac{1}{\bar{c}}\). \(\square \)
1.4 The proof of Corollary 1: the smooth accelerated gradient method
First, it is similar to the proof of (44), we can derive
where \(\Delta {F}_k := F_{\gamma _{k}}({x}^k) - F^{\star }\), and \(\hat{s}_k := \frac{\gamma _{k+1}(1-\tau _k)\left[ L_b(\gamma _k - \gamma _{k+1}) - \gamma _{k+1}\tau _k\right] }{\tau _k^2(L_g\gamma _{k+1}+\Vert {A}\Vert ^2)}\).
Next, we impose condition \(\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2(L_g\gamma _{k+1} + \Vert {A}\Vert ^2)} = \frac{\gamma _k}{\tau _{k-1}^2(L_g\gamma _k + \Vert {A}\Vert ^2)}\) and choose \(\tau _k = \frac{1}{k+1}\). Then, we can show from the last condition that \(\gamma _{k+1} = \frac{k\gamma _k\Vert {A}\Vert ^2}{L_g\gamma _k + \Vert {A}\Vert ^2(k+1)}\). Now, we show that \(\gamma _k \le \frac{\gamma _1}{k+1}\). Indeed, we have \(\frac{1}{\gamma _{k+1}} = \left( \frac{k+1}{k}\right) \frac{1}{\gamma _k} + \frac{L_g}{\Vert {A}\Vert ^2k} \ge \left( \frac{k+1}{k}\right) \frac{1}{\gamma _k}\), which implies that \(\gamma _{k+1} \le \frac{k}{k+1}\gamma _k\). By induction, we get \(\gamma _{k+1} \le \frac{\gamma _1}{k+1}\). On the other hand, assume that \(\frac{1}{\gamma _{k+1}} = \left( \frac{k+1}{k}\right) \frac{1}{\gamma _k} + \frac{L_g}{\Vert {A}\Vert ^2k} \le \frac{1}{\gamma _k} \left( \frac{k}{k-1}\right) \) for \(k\ge 2\). This condition leads to \(\gamma _k \le \frac{\Vert {A}\Vert ^2}{L_g(k-1)}\). Using \(\gamma _k \le \frac{\gamma _1}{k} = \frac{\Vert {A}\Vert ^2}{L_gk}\) due to the choice of \(\gamma _1\), we can show that \(\gamma _k \le \frac{\Vert {A}\Vert ^2}{L_g(k-1)}\). Hence, with the choice \(\gamma _1 := \frac{\Vert {A}\Vert ^2}{L_g}\), the estimate \(\frac{1}{\gamma _{k+1}} \le \frac{1}{\gamma _k} \left( \frac{k}{k-1}\right) \) and the update rule of \(\gamma _k\) eventually imply
This condition leads to \(\frac{\tau _{k-1}^2(L_g\gamma _k + \Vert {A}\Vert ^2)}{\gamma _k} = L_g\tau _{k-1}^2 + \frac{\tau _{k-1}^2}{\gamma _k}\Vert {A}\Vert ^2 \le \frac{L_g}{k^2} + \frac{3L_g(k-1)}{k^2} = \frac{3L_g}{k}\). Using the estimates of \(\tau _k\) and \(\gamma _k\), we can easily show that \(\hat{s}_k \le \frac{L_b\Vert {A}\Vert ^2}{L_g^2k(k+2)} + \frac{L_b\Vert {A}\Vert ^2}{L_g^2(k+1)(k+2)} + \frac{(L_b-1)\Vert {A}\Vert ^2k}{L_g(k+1)(k+2)}\). Hence, we can prove that
Using this estimate, we can show that
Finally, using the bound (11) and \(\gamma _k \le \frac{\Vert {A}\Vert ^2}{L_g(k+1)} < \frac{\Vert {A}\Vert ^2}{L_gk}\), we obtain (32). \(\square \)
1.5 The proof of Theorem 2: primal solution recovery
Let \(\Delta {F}_k := F_{\gamma _{k}}({x}^k) - F^{\star }\). Then, by (11), we have \(\Delta {F}_k \ge F({x}^k) - F^{\star }- \gamma _kD_{\mathcal {U}} \ge -\gamma _kD_{\mathcal {U}}\). Similar to the proof of Lemma 4, we can prove that
where \(s_i := \frac{\left[ (L_b-1)(i + \bar{c}) + 1\right] }{(i + \bar{c})^2}\) as in the proof of Lemma 4, and \(\Delta {\hat{\ell }}_{\gamma _{k+1}}^k({x}) = \big \langle {x}, {A}{u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k)\big \rangle - \varphi ({u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k)) + g({x}) - F^{\star }= \big \langle {x}, {A}{u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k) - {b}\big \rangle - \varphi ({u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k)) + s_{\mathcal {K}}({x}) - F^{\star }\). Summing up this inequality from \(i=1\) to \(i=k\) and using \(\tau _0 = 1\) and \(\tilde{{x}}^0 ={x}^0\), we obtain
where \(S_k := \sum _{i=1}^ks_i\). Now, using again (46) with \(k=1\), \({x}^0 = \tilde{{x}}^0\) and \(\tau _0 = 1\), we get \(\gamma _1\Delta {F}_1 \le \gamma _1\tau _0\Delta {\hat{\ell }}_{\gamma _1}(x) + \frac{\Vert {A}\Vert ^2}{2}\left( \Vert {x}^0 - {x}^{\star }\Vert ^2 - \Vert \tilde{{x}}^1 - {x}^{\star }\Vert ^2\right) \). Using this into (47), one yields
Combining \(\bar{{u}}^k\) defined by (35) with \(w_i :=\frac{\gamma _{i+1}}{\tau _i}\), and the convexity of \(\varphi \), we have
Substituting this into (48) and then using \(\Delta {F}_{k+1} \ge -\gamma _{k+1}D_{\mathcal {U}}\) we get
which implies
By arranging this inequality, we get
where we use the relation \(-s_{\mathcal {K}}({x}) = -\sup _{{r}\in \mathcal {K}}\langle {x}, {r}\rangle = \inf _{{r}\in \mathcal {K}}\langle {x}, -{r}\rangle \). On the other hand, by the saddle point theory for the primal and dual problems (33) and (34), for any optimal solution \({x}^{\star }\), we can show that
Since this inequality holds for any \({r}\in \mathcal {K}\) and \({u}\in \mathcal {U}\), by using \({u}= \bar{{u}}^k\), it leads to
Combining (49) and (50) yields
Taking \({x}:= {x}^0 - \Vert {A}\Vert ^{-2}\varGamma _k({A}\bar{{u}}^k - {b}+ {r})\) for any \({r}\in \mathcal {K}\), we obtain from (51) that
which implies (by the Cauchy–Schwarz inequality)
By elementary calculations and \(\mathrm {dist}\left( {b}- {A}\bar{{u}}^k, \mathcal {K}\right) = \inf \left\{ \Vert {A}\bar{{u}}^k-{b}+ {r}\Vert : {r}\in \mathcal {K}\right\} \), we can show from the last inequality that
To prove the first estimate of (37), we use (49) with \({x}= \varvec{0}^p\) and \(F^{\star } = -\varphi ^{\star }\) to get
Since we apply Algorithm 1(c) to solve the dual problem (34) using \(b_{\mathcal {U}}\) such that \(L_b = 1\), we have \(S^k \le 2\gamma _1^2\). Then, by using \(\gamma _{k+1} = \frac{\bar{c}\gamma _1}{k+\bar{c}}\), and \(\tau _k := \frac{1}{k+\bar{c}}\), we can show that \(\frac{\gamma _{k+1}^2}{\tau _k^2} = \gamma _1\bar{c}\). Moreover, we also have \(\varGamma _k := \sum _{i=0}^k\frac{\gamma _{i+1}}{\tau _i} = \gamma _1\bar{c}(k+1)\). Using these estimates, and \(S_k \le 2\gamma _1^2\) from (4) into (52) and (53) we obtain (37). For the left-hand side inequality in the first estimate of (37), we use a simple bound \(-\Vert {x}^{\star }\Vert \mathrm {dist}\left( {b}- {A}{u},\mathcal {K}\right) \le \varphi ({u}) - \varphi ^{\star }\) for \({u}= \bar{{u}}^k \in \mathcal {U}\) from the saddle point theory as in (50). \(\square \)
Rights and permissions
About this article
Cite this article
Tran-Dinh, Q. Adaptive smoothing algorithms for nonsmooth composite convex minimization. Comput Optim Appl 66, 425–451 (2017). https://doi.org/10.1007/s10589-016-9873-6
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-016-9873-6
Keywords
- Nesterov’s smoothing technique
- Accelerated proximal-gradient method
- Adaptive algorithm
- Composite convex minimization
- Nonsmooth convex optimization