Abstract
In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that treating the secant condition with a penalty method approach motivated by regularized least squares estimation generates a parametric family with the original BFGS update at one extreme and not updating the inverse Hessian approximation at the other extreme. Furthermore, we find the curvature condition is relaxed as the family moves towards not updating the inverse Hessian approximation, and disappears entirely at the extreme where the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as Secant Penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10589-022-00448-x/MediaObjects/10589_2022_448_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10589-022-00448-x/MediaObjects/10589_2022_448_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10589-022-00448-x/MediaObjects/10589_2022_448_Fig3_HTML.png)
Similar content being viewed by others
Data availability
The CUTEst test problems used in the numerical experiments are available at https://www.cuter.rl.ac.uk/Problems/mastsif.shtml.
References
Aydin, L., Aydin, O., Artem, H.S., Mert, A.: Design of dimensionally stable composites using efficient global optimization method. Proc. Inst. Mech. Eng. Part L: J. Mater. Design Appl. 233(2), 156–168 (2019). https://doi.org/10.1177/1464420716664921
Berahas, A.S., Byrd, R.H., Nocedal, J.: Derivative-free optimization of noisy functions via quasi-newton methods. SIAM J. Optim. 29, 965–993 (2019). https://doi.org/10.1137/18M1177718
Besançon, M., Anthoff, D., Arslan, A., Byrne, S., Lin, D., Papamarkou, T., Pearson, J.: Distributions.jl: Definition and modeling of probability distributions in the juliastats ecosystem. ar**v e-prints ar**v:1907.08611 (2019)
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017). https://doi.org/10.1137/141000671
Bons, N.P., He, X., Mader, C.A., Martins, J.R.R.A.: Multimodality in aerodynamic wing design optimization. AIAA J. 57(3), 1004–1018 (2019). https://doi.org/10.2514/1.J057294
Broyden, C.G.: The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA J. Appl. Math. 6(1), 76–90 (1970). https://doi.org/10.1093/imamat/6.1.76
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016). https://doi.org/10.1137/140954362
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995). https://doi.org/10.1137/0916069
Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26(3), 727–739 (1989). https://doi.org/10.2307/2157680
Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a class of quasi-newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987). https://doi.org/10.2307/2157646
Chang, D., Sun, S., Zhang, C.: An accelerated linearly convergent stochastic L-BFGS algorithm. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3338–3346 (2019). https://doi.org/10.1109/TNNLS.2019.2891088
Fasano, G., Pintér, J.D.: Modeling and Optimization in Space Engineering: State of the Art and New Challenges. Springer (2019). https://doi.org/10.1007/978-1-4614-4469-5
Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970). https://doi.org/10.1093/comjnl/13.3.317
Gal, R., Haber, E., Irwin, B., Saleh, B., Ziv, A.: How to catch a lion in the desert: on the solution of the coverage directed generation (CDG) problem. Optim. Eng. 22, 217–245 (2021). https://doi.org/10.1007/s11081-020-09507-w
Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970). https://doi.org/10.2307/2004873
Gould, N.I.M., Orban, D., contributors: The Constrained and Unconstrained Testing Environment with safe threads (CUTEst) for optimization software. https://github.com/ralna/CUTEst (2019)
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr a Constrained and Unconstrained Testing Environment, revisited. https://www.cuter.rl.ac.uk (2001)
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr and SifDec: a constrained and unconstrained testing environment, revisited. ACM Trans. Math. Softw. 29(4), 373–394 (2003). https://doi.org/10.1145/962437.962439
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2015). https://doi.org/10.1007/s10589-014-9687-3
Gower, R., Goldfarb, D., Richtarik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 1869–1878. PMLR, New York, New York, USA (2016). http://proceedings.mlr.press/v48/gower16.html
Graf, P.A., Billups, S.: MDTri: robust and efficient global mixed integer search of spaces of multiple ternary alloys. Comput. Optim. Appl. 68(3), 671–687 (2017). https://doi.org/10.1007/s10589-017-9922-9
Güler, O., Gürtuna, F., Shevchenko, O.: Duality in quasi-newton methods and new variational characterizations of the DFP and BFGS updates. Optim. Methods Softw. 24(1), 45–62 (2009). https://doi.org/10.1080/10556780802367205
Hager, W.W.: Updating the inverse of a matrix. SIAM Review 31(2), 221–239 (1989). https://doi.org/10.2307/2030425
Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, New York (2013). https://doi.org/10.1017/CBO9781139020411
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013). https://doi.org/10.5555/2999611.2999647
Johnson, S.G.: Quasi-newton optimization: Origin of the BFGS update (2019). https://ocw.mit.edu/courses/mathematics/18-335j-introduction-to-numerical-methods-spring-2019/week-11/MIT18_335JS19_lec30.pdf
Keane, A.J., Nair, P.B.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. Wiley (2005). https://doi.org/10.1002/0470855487
Kelley, C.: Implicit Filtering. SIAM, Philadelphia (2011). https://doi.org/10.1137/1.9781611971903
Koziel, S., Ogurtsov, S.: Antenna Design by Simulation-Driven Optimization. Springer (2014). https://doi.org/10.1007/978-3-319-04367-8
Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-newton methods. Math. Program. 141, 135–163 (2013). https://doi.org/10.1007/s10107-012-0514-2
Lin, D., White, J.M., Byrne, S., Bates, D., Noack, A., Pearson, J., Arslan, A., Squire, K., Anthoff, D., Papamarkou, T., Besançon, M., Drugowitsch, J., Schauer, M., other contributors: JuliaStats/Distributions.jl: a Julia package for probability distributions and associated functions. https://github.com/JuliaStats/Distributions.jl (2019). https://doi.org/10.5281/zenodo.2647458
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989). https://doi.org/10.1007/BF01589116
Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(1), 3151–3181 (2015). https://doi.org/10.5555/2789272.2912100
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, pp. 249–258. PMLR, Cadiz, Spain (2016). http://proceedings.mlr.press/v51/moritz16.html
Muñoz-Rojas, P.A.: Computational Modeling, Optimization and Manufacturing Simulation of Advanced Engineering Materials. Springer (2016). https://doi.org/10.1007/978-3-319-04265-7
Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Orban, D., Siqueira, A.S., contributors: CUTEst.jl: Julia’s CUTEst interface. https://github.com/JuliaSmoothOptimizers/CUTEst.jl (2020). https://doi.org/10.5281/zenodo.1188851
Orban, D., Siqueira, A.S., contributors: NLPModels.jl: Data structures for optimization models. https://github.com/JuliaSmoothOptimizers/NLPModels.jl (2020). https://doi.org/10.5281/zenodo.2558627
Powell, M.J.D.: Algorithms for nonlinear constraints that use lagrangian functions. Math. Program. 14(1), 224–248 (1978). https://doi.org/10.1007/BF01588967
Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175–184 (1960). https://doi.org/10.1093/comjnl/3.3.175
Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-newton method for online convex optimization. In: Meila, M., Shen, X. (eds.) Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 2, pp. 436–443. PMLR, San Juan, Puerto Rico (2007). http://proceedings.mlr.press/v2/schraudolph07a.html
Shanno, D.F.: Conditioning of quasi-newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970). https://doi.org/10.2307/2004840
Shi, H.J.M., **e, Y., Byrd, R., Nocedal, J.: A noise-tolerant quasi-newton algorithm for unconstrained optimization. SIAM J. Optim. 32(1), 29–55 (2022). https://doi.org/10.1137/20M1373190
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017). https://doi.org/10.1137/15M1053141
**e, Y., Byrd, R.H., Nocedal, J.: Analysis of the BFGS method with errors. SIAM J. Optim. 30(1), 182–209 (2020). https://doi.org/10.1137/19M1240794
Zhao, R., Haskell, W.B., Tan, V.Y.F.: Stochastic L-BFGS: improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process. 66, 1155–1169 (2018). https://doi.org/10.1109/TSP.2017.2784360
Zhu, J.: Optimization of Power System Operation. Wiley (2008). https://doi.org/10.1002/9780470466971
Acknowledgements
EH and BI’s work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of British Columbia (UBC).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Proof of Theorem 1
To produce the SP-BFGS update, we first rearrange (26a), revealing that
and so the symmetry requirement that \(H = H^T\) means transposing (77) gives
which rearranges to
and so
Next, we right multiply (80) by \(y_k\) to get
and use (26b) to get that
We now left multiply both sides by \(-2 W\) and rearrange, giving
This can be rearranged so that u is isolated, giving
To get rid of the \(u^T\) on the right hand side, we first left multiply both sides by \(y_k^T W^{-1}\), and then transpose to get
where we have taken advantage of the fact that the transpose of a scalar returns the same scalar. This now allows us to solve for \(u^T W^{-1} y_k\) using some basic algebra, and resulting in
Substituting (86) into (84) gives
Now, if we substitute the expression for u in (87) into (80), after some simplification we get
Now, we further simplify by applying that \(W s_k = y_k\), and thus \(W^{-1} y_k = s_k\), revealing
which, after a bit of algebra, reveals that the update formula solving the system defined by (26a), (26b), and (26c) can be expressed as
We can make (89) look similar to the common form of the BFGS update given in (19) by defining the two quantities \(\gamma _k\) and \(\omega _k\) as in (28) and observing that completing the square gives
which is equivalent to
concluding the proof.
Appendix 2: Proof of Lemma 1
The \(H_{k+1}\) given by (27) has the general form
with the specific choices
By definition, \(H_{k+1}\) is positive definite if
We first show that (29) is a sufficient condition for \(H_{k+1}\) to be positive definite, given that \(H_k\) is positive definite. By applying (92) to (94), we see that
must be true for the choices of G and d in (93) if \(H_{k+1}\) is positive definite. Substituting (93) into (95) reveals that
must be true for all \(v \in {\mathbb {R}}^{n} \setminus 0\) if \(H_{k+1}\) is positive definite. Both \((s_k^T v)^2\) and \(v^T G^T H_k G v\) are always nonnegative. To see that \(v^T G^T H_k G v \ge 0\), note that because \(H_k\) is positive definite, it has a principal square root \(H_k^{1/2}\), and so
We now observe that if \(d > 0\), the right term \(d (s_k^T v)^2\) in (96) is zero if and only if \((s_k^T v) = 0\). However, if \((s_k^T v) = 0\), then the left term \(v^T G^T H_k G v\) in (96) is zero only when \(v = 0\). Hence, the condition \(d > 0\) guarantees that (96) is true for all v excluding the zero vector, and thus that \(H_{k+1}\) is positive definite. The condition \(d > 0\) expands to
Using the definitions of \(\gamma _k\) and \(\omega _k\) in (28), it is clear that \((\gamma _k - \omega _k) \ge 0\), as \(\beta _k\) can only take nonnegative values. Furthermore, as \(H_k\) is positive definite, \(y_k^T H_k y_k \ge 0\) for all \(y_k\). As it is possible for \((\gamma _k - \omega _k) y_k^T H_k y_k\) to be zero, we requre \(\gamma _k > 0\). The condition \(\gamma _k > 0\) immediately gives (29), as \(\gamma _k\) can only be positive if the denominator in its definition is positive. Finally, as \(\beta _k\) can only take nonnegative values, (29) also ensures that \(\omega _k\) is nonnegative, and so when (29) is true, \(\omega _k (\gamma _k - \omega _k) y_k^T H_k y_k \ge 0\). In summary, we have shown that the condition (29) ensures that the left term in (98) is positive, and the right term nonnegative, so \(d > 0\), and thus \(H_{k+1}\) is positive definite.
We now show that (29) is a necessary condition for \(H_{k+1}\) to be positive definite, given that \(H_k\) is positive definite. If \(H_{k+1}\) is positive definite, then
assuming \(y_k \ne 0\). Substituting (26b) into (99) gives
and using (86) shows that (100) is equivalent to
Now, some algebra shows that
and we also know that because \(H_k\) is positive definite, \(y_k^T H_k y_k > 0\) for all \(y_k \ne 0\), by definition \(\beta _k \ge 0\), and by the definition of the square of a real number, \((y_k^T s_{k})^2 \ge 0\). As a result,
is guaranteed only if the denominator \(1 + \beta _k y_k^T s_{k}\) is positive, which occurs when
This establishes that (29) is a necessary condition for \(H_{k+1}\) to be positive definite, given that \(H_k\) is positive definite, and concludes the proof.
Appendix 3: Proof of Theorem 2
The Sherman-Morrison-Woodbury formula says
Now, observe that the SP-BFGS update (27) can be written in the factored form
Applying the Sherman-Morrison-Woodbury formula (105) to the factored SP-BFGS update (106) with
yields
Inverting C here gives
and we also have
which is just a \(2 \times 2\) matrix with real entries. Now, it becomes clear that
For notational compactness, let
so
where the determinant of D is
and we have used the fact that \(y_k^T s_k = s_k^T y_k\), as this is a scalar quantity. Next,
so \(U \det (D) D^{-1} V\) fully expanded becomes
This looks rather ugly at the moment, but we continue by breaking the problem down further, noting that
and
The above intermediate results further simplify \(U \det (D) D^{-1} V\) to
Left and right multiplying the line immediately above by \(A^{-1} = H_k^{-1}\) gives
and thus, after dividing out \(\det (D)\) and applying \(B_{k} = H_{k}^{-1}\), we arrive at the following final formula
for the SP-BFGS inverse update, which concludes the proof.
Appendix 4: Proof of Theorem 3
Referring to Theorem 2, taking the trace of both sides of (107) and applying the linearity and cyclic invariance properties of the trace yields
where
with \({\hat{D}}\) and \({\hat{E}}\) defined as
We now observe that after applying some basic algebra, and recalling that \(B_k\) is positive definite, one can deduce that for all \(\beta _k \in [0, +\infty ]\), the following inequalities hold
By minimizing the absolute value of the common denominator in \(\kappa _2, \kappa _3\), and \(\kappa _4\) using the inequalities above, one can obtain the bounds
As a result,
and applying \(\lambda _{max}(B_k) < {{\,\textrm{Tr}\,}}(B_k)\) establishes (53). Similarly, referring to (89) reveals the upper bound
To establish (52), we apply \(\lambda _{max}(H_k) < {{\,\textrm{Tr}\,}}(H_k)\) and \(\omega _k \le \gamma _k\) to the line above, and then factor. This completes the proof.
Appendix 5: Proof of Lemma 2
As \(\phi\) is m-strongly convex due to Assumption 3, it is true that
Note that for any fixed x, the right side of (118) provides a global quadratic lower bound on \(\phi\). As these bounds are global lower bounds, minimizing both sides of (118) with respect to y preserves the inequality, so
which simplifies to
Proceeding, the inner product condition \(\nabla \phi (x)^T H g(x) > \xi \left\| \nabla \phi (x)\right\| _2\) expands to
The smallest possible value of \(\nabla \phi (x)^T H \nabla \phi (x)\) is
By applying the Cauchy-Schwarz inequality and Assumption 2, the most negative possible value of \(\nabla \phi (x)^T H e(x)\) is
Thus, we see that if
which rearranges to
then \(\nabla \phi (x)^T H g(x) > \xi \left\| \nabla \phi (x)\right\| _2\) is guaranteed. Note that (125) implies
when combined with the inner product condition. Combining (125) with Assumption 2 and the definition of the gradient noise to signal ratio \(\delta (x)\) given by (58) reveals that
and so \(\delta (x) < \frac{\psi {\bar{\epsilon }}_g}{{\varPsi } {\bar{\epsilon }}_g + \xi } \le \frac{\psi }{{\varPsi }}\).
Contrapositively, if \(\nabla \phi (x)^T H g(x) \le \xi \left\| \nabla \phi (x)\right\| _2\), then
or if \(\delta (x) \ge \frac{\psi {\bar{\epsilon }}_g}{{\varPsi } {\bar{\epsilon }}_g + \xi } \ge 0\), then
Squaring either inequality (128) or (129) and then combining it with a rearranged (120) given by
gives \({\mathcal {N}}_1(\psi ,{\varPsi },\xi )\), completing the proof.
Appendix 6: Proof of Lemma 3
Similar to (122) and (123), by using the definition of \(\delta (x)\), the lower bound
and the upper bound
can be established. Observe that if the lower bound (131) is always greater than or equal to the upper bound (132)
it implies that \(\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)\). Hence, the condition
implies that \(\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)\). By applying Lemma 2, we see that for all \(x \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), it is true that \(\delta (x) < \frac{\psi }{(1 + A) {\varPsi }}\). Thus, setting
guarantees that \(\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)\) for all \(x \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), completing the proof.
Appendix 7: Proof of Theorem 4
As \(\phi \in C^2\) by Assumption 3, applying Taylor’s theorem and using (62) and strong convexity gives
where u is some convex combination of \(x_{k+1}\) and \(x_{k}\). Proceeding, note that the smallest possible region \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\) from Lemma 3 occurs with the choice \(\psi = {\varPsi }\). In this case \(H = {\varPsi } I\), and (59) from Lemma 2 becomes
and so \(\nabla \phi _k^T g_k > 0\) if \(x_k \notin {\mathcal {N}}_{1}(\psi = {\varPsi },{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). Hence, for all possible choices of \(0 < \psi \le {\varPsi }\) in \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), we have \(\nabla \phi _k^T g_k > 0\) if \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). Combining this with Lemma 3 gives
if \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). With (137) in hand, continuing to bound terms gives
where the last inequality follows from expanding
and using \(\alpha \le \frac{\varepsilon }{M {\varPsi }^2}\) in (63). Simplifying the last inequality reveals that
Since \(\phi\) is m-strongly convex by Assumption 3, we can apply
which comes from rearranging (120) in the proof of Lemma 2 (see Appendix 5). Combining (140) with (139) and Assumption 2 gives
Subtracting \(\phi ^{\star }\) from both sides, and using the notation \({\tilde{A}} :=(1+A)\), we get
which, by subtracting \(\frac{1}{2 m} \big ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \big )^2\) from both sides and simplifying, gives
thus establishing the Q-linear result. We obtain the R-linear result (64) by recursively applying the worst case bound given by the Q-linear result, noting that in the worst case if \(x_0 \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), then the sequence of iterates \(\{ x_k \}\) remains outside of \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), only approaching \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\) in the limit \(k \rightarrow \infty\).
Appendix 8: Proof of Theorem 5
From (139) in Appendix 7, if the step size \(\alpha \le \frac{\varepsilon }{M {\varPsi }^2}\) from (63), one has
which combines with Assumption 1 to give
The relaxed Armijo condition (38) expands to
and so the strongest possible condition (i.e. the condition requiring the greatest decrease in f) can be written as
Comparing (144) and (146) reveals that for the bound given by (144) to also imply the bound given by (146), it must be true that
which rearranges to
As \(\epsilon _{A} - {\bar{\epsilon }}_{f} > 0\), it is clear that the right side of (148) can be made arbitrarily large by sending \(\alpha \rightarrow 0\). Hence, the relaxed Armijo condition (38) will be satisfied for sufficiently small \(\alpha\) and the backtracking line search will always find an \(\alpha _k\) small enough to satisfy (38).
By Lemma 2, outside of \({\mathcal {N}}_1(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), one has \(\delta _k < \frac{\psi }{(1+A){\varPsi }}\). Applying the triangle and reverse triangle inequalities to (32) gives
which can be written using the gradient noise to signal ratio \(\delta _k\) as
Combining the definition of \(\delta _k\) (see (58) in Lemma 2), (150), and \(\delta _k < \frac{\psi }{(1+A) {\varPsi }}\) with (144) gives
Now, as \({\bar{\epsilon }}_{f} < \epsilon _{A}\), the bound (154) above implies the bound (146) for any \(\alpha \le \frac{\varepsilon }{{\varPsi }^2 M}\) if \(c_1 \le \frac{\varepsilon }{2 {\varPsi }} \frac{ \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big ) }{ \big ( 1 + \frac{\psi }{(1+A) {\varPsi }} \big ) }\). Since \(\alpha _k\) is chosen using a backtracking line search with backtracking factor \(\tau < 1\), it is true that \(\frac{\tau \varepsilon }{{\varPsi }^2 M} < \alpha _k \le \frac{\varepsilon }{{\varPsi }^2 M}\). Thus, combining the bound (146) with Assumption 1 and (150) shows that
The expression (157) measures the reduction in the value of \(\phi\) for iterates where \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). Proceeding, we take the following bound
and subtract \(\phi ^{\star }\) from both sides as well as apply the inequality (140) to get
For ease of notation, define the following quantities
Thus, for all k where \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), we have shown that the following bound holds
Subtracting \(\frac{\eta }{(1 - \rho )}\) from both sides shows that
and thus one has
where \({\bar{\eta }} :=\frac{\eta }{(1 - \rho )}\). Using the definitions in (160) shows that
which establishes (68) and (69). Similar to Appendix 7, we obtain the R-linear result (71) by recursively applying the bound in (68), stop** once an iterate enters \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). This concludes the proof.
Appendix 9: Extended numerical experiments
Table 5 below shows the performance of gradient descent for the same problem (ROSENBR) and noise combinations as in Table 1.
Tables 6, 7 and 8 compare the performance of SP-BFGS, BFGS, and gradient descent on the 32 CUTEst test problems with only gradient noise present (i.e. \({\bar{\epsilon }}_f = 0\)). Gradient noise was generated using \({\bar{\epsilon }}_g = 10^{-4} \left\| \nabla \phi (x^0)\right\| _2\), where the starting point \(x^0\) varies by CUTEst problem, to ensure that noise does not initially dominate gradient evaluations. By examining the mean and median columns in Tables 6, 7 and 8, one sees that SP-BFGS outperforms both BFGS and gradient descent on \(\frac{18}{32} \approx 56 \%\) of the CUTEst problems with only gradient noise present, and performs at least as well as the best performing alternative on \(\frac{28}{32} \approx 88 \%\) of these problems. Equivalently, SP-BFGS was only outperformed by BFGS or gradient descent on \(\frac{4}{32} \approx 12 \%\) of these problems.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Irwin, B., Haber, E. Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition. Comput Optim Appl 84, 651–702 (2023). https://doi.org/10.1007/s10589-022-00448-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-022-00448-x