Log in

Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that treating the secant condition with a penalty method approach motivated by regularized least squares estimation generates a parametric family with the original BFGS update at one extreme and not updating the inverse Hessian approximation at the other extreme. Furthermore, we find the curvature condition is relaxed as the family moves towards not updating the inverse Hessian approximation, and disappears entirely at the extreme where the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as Secant Penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

The CUTEst test problems used in the numerical experiments are available at https://www.cuter.rl.ac.uk/Problems/mastsif.shtml.

References

  1. Aydin, L., Aydin, O., Artem, H.S., Mert, A.: Design of dimensionally stable composites using efficient global optimization method. Proc. Inst. Mech. Eng. Part L: J. Mater. Design Appl. 233(2), 156–168 (2019). https://doi.org/10.1177/1464420716664921

    Article  Google Scholar 

  2. Berahas, A.S., Byrd, R.H., Nocedal, J.: Derivative-free optimization of noisy functions via quasi-newton methods. SIAM J. Optim. 29, 965–993 (2019). https://doi.org/10.1137/18M1177718

    Article  MathSciNet  MATH  Google Scholar 

  3. Besançon, M., Anthoff, D., Arslan, A., Byrne, S., Lin, D., Papamarkou, T., Pearson, J.: Distributions.jl: Definition and modeling of probability distributions in the juliastats ecosystem. ar**v e-prints ar**v:1907.08611 (2019)

  4. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017). https://doi.org/10.1137/141000671

    Article  MathSciNet  MATH  Google Scholar 

  5. Bons, N.P., He, X., Mader, C.A., Martins, J.R.R.A.: Multimodality in aerodynamic wing design optimization. AIAA J. 57(3), 1004–1018 (2019). https://doi.org/10.2514/1.J057294

    Article  Google Scholar 

  6. Broyden, C.G.: The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA J. Appl. Math. 6(1), 76–90 (1970). https://doi.org/10.1093/imamat/6.1.76

    Article  MATH  Google Scholar 

  7. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016). https://doi.org/10.1137/140954362

    Article  MathSciNet  MATH  Google Scholar 

  8. Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995). https://doi.org/10.1137/0916069

    Article  MathSciNet  MATH  Google Scholar 

  9. Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26(3), 727–739 (1989). https://doi.org/10.2307/2157680

    Article  MathSciNet  MATH  Google Scholar 

  10. Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a class of quasi-newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987). https://doi.org/10.2307/2157646

    Article  MathSciNet  MATH  Google Scholar 

  11. Chang, D., Sun, S., Zhang, C.: An accelerated linearly convergent stochastic L-BFGS algorithm. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3338–3346 (2019). https://doi.org/10.1109/TNNLS.2019.2891088

    Article  MathSciNet  Google Scholar 

  12. Fasano, G., Pintér, J.D.: Modeling and Optimization in Space Engineering: State of the Art and New Challenges. Springer (2019). https://doi.org/10.1007/978-1-4614-4469-5

  13. Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970). https://doi.org/10.1093/comjnl/13.3.317

    Article  MATH  Google Scholar 

  14. Gal, R., Haber, E., Irwin, B., Saleh, B., Ziv, A.: How to catch a lion in the desert: on the solution of the coverage directed generation (CDG) problem. Optim. Eng. 22, 217–245 (2021). https://doi.org/10.1007/s11081-020-09507-w

    Article  MathSciNet  MATH  Google Scholar 

  15. Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970). https://doi.org/10.2307/2004873

    Article  MathSciNet  MATH  Google Scholar 

  16. Gould, N.I.M., Orban, D., contributors: The Constrained and Unconstrained Testing Environment with safe threads (CUTEst) for optimization software. https://github.com/ralna/CUTEst (2019)

  17. Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr a Constrained and Unconstrained Testing Environment, revisited. https://www.cuter.rl.ac.uk (2001)

  18. Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr and SifDec: a constrained and unconstrained testing environment, revisited. ACM Trans. Math. Softw. 29(4), 373–394 (2003). https://doi.org/10.1145/962437.962439

    Article  MATH  Google Scholar 

  19. Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2015). https://doi.org/10.1007/s10589-014-9687-3

    Article  MathSciNet  MATH  Google Scholar 

  20. Gower, R., Goldfarb, D., Richtarik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 1869–1878. PMLR, New York, New York, USA (2016). http://proceedings.mlr.press/v48/gower16.html

  21. Graf, P.A., Billups, S.: MDTri: robust and efficient global mixed integer search of spaces of multiple ternary alloys. Comput. Optim. Appl. 68(3), 671–687 (2017). https://doi.org/10.1007/s10589-017-9922-9

    Article  MathSciNet  MATH  Google Scholar 

  22. Güler, O., Gürtuna, F., Shevchenko, O.: Duality in quasi-newton methods and new variational characterizations of the DFP and BFGS updates. Optim. Methods Softw. 24(1), 45–62 (2009). https://doi.org/10.1080/10556780802367205

    Article  MathSciNet  MATH  Google Scholar 

  23. Hager, W.W.: Updating the inverse of a matrix. SIAM Review 31(2), 221–239 (1989). https://doi.org/10.2307/2030425

    Article  MathSciNet  MATH  Google Scholar 

  24. Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, New York (2013). https://doi.org/10.1017/CBO9781139020411

    Book  MATH  Google Scholar 

  25. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013). https://doi.org/10.5555/2999611.2999647

  26. Johnson, S.G.: Quasi-newton optimization: Origin of the BFGS update (2019). https://ocw.mit.edu/courses/mathematics/18-335j-introduction-to-numerical-methods-spring-2019/week-11/MIT18_335JS19_lec30.pdf

  27. Keane, A.J., Nair, P.B.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. Wiley (2005). https://doi.org/10.1002/0470855487

  28. Kelley, C.: Implicit Filtering. SIAM, Philadelphia (2011). https://doi.org/10.1137/1.9781611971903

    Book  MATH  Google Scholar 

  29. Koziel, S., Ogurtsov, S.: Antenna Design by Simulation-Driven Optimization. Springer (2014). https://doi.org/10.1007/978-3-319-04367-8

  30. Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-newton methods. Math. Program. 141, 135–163 (2013). https://doi.org/10.1007/s10107-012-0514-2

    Article  MathSciNet  MATH  Google Scholar 

  31. Lin, D., White, J.M., Byrne, S., Bates, D., Noack, A., Pearson, J., Arslan, A., Squire, K., Anthoff, D., Papamarkou, T., Besançon, M., Drugowitsch, J., Schauer, M., other contributors: JuliaStats/Distributions.jl: a Julia package for probability distributions and associated functions. https://github.com/JuliaStats/Distributions.jl (2019). https://doi.org/10.5281/zenodo.2647458

  32. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989). https://doi.org/10.1007/BF01589116

    Article  MathSciNet  MATH  Google Scholar 

  33. Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(1), 3151–3181 (2015). https://doi.org/10.5555/2789272.2912100

    Article  MathSciNet  MATH  Google Scholar 

  34. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, pp. 249–258. PMLR, Cadiz, Spain (2016). http://proceedings.mlr.press/v51/moritz16.html

  35. Muñoz-Rojas, P.A.: Computational Modeling, Optimization and Manufacturing Simulation of Advanced Engineering Materials. Springer (2016). https://doi.org/10.1007/978-3-319-04265-7

  36. Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5

    Book  MATH  Google Scholar 

  37. Orban, D., Siqueira, A.S., contributors: CUTEst.jl: Julia’s CUTEst interface. https://github.com/JuliaSmoothOptimizers/CUTEst.jl (2020). https://doi.org/10.5281/zenodo.1188851

  38. Orban, D., Siqueira, A.S., contributors: NLPModels.jl: Data structures for optimization models. https://github.com/JuliaSmoothOptimizers/NLPModels.jl (2020). https://doi.org/10.5281/zenodo.2558627

  39. Powell, M.J.D.: Algorithms for nonlinear constraints that use lagrangian functions. Math. Program. 14(1), 224–248 (1978). https://doi.org/10.1007/BF01588967

    Article  MathSciNet  MATH  Google Scholar 

  40. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175–184 (1960). https://doi.org/10.1093/comjnl/3.3.175

    Article  MathSciNet  Google Scholar 

  41. Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-newton method for online convex optimization. In: Meila, M., Shen, X. (eds.) Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 2, pp. 436–443. PMLR, San Juan, Puerto Rico (2007). http://proceedings.mlr.press/v2/schraudolph07a.html

  42. Shanno, D.F.: Conditioning of quasi-newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970). https://doi.org/10.2307/2004840

    Article  MathSciNet  MATH  Google Scholar 

  43. Shi, H.J.M., **e, Y., Byrd, R., Nocedal, J.: A noise-tolerant quasi-newton algorithm for unconstrained optimization. SIAM J. Optim. 32(1), 29–55 (2022). https://doi.org/10.1137/20M1373190

    Article  MathSciNet  MATH  Google Scholar 

  44. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017). https://doi.org/10.1137/15M1053141

    Article  MathSciNet  MATH  Google Scholar 

  45. **e, Y., Byrd, R.H., Nocedal, J.: Analysis of the BFGS method with errors. SIAM J. Optim. 30(1), 182–209 (2020). https://doi.org/10.1137/19M1240794

    Article  MathSciNet  MATH  Google Scholar 

  46. Zhao, R., Haskell, W.B., Tan, V.Y.F.: Stochastic L-BFGS: improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process. 66, 1155–1169 (2018). https://doi.org/10.1109/TSP.2017.2784360

    Article  MathSciNet  MATH  Google Scholar 

  47. Zhu, J.: Optimization of Power System Operation. Wiley (2008). https://doi.org/10.1002/9780470466971

Download references

Acknowledgements

EH and BI’s work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of British Columbia (UBC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brian Irwin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Proof of Theorem 1

To produce the SP-BFGS update, we first rearrange (26a), revealing that

$$\begin{aligned} (H - H_k) = - W^{-1} ( u y_k^T + {\varGamma }^T - {\varGamma } ) W^{-1} \end{aligned}$$
(77)

and so the symmetry requirement that \(H = H^T\) means transposing (77) gives

$$\begin{aligned} u y_k^T + {\varGamma }^T - {\varGamma } = ( u y_k^T + {\varGamma }^T - {\varGamma } )^T = y_k u^T + {\varGamma } - {\varGamma }^T \end{aligned}$$
(78)

which rearranges to

$$\begin{aligned} {\varGamma }^T - {\varGamma } = \frac{1}{2} ( y_k u^T - u y_k^T ) \end{aligned}$$
(79)

and so

$$\begin{aligned} (H - H_k) = - \frac{1}{2} W^{-1} (y_k u^T + u y_k^T) W^{-1} \text { . } \end{aligned}$$
(80)

Next, we right multiply (80) by \(y_k\) to get

$$\begin{aligned} (H - H_k) y_k = - \frac{1}{2} W^{-1} \bigg ( y_k u^T W^{-1} y_k + u (y_k^T W^{-1} y_k) \bigg ) \end{aligned}$$
(81)

and use (26b) to get that

$$\begin{aligned} s_k + \frac{W^{-1} u}{\beta _k} - H_k y_k = - \frac{1}{2} W^{-1} \bigg ( y_k u^T W^{-1} y_k + u (y_k^T W^{-1} y_k) \bigg ) \text { . } \end{aligned}$$
(82)

We now left multiply both sides by \(-2 W\) and rearrange, giving

$$\begin{aligned} -2 W (s_k - H_k y_k) = y_k u^T W^{-1} y_k + u \bigg ( y_k^T W^{-1} y_k + \frac{2}{\beta _k} \bigg ) \text { . } \end{aligned}$$
(83)

This can be rearranged so that u is isolated, giving

$$\begin{aligned} u &= \frac{-2 W (s_k - H_k y_k) - y_k u^T W^{-1} y_k}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \\ &= - \frac{2 W (s_k - H_k y_k) + y_k u^T W^{-1} y_k}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \text { . } \end{aligned}$$
(84)

To get rid of the \(u^T\) on the right hand side, we first left multiply both sides by \(y_k^T W^{-1}\), and then transpose to get

$$\begin{aligned} u^T W^{-1} y_k = - \frac{2 (s_k - H_k y_k)^T y_k + (y_k^T W^{-1} y_k) (u^T W^{-1} y_k)}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \end{aligned}$$
(85)

where we have taken advantage of the fact that the transpose of a scalar returns the same scalar. This now allows us to solve for \(u^T W^{-1} y_k\) using some basic algebra, and resulting in

$$\begin{aligned} u^T W^{-1} y_k = - \frac{(s_k - H_k y_k)^T y_k}{y_k^T W^{-1} y_k + \frac{1}{\beta _k}} \text { . } \end{aligned}$$
(86)

Substituting (86) into (84) gives

$$\begin{aligned} u = \frac{y_k y_k^T (s_k - H_k y_k)}{(y_k^T W^{-1} y_k + \frac{2}{\beta _k})(y_k^T W^{-1} y_k + \frac{1}{\beta _k})} - \frac{2 W (s_k - H_k y_k)}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \text { . } \end{aligned}$$
(87)

Now, if we substitute the expression for u in (87) into (80), after some simplification we get

$$ \begin{aligned} (H - H_k) = \frac{1}{\big ( y_k^T W^{-1} y_k + \frac{2}{\beta _k}\big )} \bigg [ (s_k - &H_k y_k) y_k^T W^{-1} + W^{-1} y_k (s_k - H_k y_k)^T - \frac{y_k^T(s_k - H_k y_k)}{(y_k^T W^{-1} y_k + \frac{1}{\beta _k})} W^{-1} y_k y_k^T W^{-1} \bigg ] \text { . } \end{aligned}$$

Now, we further simplify by applying that \(W s_k = y_k\), and thus \(W^{-1} y_k = s_k\), revealing

$$ \begin{aligned} H = H_k + \frac{(s_k - H_k y_k) s_k^T + s_k (s_k - H_k y_k)^T}{(y_k^T s_k + \frac{2}{\beta _k})} - \frac{y_k^T(s_k - H_k y_k)}{(y_k^T s_k + \frac{2}{\beta _k})(y_k^T s_k + \frac{1}{\beta _k})} s_k s_k^T \end{aligned}$$
(88)

which, after a bit of algebra, reveals that the update formula solving the system defined by (26a), (26b), and (26c) can be expressed as

$$\begin{aligned} H^{*} = H_k - \frac{H_k y_k s_k^T + s_k y_k^T H_k^T}{(y_k^T s_k + \frac{2}{\beta _k})} + \bigg [ \frac{y_k^T s_k + \frac{2}{\beta _k} + y_k^T H_k y_k}{(y_k^T s_k + \frac{2}{\beta _k})(y_k^T s_k + \frac{1}{\beta _k})} \bigg ] s_k s_k^T \text { . } \end{aligned}$$
(89)

We can make (89) look similar to the common form of the BFGS update given in (19) by defining the two quantities \(\gamma _k\) and \(\omega _k\) as in (28) and observing that completing the square gives

$$ \begin{aligned} H^{*} = \bigg ( I - \frac{s_k y_k^T}{(y_k^T s_k + \frac{2}{\beta _k})} \bigg ) H_k \bigg ( I - &\frac{y_k s_k^T}{(y_k^T s_k + \frac{2}{\beta _k})} \bigg ) + \bigg [ \frac{y_k^T s_k + \frac{2}{\beta _k} + y_k^T H_k y_k}{(y_k^T s_k + \frac{2}{\beta _k})(y_k^T s_k + \frac{1}{\beta _k})} - \frac{y_k^T H_k y_k}{(y_k^T s_k + \frac{2}{\beta _k})^2} \bigg ] s_k s_k^T \end{aligned}$$
(90)

which is equivalent to

$$ \begin{aligned} H^{*} = \bigg ( I - \omega _k s_k y_k^T \bigg ) H_k \bigg ( I - \omega _k y_k s_k^T \bigg ) + \omega _k \bigg [ \frac{\gamma _k}{\omega _k} + (\gamma _k - \omega _k) y_k^T H_k y_k \bigg ] s_k s_k^T \end{aligned}$$
(91)

concluding the proof.

Appendix 2: Proof of Lemma 1

The \(H_{k+1}\) given by (27) has the general form

$$\begin{aligned} H_{k+1} = G^T H_k G + d s_k s_k^T \end{aligned}$$
(92)

with the specific choices

$$\begin{aligned} G = I - \omega _k y_k s_k^T \text { , } \quad d = \omega _k \bigg [ \frac{\gamma _k}{\omega _k} + (\gamma _k - \omega _k) y_k^T H_k y_k \bigg ] \text { . } \end{aligned}$$
(93)

By definition, \(H_{k+1}\) is positive definite if

$$\begin{aligned} v^T H_{k+1} v > 0 \text { , } \quad \forall v \in {\mathbb {R}}^{n} \setminus 0 \text{ . } \end{aligned}$$
(94)

We first show that (29) is a sufficient condition for \(H_{k+1}\) to be positive definite, given that \(H_k\) is positive definite. By applying (92) to (94), we see that

$$\begin{aligned} v^T \bigg ( G^T H_k G + d s_k s_k^T \bigg ) v > 0 \text { , } \quad \forall v \in {\mathbb {R}}^{n} \setminus 0 \end{aligned}$$
(95)

must be true for the choices of G and d in (93) if \(H_{k+1}\) is positive definite. Substituting (93) into (95) reveals that

$$ \begin{aligned} \bigg ( v - \omega_k (s_k^T v) y_k \bigg )^T H_k \bigg ( v - \omega _k (s_k^T v) y_k \bigg ) + \omega _k \bigg [ \frac{\gamma _k}{\omega _k} + (\gamma _k - \omega _k) y_k^T H_k y_k \bigg ] (s_k^T v)^2 > 0 \end{aligned}$$
(96)

must be true for all \(v \in {\mathbb {R}}^{n} \setminus 0\) if \(H_{k+1}\) is positive definite. Both \((s_k^T v)^2\) and \(v^T G^T H_k G v\) are always nonnegative. To see that \(v^T G^T H_k G v \ge 0\), note that because \(H_k\) is positive definite, it has a principal square root \(H_k^{1/2}\), and so

$$\begin{aligned} v^T G^T H_k G v = v^T G^T H_k^{1/2} H_k^{1/2} G v = \left\| H_k^{1/2} G v\right\| _2^2 \ge 0 \text{ . } \end{aligned}$$
(97)

We now observe that if \(d > 0\), the right term \(d (s_k^T v)^2\) in (96) is zero if and only if \((s_k^T v) = 0\). However, if \((s_k^T v) = 0\), then the left term \(v^T G^T H_k G v\) in (96) is zero only when \(v = 0\). Hence, the condition \(d > 0\) guarantees that (96) is true for all v excluding the zero vector, and thus that \(H_{k+1}\) is positive definite. The condition \(d > 0\) expands to

$$\begin{aligned} \gamma _k + \omega _k (\gamma _k - \omega _k) y_k^T H_k y_k > 0 \text{ . } \end{aligned}$$
(98)

Using the definitions of \(\gamma _k\) and \(\omega _k\) in (28), it is clear that \((\gamma _k - \omega _k) \ge 0\), as \(\beta _k\) can only take nonnegative values. Furthermore, as \(H_k\) is positive definite, \(y_k^T H_k y_k \ge 0\) for all \(y_k\). As it is possible for \((\gamma _k - \omega _k) y_k^T H_k y_k\) to be zero, we requre \(\gamma _k > 0\). The condition \(\gamma _k > 0\) immediately gives (29), as \(\gamma _k\) can only be positive if the denominator in its definition is positive. Finally, as \(\beta _k\) can only take nonnegative values, (29) also ensures that \(\omega _k\) is nonnegative, and so when (29) is true, \(\omega _k (\gamma _k - \omega _k) y_k^T H_k y_k \ge 0\). In summary, we have shown that the condition (29) ensures that the left term in (98) is positive, and the right term nonnegative, so \(d > 0\), and thus \(H_{k+1}\) is positive definite.

We now show that (29) is a necessary condition for \(H_{k+1}\) to be positive definite, given that \(H_k\) is positive definite. If \(H_{k+1}\) is positive definite, then

$$\begin{aligned} y_k^T H_{k+1} y_k > 0 \end{aligned}$$
(99)

assuming \(y_k \ne 0\). Substituting (26b) into (99) gives

$$\begin{aligned} y_k^T \bigg [ s_{k} + \frac{W^{-1} u}{\beta _k} \bigg ] > 0 \end{aligned}$$
(100)

and using (86) shows that (100) is equivalent to

$$\begin{aligned} y_k^T \bigg [ s_{k} + \frac{\gamma _k (H_k y_k - s_k)}{\beta _k} \bigg ] > 0 \text { . } \end{aligned}$$
(101)

Now, some algebra shows that

$$\begin{aligned} \begin{aligned} y_k^T \bigg [ s_{k} + \frac{\gamma _k (H_k y_k - s_k)}{\beta _k} \bigg ]&= y_k^T s_{k} + \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg [ y_k^T H_k y_k - y_k^T s_{k} \bigg ] \\&= \bigg ( 1 - \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T s_{k} + \bigg ( \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T H_k y_k \\&= \bigg ( \frac{\beta _k y_k^T s_{k}}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T s_{k} + \bigg ( \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T H_k y_k \\&= \frac{\beta _k (y_k^T s_{k})^2 + y_k^T H_k y_k}{1 + \beta _k y_k^T s_{k}} \end{aligned} \end{aligned}$$
(102)

and we also know that because \(H_k\) is positive definite, \(y_k^T H_k y_k > 0\) for all \(y_k \ne 0\), by definition \(\beta _k \ge 0\), and by the definition of the square of a real number, \((y_k^T s_{k})^2 \ge 0\). As a result,

$$\begin{aligned} y_k^T \bigg [ s_{k} + \frac{W^{-1} u}{\beta _k} \bigg ] = \frac{\beta _k (y_k^T s_{k})^2 + y_k^T H_k y_k}{1 + \beta _k y_k^T s_{k}} > 0 \end{aligned}$$
(103)

is guaranteed only if the denominator \(1 + \beta _k y_k^T s_{k}\) is positive, which occurs when

$$\begin{aligned} s_k^T y_k > - \frac{1}{\beta _k} \text { . } \end{aligned}$$
(104)

This establishes that (29) is a necessary condition for \(H_{k+1}\) to be positive definite, given that \(H_k\) is positive definite, and concludes the proof.

Appendix 3: Proof of Theorem 2

The Sherman-Morrison-Woodbury formula says

$$\begin{aligned} (A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V A^{-1} U)^{-1} V A^{-1} \text { . } \end{aligned}$$
(105)

Now, observe that the SP-BFGS update (27) can be written in the factored form

$$\begin{aligned} H_{k+1} = H_k + \omega _k \big [ s_k \quad H_k y_k \big ] \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \text { . } \end{aligned}$$
(106)

Applying the Sherman-Morrison-Woodbury formula (105) to the factored SP-BFGS update (106) with

$$\begin{aligned} A &= H_k \text { , } \\ U &= \omega _k \big [ s_k \quad H_k y_k \big ] \text { , } \\ C &= \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] , \\ V &= \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \end{aligned}$$

yields

$$ \begin{aligned} H_{k+1}^{-1} = H_k^{-1} - H_k^{-1} \omega _k \big [ s_k \quad H_k y_k \big ] \bigg ( C^{-1} + V H_k^{-1} U \bigg )^{-1} \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] H_k^{-1} \text { . } \end{aligned}$$

Inverting C here gives

$$\begin{aligned} C^{-1} = \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] ^{-1} = \left[ \begin{array}{cc} 0 &{} -1 \\ -1 &{} -\gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) \end{array} \right] \end{aligned}$$

and we also have

$$\begin{aligned} \begin{aligned} V H_k^{-1} U&= \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] H_k^{-1} \omega _k \big [ s_k \quad H_k y_k \big ] \\ &= \omega _k \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \big [ H_k^{-1} s_k \quad y_k \big ] \\ &= \left[ \begin{array}{cc} \omega _k s_k^T H_k^{-1} s_k & \omega _k s_k^T y_k \\ \omega _k y_k^T s_k & \omega _k y_k^T H_k y_k \end{array} \right] \end{aligned} \end{aligned}$$

which is just a \(2 \times 2\) matrix with real entries. Now, it becomes clear that

$$\begin{aligned} \begin{aligned} (C^{-1} + V H_k^{-1} U)&= \bigg ( \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] ^{-1} + \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] H_k^{-1} \omega _k \big [ s_k \quad H_k y_k \big ] \bigg ) \\&= \left[ \begin{array}{cc} \omega _k s_k^T H_k^{-1} s_k &{} -1 + \omega _k s_k^T y_k \\ -1 + \omega _k y_k^T s_k &{} \omega _k y_k^T H_k y_k - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) \end{array} \right] \text { . } \end{aligned} \end{aligned}$$

For notational compactness, let

$$\begin{aligned} D &= (C^{-1} + V H_k^{-1} U) \\ &= \left[ \begin{array}{cc} \omega _k s_k^T H_k^{-1} s_k &{} -1 + \omega _k s_k^T y_k \\ -1 + \omega _k y_k^T s_k &{} \omega _k y_k^T H_k y_k - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) \end{array} \right] \end{aligned}$$

so

$$\begin{aligned} D^{-1} = \frac{1}{\det (D)} \left[ \begin{array}{cc} \omega _k y_k^T H_k y_k - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} 1 - \omega _k s_k^T y_k \\ 1 - \omega _k y_k^T s_k &{} \omega _k s_k^T H_k^{-1} s_k \end{array} \right] \end{aligned}$$

where the determinant of D is

$$ \begin{aligned} \begin{aligned} \det (D) &= \bigg ( \omega _k y_k^T H_k y_k - \gamma _k \bigg ( \frac{1}{\omega _k} + y_k^T H_k y_k \bigg ) \bigg ) \bigg ( \omega _k s_k^T H_k^{-1} s_k \bigg ) - (1 - \omega _k y_k^T s_k)^2 \\ &= \bigg ( (\omega _k - \gamma _k ) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \bigg ) \bigg ( \omega _k s_k^T H_k^{-1} s_k \bigg ) - (1 - \omega _k y_k^T s_k)^2 \end{aligned} \end{aligned}$$

and we have used the fact that \(y_k^T s_k = s_k^T y_k\), as this is a scalar quantity. Next,

$$\begin{aligned} \begin{aligned} U \det (D) D^{-1} V &= U \left[ \begin{array}{cc} \omega _k y_k^T H_k y_k - \gamma _k (\frac{1}{\omega _k} + y_k^T H_k y_k) &{} 1 - \omega _k s_k^T y_k \\ 1 - \omega _k y_k^T s_k &{} \omega _k s_k^T H_k^{-1} s_k \end{array} \right] \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \\ &= U \left[ \begin{array}{cc} \omega _k y_k^T H_k y_k s_k^T - \gamma _k (\frac{1}{\omega _k} + y_k^T H_k y_k) s_k^T + (1 - \omega _k s_k^T y_k) y_k^T H_k \\ (1 - \omega _k y_k^T s_k) s_k^T + \omega _k s_k^T H_k^{-1} s_k y_k^T H_k \end{array} \right] \end{aligned} \end{aligned}$$

so \(U \det (D) D^{-1} V\) fully expanded becomes

$$\tiny \begin{aligned} U \det (D) D^{-1} V = \omega _k \big [ s_k \big ( \omega _k y_k^T H_k y_k s_k^T - \gamma_k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) s_k^T + (1 - \omega _k s_k^T y_k) y_k^T H_k \big ) + H_k y_k \big ( (1 - \omega _k y_k^T s_k) s_k^T + \omega _k s_k^T H_k^{-1} s_k y_k^T H_k \big ) \big ] \text { . } \end{aligned}$$

This looks rather ugly at the moment, but we continue by breaking the problem down further, noting that

$$\tiny \begin{aligned} s_k \big ( \omega _k y_k^T H_k y_k s_k^T - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) s_k^T + (1 - \omega _k s_k^T y_k) y_k^T H_k \big ) = \big ( (\omega _k - \gamma _k) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \big ) s_k s_k^T + (1 - \omega _k s_k^T y_k) s_k y_k^T H_k \end{aligned}$$

and

$$ \begin{aligned} H_k y_k \bigg ( (1 - \omega _k y_k^T s_k) s_k^T + \omega _k s_k^T H_k^{-1} s_k y_k^T H_k \bigg ) = (1 - \omega _k y_k^T s_k) H_k y_k s_k^T + \omega _k H_k y_k (s_k^T H_k^{-1} s_k) y_k^T H_k \text { . } \end{aligned}$$

The above intermediate results further simplify \(U \det (D) D^{-1} V\) to

$$\tiny \begin{aligned} U \det (D) D^{-1} V = \omega_k \big [ \big ( (\omega _k - \gamma _k) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \big ) s_k s_k^T + (1 - \omega _k s_k^T y_k) ( s_k y_k^T H_k + H_k y_k s_k^T ) + \omega _k H_k y_k (s_k^T H_k^{-1} s_k) y_k^T H_k \big ] \text { . } \end{aligned}$$

Left and right multiplying the line immediately above by \(A^{-1} = H_k^{-1}\) gives

$$\tiny \begin{aligned} H_k^{-1} U \det (D) D^{-1} V H_k^{-1} = \omega _k \big [ \big ( (\omega _k - \gamma _k) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \big ) H_k^{-1} s_k s_k^T H_k^{-1} + (1 - \omega _k s_k^T y_k) ( H_k^{-1} s_k y_k^T + y_k s_k^T H_k^{-1} ) + \omega _k y_k (s_k^T H_k^{-1} s_k) y_k^T \big ] \end{aligned}$$

and thus, after dividing out \(\det (D)\) and applying \(B_{k} = H_{k}^{-1}\), we arrive at the following final formula

$$\tiny \begin{aligned} B_{k+1} = B_k - \frac{\omega _k \big [ \big ( (\omega _k - \gamma _k) y_k^T B_k^{-1} y_k - \frac{\gamma _k}{\omega _k} \big ) B_k s_k s_k^T B_k + (1 - \omega _k s_k^T y_k) ( B_k s_k y_k^T + y_k s_k^T B_k ) + \omega _k (s_k^T B_k s_k) y_k y_k^T \big ]}{\big ( (\omega _k - \gamma _k) y_k^T B_k^{-1} y_k - \frac{\gamma _k}{\omega _k} \big ) \big ( \omega _k s_k^T B_k s_k \big ) - (1 - \omega _k y_k^T s_k)^2} \end{aligned}$$
(107)

for the SP-BFGS inverse update, which concludes the proof.

Appendix 4: Proof of Theorem 3

Referring to Theorem 2, taking the trace of both sides of (107) and applying the linearity and cyclic invariance properties of the trace yields

$$\begin{aligned} {{\,\textrm{Tr}\,}}(B_{k+1}) = \kappa _1 {{\,\textrm{Tr}\,}}(B_k) + \kappa _2 \left\| B_k s_k\right\| _2^2 + 2 \kappa _3 (y_k^T B_k s_k) + \kappa _4 \left\| y_k\right\| _2^2 \end{aligned}$$
(108)

where

$$\begin{aligned} \kappa _1 = 1 \text { , } \quad \kappa _2 = - \frac{\omega _k {\hat{D}}}{[ {\hat{D}} (\omega _k s_k^T B_k s_k) - ({\hat{E}})^2 ]} \text { , } \end{aligned}$$
(109)
$$\begin{aligned} \kappa _3 = - \frac{\omega _k {\hat{E}}}{[ {\hat{D}} (\omega _k s_k^T B_k s_k) - ({\hat{E}})^2 ]} \text { , } \quad \kappa _4 = - \frac{(\omega _k)^2 s_k^T B_k s_k}{[ {\hat{D}} (\omega _k s_k^T B_k s_k) - ({\hat{E}})^2 ]} \text{ , } \end{aligned}$$
(110)

with \({\hat{D}}\) and \({\hat{E}}\) defined as

$$\begin{aligned} {\hat{D}} = \bigg [ (\omega _k - \gamma _k) (y_k^T B_k^{-1} y_k) - \frac{\gamma _k}{\omega _k} \bigg ] \text { , } \quad {\hat{E}} = (1 - \omega _k s_k^T y_k ) = \frac{2 \omega _k}{\beta _k} \text { . } \end{aligned}$$
(111)

We now observe that after applying some basic algebra, and recalling that \(B_k\) is positive definite, one can deduce that for all \(\beta _k \in [0, +\infty ]\), the following inequalities hold

$$\begin{aligned} (\omega _k - \gamma _k) \le 0 \text { , } \quad 1 \le \frac{\gamma _k}{\omega _k} \text { , } \quad {\hat{D}} \le -1 \text { , } \quad 0 \le \frac{2 \omega _k}{\beta _k} \le 1 \text { . } \end{aligned}$$
(112)

By minimizing the absolute value of the common denominator in \(\kappa _2, \kappa _3\), and \(\kappa _4\) using the inequalities above, one can obtain the bounds

$$\begin{aligned} - \frac{1}{s_k^T B_k s_k} \le \kappa _2 \le 0 \text { , } \qquad 0 \le \kappa _4 \le \omega _k \le \gamma _k \text{ , } \end{aligned}$$
(113)
$$\begin{aligned} 0 \le \kappa _3 \le \frac{2 \omega _k}{\beta _k} \frac{1}{s_k^T B_k s_k + \frac{2 \omega _k}{\beta _k} \frac{2}{\beta _k}} \le \frac{\beta _k}{2} \text { . } \end{aligned}$$
(114)

As a result,

$$\begin{aligned} {{\,\textrm{Tr}\,}}(B_{k+1})&\le {{\,\textrm{Tr}\,}}(B_k) + 2 \kappa _3 | y_k^T B_k s_k | + \kappa _4 \left\| y_k\right\| _2^2 \end{aligned}$$
(115)
$$\begin{aligned} {{\,\textrm{Tr}\,}}(B_{k+1}) \le {{\,\textrm{Tr}\,}}(B_k) + \beta _k \left\| y_k\right\| _2 \lambda _{max}(B_k) \left\| s_k\right\| _2 + \gamma _k \left\| y_k\right\| _2^2 \end{aligned}$$
(116)

and applying \(\lambda _{max}(B_k) < {{\,\textrm{Tr}\,}}(B_k)\) establishes (53). Similarly, referring to (89) reveals the upper bound

$$ \begin{aligned} {{\,\textrm{Tr}\,}}(H_{k+1}) \le {{\,\textrm{Tr}\,}}(H_k) + 2 \omega _k | y_k^T H_k s_k | + \big [ \gamma _k + \omega _k \gamma _k (y_k^T H_k y_k) \big ] \left\| s_k\right\|_2^2 \text { . } \end{aligned}$$
(117)

To establish (52), we apply \(\lambda _{max}(H_k) < {{\,\textrm{Tr}\,}}(H_k)\) and \(\omega _k \le \gamma _k\) to the line above, and then factor. This completes the proof.

Appendix 5: Proof of Lemma 2

As \(\phi\) is m-strongly convex due to Assumption 3, it is true that

$$\begin{aligned} \phi (y) \ge \phi (x) + \nabla \phi (x)^T (y - x) + \frac{m}{2} \left\| y - x\right\| _2^2 \text { , } \quad \forall x, y \in {\mathbb {R}}^n \text { . } \end{aligned}$$
(118)

Note that for any fixed x, the right side of (118) provides a global quadratic lower bound on \(\phi\). As these bounds are global lower bounds, minimizing both sides of (118) with respect to y preserves the inequality, so

$$\begin{aligned} \min _{y} \bigg \{ \phi (y) \bigg \} \ge \min _{y} \bigg \{ \phi (x) + \nabla \phi (x)^T (y - x) + \frac{m}{2} \left\| y - x\right\| _2^2 \bigg \} \end{aligned}$$
(119)

which simplifies to

$$\begin{aligned} \phi ^{\star } \ge \phi (x) - \frac{1}{2 m} \left\| \nabla \phi (x)\right\| _2^2 \text { . } \end{aligned}$$
(120)

Proceeding, the inner product condition \(\nabla \phi (x)^T H g(x) > \xi \left\| \nabla \phi (x)\right\| _2\) expands to

$$\begin{aligned} \nabla \phi (x)^T H g(x) = \nabla \phi (x)^T H \nabla \phi (x) + \nabla \phi (x)^T H e(x) > \xi \left\| \nabla \phi (x)\right\| _2 \text { . } \end{aligned}$$
(121)

The smallest possible value of \(\nabla \phi (x)^T H \nabla \phi (x)\) is

$$\begin{aligned} \nabla \phi (x)^T H \nabla \phi (x) \ge \psi \left\| \nabla \phi (x)\right\| _2^2 \text { . } \end{aligned}$$
(122)

By applying the Cauchy-Schwarz inequality and Assumption 2, the most negative possible value of \(\nabla \phi (x)^T H e(x)\) is

$$\begin{aligned} \nabla \phi (x)^T H e(x) \ge - {\varPsi } \left\| \nabla \phi (x)\right\| _2 \left\| e(x)\right\| _2 \ge - {\varPsi } \left\| \nabla \phi (x)\right\| _2 {\bar{\epsilon }}_g \text { . } \end{aligned}$$
(123)

Thus, we see that if

$$\begin{aligned} \psi \left\| \nabla \phi (x)\right\| _2^2 - {\varPsi } \left\| \nabla \phi (x)\right\| _2 {\bar{\epsilon }}_g > \xi \left\| \nabla \phi (x)\right\| _2 \text { , } \end{aligned}$$
(124)

which rearranges to

$$\begin{aligned} \left\| \nabla \phi (x)\right\| _2 > \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \text { , } \end{aligned}$$
(125)

then \(\nabla \phi (x)^T H g(x) > \xi \left\| \nabla \phi (x)\right\| _2\) is guaranteed. Note that (125) implies

$$\begin{aligned} \nabla \phi (x)^T H g(x)> \xi \left\| \nabla \phi (x)\right\| _2 > \xi \bigg [ \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \bigg ] \end{aligned}$$
(126)

when combined with the inner product condition. Combining (125) with Assumption 2 and the definition of the gradient noise to signal ratio \(\delta (x)\) given by (58) reveals that

$$\begin{aligned} \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } < \left\| \nabla \phi (x)\right\| _2 = \frac{\left\| e(x)\right\| _2}{\delta (x)} \le \frac{{\bar{\epsilon }}_g}{\delta (x)} \end{aligned}$$
(127)

and so \(\delta (x) < \frac{\psi {\bar{\epsilon }}_g}{{\varPsi } {\bar{\epsilon }}_g + \xi } \le \frac{\psi }{{\varPsi }}\).

Contrapositively, if \(\nabla \phi (x)^T H g(x) \le \xi \left\| \nabla \phi (x)\right\| _2\), then

$$\begin{aligned} \left\| \nabla \phi (x)\right\| _2 \le \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \text { , } \end{aligned}$$
(128)

or if \(\delta (x) \ge \frac{\psi {\bar{\epsilon }}_g}{{\varPsi } {\bar{\epsilon }}_g + \xi } \ge 0\), then

$$\begin{aligned} \left\| \nabla \phi (x)\right\| _2 \le \bigg ( \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi {\bar{\epsilon }}_g} \bigg ) \left\| e(x)\right\| _2 \le \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \text { . } \end{aligned}$$
(129)

Squaring either inequality (128) or (129) and then combining it with a rearranged (120) given by

$$\begin{aligned} \phi (x) - \phi ^{\star } \le \frac{1}{2 m} \left\| \nabla \phi (x)\right\| _2^2 \end{aligned}$$
(130)

gives \({\mathcal {N}}_1(\psi ,{\varPsi },\xi )\), completing the proof.

Appendix 6: Proof of Lemma 3

Similar to (122) and (123), by using the definition of \(\delta (x)\), the lower bound

$$ \begin{aligned} \nabla \phi (x)^T H g(x) \ge \psi \left\| \nabla \phi (x)\right\| _2^2 - {\varPsi } \left\| \nabla \phi (x)\right\| _2 \left\| e(x)\right\| _2 = \big ( \psi - {\varPsi } \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 \end{aligned}$$
(131)

and the upper bound

$$ \begin{aligned} \varepsilon \big ( 1 + \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 = \varepsilon \big ( \left\| \nabla \phi (x)\right\| _2^2 + \left\| \nabla \phi (x)\right\| _2 \left\| e(x)\right\| _2 \big ) \ge \varepsilon \nabla \phi (x)^T g(x) \end{aligned}$$
(132)

can be established. Observe that if the lower bound (131) is always greater than or equal to the upper bound (132)

$$\begin{aligned} \big ( \psi - {\varPsi } \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 \ge \varepsilon \big ( 1 + \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 \text { , } \end{aligned}$$
(133)

it implies that \(\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)\). Hence, the condition

$$\begin{aligned} \varepsilon \le \frac{\big ( \psi - {\varPsi } \delta (x) \big )}{\big ( 1 + \delta (x) \big )} \end{aligned}$$
(134)

implies that \(\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)\). By applying Lemma 2, we see that for all \(x \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), it is true that \(\delta (x) < \frac{\psi }{(1 + A) {\varPsi }}\). Thus, setting

$$\begin{aligned} \varepsilon < \frac{\big ( \psi - \frac{\psi }{(1 + A) } \big ) }{ \big ( 1 + \frac{\psi }{(1 + A) {\varPsi } } \big ) } = \frac{ A \psi {\varPsi }}{\big ( (1+A) {\varPsi } + \psi \big )} \end{aligned}$$
(135)

guarantees that \(\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)\) for all \(x \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), completing the proof.

Appendix 7: Proof of Theorem 4

As \(\phi \in C^2\) by Assumption 3, applying Taylor’s theorem and using (62) and strong convexity gives

$$\begin{aligned} \phi _{k+1}&= \phi _k + \nabla \phi _k^T [ x_{k+1} - x_{k} ] + \frac{1}{2} [ x_{k+1} - x_{k} ]^T \nabla ^2 \phi (u) [ x_{k+1} - x_{k} ] \\&\le \phi _k - \alpha \nabla \phi _k^T H_k g_k + \frac{\alpha ^2 M}{2} \left\| H_k g_k\right\| _2^2 \end{aligned}$$

where u is some convex combination of \(x_{k+1}\) and \(x_{k}\). Proceeding, note that the smallest possible region \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\) from Lemma 3 occurs with the choice \(\psi = {\varPsi }\). In this case \(H = {\varPsi } I\), and (59) from Lemma 2 becomes

$$\begin{aligned} \nabla \phi _k^T g_k> A {\bar{\epsilon }}_g \bigg [ (1+A) {\bar{\epsilon }}_g \bigg ] > 0 \end{aligned}$$
(136)

and so \(\nabla \phi _k^T g_k > 0\) if \(x_k \notin {\mathcal {N}}_{1}(\psi = {\varPsi },{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). Hence, for all possible choices of \(0 < \psi \le {\varPsi }\) in \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), we have \(\nabla \phi _k^T g_k > 0\) if \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). Combining this with Lemma 3 gives

$$\begin{aligned} \nabla \phi _k^T H_k g_k \ge \varepsilon \nabla \phi _k^T g_k > 0 \end{aligned}$$
(137)

if \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). With (137) in hand, continuing to bound terms gives

$$ \begin{aligned} \phi _{k+1} &\le \phi _k - \alpha \varepsilon \nabla \phi _k^T [ \nabla \phi _k + e_k ] + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| \nabla \phi _k + e_k\right\| _2^2 \\ &= \phi _k - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \frac{\alpha {\varPsi } M}{2} \bigg ) \left\| \nabla \phi _k\right\| _2^2 - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \alpha {\varPsi } M \bigg ) \nabla \phi _k^T e_k + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| e_k\right\| _2^2 \\ & \le \phi _k - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \frac{\alpha {\varPsi } M}{2} \bigg ) \left\| \nabla \phi _k\right\| _2^2 + \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \alpha {\varPsi } M \bigg ) \left\| \nabla \phi _k\right\|_2 \left\| e_k\right\| _2 + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| e_k\right\| _2^2 \\ & \le \phi _k - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \frac{\alpha {\varPsi } M}{2} \bigg ) \left\| \nabla \phi _k\right\| _2^2 + \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \alpha {\varPsi } M \bigg ) \bigg [ \frac{1}{2} \left\| \nabla \phi _k\right\| _2^2 + \frac{1}{2} \left\| e_k\right\|_2^2 \bigg ] + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| e_k\right\|_2^2 \end{aligned}$$

where the last inequality follows from expanding

$$\begin{aligned} 0 \leq \bigg ( \frac{1}{\sqrt{2}} \left\| \nabla \phi _k\right\| _2 - \frac{1}{\sqrt{2}} \left\| e_k\right\| _2 \bigg )^2 = \frac{1}{2} \left\| \nabla \phi _k\right\| _2^2 - \left\| \nabla \phi _k\right\| _2 \left\| e_k\right\| _2 + \frac{1}{2} \left\| e_k\right\| _2^2 \end{aligned}$$
(138)

and using \(\alpha \le \frac{\varepsilon }{M {\varPsi }^2}\) in (63). Simplifying the last inequality reveals that

$$\begin{aligned} \phi _{k+1} \le \phi _k - \frac{\alpha \varepsilon }{2} \left\| \nabla \phi _k\right\| _2^2 + \frac{\alpha \varepsilon }{2} \left\| e_k\right\| _2^2 \text { . } \end{aligned}$$
(139)

Since \(\phi\) is m-strongly convex by Assumption 3, we can apply

$$\begin{aligned} \left\| \nabla \phi _k\right\| _2^2 \ge 2 m ( \phi _k - \phi ^{\star } ) \end{aligned}$$
(140)

which comes from rearranging (120) in the proof of Lemma 2 (see Appendix 5). Combining (140) with (139) and Assumption 2 gives

$$\begin{aligned} \phi _{k+1} \le \phi _k - \alpha \varepsilon m ( \phi _k - \phi ^{\star } ) + \frac{\alpha \varepsilon }{2} \bigg ( \frac{(1+A) {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \text { . } \end{aligned}$$
(141)

Subtracting \(\phi ^{\star }\) from both sides, and using the notation \({\tilde{A}} :=(1+A)\), we get

$$\begin{aligned} \phi _{k+1} - \phi ^{\star } \le (1 - \alpha \varepsilon m) (\phi _k - \phi ^{\star } ) + \frac{\alpha \varepsilon }{2} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \end{aligned}$$
(142)

which, by subtracting \(\frac{1}{2 m} \big ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \big )^2\) from both sides and simplifying, gives

$$\begin{aligned} \phi _{k+1} - \phi ^{\star } - \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2&\le (1 - \alpha \varepsilon m) (\phi _k - \phi ^{\star } ) + \frac{\alpha \varepsilon }{2} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 - \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \\&= (1 - \alpha \varepsilon m) (\phi _k - \phi ^{\star } ) + (\alpha \varepsilon m - 1) \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \\&= (1 - \alpha \varepsilon m) \bigg ( \phi _k - \bigg [ \phi ^{\star } + \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \bigg ] \bigg ) \end{aligned}$$

thus establishing the Q-linear result. We obtain the R-linear result (64) by recursively applying the worst case bound given by the Q-linear result, noting that in the worst case if \(x_0 \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), then the sequence of iterates \(\{ x_k \}\) remains outside of \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), only approaching \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\) in the limit \(k \rightarrow \infty\).

Appendix 8: Proof of Theorem 5

From (139) in Appendix 7, if the step size \(\alpha \le \frac{\varepsilon }{M {\varPsi }^2}\) from (63), one has

$$\begin{aligned} \phi (x_k - \alpha H_k g_k) \le \phi (x_k) - \frac{\alpha \varepsilon }{2} \left\| \nabla \phi (x_k)\right\| _2^2 + \frac{\alpha \varepsilon }{2} \left\| e(x_k)\right\| _2^2 \end{aligned}$$
(143)

which combines with Assumption 1 to give

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - \frac{\alpha \varepsilon }{2} \big ( \left\| \nabla \phi (x_k)\right\| _2^2 - \left\| e(x_k)\right\| _2^2 \big ) + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$
(144)

The relaxed Armijo condition (38) expands to

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - c_1 \alpha g_k^T H_k g_k + 2 \epsilon _{A} \end{aligned}$$
(145)

and so the strongest possible condition (i.e. the condition requiring the greatest decrease in f) can be written as

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - c_1 \alpha {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} \text { . } \end{aligned}$$
(146)

Comparing (144) and (146) reveals that for the bound given by (144) to also imply the bound given by (146), it must be true that

$$\begin{aligned} - \frac{\alpha \varepsilon }{2} \big ( \left\| \nabla \phi (x_k)\right\| _2^2 - \left\| e(x_k)\right\| _2^2 \big ) + 2 {\bar{\epsilon }}_{f} \le - c_1 \alpha {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} \end{aligned}$$
(147)

which rearranges to

$$\begin{aligned} c_1 {\varPsi } \left\| g_k\right\| _2^2 + \frac{\varepsilon }{2} \left\| e(x_k)\right\| _2^2 \le \frac{\varepsilon }{2} \left\| \nabla \phi (x_k)\right\| _2^2 + \frac{2}{\alpha } ( \epsilon _{A} - {\bar{\epsilon }}_{f} ) \text { . } \end{aligned}$$
(148)

As \(\epsilon _{A} - {\bar{\epsilon }}_{f} > 0\), it is clear that the right side of (148) can be made arbitrarily large by sending \(\alpha \rightarrow 0\). Hence, the relaxed Armijo condition (38) will be satisfied for sufficiently small \(\alpha\) and the backtracking line search will always find an \(\alpha _k\) small enough to satisfy (38).

By Lemma 2, outside of \({\mathcal {N}}_1(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), one has \(\delta _k < \frac{\psi }{(1+A){\varPsi }}\). Applying the triangle and reverse triangle inequalities to (32) gives

$$ \begin{aligned} \left\| \nabla \phi (x_k)\right\| _2 - \left\| e(x_k)\right\| _2 \le \left\| \nabla \phi (x_k) + e(x_k)\right\| _2 \le \left\| \nabla \phi (x_k)\right\| _2 + \left\| e(x_k)\right\| _2 \end{aligned}$$
(149)

which can be written using the gradient noise to signal ratio \(\delta _k\) as

$$\begin{aligned} (1 - \delta _k) \left\| \nabla \phi (x_k)\right\| _2 \le \left\| g_k\right\| _2 \le (1 + \delta _k) \left\| \nabla \phi (x_k)\right\| _2 \text { . } \end{aligned}$$
(150)

Combining the definition of \(\delta _k\) (see (58) in Lemma 2), (150), and \(\delta _k < \frac{\psi }{(1+A) {\varPsi }}\) with (144) gives

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - \frac{\alpha \varepsilon }{2} (1 - \delta _k^2) \left\| \nabla \phi (x_k)\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \end{aligned}$$
(151)
$$ \begin{aligned} f(x_k) - \frac{\alpha \varepsilon }{2} (1 - \delta _k^2) \left\| \nabla \phi (x_k)\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \le f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k^2)}{(1 + \delta _k)^2} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \end{aligned}$$
(152)
$$ \begin{aligned} f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k^2)}{(1 + \delta _k)^2} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} = f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k)}{(1 + \delta _k)} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \end{aligned}$$
(153)
$$ \begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k)}{(1 + \delta _k)} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \le f(x_k) - \frac{\alpha \varepsilon }{2} \frac{ \bigg ( 1 - \frac{\psi }{(1+A) {\varPsi }} \bigg )}{ \bigg ( 1 + \frac{\psi }{(1+A) {\varPsi }} \bigg ) } \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$
(154)

Now, as \({\bar{\epsilon }}_{f} < \epsilon _{A}\), the bound (154) above implies the bound (146) for any \(\alpha \le \frac{\varepsilon }{{\varPsi }^2 M}\) if \(c_1 \le \frac{\varepsilon }{2 {\varPsi }} \frac{ \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big ) }{ \big ( 1 + \frac{\psi }{(1+A) {\varPsi }} \big ) }\). Since \(\alpha _k\) is chosen using a backtracking line search with backtracking factor \(\tau < 1\), it is true that \(\frac{\tau \varepsilon }{{\varPsi }^2 M} < \alpha _k \le \frac{\varepsilon }{{\varPsi }^2 M}\). Thus, combining the bound (146) with Assumption 1 and (150) shows that

$$\begin{aligned} \phi (x_k - \alpha _k H_k g_k) \le \phi (x_k) - c_1 \alpha _k {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \end{aligned}$$
(155)
$$ \begin{aligned} \phi (x_k) - c_1 \alpha _k {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \le \phi (x_k) - \frac{c_1 \tau \varepsilon }{{\varPsi } M} (1 - \delta _k)^2 \left\| \nabla \phi (x_k)\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \end{aligned}$$
(156)
$$ \begin{aligned} \phi (x_k - \alpha _k H_k g_k) \le \phi (x_k) - \frac{c_1 \tau \varepsilon }{{\varPsi } M} \bigg (1 - \frac{\psi }{(1+A) {\varPsi }} \bigg )^2 \left\| \nabla \phi (x_k)\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$
(157)

The expression (157) measures the reduction in the value of \(\phi\) for iterates where \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). Proceeding, we take the following bound

$$\begin{aligned} \phi (x_{k+1}) \le \phi (x_k) - \frac{c_1 \tau \varepsilon }{{\varPsi } M} \bigg (1 - \frac{\psi }{(1+A) {\varPsi }} \bigg )^2 \left\| \nabla \phi (x_k)\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \end{aligned}$$
(158)

and subtract \(\phi ^{\star }\) from both sides as well as apply the inequality (140) to get

$$ \begin{aligned} \phi (x_{k+1}) - \phi ^{\star } \le \bigg ( 1 - \frac{2 m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2}{{\varPsi } M} \bigg ) ( \phi (x_k) - \phi ^{\star } ) + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$
(159)

For ease of notation, define the following quantities

$$\begin{aligned} \rho :=\bigg ( 1 - \frac{2 m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2}{{\varPsi } M} \bigg ) \text { , } \qquad \eta :=2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$
(160)

Thus, for all k where \(x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\), we have shown that the following bound holds

$$\begin{aligned} \phi (x_{k+1}) - \phi ^{\star } \le \rho ( \phi (x_k) - \phi ^{\star } ) + \eta \text { . } \end{aligned}$$
(161)

Subtracting \(\frac{\eta }{(1 - \rho )}\) from both sides shows that

$$\begin{aligned} \phi (x_{k+1}) - \phi ^{\star } - \frac{\eta }{1 - \rho }&\le \rho ( \phi (x_k) - \phi ^{\star } ) + \eta - \frac{\eta }{1 - \rho } \\&= \rho ( \phi (x_k) - \phi ^{\star } ) - \frac{\rho \eta }{1 - \rho } \\&= \rho \bigg ( \phi (x_k) - \phi ^{\star } - \frac{\eta }{1 - \rho } \bigg ) \end{aligned}$$

and thus one has

$$\begin{aligned} \phi (x_{k+1}) - \phi ^{\star } - {\bar{\eta }} \le \rho ( \phi (x_{k}) - \phi ^{\star } - {\bar{\eta }} ) \end{aligned}$$
(162)

where \({\bar{\eta }} :=\frac{\eta }{(1 - \rho )}\). Using the definitions in (160) shows that

$$\begin{aligned} {\bar{\eta }} = \frac{{\varPsi } M}{2 m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2} \bigg ( 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \bigg ) \end{aligned}$$
(163)
$$\begin{aligned} {\bar{\eta }} = \frac{{\varPsi } M}{m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2} ( \epsilon _{A} + {\bar{\epsilon }}_{f} ) \end{aligned}$$
(164)

which establishes (68) and (69). Similar to Appendix 7, we obtain the R-linear result (71) by recursively applying the bound in (68), stop** once an iterate enters \({\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)\). This concludes the proof.

Appendix 9: Extended numerical experiments

Table 5 below shows the performance of gradient descent for the same problem (ROSENBR) and noise combinations as in Table 1.

Table 5 Performance of gradient descent on the Rosenbrock function (i.e. ROSENBR) corrupted by noise

Tables 67 and 8 compare the performance of SP-BFGS, BFGS, and gradient descent on the 32 CUTEst test problems with only gradient noise present (i.e. \({\bar{\epsilon }}_f = 0\)). Gradient noise was generated using \({\bar{\epsilon }}_g = 10^{-4} \left\| \nabla \phi (x^0)\right\| _2\), where the starting point \(x^0\) varies by CUTEst problem, to ensure that noise does not initially dominate gradient evaluations. By examining the mean and median columns in Tables 67 and 8, one sees that SP-BFGS outperforms both BFGS and gradient descent on \(\frac{18}{32} \approx 56 \%\) of the CUTEst problems with only gradient noise present, and performs at least as well as the best performing alternative on \(\frac{28}{32} \approx 88 \%\) of these problems. Equivalently, SP-BFGS was only outperformed by BFGS or gradient descent on \(\frac{4}{32} \approx 12 \%\) of these problems.

Table 6 Performance of SP-BFGS on 32 selected CUTEst test problems with noise added to gradient evaluations only (i.e. \({\bar{\epsilon }}_f = 0\))
Table 7 Performance of BFGS on 32 selected CUTEst test problems with noise added to gradient evaluations only (i.e. \({\bar{\epsilon }}_f = 0\))
Table 8 Performance of gradient descent on 32 selected CUTEst test problems with noise added to gradient evaluations only (i.e. \({\bar{\epsilon }}_f = 0\))

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Irwin, B., Haber, E. Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition. Comput Optim Appl 84, 651–702 (2023). https://doi.org/10.1007/s10589-022-00448-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-022-00448-x

Keywords

Navigation