Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition

Irwin, Brian; Haber, Eldad

doi:10.1007/s10589-022-00448-x

Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition

Published: 09 January 2023

Volume 84, pages 651–702, (2023)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

383 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that treating the secant condition with a penalty method approach motivated by regularized least squares estimation generates a parametric family with the original BFGS update at one extreme and not updating the inverse Hessian approximation at the other extreme. Furthermore, we find the curvature condition is relaxed as the family moves towards not updating the inverse Hessian approximation, and disappears entirely at the extreme where the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as Secant Penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

A new adaptive Barzilai and Borwein method for unconstrained optimization

Article 22 May 2017

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Article 06 December 2018

A cubic regularization of Newton’s method with finite difference Hessian approximations

Article 13 October 2021

Data availability

The CUTEst test problems used in the numerical experiments are available at https://www.cuter.rl.ac.uk/Problems/mastsif.shtml.

References

Aydin, L., Aydin, O., Artem, H.S., Mert, A.: Design of dimensionally stable composites using efficient global optimization method. Proc. Inst. Mech. Eng. Part L: J. Mater. Design Appl. 233(2), 156–168 (2019). https://doi.org/10.1177/1464420716664921
Article Google Scholar
Berahas, A.S., Byrd, R.H., Nocedal, J.: Derivative-free optimization of noisy functions via quasi-newton methods. SIAM J. Optim. 29, 965–993 (2019). https://doi.org/10.1137/18M1177718
Article MathSciNet MATH Google Scholar
Besançon, M., Anthoff, D., Arslan, A., Byrne, S., Lin, D., Papamarkou, T., Pearson, J.: Distributions.jl: Definition and modeling of probability distributions in the juliastats ecosystem. ar**v e-prints ar**v:1907.08611 (2019)
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017). https://doi.org/10.1137/141000671
Article MathSciNet MATH Google Scholar
Bons, N.P., He, X., Mader, C.A., Martins, J.R.R.A.: Multimodality in aerodynamic wing design optimization. AIAA J. 57(3), 1004–1018 (2019). https://doi.org/10.2514/1.J057294
Article Google Scholar
Broyden, C.G.: The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA J. Appl. Math. 6(1), 76–90 (1970). https://doi.org/10.1093/imamat/6.1.76
Article MATH Google Scholar
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016). https://doi.org/10.1137/140954362
Article MathSciNet MATH Google Scholar
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995). https://doi.org/10.1137/0916069
Article MathSciNet MATH Google Scholar
Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26(3), 727–739 (1989). https://doi.org/10.2307/2157680
Article MathSciNet MATH Google Scholar
Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a class of quasi-newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987). https://doi.org/10.2307/2157646
Article MathSciNet MATH Google Scholar
Chang, D., Sun, S., Zhang, C.: An accelerated linearly convergent stochastic L-BFGS algorithm. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3338–3346 (2019). https://doi.org/10.1109/TNNLS.2019.2891088
Article MathSciNet Google Scholar
Fasano, G., Pintér, J.D.: Modeling and Optimization in Space Engineering: State of the Art and New Challenges. Springer (2019). https://doi.org/10.1007/978-1-4614-4469-5
Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970). https://doi.org/10.1093/comjnl/13.3.317
Article MATH Google Scholar
Gal, R., Haber, E., Irwin, B., Saleh, B., Ziv, A.: How to catch a lion in the desert: on the solution of the coverage directed generation (CDG) problem. Optim. Eng. 22, 217–245 (2021). https://doi.org/10.1007/s11081-020-09507-w
Article MathSciNet MATH Google Scholar
Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970). https://doi.org/10.2307/2004873
Article MathSciNet MATH Google Scholar
Gould, N.I.M., Orban, D., contributors: The Constrained and Unconstrained Testing Environment with safe threads (CUTEst) for optimization software. https://github.com/ralna/CUTEst (2019)
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr a Constrained and Unconstrained Testing Environment, revisited. https://www.cuter.rl.ac.uk (2001)
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr and SifDec: a constrained and unconstrained testing environment, revisited. ACM Trans. Math. Softw. 29(4), 373–394 (2003). https://doi.org/10.1145/962437.962439
Article MATH Google Scholar
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2015). https://doi.org/10.1007/s10589-014-9687-3
Article MathSciNet MATH Google Scholar
Gower, R., Goldfarb, D., Richtarik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 1869–1878. PMLR, New York, New York, USA (2016). http://proceedings.mlr.press/v48/gower16.html
Graf, P.A., Billups, S.: MDTri: robust and efficient global mixed integer search of spaces of multiple ternary alloys. Comput. Optim. Appl. 68(3), 671–687 (2017). https://doi.org/10.1007/s10589-017-9922-9
Article MathSciNet MATH Google Scholar
Güler, O., Gürtuna, F., Shevchenko, O.: Duality in quasi-newton methods and new variational characterizations of the DFP and BFGS updates. Optim. Methods Softw. 24(1), 45–62 (2009). https://doi.org/10.1080/10556780802367205
Article MathSciNet MATH Google Scholar
Hager, W.W.: Updating the inverse of a matrix. SIAM Review 31(2), 221–239 (1989). https://doi.org/10.2307/2030425
Article MathSciNet MATH Google Scholar
Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, New York (2013). https://doi.org/10.1017/CBO9781139020411
Book MATH Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013). https://doi.org/10.5555/2999611.2999647
Johnson, S.G.: Quasi-newton optimization: Origin of the BFGS update (2019). https://ocw.mit.edu/courses/mathematics/18-335j-introduction-to-numerical-methods-spring-2019/week-11/MIT18_335JS19_lec30.pdf
Keane, A.J., Nair, P.B.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. Wiley (2005). https://doi.org/10.1002/0470855487
Kelley, C.: Implicit Filtering. SIAM, Philadelphia (2011). https://doi.org/10.1137/1.9781611971903
Book MATH Google Scholar
Koziel, S., Ogurtsov, S.: Antenna Design by Simulation-Driven Optimization. Springer (2014). https://doi.org/10.1007/978-3-319-04367-8
Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-newton methods. Math. Program. 141, 135–163 (2013). https://doi.org/10.1007/s10107-012-0514-2
Article MathSciNet MATH Google Scholar
Lin, D., White, J.M., Byrne, S., Bates, D., Noack, A., Pearson, J., Arslan, A., Squire, K., Anthoff, D., Papamarkou, T., Besançon, M., Drugowitsch, J., Schauer, M., other contributors: JuliaStats/Distributions.jl: a Julia package for probability distributions and associated functions. https://github.com/JuliaStats/Distributions.jl (2019). https://doi.org/10.5281/zenodo.2647458
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989). https://doi.org/10.1007/BF01589116
Article MathSciNet MATH Google Scholar
Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(1), 3151–3181 (2015). https://doi.org/10.5555/2789272.2912100
Article MathSciNet MATH Google Scholar
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, pp. 249–258. PMLR, Cadiz, Spain (2016). http://proceedings.mlr.press/v51/moritz16.html
Muñoz-Rojas, P.A.: Computational Modeling, Optimization and Manufacturing Simulation of Advanced Engineering Materials. Springer (2016). https://doi.org/10.1007/978-3-319-04265-7
Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Book MATH Google Scholar
Orban, D., Siqueira, A.S., contributors: CUTEst.jl: Julia’s CUTEst interface. https://github.com/JuliaSmoothOptimizers/CUTEst.jl (2020). https://doi.org/10.5281/zenodo.1188851
Orban, D., Siqueira, A.S., contributors: NLPModels.jl: Data structures for optimization models. https://github.com/JuliaSmoothOptimizers/NLPModels.jl (2020). https://doi.org/10.5281/zenodo.2558627
Powell, M.J.D.: Algorithms for nonlinear constraints that use lagrangian functions. Math. Program. 14(1), 224–248 (1978). https://doi.org/10.1007/BF01588967
Article MathSciNet MATH Google Scholar
Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175–184 (1960). https://doi.org/10.1093/comjnl/3.3.175
Article MathSciNet Google Scholar
Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-newton method for online convex optimization. In: Meila, M., Shen, X. (eds.) Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 2, pp. 436–443. PMLR, San Juan, Puerto Rico (2007). http://proceedings.mlr.press/v2/schraudolph07a.html
Shanno, D.F.: Conditioning of quasi-newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970). https://doi.org/10.2307/2004840
Article MathSciNet MATH Google Scholar
Shi, H.J.M., **e, Y., Byrd, R., Nocedal, J.: A noise-tolerant quasi-newton algorithm for unconstrained optimization. SIAM J. Optim. 32(1), 29–55 (2022). https://doi.org/10.1137/20M1373190
Article MathSciNet MATH Google Scholar
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017). https://doi.org/10.1137/15M1053141
Article MathSciNet MATH Google Scholar
**e, Y., Byrd, R.H., Nocedal, J.: Analysis of the BFGS method with errors. SIAM J. Optim. 30(1), 182–209 (2020). https://doi.org/10.1137/19M1240794
Article MathSciNet MATH Google Scholar
Zhao, R., Haskell, W.B., Tan, V.Y.F.: Stochastic L-BFGS: improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process. 66, 1155–1169 (2018). https://doi.org/10.1109/TSP.2017.2784360
Article MathSciNet MATH Google Scholar
Zhu, J.: Optimization of Power System Operation. Wiley (2008). https://doi.org/10.1002/9780470466971

Download references

Acknowledgements

EH and BI’s work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of British Columbia (UBC).

Author information

Authors and Affiliations

Department of Earth, Ocean and Atmospheric Sciences, The University of British Columbia, Vancouver, BC, Canada
Brian Irwin & Eldad Haber

Authors

Brian Irwin
View author publications
You can also search for this author in PubMed Google Scholar
Eldad Haber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brian Irwin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Proof of Theorem 1

To produce the SP-BFGS update, we first rearrange (26a), revealing that

$$\begin{aligned} (H - H_k) = - W^{-1} ( u y_k^T + {\varGamma }^T - {\varGamma } ) W^{-1} \end{aligned}$$

(77)

and so the symmetry requirement that $H = H^T$ means transposing (77) gives

$$\begin{aligned} u y_k^T + {\varGamma }^T - {\varGamma } = ( u y_k^T + {\varGamma }^T - {\varGamma } )^T = y_k u^T + {\varGamma } - {\varGamma }^T \end{aligned}$$

(78)

which rearranges to

$$\begin{aligned} {\varGamma }^T - {\varGamma } = \frac{1}{2} ( y_k u^T - u y_k^T ) \end{aligned}$$

(79)

and so

$$\begin{aligned} (H - H_k) = - \frac{1}{2} W^{-1} (y_k u^T + u y_k^T) W^{-1} \text { . } \end{aligned}$$

(80)

Next, we right multiply (80) by $y_k$ to get

$$\begin{aligned} (H - H_k) y_k = - \frac{1}{2} W^{-1} \bigg ( y_k u^T W^{-1} y_k + u (y_k^T W^{-1} y_k) \bigg ) \end{aligned}$$

(81)

and use (26b) to get that

$$\begin{aligned} s_k + \frac{W^{-1} u}{\beta _k} - H_k y_k = - \frac{1}{2} W^{-1} \bigg ( y_k u^T W^{-1} y_k + u (y_k^T W^{-1} y_k) \bigg ) \text { . } \end{aligned}$$

(82)

We now left multiply both sides by $-2 W$ and rearrange, giving

$$\begin{aligned} -2 W (s_k - H_k y_k) = y_k u^T W^{-1} y_k + u \bigg ( y_k^T W^{-1} y_k + \frac{2}{\beta _k} \bigg ) \text { . } \end{aligned}$$

(83)

This can be rearranged so that u is isolated, giving

$$\begin{aligned} u &= \frac{-2 W (s_k - H_k y_k) - y_k u^T W^{-1} y_k}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \\ &= - \frac{2 W (s_k - H_k y_k) + y_k u^T W^{-1} y_k}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \text { . } \end{aligned}$$

(84)

To get rid of the $u^T$ on the right hand side, we first left multiply both sides by $y_k^T W^{-1}$, and then transpose to get

$$\begin{aligned} u^T W^{-1} y_k = - \frac{2 (s_k - H_k y_k)^T y_k + (y_k^T W^{-1} y_k) (u^T W^{-1} y_k)}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \end{aligned}$$

(85)

where we have taken advantage of the fact that the transpose of a scalar returns the same scalar. This now allows us to solve for $u^T W^{-1} y_k$ using some basic algebra, and resulting in

$$\begin{aligned} u^T W^{-1} y_k = - \frac{(s_k - H_k y_k)^T y_k}{y_k^T W^{-1} y_k + \frac{1}{\beta _k}} \text { . } \end{aligned}$$

(86)

Substituting (86) into (84) gives

$$\begin{aligned} u = \frac{y_k y_k^T (s_k - H_k y_k)}{(y_k^T W^{-1} y_k + \frac{2}{\beta _k})(y_k^T W^{-1} y_k + \frac{1}{\beta _k})} - \frac{2 W (s_k - H_k y_k)}{y_k^T W^{-1} y_k + \frac{2}{\beta _k}} \text { . } \end{aligned}$$

(87)

Now, if we substitute the expression for u in (87) into (80), after some simplification we get

$$ \begin{aligned} (H - H_k) = \frac{1}{\big ( y_k^T W^{-1} y_k + \frac{2}{\beta _k}\big )} \bigg [ (s_k - &H_k y_k) y_k^T W^{-1} + W^{-1} y_k (s_k - H_k y_k)^T - \frac{y_k^T(s_k - H_k y_k)}{(y_k^T W^{-1} y_k + \frac{1}{\beta _k})} W^{-1} y_k y_k^T W^{-1} \bigg ] \text { . } \end{aligned}$$

Now, we further simplify by applying that $W s_k = y_k$, and thus $W^{-1} y_k = s_k$, revealing

$$ \begin{aligned} H = H_k + \frac{(s_k - H_k y_k) s_k^T + s_k (s_k - H_k y_k)^T}{(y_k^T s_k + \frac{2}{\beta _k})} - \frac{y_k^T(s_k - H_k y_k)}{(y_k^T s_k + \frac{2}{\beta _k})(y_k^T s_k + \frac{1}{\beta _k})} s_k s_k^T \end{aligned}$$

(88)

which, after a bit of algebra, reveals that the update formula solving the system defined by (26a), (26b), and (26c) can be expressed as

$$\begin{aligned} H^{*} = H_k - \frac{H_k y_k s_k^T + s_k y_k^T H_k^T}{(y_k^T s_k + \frac{2}{\beta _k})} + \bigg [ \frac{y_k^T s_k + \frac{2}{\beta _k} + y_k^T H_k y_k}{(y_k^T s_k + \frac{2}{\beta _k})(y_k^T s_k + \frac{1}{\beta _k})} \bigg ] s_k s_k^T \text { . } \end{aligned}$$

(89)

We can make (89) look similar to the common form of the BFGS update given in (19) by defining the two quantities $\gamma _k$ and $\omega _k$ as in (28) and observing that completing the square gives

$$ \begin{aligned} H^{*} = \bigg ( I - \frac{s_k y_k^T}{(y_k^T s_k + \frac{2}{\beta _k})} \bigg ) H_k \bigg ( I - &\frac{y_k s_k^T}{(y_k^T s_k + \frac{2}{\beta _k})} \bigg ) + \bigg [ \frac{y_k^T s_k + \frac{2}{\beta _k} + y_k^T H_k y_k}{(y_k^T s_k + \frac{2}{\beta _k})(y_k^T s_k + \frac{1}{\beta _k})} - \frac{y_k^T H_k y_k}{(y_k^T s_k + \frac{2}{\beta _k})^2} \bigg ] s_k s_k^T \end{aligned}$$

(90)

which is equivalent to

$$ \begin{aligned} H^{*} = \bigg ( I - \omega _k s_k y_k^T \bigg ) H_k \bigg ( I - \omega _k y_k s_k^T \bigg ) + \omega _k \bigg [ \frac{\gamma _k}{\omega _k} + (\gamma _k - \omega _k) y_k^T H_k y_k \bigg ] s_k s_k^T \end{aligned}$$

(91)

concluding the proof.

Appendix 2: Proof of Lemma 1

The $H_{k+1}$ given by (27) has the general form

$$\begin{aligned} H_{k+1} = G^T H_k G + d s_k s_k^T \end{aligned}$$

(92)

with the specific choices

$$\begin{aligned} G = I - \omega _k y_k s_k^T \text { , } \quad d = \omega _k \bigg [ \frac{\gamma _k}{\omega _k} + (\gamma _k - \omega _k) y_k^T H_k y_k \bigg ] \text { . } \end{aligned}$$

(93)

By definition, $H_{k+1}$ is positive definite if

$$\begin{aligned} v^T H_{k+1} v > 0 \text { , } \quad \forall v \in {\mathbb {R}}^{n} \setminus 0 \text{ . } \end{aligned}$$

(94)

We first show that (29) is a sufficient condition for $H_{k+1}$ to be positive definite, given that $H_k$ is positive definite. By applying (92) to (94), we see that

$$\begin{aligned} v^T \bigg ( G^T H_k G + d s_k s_k^T \bigg ) v > 0 \text { , } \quad \forall v \in {\mathbb {R}}^{n} \setminus 0 \end{aligned}$$

(95)

must be true for the choices of G and d in (93) if $H_{k+1}$ is positive definite. Substituting (93) into (95) reveals that

$$ \begin{aligned} \bigg ( v - \omega_k (s_k^T v) y_k \bigg )^T H_k \bigg ( v - \omega _k (s_k^T v) y_k \bigg ) + \omega _k \bigg [ \frac{\gamma _k}{\omega _k} + (\gamma _k - \omega _k) y_k^T H_k y_k \bigg ] (s_k^T v)^2 > 0 \end{aligned}$$

(96)

must be true for all $v \in {\mathbb {R}}^{n} \setminus 0$ if $H_{k+1}$ is positive definite. Both $(s_k^T v)^2$ and $v^T G^T H_k G v$ are always nonnegative. To see that $v^T G^T H_k G v \ge 0$, note that because $H_k$ is positive definite, it has a principal square root $H_k^{1/2}$, and so

$$\begin{aligned} v^T G^T H_k G v = v^T G^T H_k^{1/2} H_k^{1/2} G v = \left\| H_k^{1/2} G v\right\| _2^2 \ge 0 \text{ . } \end{aligned}$$

(97)

We now observe that if $d > 0$, the right term $d (s_k^T v)^2$ in (96) is zero if and only if $(s_k^T v) = 0$. However, if $(s_k^T v) = 0$, then the left term $v^T G^T H_k G v$ in (96) is zero only when $v = 0$. Hence, the condition $d > 0$ guarantees that (96) is true for all v excluding the zero vector, and thus that $H_{k+1}$ is positive definite. The condition $d > 0$ expands to

$$\begin{aligned} \gamma _k + \omega _k (\gamma _k - \omega _k) y_k^T H_k y_k > 0 \text{ . } \end{aligned}$$

(98)

Using the definitions of $\gamma _k$ and $\omega _k$ in (28), it is clear that $(\gamma _k - \omega _k) \ge 0$, as $\beta _k$ can only take nonnegative values. Furthermore, as $H_k$ is positive definite, $y_k^T H_k y_k \ge 0$ for all $y_k$. As it is possible for $(\gamma _k - \omega _k) y_k^T H_k y_k$ to be zero, we requre $\gamma _k > 0$. The condition $\gamma _k > 0$ immediately gives (29), as $\gamma _k$ can only be positive if the denominator in its definition is positive. Finally, as $\beta _k$ can only take nonnegative values, (29) also ensures that $\omega _k$ is nonnegative, and so when (29) is true, $\omega _k (\gamma _k - \omega _k) y_k^T H_k y_k \ge 0$. In summary, we have shown that the condition (29) ensures that the left term in (98) is positive, and the right term nonnegative, so $d > 0$, and thus $H_{k+1}$ is positive definite.

We now show that (29) is a necessary condition for $H_{k+1}$ to be positive definite, given that $H_k$ is positive definite. If $H_{k+1}$ is positive definite, then

$$\begin{aligned} y_k^T H_{k+1} y_k > 0 \end{aligned}$$

(99)

assuming $y_k \ne 0$. Substituting (26b) into (99) gives

$$\begin{aligned} y_k^T \bigg [ s_{k} + \frac{W^{-1} u}{\beta _k} \bigg ] > 0 \end{aligned}$$

(100)

and using (86) shows that (100) is equivalent to

$$\begin{aligned} y_k^T \bigg [ s_{k} + \frac{\gamma _k (H_k y_k - s_k)}{\beta _k} \bigg ] > 0 \text { . } \end{aligned}$$

(101)

Now, some algebra shows that

$$\begin{aligned} \begin{aligned} y_k^T \bigg [ s_{k} + \frac{\gamma _k (H_k y_k - s_k)}{\beta _k} \bigg ]&= y_k^T s_{k} + \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg [ y_k^T H_k y_k - y_k^T s_{k} \bigg ] \\&= \bigg ( 1 - \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T s_{k} + \bigg ( \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T H_k y_k \\&= \bigg ( \frac{\beta _k y_k^T s_{k}}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T s_{k} + \bigg ( \frac{1}{1 + \beta _k y_k^T s_{k}} \bigg ) y_k^T H_k y_k \\&= \frac{\beta _k (y_k^T s_{k})^2 + y_k^T H_k y_k}{1 + \beta _k y_k^T s_{k}} \end{aligned} \end{aligned}$$

(102)

and we also know that because $H_k$ is positive definite, $y_k^T H_k y_k > 0$ for all $y_k \ne 0$, by definition $\beta _k \ge 0$, and by the definition of the square of a real number, $(y_k^T s_{k})^2 \ge 0$. As a result,

$$\begin{aligned} y_k^T \bigg [ s_{k} + \frac{W^{-1} u}{\beta _k} \bigg ] = \frac{\beta _k (y_k^T s_{k})^2 + y_k^T H_k y_k}{1 + \beta _k y_k^T s_{k}} > 0 \end{aligned}$$

(103)

is guaranteed only if the denominator $1 + \beta _k y_k^T s_{k}$ is positive, which occurs when

$$\begin{aligned} s_k^T y_k > - \frac{1}{\beta _k} \text { . } \end{aligned}$$

(104)

This establishes that (29) is a necessary condition for $H_{k+1}$ to be positive definite, given that $H_k$ is positive definite, and concludes the proof.

Appendix 3: Proof of Theorem 2

The Sherman-Morrison-Woodbury formula says

$$\begin{aligned} (A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V A^{-1} U)^{-1} V A^{-1} \text { . } \end{aligned}$$

(105)

Now, observe that the SP-BFGS update (27) can be written in the factored form

$$\begin{aligned} H_{k+1} = H_k + \omega _k \big [ s_k \quad H_k y_k \big ] \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \text { . } \end{aligned}$$

(106)

Applying the Sherman-Morrison-Woodbury formula (105) to the factored SP-BFGS update (106) with

$$\begin{aligned} A &= H_k \text { , } \\ U &= \omega _k \big [ s_k \quad H_k y_k \big ] \text { , } \\ C &= \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] , \\ V &= \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \end{aligned}$$

yields

$$ \begin{aligned} H_{k+1}^{-1} = H_k^{-1} - H_k^{-1} \omega _k \big [ s_k \quad H_k y_k \big ] \bigg ( C^{-1} + V H_k^{-1} U \bigg )^{-1} \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] H_k^{-1} \text { . } \end{aligned}$$

Inverting C here gives

$$\begin{aligned} C^{-1} = \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] ^{-1} = \left[ \begin{array}{cc} 0 &{} -1 \\ -1 &{} -\gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) \end{array} \right] \end{aligned}$$

and we also have

$$\begin{aligned} \begin{aligned} V H_k^{-1} U&= \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] H_k^{-1} \omega _k \big [ s_k \quad H_k y_k \big ] \\ &= \omega _k \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \big [ H_k^{-1} s_k \quad y_k \big ] \\ &= \left[ \begin{array}{cc} \omega _k s_k^T H_k^{-1} s_k & \omega _k s_k^T y_k \\ \omega _k y_k^T s_k & \omega _k y_k^T H_k y_k \end{array} \right] \end{aligned} \end{aligned}$$

which is just a $2 \times 2$ matrix with real entries. Now, it becomes clear that

$$\begin{aligned} \begin{aligned} (C^{-1} + V H_k^{-1} U)&= \bigg ( \left[ \begin{array}{cc} \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} -1 \\ -1 &{} 0 \end{array} \right] ^{-1} + \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] H_k^{-1} \omega _k \big [ s_k \quad H_k y_k \big ] \bigg ) \\&= \left[ \begin{array}{cc} \omega _k s_k^T H_k^{-1} s_k &{} -1 + \omega _k s_k^T y_k \\ -1 + \omega _k y_k^T s_k &{} \omega _k y_k^T H_k y_k - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) \end{array} \right] \text { . } \end{aligned} \end{aligned}$$

For notational compactness, let

$$\begin{aligned} D &= (C^{-1} + V H_k^{-1} U) \\ &= \left[ \begin{array}{cc} \omega _k s_k^T H_k^{-1} s_k &{} -1 + \omega _k s_k^T y_k \\ -1 + \omega _k y_k^T s_k &{} \omega _k y_k^T H_k y_k - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) \end{array} \right] \end{aligned}$$

so

$$\begin{aligned} D^{-1} = \frac{1}{\det (D)} \left[ \begin{array}{cc} \omega _k y_k^T H_k y_k - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) &{} 1 - \omega _k s_k^T y_k \\ 1 - \omega _k y_k^T s_k &{} \omega _k s_k^T H_k^{-1} s_k \end{array} \right] \end{aligned}$$

where the determinant of D is

$$ \begin{aligned} \begin{aligned} \det (D) &= \bigg ( \omega _k y_k^T H_k y_k - \gamma _k \bigg ( \frac{1}{\omega _k} + y_k^T H_k y_k \bigg ) \bigg ) \bigg ( \omega _k s_k^T H_k^{-1} s_k \bigg ) - (1 - \omega _k y_k^T s_k)^2 \\ &= \bigg ( (\omega _k - \gamma _k ) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \bigg ) \bigg ( \omega _k s_k^T H_k^{-1} s_k \bigg ) - (1 - \omega _k y_k^T s_k)^2 \end{aligned} \end{aligned}$$

and we have used the fact that $y_k^T s_k = s_k^T y_k$, as this is a scalar quantity. Next,

$$\begin{aligned} \begin{aligned} U \det (D) D^{-1} V &= U \left[ \begin{array}{cc} \omega _k y_k^T H_k y_k - \gamma _k (\frac{1}{\omega _k} + y_k^T H_k y_k) &{} 1 - \omega _k s_k^T y_k \\ 1 - \omega _k y_k^T s_k &{} \omega _k s_k^T H_k^{-1} s_k \end{array} \right] \left[ \begin{array}{c} s_k^T \\ y_k^T H_k \end{array} \right] \\ &= U \left[ \begin{array}{cc} \omega _k y_k^T H_k y_k s_k^T - \gamma _k (\frac{1}{\omega _k} + y_k^T H_k y_k) s_k^T + (1 - \omega _k s_k^T y_k) y_k^T H_k \\ (1 - \omega _k y_k^T s_k) s_k^T + \omega _k s_k^T H_k^{-1} s_k y_k^T H_k \end{array} \right] \end{aligned} \end{aligned}$$

so $U \det (D) D^{-1} V$ fully expanded becomes

$$\tiny \begin{aligned} U \det (D) D^{-1} V = \omega _k \big [ s_k \big ( \omega _k y_k^T H_k y_k s_k^T - \gamma_k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) s_k^T + (1 - \omega _k s_k^T y_k) y_k^T H_k \big ) + H_k y_k \big ( (1 - \omega _k y_k^T s_k) s_k^T + \omega _k s_k^T H_k^{-1} s_k y_k^T H_k \big ) \big ] \text { . } \end{aligned}$$

This looks rather ugly at the moment, but we continue by breaking the problem down further, noting that

$$\tiny \begin{aligned} s_k \big ( \omega _k y_k^T H_k y_k s_k^T - \gamma _k \big ( \frac{1}{\omega _k} + y_k^T H_k y_k \big ) s_k^T + (1 - \omega _k s_k^T y_k) y_k^T H_k \big ) = \big ( (\omega _k - \gamma _k) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \big ) s_k s_k^T + (1 - \omega _k s_k^T y_k) s_k y_k^T H_k \end{aligned}$$

and

$$ \begin{aligned} H_k y_k \bigg ( (1 - \omega _k y_k^T s_k) s_k^T + \omega _k s_k^T H_k^{-1} s_k y_k^T H_k \bigg ) = (1 - \omega _k y_k^T s_k) H_k y_k s_k^T + \omega _k H_k y_k (s_k^T H_k^{-1} s_k) y_k^T H_k \text { . } \end{aligned}$$

The above intermediate results further simplify $U \det (D) D^{-1} V$ to

$$\tiny \begin{aligned} U \det (D) D^{-1} V = \omega_k \big [ \big ( (\omega _k - \gamma _k) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \big ) s_k s_k^T + (1 - \omega _k s_k^T y_k) ( s_k y_k^T H_k + H_k y_k s_k^T ) + \omega _k H_k y_k (s_k^T H_k^{-1} s_k) y_k^T H_k \big ] \text { . } \end{aligned}$$

Left and right multiplying the line immediately above by $A^{-1} = H_k^{-1}$ gives

$$\tiny \begin{aligned} H_k^{-1} U \det (D) D^{-1} V H_k^{-1} = \omega _k \big [ \big ( (\omega _k - \gamma _k) y_k^T H_k y_k - \frac{\gamma _k}{\omega _k} \big ) H_k^{-1} s_k s_k^T H_k^{-1} + (1 - \omega _k s_k^T y_k) ( H_k^{-1} s_k y_k^T + y_k s_k^T H_k^{-1} ) + \omega _k y_k (s_k^T H_k^{-1} s_k) y_k^T \big ] \end{aligned}$$

and thus, after dividing out $\det (D)$ and applying $B_{k} = H_{k}^{-1}$, we arrive at the following final formula

$$\tiny \begin{aligned} B_{k+1} = B_k - \frac{\omega _k \big [ \big ( (\omega _k - \gamma _k) y_k^T B_k^{-1} y_k - \frac{\gamma _k}{\omega _k} \big ) B_k s_k s_k^T B_k + (1 - \omega _k s_k^T y_k) ( B_k s_k y_k^T + y_k s_k^T B_k ) + \omega _k (s_k^T B_k s_k) y_k y_k^T \big ]}{\big ( (\omega _k - \gamma _k) y_k^T B_k^{-1} y_k - \frac{\gamma _k}{\omega _k} \big ) \big ( \omega _k s_k^T B_k s_k \big ) - (1 - \omega _k y_k^T s_k)^2} \end{aligned}$$

(107)

for the SP-BFGS inverse update, which concludes the proof.

Appendix 4: Proof of Theorem 3

Referring to Theorem 2, taking the trace of both sides of (107) and applying the linearity and cyclic invariance properties of the trace yields

$$\begin{aligned} {{\,\textrm{Tr}\,}}(B_{k+1}) = \kappa _1 {{\,\textrm{Tr}\,}}(B_k) + \kappa _2 \left\| B_k s_k\right\| _2^2 + 2 \kappa _3 (y_k^T B_k s_k) + \kappa _4 \left\| y_k\right\| _2^2 \end{aligned}$$

(108)

where

$$\begin{aligned} \kappa _1 = 1 \text { , } \quad \kappa _2 = - \frac{\omega _k {\hat{D}}}{[ {\hat{D}} (\omega _k s_k^T B_k s_k) - ({\hat{E}})^2 ]} \text { , } \end{aligned}$$

(109)

$$\begin{aligned} \kappa _3 = - \frac{\omega _k {\hat{E}}}{[ {\hat{D}} (\omega _k s_k^T B_k s_k) - ({\hat{E}})^2 ]} \text { , } \quad \kappa _4 = - \frac{(\omega _k)^2 s_k^T B_k s_k}{[ {\hat{D}} (\omega _k s_k^T B_k s_k) - ({\hat{E}})^2 ]} \text{ , } \end{aligned}$$

(110)

with ${\hat{D}}$ and ${\hat{E}}$ defined as

$$\begin{aligned} {\hat{D}} = \bigg [ (\omega _k - \gamma _k) (y_k^T B_k^{-1} y_k) - \frac{\gamma _k}{\omega _k} \bigg ] \text { , } \quad {\hat{E}} = (1 - \omega _k s_k^T y_k ) = \frac{2 \omega _k}{\beta _k} \text { . } \end{aligned}$$

(111)

We now observe that after applying some basic algebra, and recalling that $B_k$ is positive definite, one can deduce that for all $\beta _k \in [0, +\infty ]$, the following inequalities hold

$$\begin{aligned} (\omega _k - \gamma _k) \le 0 \text { , } \quad 1 \le \frac{\gamma _k}{\omega _k} \text { , } \quad {\hat{D}} \le -1 \text { , } \quad 0 \le \frac{2 \omega _k}{\beta _k} \le 1 \text { . } \end{aligned}$$

(112)

By minimizing the absolute value of the common denominator in $\kappa _2, \kappa _3$, and $\kappa _4$ using the inequalities above, one can obtain the bounds

$$\begin{aligned} - \frac{1}{s_k^T B_k s_k} \le \kappa _2 \le 0 \text { , } \qquad 0 \le \kappa _4 \le \omega _k \le \gamma _k \text{ , } \end{aligned}$$

(113)

$$\begin{aligned} 0 \le \kappa _3 \le \frac{2 \omega _k}{\beta _k} \frac{1}{s_k^T B_k s_k + \frac{2 \omega _k}{\beta _k} \frac{2}{\beta _k}} \le \frac{\beta _k}{2} \text { . } \end{aligned}$$

(114)

As a result,

$$\begin{aligned} {{\,\textrm{Tr}\,}}(B_{k+1})&\le {{\,\textrm{Tr}\,}}(B_k) + 2 \kappa _3 | y_k^T B_k s_k | + \kappa _4 \left\| y_k\right\| _2^2 \end{aligned}$$

(115)

$$\begin{aligned} {{\,\textrm{Tr}\,}}(B_{k+1}) \le {{\,\textrm{Tr}\,}}(B_k) + \beta _k \left\| y_k\right\| _2 \lambda _{max}(B_k) \left\| s_k\right\| _2 + \gamma _k \left\| y_k\right\| _2^2 \end{aligned}$$

(116)

and applying $\lambda _{max}(B_k) < {{\,\textrm{Tr}\,}}(B_k)$ establishes (53). Similarly, referring to (89) reveals the upper bound

$$ \begin{aligned} {{\,\textrm{Tr}\,}}(H_{k+1}) \le {{\,\textrm{Tr}\,}}(H_k) + 2 \omega _k | y_k^T H_k s_k | + \big [ \gamma _k + \omega _k \gamma _k (y_k^T H_k y_k) \big ] \left\| s_k\right\|_2^2 \text { . } \end{aligned}$$

(117)

To establish (52), we apply $\lambda _{max}(H_k) < {{\,\textrm{Tr}\,}}(H_k)$ and $\omega _k \le \gamma _k$ to the line above, and then factor. This completes the proof.

Appendix 5: Proof of Lemma 2

As $\phi$ is m-strongly convex due to Assumption 3, it is true that

$$\begin{aligned} \phi (y) \ge \phi (x) + \nabla \phi (x)^T (y - x) + \frac{m}{2} \left\| y - x\right\| _2^2 \text { , } \quad \forall x, y \in {\mathbb {R}}^n \text { . } \end{aligned}$$

(118)

Note that for any fixed x, the right side of (118) provides a global quadratic lower bound on $\phi$. As these bounds are global lower bounds, minimizing both sides of (118) with respect to y preserves the inequality, so

$$\begin{aligned} \min _{y} \bigg \{ \phi (y) \bigg \} \ge \min _{y} \bigg \{ \phi (x) + \nabla \phi (x)^T (y - x) + \frac{m}{2} \left\| y - x\right\| _2^2 \bigg \} \end{aligned}$$

(119)

which simplifies to

$$\begin{aligned} \phi ^{\star } \ge \phi (x) - \frac{1}{2 m} \left\| \nabla \phi (x)\right\| _2^2 \text { . } \end{aligned}$$

(120)

Proceeding, the inner product condition $\nabla \phi (x)^T H g(x) > \xi \left\| \nabla \phi (x)\right\| _2$ expands to

$$\begin{aligned} \nabla \phi (x)^T H g(x) = \nabla \phi (x)^T H \nabla \phi (x) + \nabla \phi (x)^T H e(x) > \xi \left\| \nabla \phi (x)\right\| _2 \text { . } \end{aligned}$$

(121)

The smallest possible value of $\nabla \phi (x)^T H \nabla \phi (x)$ is

$$\begin{aligned} \nabla \phi (x)^T H \nabla \phi (x) \ge \psi \left\| \nabla \phi (x)\right\| _2^2 \text { . } \end{aligned}$$

(122)

By applying the Cauchy-Schwarz inequality and Assumption 2, the most negative possible value of $\nabla \phi (x)^T H e(x)$ is

$$\begin{aligned} \nabla \phi (x)^T H e(x) \ge - {\varPsi } \left\| \nabla \phi (x)\right\| _2 \left\| e(x)\right\| _2 \ge - {\varPsi } \left\| \nabla \phi (x)\right\| _2 {\bar{\epsilon }}_g \text { . } \end{aligned}$$

(123)

Thus, we see that if

$$\begin{aligned} \psi \left\| \nabla \phi (x)\right\| _2^2 - {\varPsi } \left\| \nabla \phi (x)\right\| _2 {\bar{\epsilon }}_g > \xi \left\| \nabla \phi (x)\right\| _2 \text { , } \end{aligned}$$

(124)

which rearranges to

$$\begin{aligned} \left\| \nabla \phi (x)\right\| _2 > \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \text { , } \end{aligned}$$

(125)

then $\nabla \phi (x)^T H g(x) > \xi \left\| \nabla \phi (x)\right\| _2$ is guaranteed. Note that (125) implies

$$\begin{aligned} \nabla \phi (x)^T H g(x)> \xi \left\| \nabla \phi (x)\right\| _2 > \xi \bigg [ \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \bigg ] \end{aligned}$$

(126)

when combined with the inner product condition. Combining (125) with Assumption 2 and the definition of the gradient noise to signal ratio $\delta (x)$ given by (58) reveals that

$$\begin{aligned} \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } < \left\| \nabla \phi (x)\right\| _2 = \frac{\left\| e(x)\right\| _2}{\delta (x)} \le \frac{{\bar{\epsilon }}_g}{\delta (x)} \end{aligned}$$

(127)

and so $\delta (x) < \frac{\psi {\bar{\epsilon }}_g}{{\varPsi } {\bar{\epsilon }}_g + \xi } \le \frac{\psi }{{\varPsi }}$.

Contrapositively, if $\nabla \phi (x)^T H g(x) \le \xi \left\| \nabla \phi (x)\right\| _2$, then

$$\begin{aligned} \left\| \nabla \phi (x)\right\| _2 \le \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \text { , } \end{aligned}$$

(128)

or if $\delta (x) \ge \frac{\psi {\bar{\epsilon }}_g}{{\varPsi } {\bar{\epsilon }}_g + \xi } \ge 0$, then

$$\begin{aligned} \left\| \nabla \phi (x)\right\| _2 \le \bigg ( \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi {\bar{\epsilon }}_g} \bigg ) \left\| e(x)\right\| _2 \le \frac{{\varPsi } {\bar{\epsilon }}_g + \xi }{\psi } \text { . } \end{aligned}$$

(129)

Squaring either inequality (128) or (129) and then combining it with a rearranged (120) given by

$$\begin{aligned} \phi (x) - \phi ^{\star } \le \frac{1}{2 m} \left\| \nabla \phi (x)\right\| _2^2 \end{aligned}$$

(130)

gives ${\mathcal {N}}_1(\psi ,{\varPsi },\xi )$, completing the proof.

Appendix 6: Proof of Lemma 3

Similar to (122) and (123), by using the definition of $\delta (x)$, the lower bound

$$ \begin{aligned} \nabla \phi (x)^T H g(x) \ge \psi \left\| \nabla \phi (x)\right\| _2^2 - {\varPsi } \left\| \nabla \phi (x)\right\| _2 \left\| e(x)\right\| _2 = \big ( \psi - {\varPsi } \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 \end{aligned}$$

(131)

and the upper bound

$$ \begin{aligned} \varepsilon \big ( 1 + \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 = \varepsilon \big ( \left\| \nabla \phi (x)\right\| _2^2 + \left\| \nabla \phi (x)\right\| _2 \left\| e(x)\right\| _2 \big ) \ge \varepsilon \nabla \phi (x)^T g(x) \end{aligned}$$

(132)

can be established. Observe that if the lower bound (131) is always greater than or equal to the upper bound (132)

$$\begin{aligned} \big ( \psi - {\varPsi } \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 \ge \varepsilon \big ( 1 + \delta (x) \big ) \left\| \nabla \phi (x)\right\| _2^2 \text { , } \end{aligned}$$

(133)

it implies that $\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)$. Hence, the condition

$$\begin{aligned} \varepsilon \le \frac{\big ( \psi - {\varPsi } \delta (x) \big )}{\big ( 1 + \delta (x) \big )} \end{aligned}$$

(134)

implies that $\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)$. By applying Lemma 2, we see that for all $x \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, it is true that $\delta (x) < \frac{\psi }{(1 + A) {\varPsi }}$. Thus, setting

$$\begin{aligned} \varepsilon < \frac{\big ( \psi - \frac{\psi }{(1 + A) } \big ) }{ \big ( 1 + \frac{\psi }{(1 + A) {\varPsi } } \big ) } = \frac{ A \psi {\varPsi }}{\big ( (1+A) {\varPsi } + \psi \big )} \end{aligned}$$

(135)

guarantees that $\nabla \phi (x)^T H g(x) \ge \varepsilon \nabla \phi (x)^T g(x)$ for all $x \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, completing the proof.

Appendix 7: Proof of Theorem 4

As $\phi \in C^2$ by Assumption 3, applying Taylor’s theorem and using (62) and strong convexity gives

$$\begin{aligned} \phi _{k+1}&= \phi _k + \nabla \phi _k^T [ x_{k+1} - x_{k} ] + \frac{1}{2} [ x_{k+1} - x_{k} ]^T \nabla ^2 \phi (u) [ x_{k+1} - x_{k} ] \\&\le \phi _k - \alpha \nabla \phi _k^T H_k g_k + \frac{\alpha ^2 M}{2} \left\| H_k g_k\right\| _2^2 \end{aligned}$$

where u is some convex combination of $x_{k+1}$ and $x_{k}$. Proceeding, note that the smallest possible region ${\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$ from Lemma 3 occurs with the choice $\psi = {\varPsi }$. In this case $H = {\varPsi } I$, and (59) from Lemma 2 becomes

$$\begin{aligned} \nabla \phi _k^T g_k> A {\bar{\epsilon }}_g \bigg [ (1+A) {\bar{\epsilon }}_g \bigg ] > 0 \end{aligned}$$

(136)

and so $\nabla \phi _k^T g_k > 0$ if $x_k \notin {\mathcal {N}}_{1}(\psi = {\varPsi },{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$. Hence, for all possible choices of $0 < \psi \le {\varPsi }$ in ${\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, we have $\nabla \phi _k^T g_k > 0$ if $x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$. Combining this with Lemma 3 gives

$$\begin{aligned} \nabla \phi _k^T H_k g_k \ge \varepsilon \nabla \phi _k^T g_k > 0 \end{aligned}$$

(137)

if $x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$. With (137) in hand, continuing to bound terms gives

$$ \begin{aligned} \phi _{k+1} &\le \phi _k - \alpha \varepsilon \nabla \phi _k^T [ \nabla \phi _k + e_k ] + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| \nabla \phi _k + e_k\right\| _2^2 \\ &= \phi _k - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \frac{\alpha {\varPsi } M}{2} \bigg ) \left\| \nabla \phi _k\right\| _2^2 - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \alpha {\varPsi } M \bigg ) \nabla \phi _k^T e_k + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| e_k\right\| _2^2 \\ & \le \phi _k - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \frac{\alpha {\varPsi } M}{2} \bigg ) \left\| \nabla \phi _k\right\| _2^2 + \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \alpha {\varPsi } M \bigg ) \left\| \nabla \phi _k\right\|_2 \left\| e_k\right\| _2 + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| e_k\right\| _2^2 \\ & \le \phi _k - \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \frac{\alpha {\varPsi } M}{2} \bigg ) \left\| \nabla \phi _k\right\| _2^2 + \alpha {\varPsi } \bigg ( \frac{\varepsilon }{{\varPsi }} - \alpha {\varPsi } M \bigg ) \bigg [ \frac{1}{2} \left\| \nabla \phi _k\right\| _2^2 + \frac{1}{2} \left\| e_k\right\|_2^2 \bigg ] + \frac{\alpha ^2 {\varPsi }^2 M}{2} \left\| e_k\right\|_2^2 \end{aligned}$$

where the last inequality follows from expanding

$$\begin{aligned} 0 \leq \bigg ( \frac{1}{\sqrt{2}} \left\| \nabla \phi _k\right\| _2 - \frac{1}{\sqrt{2}} \left\| e_k\right\| _2 \bigg )^2 = \frac{1}{2} \left\| \nabla \phi _k\right\| _2^2 - \left\| \nabla \phi _k\right\| _2 \left\| e_k\right\| _2 + \frac{1}{2} \left\| e_k\right\| _2^2 \end{aligned}$$

(138)

and using $\alpha \le \frac{\varepsilon }{M {\varPsi }^2}$ in (63). Simplifying the last inequality reveals that

$$\begin{aligned} \phi _{k+1} \le \phi _k - \frac{\alpha \varepsilon }{2} \left\| \nabla \phi _k\right\| _2^2 + \frac{\alpha \varepsilon }{2} \left\| e_k\right\| _2^2 \text { . } \end{aligned}$$

(139)

Since $\phi$ is m-strongly convex by Assumption 3, we can apply

$$\begin{aligned} \left\| \nabla \phi _k\right\| _2^2 \ge 2 m ( \phi _k - \phi ^{\star } ) \end{aligned}$$

(140)

which comes from rearranging (120) in the proof of Lemma 2 (see Appendix 5). Combining (140) with (139) and Assumption 2 gives

$$\begin{aligned} \phi _{k+1} \le \phi _k - \alpha \varepsilon m ( \phi _k - \phi ^{\star } ) + \frac{\alpha \varepsilon }{2} \bigg ( \frac{(1+A) {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \text { . } \end{aligned}$$

(141)

Subtracting $\phi ^{\star }$ from both sides, and using the notation ${\tilde{A}} :=(1+A)$, we get

$$\begin{aligned} \phi _{k+1} - \phi ^{\star } \le (1 - \alpha \varepsilon m) (\phi _k - \phi ^{\star } ) + \frac{\alpha \varepsilon }{2} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \end{aligned}$$

(142)

which, by subtracting $\frac{1}{2 m} \big ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \big )^2$ from both sides and simplifying, gives

$$\begin{aligned} \phi _{k+1} - \phi ^{\star } - \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2&\le (1 - \alpha \varepsilon m) (\phi _k - \phi ^{\star } ) + \frac{\alpha \varepsilon }{2} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 - \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \\&= (1 - \alpha \varepsilon m) (\phi _k - \phi ^{\star } ) + (\alpha \varepsilon m - 1) \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \\&= (1 - \alpha \varepsilon m) \bigg ( \phi _k - \bigg [ \phi ^{\star } + \frac{1}{2 m} \bigg ( \frac{{\tilde{A}} {\varPsi } {\bar{\epsilon }}_g}{\psi } \bigg )^2 \bigg ] \bigg ) \end{aligned}$$

thus establishing the Q-linear result. We obtain the R-linear result (64) by recursively applying the worst case bound given by the Q-linear result, noting that in the worst case if $x_0 \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, then the sequence of iterates $\{ x_k \}$ remains outside of ${\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, only approaching ${\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$ in the limit $k \rightarrow \infty$.

Appendix 8: Proof of Theorem 5

From (139) in Appendix 7, if the step size $\alpha \le \frac{\varepsilon }{M {\varPsi }^2}$ from (63), one has

$$\begin{aligned} \phi (x_k - \alpha H_k g_k) \le \phi (x_k) - \frac{\alpha \varepsilon }{2} \left\| \nabla \phi (x_k)\right\| _2^2 + \frac{\alpha \varepsilon }{2} \left\| e(x_k)\right\| _2^2 \end{aligned}$$

(143)

which combines with Assumption 1 to give

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - \frac{\alpha \varepsilon }{2} \big ( \left\| \nabla \phi (x_k)\right\| _2^2 - \left\| e(x_k)\right\| _2^2 \big ) + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$

(144)

The relaxed Armijo condition (38) expands to

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - c_1 \alpha g_k^T H_k g_k + 2 \epsilon _{A} \end{aligned}$$

(145)

and so the strongest possible condition (i.e. the condition requiring the greatest decrease in f) can be written as

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - c_1 \alpha {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} \text { . } \end{aligned}$$

(146)

Comparing (144) and (146) reveals that for the bound given by (144) to also imply the bound given by (146), it must be true that

$$\begin{aligned} - \frac{\alpha \varepsilon }{2} \big ( \left\| \nabla \phi (x_k)\right\| _2^2 - \left\| e(x_k)\right\| _2^2 \big ) + 2 {\bar{\epsilon }}_{f} \le - c_1 \alpha {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} \end{aligned}$$

(147)

which rearranges to

$$\begin{aligned} c_1 {\varPsi } \left\| g_k\right\| _2^2 + \frac{\varepsilon }{2} \left\| e(x_k)\right\| _2^2 \le \frac{\varepsilon }{2} \left\| \nabla \phi (x_k)\right\| _2^2 + \frac{2}{\alpha } ( \epsilon _{A} - {\bar{\epsilon }}_{f} ) \text { . } \end{aligned}$$

(148)

As $\epsilon _{A} - {\bar{\epsilon }}_{f} > 0$, it is clear that the right side of (148) can be made arbitrarily large by sending $\alpha \rightarrow 0$. Hence, the relaxed Armijo condition (38) will be satisfied for sufficiently small $\alpha$ and the backtracking line search will always find an $\alpha _k$ small enough to satisfy (38).

By Lemma 2, outside of ${\mathcal {N}}_1(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, one has $\delta _k < \frac{\psi }{(1+A){\varPsi }}$. Applying the triangle and reverse triangle inequalities to (32) gives

$$ \begin{aligned} \left\| \nabla \phi (x_k)\right\| _2 - \left\| e(x_k)\right\| _2 \le \left\| \nabla \phi (x_k) + e(x_k)\right\| _2 \le \left\| \nabla \phi (x_k)\right\| _2 + \left\| e(x_k)\right\| _2 \end{aligned}$$

(149)

which can be written using the gradient noise to signal ratio $\delta _k$ as

$$\begin{aligned} (1 - \delta _k) \left\| \nabla \phi (x_k)\right\| _2 \le \left\| g_k\right\| _2 \le (1 + \delta _k) \left\| \nabla \phi (x_k)\right\| _2 \text { . } \end{aligned}$$

(150)

Combining the definition of $\delta _k$ (see (58) in Lemma 2), (150), and $\delta _k < \frac{\psi }{(1+A) {\varPsi }}$ with (144) gives

$$\begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - \frac{\alpha \varepsilon }{2} (1 - \delta _k^2) \left\| \nabla \phi (x_k)\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \end{aligned}$$

(151)

$$ \begin{aligned} f(x_k) - \frac{\alpha \varepsilon }{2} (1 - \delta _k^2) \left\| \nabla \phi (x_k)\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \le f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k^2)}{(1 + \delta _k)^2} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \end{aligned}$$

(152)

$$ \begin{aligned} f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k^2)}{(1 + \delta _k)^2} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} = f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k)}{(1 + \delta _k)} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \end{aligned}$$

(153)

$$ \begin{aligned} f(x_k - \alpha H_k g_k) \le f(x_k) - \frac{\alpha \varepsilon }{2} \frac{(1 - \delta _k)}{(1 + \delta _k)} \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \le f(x_k) - \frac{\alpha \varepsilon }{2} \frac{ \bigg ( 1 - \frac{\psi }{(1+A) {\varPsi }} \bigg )}{ \bigg ( 1 + \frac{\psi }{(1+A) {\varPsi }} \bigg ) } \left\| g_k\right\| _2^2 + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$

(154)

Now, as ${\bar{\epsilon }}_{f} < \epsilon _{A}$, the bound (154) above implies the bound (146) for any $\alpha \le \frac{\varepsilon }{{\varPsi }^2 M}$ if $c_1 \le \frac{\varepsilon }{2 {\varPsi }} \frac{ \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big ) }{ \big ( 1 + \frac{\psi }{(1+A) {\varPsi }} \big ) }$. Since $\alpha _k$ is chosen using a backtracking line search with backtracking factor $\tau < 1$, it is true that $\frac{\tau \varepsilon }{{\varPsi }^2 M} < \alpha _k \le \frac{\varepsilon }{{\varPsi }^2 M}$. Thus, combining the bound (146) with Assumption 1 and (150) shows that

$$\begin{aligned} \phi (x_k - \alpha _k H_k g_k) \le \phi (x_k) - c_1 \alpha _k {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \end{aligned}$$

(155)

$$ \begin{aligned} \phi (x_k) - c_1 \alpha _k {\varPsi } \left\| g_k\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \le \phi (x_k) - \frac{c_1 \tau \varepsilon }{{\varPsi } M} (1 - \delta _k)^2 \left\| \nabla \phi (x_k)\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \end{aligned}$$

(156)

$$ \begin{aligned} \phi (x_k - \alpha _k H_k g_k) \le \phi (x_k) - \frac{c_1 \tau \varepsilon }{{\varPsi } M} \bigg (1 - \frac{\psi }{(1+A) {\varPsi }} \bigg )^2 \left\| \nabla \phi (x_k)\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$

(157)

The expression (157) measures the reduction in the value of $\phi$ for iterates where $x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$. Proceeding, we take the following bound

$$\begin{aligned} \phi (x_{k+1}) \le \phi (x_k) - \frac{c_1 \tau \varepsilon }{{\varPsi } M} \bigg (1 - \frac{\psi }{(1+A) {\varPsi }} \bigg )^2 \left\| \nabla \phi (x_k)\right\| _2^2 + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \end{aligned}$$

(158)

and subtract $\phi ^{\star }$ from both sides as well as apply the inequality (140) to get

$$ \begin{aligned} \phi (x_{k+1}) - \phi ^{\star } \le \bigg ( 1 - \frac{2 m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2}{{\varPsi } M} \bigg ) ( \phi (x_k) - \phi ^{\star } ) + 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$

(159)

For ease of notation, define the following quantities

$$\begin{aligned} \rho :=\bigg ( 1 - \frac{2 m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2}{{\varPsi } M} \bigg ) \text { , } \qquad \eta :=2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \text { . } \end{aligned}$$

(160)

Thus, for all k where $x_k \notin {\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$, we have shown that the following bound holds

$$\begin{aligned} \phi (x_{k+1}) - \phi ^{\star } \le \rho ( \phi (x_k) - \phi ^{\star } ) + \eta \text { . } \end{aligned}$$

(161)

Subtracting $\frac{\eta }{(1 - \rho )}$ from both sides shows that

$$\begin{aligned} \phi (x_{k+1}) - \phi ^{\star } - \frac{\eta }{1 - \rho }&\le \rho ( \phi (x_k) - \phi ^{\star } ) + \eta - \frac{\eta }{1 - \rho } \\&= \rho ( \phi (x_k) - \phi ^{\star } ) - \frac{\rho \eta }{1 - \rho } \\&= \rho \bigg ( \phi (x_k) - \phi ^{\star } - \frac{\eta }{1 - \rho } \bigg ) \end{aligned}$$

and thus one has

$$\begin{aligned} \phi (x_{k+1}) - \phi ^{\star } - {\bar{\eta }} \le \rho ( \phi (x_{k}) - \phi ^{\star } - {\bar{\eta }} ) \end{aligned}$$

(162)

where ${\bar{\eta }} :=\frac{\eta }{(1 - \rho )}$. Using the definitions in (160) shows that

$$\begin{aligned} {\bar{\eta }} = \frac{{\varPsi } M}{2 m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2} \bigg ( 2 \epsilon _{A} + 2 {\bar{\epsilon }}_{f} \bigg ) \end{aligned}$$

(163)

$$\begin{aligned} {\bar{\eta }} = \frac{{\varPsi } M}{m c_1 \tau \varepsilon \big ( 1 - \frac{\psi }{(1+A) {\varPsi }} \big )^2} ( \epsilon _{A} + {\bar{\epsilon }}_{f} ) \end{aligned}$$

(164)

which establishes (68) and (69). Similar to Appendix 7, we obtain the R-linear result (71) by recursively applying the bound in (68), stop** once an iterate enters ${\mathcal {N}}_{1}(\psi ,{\varPsi }, A {\varPsi } {\bar{\epsilon }}_g)$. This concludes the proof.

Appendix 9: Extended numerical experiments

Table 5 below shows the performance of gradient descent for the same problem (ROSENBR) and noise combinations as in Table 1.

Table 5 Performance of gradient descent on the Rosenbrock function (i.e. ROSENBR) corrupted by noise

Full size table

Tables 6, 7 and 8 compare the performance of SP-BFGS, BFGS, and gradient descent on the 32 CUTEst test problems with only gradient noise present (i.e. ${\bar{\epsilon }}_f = 0$). Gradient noise was generated using ${\bar{\epsilon }}_g = 10^{-4} \left\| \nabla \phi (x^0)\right\| _2$, where the starting point $x^0$ varies by CUTEst problem, to ensure that noise does not initially dominate gradient evaluations. By examining the mean and median columns in Tables 6, 7 and 8, one sees that SP-BFGS outperforms both BFGS and gradient descent on $\frac{18}{32} \approx 56 \%$ of the CUTEst problems with only gradient noise present, and performs at least as well as the best performing alternative on $\frac{28}{32} \approx 88 \%$ of these problems. Equivalently, SP-BFGS was only outperformed by BFGS or gradient descent on $\frac{4}{32} \approx 12 \%$ of these problems.

Table 6 Performance of SP-BFGS on 32 selected CUTEst test problems with noise added to gradient evaluations only (i.e. ${\bar{\epsilon }}_f = 0$)

Full size table

Table 7 Performance of BFGS on 32 selected CUTEst test problems with noise added to gradient evaluations only (i.e. ${\bar{\epsilon }}_f = 0$)

Full size table

Table 8 Performance of gradient descent on 32 selected CUTEst test problems with noise added to gradient evaluations only (i.e. ${\bar{\epsilon }}_f = 0$)

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Irwin, B., Haber, E. Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition. Comput Optim Appl 84, 651–702 (2023). https://doi.org/10.1007/s10589-022-00448-x

Download citation

Received: 10 July 2021
Accepted: 23 December 2022
Published: 09 January 2023
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10589-022-00448-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A new adaptive Barzilai and Borwein method for unconstrained optimization

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

A cubic regularization of Newton’s method with finite difference Hessian approximations

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: Proof of Theorem 1

Appendix 2: Proof of Lemma 1

Appendix 3: Proof of Theorem 2

Appendix 4: Proof of Theorem 3

Appendix 5: Proof of Lemma 2

Appendix 6: Proof of Lemma 3

Appendix 7: Proof of Theorem 4

Appendix 8: Proof of Theorem 5

Appendix 9: Extended numerical experiments

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant condition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A new adaptive Barzilai and Borwein method for unconstrained optimization

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

A cubic regularization of Newton’s method with finite difference Hessian approximations

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: Proof of Theorem 1

Appendix 2: Proof of Lemma 1

Appendix 3: Proof of Theorem 2

Appendix 4: Proof of Theorem 3

Appendix 5: Proof of Lemma 2

Appendix 6: Proof of Lemma 3

Appendix 7: Proof of Theorem 4

Appendix 8: Proof of Theorem 5

Appendix 9: Extended numerical experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation