1 Introduction

Deep learning (DL) as a leading technique of machine learning (ML) has attracted much attention and become one of the most popular directions of research. DL approaches have been applied to solve many large-scale problems in different fields by training deep neural networks (DNNs) over large available datasets. Let \({\mathcal {N}}=\{1, 2, \ldots , N\}\) be the index set of the training dataset \(\{(x_i, y_i)\}_{i=1}^{N}\) with \(N=|{\mathcal {N}}|\) sample pairs including input \(x_i\in {\mathbb {R}}^d\) and target \(y_i\in {\mathbb {R}}^C\). DL problems are often formulated as unconstrained optimization problems with an empirical risk function, in which a parametric function \({\hat{h}}( x_i;\cdot ):{\mathbb {R}}^d \longrightarrow {\mathbb {R}}^C\) is found such that the prediction errors are minimized. More precisely, we obtain the following problem

$$\begin{aligned} \min _{w \in {\mathbb {R}}^n} f(w) \triangleq \frac{1}{N} \sum _{i=1}^{N}{f_i(w)}, \end{aligned}$$
(1)

where \(w \in {\mathbb {R}}^n\) is the vector of trainable parameters and \(f_i(w) \triangleq L(y_i, {\hat{h}}(x_i; w))\) with a relevant loss function \(L(\cdot )\) measuring the prediction error between the target \(y_i\) and the network’s output \({\hat{h}}(x_i; w)\). The DL problem (1) is large-scale, highly nonlinear, and often non-convex, and thus it is not straightforward to apply traditional (deterministic) optimization algorithms like steepest descent or Newton-type methods. Recently, much effort has been devoted to the development of DL optimization algorithms. Popular DL optimization methods can be divided into two general categories, first-order methods using gradient information, e.g. steepest gradient descent, and (higher-) second-order methods using also curvature information, e.g. Newton methods [1]. On the other hand, since the full (training) sample size N in (1) is usually excessively large for a deterministic approach, these optimizers are further adapted to use subsampling strategies that aim to reduce computational costs. Subsampling strategies employ sample average approximations of the function and its gradient as follows

$$\begin{aligned} f_{{\mathcal {N}}_k}(w) = \frac{1}{N_k} \sum _{i \in {\mathcal {N}}_k}{f_i(w)}, \quad \nabla f_{{\mathcal {N}}_k}(w) = \frac{1}{N_k} \sum _{i \in {\mathcal {N}}_k}{\nabla f_i(w)}, \end{aligned}$$
(2)

where \({\mathcal {N}}_k \subseteq {\mathcal {N}}\) represents a subset of the full (training) sample set at iteration k and \(N_k\) is the subsample size, i.e., \(N_k=|{\mathcal {N}}_k|\).

In this work, we propose a second-order trust-region (TR) algorithm [2] adapted to the stochastic framework where the step and the candidate point for the next iterate are obtained using subsampled function values and subsampled gradients (2). The quadratic TR models are constructed by using Hessian approximations, without imposing a positive definiteness assumption, as the true Hessian in DL problems may not be positive definite due to their non-convex nature. Moreover, having in mind that we work with noisy approximations (2), imposing a strict decrease might be unnecessary. Thus, we employ a non-monotone trust-region (NTR) approach; see e.g. [3] or references therein. Unlike the classical TR, our decision on acceptance of the trial point is not based only on the agreement between the model and the approximate objective function decrease, but on the independent subsampled function. This "control" function which is formed through additional sampling, similar to one proposed in [4] for the line search framework, also has a role in controlling the sample average approximation error by adaptively choosing the sample size. Depending on the estimated progress of our algorithm, this can yield sample size scenarios ranging from mini-batch to full sample functions. We provide convergence analysis for all possible scenarios and show that the proposed method achieves almost sure convergence under standard assumptions for the TR framework such as Lipschitz-continuous gradients and bounded Hessian approximations.

Literature review. Stochastic first-order methods such as stochastic gradient descent (SGD) method [5, 6], and its variance-reduced [7,8,9,10] and adaptive [11, 12] variants have been widely used in many ML and DL applications likely because of their proven efficiency in practice. However, due to the use of only gradient information, these methods come with several issues like, for instance, relatively slow convergence, high sensitivity to the hyper-parameters, stagnation at high training loss [13], difficulty in esca** saddle points, and suffering from ill-conditioning [14]. To cope with some of these issues, there are some attractive alternatives as well as their stochastic variants aimed at incorporating second-order information, e.g. Hessian-Free methods [15,16,17,18,19,20] which find an estimation of the Newton direction by (subsampled) Hessian-vector products without directly calculating the (subsampled) Hessian matrix, and limited memory Quasi-Newton methods which construct some approximations of the true Hessian only by using gradient information. Furthermore, algorithms based on Quasi-Newton methods have been the subject of many research efforts both in convex (see e.g. [21, 22] and references therein) and non-convex settings (see e.g. [23,24,25] and references therein), or [26, 27] where the advantage of modern computational architectures and parallelization for evaluating the full objective function and its derivatives is employed. In almost all these articles, the Quasi-Newton Hessian approximation used is BFGS or limited memory BFGS (L-BFGS) with positive definiteness property, which is often considered in the line-search framework except e.g. [28, 29]. A disadvantage of using BFGS may occur when it tries to approximate the true Hessian of a non-convex objective function in DL optimization. We refer to [30] as one of the earliest works in which the limited memory Quasi-Newton update SR1 (L-SR1) allowing for indefinite Hessian approximation was used in a trust-region framework with a periodical progressive overlap batching.

The potential usefulness of non-monotonicity may be traced back to [31] where a non-monotone line-search technique was proposed for the Newton method to relax some standard line-search conditions and to avoid slow convergence of a deterministic method. Similarly, the idea of non-monotonicity exploited for trust-region could be dated back to [4, 38, 39]. The rule that we apply in our method mainly corresponds to that presented in [4] where it is used in a line-search framework and plays a role in deciding whether to switch from the line-search to a predefined step size sequence or not. We adapted this strategy to the TR framework and used it to control the sample size in our method. It is worth pointing out that the additional sampling can be arbitrarily small, i.e. the sample size can even be 1, and hence, it does not increase the computational cost significantly. Adaptive sample size strategies can be also found in other works, for instance, a type of adaptive subsampling strategy was applied for the STORM algorithm [40, 41] which is also a second-order method in a standard TR framework. A different strategy using inexact restoration was proposed in [42] for a first-order standard TR approach. The variable size subsampling is not restricted to TR frameworks, see e.g. [25] where a progressive subsampling was considered for a line-search method.

Notation. Throughout this paper, vectors and matrices are respectively written in lowercase and uppercase letters unless otherwise specified in the context. The symbol \(\triangleq \) is used to define a new variable. \({\mathbb {N}}\) and \({\mathbb {R}}^n\) denote the set of natural numbers and the real coordinate space of dimension n, respectively. The set of positive real numbers and non-negative integers are denoted by \({\mathbb {R}}_+\) and \({\mathbb {N}}_0\), respectively. Subscripts indicate the elements of a sequence. For a random variable X, the mathematical expectation of X and the conditional expectation of X given the \(\sigma \)-algebra \({\mathcal {F}}\) are respectively denoted by \({\mathbb {E}}(X)\) and \({\mathbb {E}}(X|{\mathcal {F}})\). The Euclidean vector norm and the corresponding matrix norm are denoted by \(\Vert .\Vert \), while the cardinality of a set or the absolute value of a number is indicated by |.|. Finally, "a.s" abbreviates the expression "almost surely".

Outline of the paper. In Sect. 2, we describe the algorithm and all the necessary ingredients. In Sect. 3, we state the assumptions and provide an almost sure convergence analysis of the proposed method. Section 4 is devoted to the numerical evaluation of a specific version of the proposed algorithm that makes use of an L-SR1 update and a simple sampling rule; we make a comparison with the state-of-the-art method STORM [40, 41] to show the effectiveness of the proposed method for training DNNs in classification and regression tasks. A comparison with the popular first-order method ADAM [12] is also presented. Some conclusions are drawn in Sect. 5.

2 The algorithm

Within this section, we describe the proposed method called ASNTR (Adaptive Subsample Non-monotone Trust-Region). At iteration k, given the current iteration \( w_k\), we form a quadratic model based on the subsampled function (2), with \(g_k \triangleq \nabla f_{{\mathcal {N}}_k}(w_k)\) and an arbitrary Hessian approximation \(B_k\) satisfying

$$\begin{aligned} \Vert B_k\Vert \le L, \end{aligned}$$
(3)

for some \(L>0\) and solve the common TR subproblem to obtain the relevant direction

$$\begin{aligned} p_k \triangleq \arg \min _{p \in {\mathbb {R}}^n} Q_k(p) \triangleq \frac{1}{2}p^T B_k p + g_k^T p \quad \text {s.t.} \quad \left\| p \right\| _2 \le \delta _k, \end{aligned}$$
(4)

for some TR radius \(\delta _k>0\). We assume that at least some fraction of the Cauchy decrease is obtained, i.e., the direction satisfies

$$\begin{aligned} Q_k(p_k) \le -\frac{c}{2} \Vert g_k\Vert \min \left\{ \delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\right\} , \end{aligned}$$
(5)

for some \(c \in (0,1]\). This is a standard assumption in TR and it is not restrictive even in the stochastic framework. In the classical deterministic TR approach, the trial point \( w_t = w_k + p_k \) acceptance is based on the agreement between the decrease in the function and that of its quadratic model. However, since we are dealing with noisy approximations (2), we modify the acceptance strategy as follows. Motivated by the study in [35], we propose a non-monotone TR (NTR) framework instead of the standard TR one because we do not want to impose a strict decrease in the approximate function. Therefore, we define the relevant ratio as follows

$$\begin{aligned} \rho _{{\mathcal {N}}_k}\triangleq \frac{f_{{\mathcal {N}}_k}(w_t)-r_{{\mathcal {N}}_k}}{Q_k(p_k)}, \end{aligned}$$
(6)

where

$$\begin{aligned} r_{{\mathcal {N}}_k}\triangleq f_{{\mathcal {N}}_k}(w_k)+ t_k \delta _k,\quad t_k>0, \end{aligned}$$
(7)

and

$$\begin{aligned} \sum _{k=0}^{\infty } t_k \le t < \infty . \end{aligned}$$
(8)

The sequence \( \{t_k\} \) is a summable sequence of positive numbers and one can define it in different ways to control the level of non-monotonicity in each iteration. Different choices of \(\{t_k\}\) lead to different upper bounds of its sum which we denote by t. We allow \({\mathcal {N}}_k\) to be chosen arbitrarily in ASNTR. However, even if we impose uniform sampling with replacement to \({\mathcal {N}}\) such that it yields an unbiased estimator \(f_{{\mathcal {N}}_k}(w_k)\) of the objective function at \(w_k\), the dependence between \({\mathcal {N}}_k\) and \(w_t\) may produce a biased estimator \(f_{{\mathcal {N}}_k}(w_t)\) of \(f(w_t)\). This is because \({\mathcal {N}}_k\) directly affects the TR model \(Q_k(p)\) through the approximate gradient \(g_k= \nabla f_{{\mathcal {N}}_k}(w_k)\) and thus \(p_k\) also depends on the choice of \({\mathcal {N}}_k\). Thus, \(w_t=w_k+p_k\) is also dependent on \({\mathcal {N}}_k\). To overcome this difficulty, we apply an additional sampling strategy [4]. To this end, at every iteration at which \(N_k<N\), we choose another independent subsample set represented by the index set \({\mathcal {D}}_k \subset {\mathcal {N}}\) of size \(D_k=|{\mathcal {D}}_k|<N\) and calculate \(f_{{\mathcal {D}}_k}(w_k), f_{{\mathcal {D}}_k}(w_t)\) and \({\bar{g}}_k \triangleq \nabla f_{{\mathcal {D}}_k}(w_k)\) (see lines 5–6 of the ASNTR algorithm). There are no theoretical requirements on the size of \({\mathcal {D}}_k\), and hence, the additional sampling might be done cheaply, i.e. with a modest number of additional samples. In fact, in our experiments, we set \(D_k=1\) for all k. Furthermore, in the spirit of TR, we define a linear model as \(L_k(v)\triangleq v^T {\bar{g}}_k,\) and consider another agreement measure defined as follows

$$\begin{aligned} \rho _{{\mathcal {D}}_k} \triangleq \frac{f_{{\mathcal {D}}_k}(w_t)-r_{{\mathcal {D}}_k}}{L_k(-{\bar{g}}_k)}, \end{aligned}$$
(9)

where

$$\begin{aligned} r_{{\mathcal {D}}_k}\triangleq f_{{\mathcal {D}}_k}(w_k)+\delta _k {\tilde{t}}_k,\quad {\tilde{t}}_k>0, \end{aligned}$$
(10)

and

$$\begin{aligned} \sum _{k=0}^{\infty } {\tilde{t}}_k \le {\tilde{t}} < \infty . \end{aligned}$$
(11)

We assume that \( \{{\tilde{t}}_k\} \) is a summable sequence of positive numbers and that \({\tilde{t}}\) is an upper bound of the sum of \( \{{\tilde{t}}_k\} \). Therefore, \({\tilde{t}}\) is controllable through the choice of \( \{{\tilde{t}}_k\} \). Notice that choosing a greater \({\tilde{t}}_k\) yields more chances for the trial point \(w_t\) to be accepted. The denominator in (9) is the linear model computed along the negative gradient \( {\bar{g}}_k \) and thus it is negative. Therefore, the condition \(\rho _{{\mathcal {D}}_k}\ge \nu \) corresponds to Armijo-like condition for the function \( f_{{\mathcal {D}}_k}, \) similar to one in [4] i.e., \(f_{{\mathcal {D}}_k}(w_{t}) \le f_{{\mathcal {D}}_k}(w_{k})- \nu \Vert \nabla f_{{\mathcal {D}}_k}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k\). If \(N_k<N\), the trial point is accepted only if both \(\rho _{{\mathcal {N}}_k}\) and \(\rho _{{\mathcal {D}}_k}\) are bounded away from zero; otherwise, if the full sample is used, the decision is made by \(\rho _{{\mathcal {N}}_k}\) solely as in deterministic NTR (see lines 23–35 of the ASNTR algorithm). The rationale behind this is the following: we double-checked that the trial point obtained employing \(f_{{\mathcal {N}}_k}\) is acceptable also with respect to another approximation of the objective function \(f_{{\mathcal {D}}_k}\). Notice that \({\mathcal {D}}_k\) is chosen with replacement, uniformly and randomly from \({\mathcal {N}}\), independently from the choice of \({\mathcal {N}}_k\), and after the trial point \(w_t\) is already determined. Therefore, conditionally on \(\sigma \)-algebra generated by all the previous choices of \({\mathcal {N}}_j, j=1,...,k\) and \({\mathcal {D}}_j, j=1,...,k-1\), the approximation \(f_{{\mathcal {D}}_k}(w_t)\) is an unbiased estimator of \(f(w_t)\).

Another role of \(\rho _{{\mathcal {D}}_k}\) is to control the sample size. If the obtained trial point \(w_t\) yields an uncontrolled increase in \(f_{{\mathcal {D}}_k}\) in a sense that \(\rho _{{\mathcal {D}}_k}< \nu \), i.e., \(f_{{\mathcal {D}}_k}(w_{t}) > f_{{\mathcal {D}}_k}(w_{k})- \nu \Vert \nabla f_{{\mathcal {D}}_k}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k\), we conclude that we need a better approximation of the objective function and we increase the sample size \(N_k\) by choosing a new sample set \({\mathcal {N}}_k\) for the next iteration. Roughly speaking, an uncontrolled increase in \(f_{{\mathcal {D}}_k}\) is possible if the approximate function \( f_{{\mathcal {D}}_k} \) is very different from \(f_{{\mathcal {N}}_k}\) given that the search direction is computed for \(f_{{\mathcal {N}}_k}\). The sample can also be increased if we are too close to a stationary point of the approximate function \(f_{{\mathcal {N}}_k}\). This is stated in line 7 of ASNTR, where h represents an SAA error estimate given by

$$\begin{aligned} h(N_k)\triangleq \frac{N-N_k}{N}. \end{aligned}$$
Algorithm 1
figure a

ASNTR

The algorithm can also keep the same sample size (see lines 14 and 16 of the ASNTR algorithm). Kee** the same sample \({\mathcal {N}}_k\) in line 14 corresponds to the case where the trial point is acceptable with respect to \(f_{{\mathcal {D}}_k}\), but we do not have a decrease in \(f_{{\mathcal {N}}_k}\). In that case, the (non-monotone) TR radius (\(\delta _k\)) is decreased (see line 37 of ASNTR). Otherwise, we allow the algorithm in line 16 to choose a new sample of the current size and exploit some new data points. The strategy for updating the sample size is described in lines 7–19 of the ASNTR algorithm.

Notice that the sample size cannot be decreased, and if the full sample is reached it is kept until the end of the procedure. Moreover, it should be noted that ASNTR provides complete freedom in terms of the increment in the sample size as well as the choice of samples \({\mathcal {N}}_k\). This leaves enough space for different sampling strategies within the algorithm. As we already mentioned, mostly depending on the problem, ASNTR can result in a mini-batch strategy, but it can also yield an increasing sample size procedure.

The TR radius is updated within lines 36–42 of ASNTR. We follow a common update strategy for TR approaches. It is completely based on \(f_{{\mathcal {N}}_k}\) since it is related to the error of the quadratic model, and not to the SAA error which we control by additional sampling. Thus, if the \(\rho _{{\mathcal {N}}_k} \) is small we decrease the trust-region size (see lines 36–37 of ASNTR). Otherwise, the trust-region is either enlarged or kept the same (see lines 38–42 of ASNTR). Notice that the additional condition that relates the norm of the search direction \( p_k \) and the current trust-region size \( \delta _k \) does not play any role in the theoretical analysis but it is important in practical implementation. We need some predetermined hyper-parameters for ASNTR, which are established in the algorithm’s initial line according to ones outlined in relevant references (e.g. [1, 30]), and to meet the assumptions underlying the convergence analysis.

3 Convergence analysis

We make the following standard assumption for the TR framework needed to prove the a.s. convergence result of ASNTR.

Assumption 1

The functions \(f_i, i=1,...,N\) are bounded from below and twice continuously-differentiable with L-Lipschitz-continuous gradients.

First, we prove an important lemma that will help us prove the main result, the a.s. convergence of ASNTR. There are two possible scenarios depending on the size of the sample sequence: (1) mini-batch scenario—where the subsampling is employed in each iteration; (2) deterministic scenario—where the full sample is reached eventually. We start the analysis by considering the first, mini-batch scenario. In Lemma 1, we show that in this case there holds \(\rho _{{\mathcal {D}}_k}\ge \nu \) for any realization of \({\mathcal {D}}_k\) and all k sufficiently large.

Let us denote by \({\mathcal {D}}_k^{+}\) the subset of all possible outcomes of \({\mathcal {D}}_k\) at iteration k that satisfy \(\rho _{{\mathcal {D}}_k} \ge \nu \), i.e.,

$$\begin{aligned} {\mathcal {D}}_k^{+}= \{{\mathcal {D}}_k \subset {\mathcal {N}}\; | \; f_{{\mathcal {D}}_k}(w_{t}) \le f_{{\mathcal {D}}_k}(w_{k})- \nu \Vert \nabla f_{{\mathcal {D}}_k}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k \}. \end{aligned}$$
(12)

Notice that if \({\mathcal {D}}_k \in {\mathcal {D}}_k^{+}\) and \(\rho _{{\mathcal {N}}_k} \ge \eta \) then \(w_{k+1}=w_t\). On the other hand, if \({\mathcal {D}}_k \in {\mathcal {D}}_k^{+}\) and \(\rho _{{\mathcal {N}}_k} < \eta \) then \(w_{k+1}=w_k\). Finally, if \({\mathcal {D}}_k \in {\mathcal {D}}_k^{-}\), where

$$\begin{aligned} {\mathcal {D}}_k^{-}= \{{\mathcal {D}}_k \subset {\mathcal {N}}\; | \; f_{{\mathcal {D}}_k}(w_{t}) > f_{{\mathcal {D}}_k}(w_{k})- \nu \Vert \nabla f_{{\mathcal {D}}_k}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k \}, \end{aligned}$$
(13)

we have again \(w_{k+1}=w_k\). Notice that \({\mathcal {D}}_k^{-} = \emptyset \) if and only if \(\rho _{{\mathcal {D}}_k}\ge \nu \) for all possible realizations of \({\mathcal {D}}_k\).

Lemma 1

Suppose that Assumption 1 holds. If \(N_k<N\) for all \(k \in {\mathbb {N}}\), then a.s. there exists \(k_1 \in {\mathbb {N}}\) such that \(\rho _{{\mathcal {D}}_k}\ge \nu \) for all \(k \ge k_1\) and for all possible realizations \({\mathcal {D}}_k\).

Proof

Assume that \(N_k<N\) for all \(k \in {\mathbb {N}}\). So, there exists some \({\bar{N}}<N\) and \(k_0\in {\mathbb {N}}\) such that \(N_k={\bar{N}}\) for all \(k \ge k_0\). Now, let us use the notation as in (12)–(13) and assume that there exists an infinite subsequence of iterations \(K \subseteq {\mathbb {N}}\), such that \({\mathcal {D}}_k^{-}\ne \emptyset \) for all \(k \in K\). Since \({\mathcal {D}}_k\) is chosen randomly and uniformly with replacement from the finite set \({\mathcal {N}}\) and \(D_k\le N-1\), we know that each \({\mathcal {D}}_k\) has only finitely many possible outcomes. More precisely, we conclude that \(S(D_k)\le {\bar{S}}:=(2N-2)!/((N-1)!)^2\) where the upper bound comes from the combinatorics of unordered sampling with replacementFootnote 1. Therefore, there exists \({\underline{p}} \in (0,1)\) such that \(P({\mathcal {D}}_k \in {\mathcal {D}}_k^{-})\ge {\underline{p}}\), i.e., \(P({\mathcal {D}}_k \in {\mathcal {D}}_k^{+})\le 1-{\underline{p}}=:{\bar{p}}<1\) for all \(k \in K\). So,

$$\begin{aligned} P({\mathcal {D}}_k \in {\mathcal {D}}_k^{+}, k \in K)\le \prod _{k \in K} {\bar{p}}=0. \end{aligned}$$

Therefore, a.s. there exists an iteration \( k \ge k_1 \) such that \(\rho _{{\mathcal {D}}_k} < \nu \) and the sample size is increased, which is in contradiction with \(N_k={\bar{N}}\) for all k large enough. This completes the proof. \(\square \)

Next, we prove that the mini-batch scenario implies a non-monotone type of decrease for all k large enough.

Lemma 2

Suppose that Assumption 1 is satisfied and

\(N_k<N\) for all \(k \in {\mathbb {N}}\). Then a.s.

$$\begin{aligned} f(w_t) \le f(w_k)- \nu \Vert \nabla f(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k, \end{aligned}$$

holds for all \(k \ge k_1\), where \(k_1\) is as in Lemma 1.

Proof

Lemma 1 implies that \(\rho _{{\mathcal {D}}_k}\ge \nu \), i.e.,

$$\begin{aligned} f_{{\mathcal {D}}_k}(w_{t}) \le f_{{\mathcal {D}}_k}(w_{k})- \nu \Vert \nabla f_{{\mathcal {D}}_k}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k,\end{aligned}$$
(14)

for all \(k \ge k_1\) and for all possible realizations of \({\mathcal {D}}_k\) a.s.. Since the sampling of \({\mathcal {D}}_k\) is with replacement, notice that this further yields

$$\begin{aligned} f_{i}(w_{t}) \le f_{i}(w_{k})- \nu \Vert \nabla f_{i}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k, \end{aligned}$$
(15)

for all \(i \in {\mathcal {N}}\) and all \(k \ge k_1\) a.s.. Indeed, if there exists \(i \in {\mathcal {N}}\) such that (15) is violated, then (14) is violated for at least one possible realization of \({\mathcal {D}}_k\), namely, for \({\mathcal {D}}_k=\{i,i,...,i\}\). Thus, a.s. for all \(k \ge k_1\) there holds

$$\begin{aligned} f(w_t)= & {} \frac{1}{N} \sum _{i=1}^{N} f_{i}(w_{t}) \le \frac{1}{N} \sum _{i=1}^{N} (f_{i}(w_{k})- \nu \Vert \nabla f_{i}(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k)\nonumber \\{} & {} = f(w_k)- \nu \frac{1}{N} \sum _{i=1}^{N} \Vert \nabla f_i(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k\nonumber \\{} & {} \le f(w_k)- \nu \Vert \nabla f(w_k)\Vert ^2+\delta _k {\tilde{t}}_k, \end{aligned}$$
(16)

where the last inequality comes from the fact that \(\Vert \cdot \Vert ^2\) is convex and therefore

$$\begin{aligned} \Vert \nabla f(w_k)\Vert ^2=\left\| \frac{1}{N} \sum _{i=1}^{N} \nabla f_i(w_{k})\right\| ^2\le \frac{1}{N} \sum _{i=1}^{N} \Vert \nabla f_i(w_{k})\Vert ^2. \end{aligned}$$

\(\square \)

Now, let us prove that starting from some finite but random iteration, the sequence of iterates generated by ASNTR belongs to a level set of the original objective function f a.s.. This level set is also random since it depends on the sample path.Footnote 2

Lemma 3

Suppose that Assumption 1 holds. Then a.s.

$$\begin{aligned} f(w_{{\tilde{k}}+k})\le f(w_{{\tilde{k}}}) + \delta _{max}\max \{t,{\tilde{t}}\}, \quad k = 0,1,\ldots \, \end{aligned}$$

where \({\tilde{k}}\) is some finite random number, t and \({\tilde{t}}\) correspond to those in (8) and (11) respectively.

Proof

Let us consider both scenarios, mini-batch and deterministic separately. In the mini-batch case, when \({N_k<N}\) for each k, Lemma 2 implies that a.s.

$$\begin{aligned} f(w_t) \le f(w_k)- \nu \Vert \nabla f(w_{k})\Vert ^2+\delta _k {\tilde{t}}_k\le f(w_k)+\delta _k {\tilde{t}}_k, \end{aligned}$$

holds for all \(k \ge k_1\), where \(k_1\) is as in Lemma 1. Since \(w_{k+1} = w_t \) or \(w_{k+1}=w_k\), there a.s.

$$\begin{aligned} f(w_{k+1}) \le f(w_k)+\delta _k {\tilde{t}}_k, \end{aligned}$$

holds for all \(k \ge k_1\). So, the summability of \(\{{\tilde{t}}_k\}\) (11) and the fact that \(\delta _k \le \delta _{max}\) for all k together imply that the statement of this lemma holds with \({\tilde{k}}=k_1\).

Assume now that N is achieved at some finite iteration, i.e., there exists a finite iteration \(k_2\) such that \(N_k=N\) for all \(k \ge k_2\). Thus, the trial point for all \(k \ge k_2\) is accepted if and only if \(\rho _{{\mathcal {N}}_k}\ge \eta \). This implies that for all \(k \ge k_2\) we either have \(f(w_{k+1})= f(w_k)\) or

$$\begin{aligned} f(w_{k+1}) \le f(w_k)+\delta _k t_k-\frac{\eta c}{2}\Vert g_k\Vert \min \left\{ \delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\right\} \le f(w_k)+\delta _{max}t_k, \end{aligned}$$
(17)

where \(\Vert g_k\Vert = \Vert \nabla f(w_k)\Vert \) in this scenario. Thus, using the summability of \(\{t_k\}\) in (8) we obtain the result with \({\tilde{k}}=k_2\). \(\square \)

In order to prove the main convergence result, we assume that the expected value of \(f(w_{{\tilde{k}}})\) is uniformly bounded.

Assumption 2

There exists a constant \(C>0\) such that \({\mathbb {E}}(|f(w_{{\tilde{k}}})| ) \le C\), where \({\tilde{k}}\) is as in Lemma 3 and the expectation is taken over all possible sample paths.

This assumption, together with the result of Lemma 3, implies that a.s. the sequence \(\{f(w_k)\}_{k \ge {\tilde{k}}}\) is uniformly bounded in expectation. Moreover, the Assumption 2 also implies that \({\mathbb {E}}(|f(w_{{\tilde{k}}})|\; | \; A) \le C_1\), where A represents the subset of all possible outcomes (sample paths) such that the full sample is reached eventually and \(C_1\) is some positive constant. To see this, observe that the following holds

$$\begin{aligned} P(A) {\mathbb {E}}(|f(w_{{\tilde{k}}})| | A)\!\le \! P(A) {\mathbb {E}}(|f(w_{{\tilde{k}}})| | A)\!+\!P({\bar{A}}) {\mathbb {E}}(|f(w_{{\tilde{k}}})| | {\bar{A}})\!=\! {\mathbb {E}}(|f(w_{{\tilde{k}}})| ) \le C, \end{aligned}$$

where \({\bar{A}}\) represents all possible sample paths that remain in the mini-batch scenario. Thus, we obtain

$$\begin{aligned} {\mathbb {E}}_A (|f(w_{{\tilde{k}}})| ):={\mathbb {E}}(|f(w_{{\tilde{k}}})|\; | \; A)\le C/P(A)=:C_1. \end{aligned}$$

Similarly, there exists a constant \(C_2>0\) such that

$$\begin{aligned} {\mathbb {E}}_{{\bar{A}}} (|f(w_{{\tilde{k}}})| ):={\mathbb {E}}(|f(w_{{\tilde{k}}})|\; | \; {\bar{A}})\le C/P({\bar{A}})=:C_2. \end{aligned}$$

Notice that Assumption 2 is satisfied under the assumption of bounded iterates (\(\{w_k\}_{k \in {\mathbb {N}}} \subset {\bar{G}}\), where \({\bar{G}}\) is a compact subset of \({\mathbb {R}}^n\)) which is fairly common in a stochastic optimization framework.

Theorem 1

Suppose that Assumption 1 and Assumption 2 hold. Then the sequence \(\{w_k\}\) generated by ASNTR algorithm satisfies

$$\begin{aligned} \liminf _{k \rightarrow \infty } \Vert \nabla f(w_k)\Vert =0 \quad \text{ a.s }. \end{aligned}$$

Proof

Let us consider two different scenarios, namely, \(N_k=N \) for all k large enough, and \(N_k<N\) for all k. Let us start with the first one in which \(N_k=N\) for all \(k \ge {\tilde{k}}\), where \({\tilde{k}}\) is random but finite. In this case, ASNTR reduces to the non-monotone deterministic trust-region algorithm applied on f. By L-Lipschitz continuity of \(\nabla f\), we obtain

$$\begin{aligned} |f(w_k+p_k)- f(w_k)-\nabla ^T f(w_k) p_k| \le \frac{L}{2} \Vert p_k\Vert ^2. \end{aligned}$$
(18)

Now, let us denote \(\rho _k \triangleq \rho _{{\mathcal {N}}_k}\) and assume that \(\Vert \nabla f(w_k)\Vert \ge \varepsilon >0\) for all \(k\ge {\tilde{k}}\). Then,

$$\begin{aligned} |\rho _k-1|= & {} \frac{|f(w_k+p_k)- f(w_k)-\delta _k t_k-\nabla ^T f(w_k) p_k-0.5 p_k^T B_k p_k|}{|Q_k(p_k)|}\nonumber \\ {}\le & {} \frac{0.5 L\Vert p_k\Vert ^2 + \delta _k t_k+0.5 L \Vert p_k\Vert ^2}{0.5 c \Vert g_k\Vert \min \{\delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\}}\nonumber \\ {}\le & {} \frac{L\Vert p_k\Vert ^2 + \delta _k t_k}{0.5 c \varepsilon \min \{\delta _k, \frac{\varepsilon }{L}\}}, \end{aligned}$$
(19)

where \(\Vert g_k\Vert = \Vert \nabla f(w_k)\Vert \) in this scenario. Let us define \(\tilde{\delta }=\frac{\varepsilon c }{20 L}\). Without loss of generality, given that the sequence \(\{t_k\}\) is summable and hence \( t_k \rightarrow 0\) (see (8)), we can assume that \(t_k \le L \tilde{\delta }\) for all \(k\ge {\tilde{k}}\). Observe now the iterations k for \(k > {\tilde{k}}\). If \(\delta _k \le \tilde{\delta }\), then \(\delta _{k+1} \ge \delta _k\). This is due to the fact that

$$\begin{aligned} |\rho _k-1| \le \frac{L\delta _k^2 + \delta _k t_k}{0.5 c \varepsilon \delta _k}\le \frac{2 L }{0.5 c \varepsilon } \tilde{\delta } {= \frac{2 L }{0.5 c \varepsilon }\frac{\varepsilon c }{20 L}=\frac{1}{5}}< \frac{1}{4}, \end{aligned}$$
(20)

i.e., \(\rho _k>\frac{3}{4}\ge \eta _2\) which implies that the NTR radius is not decreased; see lines 36–42 in ASNTR. Further, there exists \(\hat{\delta }>0\) such that \(\delta _k \ge \hat{\delta }\) for all \(k \ge {\tilde{k}}\). Moreover, for all \(k \ge {\tilde{k}}\), the assumption \(\rho _k<\eta \), where \(\eta <\eta _1\), would yield a contradiction since it would imply \(\lim _{k \rightarrow \infty } \delta _k=0\). Therefore, there must exist an infinite set \(K\subseteq {\mathbb {N}}\) as \(K=\{ k \ge {\tilde{k}} \,| \,\rho _k \ge \eta \}\). For all \(k \in K\), we have

$$\begin{aligned} f(w_{k+1}) \!\le \!f(w_k)\!+\!\delta _k t_k\!-\!\frac{c}{8}\Vert g_k\Vert \min \left\{ \delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\right\} \le f(w_k)\!+\!\delta _{max}t_k\!-\!\frac{c}{8} \varepsilon \min \{\hat{\delta }, \frac{\varepsilon }{L}\}. \end{aligned}$$

Now, let \({\bar{c}} \triangleq \frac{c}{8} \varepsilon \min \{\hat{\delta }, \frac{\varepsilon }{L}\}\). Since \(t_k\) tends to zero, we have \(\delta _{max}t_k \le \frac{{\bar{c}}}{2}\) for all k large enough. Without loss of generality, we can say that this holds for all \(k \in K\), rewriting \(K=\{k_j\}_{j \in {\mathbb {N}}_0}\), we have

$$\begin{aligned} f(w_{k_j+1})\le f(w_{k_j})-\frac{{\bar{c}}}{2}. \end{aligned}$$

Since \(w_{k+1}=w_k\) for all \(k\ge {\tilde{k}}\) and \(k \notin K\), i.e. when \(\rho _k < \eta \), we obtain

$$\begin{aligned} f(w_{k_{j+1}})\le f(w_{k_j})-\frac{{\bar{c}}}{2}. \end{aligned}$$

Thus we obtain for all \(j \in {\mathbb {N}}_0\)

$$\begin{aligned} f(w_{k_{j}})\le f(w_{k_0})- j \frac{{\bar{c}}}{2} \le f(w_{{\tilde{k}}})+\delta _{max}\max \{t, {\tilde{t}}\}- j \frac{{\bar{c}}}{2}, \end{aligned}$$
(21)

where the last inequality is due to Lemma 3. Furthermore, applying the expectation to both sides of (21) and using Assumption 2 we get

$$\begin{aligned} {\mathbb {E}}_{A}(f(w_{k_{j}}))\le C_1+\delta _{max}\max \{t, {\tilde{t}}\}-j \frac{{\bar{c}}}{2}. \end{aligned}$$
(22)

Letting j tend to infinity in (22), we obtain \(\lim _{j \rightarrow \infty } {\mathbb {E}}_{A}(f(w_{k_j}))=-\infty \), which is in contradiction with the assumption of f being bounded from below. Therefore, \(\Vert \nabla f(w_k)\Vert \ge \varepsilon >0\) for all \(k\ge {\tilde{k}}\) can not be true, so we conclude that

$$\begin{aligned} \liminf _{k \rightarrow \infty } \Vert \nabla f(w_k)\Vert =0. \end{aligned}$$

Now let us consider the mini-batch scenario, i.e., \(N_k < N\) for all k, i.e., the sample size is increased only finitely many times. According to Lemma 1 and lines 7–8 of the algorithm, the currently considered scenario implies the existence of a finite \({\tilde{k}}_1\) such that

$$\begin{aligned} N_k={\tilde{N}}, \quad \Vert g_k\Vert \ge \epsilon h(N_k)\triangleq \epsilon _{{\tilde{N}}}>0 \quad \text{ and } \quad \rho _{{\mathcal {D}}_k}\ge \nu , \end{aligned}$$
(23)

for all \(k\ge {\tilde{k}}_1\) and some \({\tilde{N}}<N\) a.s. Now, let us prove that there exists an infinite subset of iterations \({\tilde{K}} \subseteq {\mathbb {N}}\) such that \(\rho _{k}\ge \eta \) for all \(k \in {\tilde{K}}\), i.e., the trial point \(w_t\) is accepted infinitely many times. Assume a contrary, i.e., there exists some finite \({\tilde{k}}_2\) such that \(\rho _{k}<\eta \) for all \(k \ge {\tilde{k}}_2\). Since \(\eta < \eta _1\), this further implies that \(\lim _{k \rightarrow \infty }\delta _k=0\); see lines 36 and 37 in Algorithm 1. Moreover, line 13 of Algorithm 1 implies that the sample does not change, meaning that there exists \(\tilde{{\mathcal {N}}}\subset {\mathcal {N}}\) such that \({\mathcal {N}}_k=\tilde{{\mathcal {N}}}\) for all \(k \ge {\tilde{k}}_3\triangleq \max \{{\tilde{k}}_1, {\tilde{k}}_2\}\). By L-Lipschitz continuity of \(\nabla f_{\tilde{{\mathcal {N}}}}\), we obtain

$$\begin{aligned} |f_{\tilde{{\mathcal {N}}}}(w_k+p_k)- f_{\tilde{{\mathcal {N}}}}(w_k)-\nabla ^T f_{\tilde{{\mathcal {N}}}}(w_k) p_k| \le \frac{L}{2} \Vert p_k\Vert ^2. \end{aligned}$$
(24)

For every \(k \ge {\tilde{k}}_3\), then we have

$$\begin{aligned} |\rho _{k}-1|= & {} \frac{|f_{\tilde{{\mathcal {N}}}}(w_k+p_k)- f_{\tilde{{\mathcal {N}}}}(w_k)-\delta _k t_k-\nabla ^T f_{\tilde{{\mathcal {N}}}}(w_k) p_k-0.5 p_k^T B_k p_k|}{|Q_k(p_k)|}\nonumber \\ {}\le & {} \frac{0.5 L\Vert p_k\Vert ^2 + \delta _k t_k+0.5 L \Vert p_k\Vert ^2}{0.5 c \Vert g_k\Vert \min \{\delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\}}\nonumber \\ {}\le & {} \frac{L\delta _k^2 + \delta _k t_k}{0.5 c \epsilon _{{\tilde{N}}} \min \{\delta _k, \frac{\varepsilon }{L}\}}. \end{aligned}$$
(25)

Since \(\lim _{k \rightarrow \infty }\delta _k=0\), there exists \({\tilde{k}}_4\) such that for all \(k \ge {\tilde{k}}_4\) we obtain

$$\begin{aligned} |\rho _{k}-1|\le \frac{L\delta _k^2 + \delta _k t_k}{0.5 c \epsilon _{{\tilde{N}}} \delta _k}=\frac{L\delta _k + t_k}{0.5 c \epsilon _{{\tilde{N}}} }. \end{aligned}$$

Due to the fact that \(t_k\) tends to zero, we obtain that \(\lim _{k \rightarrow \infty } \rho _{k}=1\) which is in contradiction with \(\rho _{k}<\eta <1/4.\) Thus, we conclude that there must exist \({\tilde{K}} \subseteq {\mathbb {N}}\) such that \(\rho _{k}\ge \eta \) for all \(k \in {\tilde{K}}\). Let us define \(K_1 \triangleq {\tilde{K}}\cap \{{\tilde{k}}_1,{\tilde{k}}_1+1,\ldots \}\). Notice that we have \(\rho _{{\mathcal {D}}_k}\ge \nu \) and \(\rho _{k}\ge \eta \) for all \(k \in K_1\). Thus, due to Lemma 2, a.s. the following holds for all \(k \in K_1\)

$$\begin{aligned} \begin{aligned} f(w_{k+1})=f(w_t)&\le f(w_k)-\nu \Vert \nabla f(w_{k})\Vert ^2 + \delta _k {\tilde{t}}_k, \\&\le f(w_k)-\nu \Vert \nabla f(w_{k})\Vert ^2 + \delta _{max} {\tilde{t}}_k. \end{aligned} \end{aligned}$$
(26)

Notice that \(w_{k+1}=w_k\) in all other iterations \(k \ge {\tilde{k}}_1\) and \(k \notin K_1\). Rewriting \(K_1=\{k_j\}_{j \in {\mathbb {N}}_0}\), for all \(j \in {\mathbb {N}}_0\), we obtain

$$\begin{aligned} f(w_{k_{j+1}})=f(w_{k_{j}+1}) \le f(w_{k_j})-\nu \Vert \nabla f(w_{k_j})\Vert ^2+ \delta _{max} {\tilde{t}}_{k_j}. \end{aligned}$$

Then, due to the summability of the sequeence \(\{{\tilde{t}}_k\}\) in (11), and Lemma 3, the following holds a.s. for any \(s \in {\mathbb {N}}_0\)

$$\begin{aligned} \begin{aligned} f(w_{k_{s+1}})&\le f(w_{k_0})-\nu \sum _{j=0}^{s}\Vert \nabla f(w_{k_j})\Vert ^2+ \delta _{max} {\tilde{t}} \\&\le f(w_{{\tilde{k}}})+\delta _{max}\max \{t,{\tilde{t}}\}-\nu \sum _{j=0}^{s}\Vert \nabla f(w_{k_j})\Vert ^2+ \delta _{max} {\tilde{t}}. \end{aligned} \end{aligned}$$
(27)

Now, applying the expectation to both sides of the second inequality in (27), using Assumption 2 which implies \({\mathbb {E}}_{{\bar{A}}} (|f(w_{{\tilde{k}}})| )\le C_2\), and the boundedness of f from below, letting s tend to infinity yields

$$\begin{aligned} \sum _{j=0}^{\infty } {\mathbb {E}}_{{\bar{A}}}(\Vert \nabla f(w_{k_j})\Vert ^2) <\infty . \end{aligned}$$

Moreover, the following holds as a consequence of an extended version of Markov’s inequality with the specific choice of the nonnegative function \(\Phi (y)=y^2\) nondecreasing on \( y\ge 0\): for any \(\epsilon >0\)

$$\begin{aligned} P_{{\bar{A}}}(\Vert \nabla f(w_{k_j})\Vert \ge \epsilon )\le \frac{{\mathbb {E}}_{{\bar{A}}}(\Phi (\Vert \nabla f(w_{k_j})\Vert ))}{\Phi (\epsilon )} = \frac{{\mathbb {E}}_{{\bar{A}}}(\Vert \nabla f(w_{k_j})\Vert ^2)}{\epsilon ^2}. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \sum _{j=0}^{\infty } P_{{\bar{A}}}(\Vert \nabla f(w_{k_j})\Vert \ge \epsilon )< \infty . \end{aligned}$$

Finally, Borel-Cantelli Lemma implies that (in the current scenario) almost surely \( \lim _{j \rightarrow \infty } \Vert \nabla f(w_{k_j})\Vert =0\). Combining all together, the result follows as in both scenarios we have at least

$$\begin{aligned} \liminf _{k \rightarrow \infty } \Vert \nabla f(w_{k})\Vert =0 \quad \text{ a.s }. \end{aligned}$$

\(\square \)

Finally, we analyze the convergence rate of the proposed algorithm for a restricted class of problems stated in the following theorem.

Theorem 2

Suppose that the assumptions of Theorem 1 hold and that \(\nu <L/2\). Moreover, suppose that f is m-strongly convex and the sequence \(\{{\tilde{t}}_k\}\) converges to zero R-linearly. If \(N_k < N\) for all \(k \in {\mathbb {N}}\), then the sequence \(\{w_k\}_{k \in K_1}\) with \(K_1\) as in (26) converges to the unique minimizer \(w^*\) of f R-linearly in the mean squared sense.

Proof

Considering the mini-batch scenario in the proof of the previous theorem, we obtain

$$\begin{aligned} f(w_{k_{j+1}}) \le f(w_{k_j})-\nu \Vert \nabla f(w_{k_j})\Vert ^2+ \delta _{max} {\tilde{t}}_{k_j}, \end{aligned}$$

for all \(j \in {\mathbb {N}}\) a.s.. Moreover, the strong convexity of f implies

$$\begin{aligned} f(w_{k_{j+1}})-f(w^*) \le f(w_{k_j})-f(w^*)-\nu \frac{2}{L} (f(w_{k_j})-f(w^*))+ \delta _{max} {\tilde{t}}_{k_j}, \end{aligned}$$

for all \(j \in {\mathbb {N}}\) a.s.. Now, applying the expectation and defining \(e_j:={\mathbb {E}}_{{\bar{A}}}(f(w_{k_j})-f(w^*)),\) from the previous inequality we obtain

$$\begin{aligned} e_{j+1} \le \theta e_j+ \varepsilon _j, \end{aligned}$$

where \( \varepsilon _j:=\delta _{max} {\tilde{t}}_{k_j}\) and \(\theta :=1-2 \nu /L \in (0,1)\) according to the assumption on \(\nu \) that is \(\nu \in (0,1 / 4)\). Since the previous inequality holds for each j, we conclude via induction that for all \(j \in {\mathbb {N}}\) there holds

$$\begin{aligned} e_j\le \theta ^j e_0+s_j, \end{aligned}$$

where \(s_j:=\delta _{max}\sum _{i=1}^{j} \theta ^{i-1} \varepsilon _{j-i}.\) Notice that \(\{\varepsilon _j\}\) converges to zero R-linearly according to the assumption on \({\tilde{t}}_k\). This further implies that \(s_j\) converges to zero R-linearly (see Lemma 4.2 in [34]). Thus, we conclude that \(\{e_j\}\) converges to zero R-linearly. Finally, since the strong convexity also implies

$$\begin{aligned} \frac{m}{2} {\mathbb {E}}_{{\bar{A}}}(\Vert w_{k_j}-w^*\Vert ^2)\le e_j, \end{aligned}$$

we obtain the result. \(\square \)

4 Numerical experiments

In this section, we provide some experimental results to make a comparison between ASNTR and STORM (as Algorithm 5 in [41]). We examine the performance of both algorithms for training DNNs in two types of problems: (i) Regression with synthetic DIGITS dataset and (ii) Classification with MNIST and CIFAR10 datasets.Footnote 3

We also provide additional results to give more insights into the behavior of the ASNTR algorithm, especially concerning the sampling strategy. All experiments were conducted with the MATLAB DL toolbox on an Ubuntu 20.04.4 LTS (64-bit) Linux server VMware with 20GB memory using a VGPU NVIDIA A100D-20C.

Table 1 Hyper-parameters

4.1 Experimental configuration

All three image datasets are divided into training and testing sets including N and \({\hat{N}}\) samples, respectively, and whose pixels in the range [0, 255] are divided by 255 so that the pixel intensity range is bounded in [0, 1] (zero–one rescaling). To define the initialized networks and training loops of both algorithms, we have applied dlarray and dlnetwork MATLAB objects.Footnote 4 The networks’ parameters in \(w\in {\mathbb {R}}^n\) are initialized by the Glorot (Xavier) initializer [43] and zeros for respectively weights and biases of convolutional layers as well as ones and zeros respectively for scale and offset variables of batch normalization layers. Table 1 describes the hyper-parameters applied in both algorithms.

Since ASNTR and STORM allow the use of any Hessian approximations, the underlying quadratic model can be constructed by exploiting a Quasi-Newton update for \(B_k\). Quasi-Newton updates aim at constructing Hessian approximations using only gradient information and thus incorporating second-order information without computing and storing the true Hessian matrix. We have considered the Limited Memory Symmetric Rank one (L-SR1) update formula to generate \(B_k\) rather than other widely used alternatives such as limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS). The L-SR1 updates might better navigate the pathological saddle points present in the non-convex optimization found in DL applications. Given \(B_0 = \gamma _ k I\) at iteration k, and curvature pair \((s_k, y_k)=(w_t-w_k, g_t - g_k)\) where \(g_k \triangleq \nabla f_{{\mathcal {N}}_k}(w_k)\) and \(g_t \triangleq \nabla f_{{\mathcal {N}}_k}(w_t)\) provided that \((y_k - B_ks_k)^Ts_k\ne 0\), the SR1 updates are obtained as follows

$$\begin{aligned} B_{k+1} = B_k + \dfrac{(y_k - B_k s_k)(y_k - B_k s_k)^T}{(y_k - B_k s_k)^T s_k}, \quad k=0,1,\ldots . \end{aligned}$$
(28)

Using two limited memory storage matrices \(S_k\) and \(Y_k\) with at most \(l \ll n\) columns for storing the recent l pairs \(\left\{ s_j, y_j\right\} \)   \( j \in \{1,\ldots ,l\}\), the compact form of L-SR1 updates can be written as

$$\begin{aligned} B_k = B_0 + \Psi _k M_k \Psi _k^T, \quad k = 1,2,,\ldots , \end{aligned}$$
(29)

where

$$\begin{aligned} \Psi _k = Y_k - B_0S_k,\qquad M_k = (D_k + L_k + L_k^T - S_k^T B_0 S_k)^{-1} \end{aligned}$$

with matrices \(L_k\) and \(D_k\) which respectively are the strictly lower triangular part and the diagonal part of \(S_k^TY_k\); see [1]. Regarding the selection of the variable multiplier \(\gamma _k\) in \(B_0\), we refer to the initialization strategy proposed in [30]. Given the quadratic model \(Q_k(p)\) using L-SR1 in ASNTR and STORM, we have used an efficient algorithm called OBS [44], exploiting the structure of the L-SR1 matrix (29) and the Sherman-Morrison-Woodbury formula for inversions to obtain \(p_k\).

We have randomly (without replacement) chosen the index subset \({\mathcal {N}}_k \subseteq \{1,2,\dots , N\}\) to generate a mini-batch of size \(N_k\) for computing the required quantities, i.e., subsampled functions and gradients. Given \(N_0 = d+1\) where d is the dimension of a single input sample \(x_i \in {\mathbb {R}}^d\), we have adopted the linearly increased sampling rule that \(N_k = \min ( N, \max (100k + N_0, \lceil \frac{1}{{\delta _k}^2}\rceil )\) for STORM as in [41] while we have exploited the simple following sampling rule for ASNTR when the increase is necessary

$$\begin{aligned} N_{k+1}= \lceil 1.01 N_k\rceil , \end{aligned}$$
(30)

otherwise \(N_{k+1}=N_k\). Using the non-monotone TR framework in our algorithm, we set \(t_k = \frac{C_1}{(k+1)^{1.1}}\) and \({\tilde{t}}_k = \frac{C_2}{(k+1)^{1.1}}\) for some \(C_1, C_2 > 0\) in (7) and (10), respectively. We have also selected \({\mathcal {D}}_k\) with cardinality 1 at every single iteration in ASNTR.

In our implementations, each algorithm was run with 5 different initial random seeds. The criteria of both algorithms’ performance (accuracy and loss) are compared against the number of gradient calls (\(N_g\)) during the training phase. The algorithms were terminated when \(N_g\) reached the determined budget of the gradient evaluations (\(N_g^{\text {max}}\)). Each network is trained by ASNTR and STORM; then the trained network is used for the prediction of the testing dataset. Notice that the training loss and accuracy are the subsampled function’s value and the number of correct predictions in percentage with respect to the given mini-batch.

4.2 Classification problems

To solve an image classification problem for images with unknown classes/labels, we need to seek an optimal classification model by using a C-class training dataset \(\{(x_i,y_i)_{i=1}^{N}\}\) with an image \(x_i \in {\mathbb {R}}^{d}\) and its one-hot encoded label \(y_i \in {\mathbb {R}}^C\). To this end, the generic DL problem (1) is minimized, where its single loss function \(f_i = L(y_i, h(x_i;.))\) is softmax cross-entropy as follows

$$\begin{aligned} f_i(w) = - \sum _{k=1}^{C} (y_i)_k \log (h(x_i; w))_k,\quad i=1,\dots ,N. \end{aligned}$$
(31)

In (31), the output \(h(x_i; w)\) is a prediction provided by a DNN whose last layer includes the softmax function. For this classification task, we have considered two types of networks, the LeNet-like network with a shallow structure inspired by LeNet-5 [48], and the modern residual network ResNet-20 [46] with a deep structure. See Table 2 for the network’s architectures. We have also considered the two most popular benchmarks; the MNIST dataset [49] with \(7\times 10^4\) samples of handwritten gray-scale digit images of \(28\times 28\) pixels and the CIFAR10 dataset [50] with \(6\times 10^4\) RGB images of \(32\times 32\) pixels, both in 10 categories. Every single image of MNIST and CIFAR10 datasets is defined as a 3-D numeric array \(x_i \in {\mathbb {R}}^d\) where \(d = {28 \times 28 \times 1}\) and \(d = {32 \times 32 \times 3}\), respectively. Moreover, every single label \(y_i\) must be converted into a one-hot encoded label as \(y_i \in {\mathbb {R}}^C\), where \(C=10\). In both datasets, \({\hat{N}}=10^4\) images are set aside as testing sets, and the remaining N images are set as training sets. Besides the zero–one rescaling, we implemented z-score normalization to have zero mean and unit variance. Precisely, all N training images undergo normalization by subtracting the mean (an \(h \times w \times c\) array) and dividing by the standard deviation (an \(h \times w \times c\) array) of training images as an array of size \(h \times w \times c \times N\). Here, h, w, and c denote the height, width, and number of channels of the images, respectively, while N represents the total number of images. Test data are also normalized using the same parameters as in the training data.Footnote 5

Table 2 Architectures of the networks
Fig. 1
figure 1

The accuracy variations of STORM and ASNTR on MNIST

Fig. 2
figure 2

The accuracy variations of STORM and ASNTR on CIFAR10

Figures 1, 2, 3 and 4 show the variations, the mean and standard error obtained from 5 separate runs, of the aforementioned measures for both train and test datasets of ASNTR with \(C_2=1, 10^2,10^8\) in \(\rho _{{\mathcal {D}}_k}\), and \(C_1=1\) in \(\rho _{{\mathcal {N}}_k}\) within fixed budgets of gradient evaluations \(N_g^{\text {max}}=6\times 10^5\) for MNIST and \(N_g^{\text {max}}=9\times 10^6\) for CIFAR10. These figures demonstrate that ASNTR achieves higher training and testing accuracy than STORM in all considered values of \(C_2\) except in Fig. 2 with \(C_2=1\) by which ASNTR can be comparable with STORM. Nevertheless, ASNTR is capable of achieving higher accuracy (lower loss) with fewer \(N_g\) in comparison with STORM. Moreover, these figures indicate the robustness of ASNTR for larger values of \(C_2\) meaning higher rates of non-monotonicity.

Fig. 3
figure 3

The loss variation of STORM and ASNTR on MNIST

Fig. 4
figure 4

The loss variation of STORM and ASNTR on CIFAR10

4.3 Regression problem

This example shows how to fit a regression model using a neural network to be able to predict the angles of rotation of handwritten digits which is useful for optical character recognition. To find an optimal regression model, the generic problem (1) is minimized, where \(f_i = L(y_i, h(x_i;.))\) with a predicted output \(h(x_i; w)\) is half-mean-squared error as follows

$$\begin{aligned} f_i(w) = - \frac{1}{2} ( y_i - h(x_i; w) )^2,\quad i=1,\dots ,N. \end{aligned}$$
(32)

In this regression example, we have considered a convolutional neural network (CNN) with an architecture named CNN-Rn as indicated in Table 2. We have also used the DIGITS dataset containing \(10\times 10^3\) synthetic images with \(28\times 28\) pixels as well as their angles (in degrees) by which each image is rotated. Every single image is defined as a 3-D numeric array \(x_i \in {\mathbb {R}}^d\) where \(d = {28 \times 28 \times 1}\). Moreover, the response \(y_i\) (the rotation angle in degrees) is approximately uniformly distributed between \(-45^{\circ }\) and \(45^{\circ }\). Each training and testing dataset has the same number of images (\(N={\hat{N}}=5\times 10^3\)). Besides zero–one rescaling, we have also applied zero-center normalization to have zero mean. The problem (1) with single loss function (32) is solved for \(w \in {\mathbb {R}}^n\) where \(n=16,881\) using CNN-Rn for DIGITS. In this experiment, the accuracy is the number of predictions in percentage within an acceptable error margin (threshold) that we have set to be 10 degrees.

Fig. 5
figure 5

The accuracy variations of STORM and ASNTR on DIGITS

Fig. 6
figure 6

The loss variation of STORM and ASNTR on DIGITS

Fig. 7
figure 7

Tracking subsampling in ASNTR

Fig. 8
figure 8

Batch size progress with initial \(\texttt {rng(42)}\)

Figures 5 and 6 show training accuracy and loss, and testing accuracy and loss variations of ASNTR for DIGITS dataset with three values of \(C_2\) within a fixed budget of gradient evaluations \(N_g^{\text {max}}=2\times 10^6\). These figures also illuminate how resilient ASNTR is for the highest value of \(C_2\). Despite several challenges in the early stages of the training phase with \(C_2=1\) and \(C_2=10^2\), ASNTR can overcome them and achieve accuracy levels comparable to those of the STORM algorithm.

4.4 Additional results

We present two additional figures (Figs. 7 and 8) containing further details regarding our proposed algorithm, ASNTR, with \(C_2=1,\, 10^{2},\, 10^{8}\) in \(\rho _{{\mathcal {D}}_k}\) and \(C_1=1\) in \(\rho _{{\mathcal {N}}_k}\). More specifically, these results aim to give useful insights concerning the sampling behavior of ASNTR. Let S1 and S2 indicate the iterations of ASNTR at which steps 7 and 10, respectively, are executed using the increasing sampling rule (30). When the cardinality of the sample set is not changed, i.e., \(N_{k+1}=N_k\), let S3 and S0 show the iterations at which new samples through step 15 and current samples through step 13 are selected. We also define variable S4 representing the iteration of ASNTR at which all available samples (\(N_k=N\)) are needed for computing the required quantities.

Figure 7a shows the (average) contributions of the aforementioned sampling types in ASNTR running with five different initial random generators for MNIST, CIFAR10, and DIGITS with respectively predetermined \(N_g^{\text {max}}=0.6\times 10^{6},\, 9\times 10^{6}\) and \(2\times 10^{6}\); however, considering only a specific initial seed, e.g. rng(42), Fig. 7b–d indicate when/where each of these sampling types is utilized in ASNTR. Obviously, the contribution rate of S3 is influenced by S2, where ASNTR has to increase the batch size if \(\rho _{{\mathcal {D}}_k}<\nu \). In fact, the larger \(C_2\) in \(\rho _{D_k}\) results in the higher portion of S3 and the smaller portion of S2. Moreover, using a large value of \(C_2\), there is no need to increase the current batch size unless the current iterate approaches a stationary point of the current approximate function. This, in turn, leads to increasing the portion of S1, which usually happens at the end of the training stage. In addition, we have also found that the value of \(C_2\) in \({\tilde{t}}_k\) plays an important role in the robustness of our proposed algorithm as observed in Figs. 1 to 6; in other words, the higher the \(C_2\), the more robust ASNTR is. These mentioned observations, more specifically, can be followed for every single dataset as below:

  • MNIST: according to Fig. 7a, the portion of the sampling type S3 is larger than the others. This means many new batches without an increase in the batch size are applied in ASNTR during training; i.e., the proposed algorithm can train a network with fewer samples, and thus fewer gradient evaluations. Nevertheless, ASNTR with different values of \(C_2\) in \(\rho _{{\mathcal {D}}_k}\) increases the size of the mini-batches in some of its iterations; see the portions of S1 and S2 in Fig. 7a or their scatters in Fig. 7b. We should note that the sampling type S1 occurs almost at the end of the training phase where the algorithm tends to be close to a stationary point of the current approximate function; Fig. 7b shows this fact.

  • CIFAR10: according to Fig. 7a, the portion of the sampling type S3 is still large. Unlike MNIST, the sampling type S1 rarely occurs during the training phase. On the other hand, a large portion of the increase of the sample size through S2 may com- pensate for the lack of sufficiently accurate functions and gradients required in ASNTR. These points are also illustrated in Fig. 7c, which shows how ASNTR successfully trained the ResNet-20 model without frequently enlarging the sample sizes. For both the MNIST and CIFAR10 problems, S3 as the predominant type corresponds to \(C_2=10^{8}\).

  • DIGITS: according to Fig. 7a, we observe that the main sampling types are S2 and S4. As the portion of S2 increases, the portion of S3 decreases and the highest portion of S3 corresponds to the largest value of \(C_2\). This pattern is similar to what is seen in the MNIST and CIFAR10 datasets. However, in the case of DIGITS, the portion of S4 is higher. This higher portion of S4 in DIGITS may be attributed to the smaller number of samples in this dataset (\(N=5000\)), which causes ASNTR to quickly encompass all the samples after a few iterations. Notably, the sampling type S4 occurs towards the end of the training phase, as shown in Fig. 7d.

Figure 8 compares the progression of batch size growth in both ASNTR and STORM. In contrast to the STORM algorithm, ASNTR increases the batch size only when necessary, which can reduce the computational costs of gradient evaluations. This is considered a significant advantage of ASNTR over STORM. However, according to this figure, the proposed algorithm needs fewer samples than STORM during the initial phase of the training task, but it requires more samples toward the end. Nevertheless, we should notice that the increase in batch size that happened at the end of the training phase is determined either by S1 or by S4 (see Figs. 7b and d). In our experiments, we have observed that ASNTR does not use very large \(N_g^{\text {max}}\), as it typically achieves the required training accuracy.

Fig. 9
figure 9

Comparison of ASNTR vs. tuned ADAM (first row), and vs. untuned ADAM (second row) on MNIST for training LeNet-like with rng(42)

4.5 Comparison with ADAM

We have compared the proposed stochastic second-order method (ASNTR) with a popular efficient first-order optimizer, i.e., Adaptive Moment Estimation (ADAM) [12] used in DL. We have implemented ADAM using MATLAB built-in function adamupdate in customized training loops. It’s widely recognized that this optimizer is highly sensitive to the value of its hyper-parameters including the learning rate, \(\alpha _k\), the gradient decay factor, \(\beta _1\), the squared gradient decay factor, \(\beta _2\), a small constant to prevent division by zero, \(\epsilon \), and batch size. Users should be aware of the hyper-parameter choices and invest time in tuning them using techniques such as grid search. In our experiments, we set \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), and \(\epsilon =10^{-8}\), respectively, and focus our tuning effort on learning rate and batch size. The specified choices for tuning the learning rate were \(\alpha _k = \alpha \) with \(\alpha \) taking values from the set \(\{10^{-4}, 10^{-3}, 10^{-2}\}\), and the corresponding values for batch size were \(N_k = bs\) where bs varied within \(\{128, 256, 784\}\) for both MNIST and DIGITS, and within \(\{128, 256, 3072\}\) for CIFAR10.

The best hyper-parameters were those that yielded the highest testing accuracy within a fixed budget of gradient evaluations. In all experiments, the optimal performance with Adam was consistently achieved using \(\alpha =10^{-3}\) and \(bs=128\). These settings were employed with ADAM in Figs. 9,10 and 11 (first rows) for comparison purposes against ASNTR with \(C_2=10^8\) demonstrated in Figs. 1 to 6. As these figures show, the tuned ADAM could produce the highest accuracy and lowest loss with fewer gradient evaluations in comparison with ASNTR. The common observation of rapid initial improvement achieved by ADAM, followed by a drastic slowdown, is well understood in practice for some first-order methods (see, e.g., [13]). First-order methods such as ADAM are computationally less expensive per iteration compared to second-order methods such as ASNTR, as they only involve the computation of one gradient compared to two gradients. Moreover, as already mentioned the initial sample size for ASNTR was set as \(N_0 = d+1\) where \(d=784\) for both MNIST and DIGITS and \(d=3072\) for CIFAR10. By initially employing such large batch sizes against an obtained optimal batch size for ADAM, i.e., \(N_k=128\), there are only a small number of updated iterates (\(w_k\)) during training with both second-order methods within the fixed budget of gradient evaluations. Therefore, allowing ASNTR to train networks within a larger number of gradient evaluations helps it to eventually achieve a higher level of accuracy while this cannot help ADAM. Note that we have performed this analysis between the tuned ADAM (\(N_k=128\), and \(\alpha _k=10^{-3}\)) and ASNTR, where tuning the hyper-parameters of ADAM incurs significant time costs. When ADAM is used with suboptimal hyper-parameters, its sensitivity becomes evident, as illustrated in Figs. 9,10 and 11 (second rows) where batch size \(N_k=d+1\) and different learning rates are considered. As these figures show, for a challenging problem such as CIFAR10 classification, ADAM may not achieve a lower loss value than ASNTR, even when employing an optimal learning rate (\(\alpha =10^{-2}\)). Neither for the other challenging problem DIGITS ADAM could produce higher testing Accuracy than ASNTR. All the results shown in the three figures were obtained using the same seed for the random number generator (rng(42)). Moreover, due to awkward oscillations in Loss and Accuracy obtained by ADAM in CIFAR10 and DIGITS classification problems, we have imposed a filter using movmean MATLAB built-in function (moving mean with a window of length c) in Figs. 10 and 11; in our experiments \(c=30\).

Fig. 10
figure 10

Comparison of ASNTR vs. tuned ADAM (first row), and vs. untuned ADAM (second row) on CIFAR10 for training ResNet-20 with rng(42)

Fig. 11
figure 11

Comparison of ASNTR vs. tuned ADAM (first row), and vs. untuned ADAM (second row) on DIGITS for training CNN-Rn with rng(42)

5 Conclusion

In this work, we have presented ASNTR, a second-order non-monotone trust-region method that employs an adaptive subsampling strategy. We have incorporated additional sampling into the TR framework to control the noise and overcome issues in the convergence analysis coming from biased estimators. Depending on the estimated progress of the algorithm, this can yield different scenarios ranging from mini-batch to full sample functions. We provide convergence analysis for all possible scenarios and show that the proposed method achieves almost sure convergence under standard assumptions for the TR framework. The experiments in deep neural network training for both image classification and regression show the efficiency of the proposed method. In comparison to the state-of-the-art second-order method STORM, ASNTR achieves higher testing accuracy with a fixed budget of gradient evaluations. However, our experiments show that the popular first-order method, tuned ADAM using its optimal hyper-parameters, can produce higher accuracy than ASNTR with fixed and fewer gradient computations if one does not count the effort needed for tuning the parameters. As expected, ASNTR is more robust and performs better than ADAM with suboptimal parameters. Future work on ASNTR could include more specified sample size updates and Hessian approximation strategies.