Stochastic projective splitting

Johnstone, Patrick R.; Eckstein, Jonathan; Flynn, Thomas; Yoo, Shinjae

doi:10.1007/s10589-023-00528-6

Stochastic projective splitting

Published: 23 September 2023

Volume 87, pages 397–437, (2024)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

Patrick R. Johnstone¹,
Jonathan Eckstein²,
Thomas Flynn¹ &
…
Shinjae Yoo¹

363 Accesses
1 Altmetric
Explore all metrics

A Correction to this article was published on 27 October 2023

This article has been updated

Abstract

We present a new, stochastic variant of the projective splitting (PS) family of algorithms for inclusion problems involving the sum of any finite number of maximal monotone operators. This new variant uses a stochastic oracle to evaluate one of the operators, which is assumed to be Lipschitz continuous, and (deterministic) resolvents to process the remaining operators. Our proposal is the first version of PS with such stochastic capabilities. We envision the primary application being machine learning (ML) problems, with the method’s stochastic features facilitating “mini-batch” sampling of datasets. Since it uses a monotone operator formulation, the method can handle not only Lipschitz-smooth loss minimization, but also min–max and noncooperative game formulations, with better convergence properties than the gradient descent-ascent methods commonly applied in such settings. The proposed method can handle any number of constraints and nonsmooth regularizers via projection and proximal operators. We prove almost-sure convergence of the iterates to a solution and a convergence rate result for the expected residual, and close with numerical experiments on a distributionally robust sparse logistic regression problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic relaxed inertial forward-backward-forward splitting for monotone inclusions in Hilbert spaces

Article Open access 30 July 2022

A Projective Splitting Method for Monotone Inclusions: Iteration-Complexity and Application to Composite Optimization

Article 05 May 2023

Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimization

Article 04 March 2020

Data Availability

The data analyzed during the current study are from the public LIBSVM repository available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Change history

27 October 2023
A Correction to this paper has been published: https://doi.org/10.1007/s10589-023-00539-3

Notes

Original data source http://largescale.ml.tu-berlin.de/instructions/.
Original data source https://people.cs.umass.edu/~mccallum/data.html.

References

Alacaoglu, A., Malitsky, Y., Cevher, V.: Forward-reflected-backward method with variance reduction. Comput. Optim. Appl. (2021)
Alotaibi, A., Combettes, P.L., Shahzad, N.: Solving coupled composite monotone inclusions by successive Fejér approximations of their Kuhn–Tucker set. SIAM J. Optim. 24(4), 2076–2095 (2014)
Article MathSciNet Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)
Article Google Scholar
Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of $n$-player differentiable games. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 354–363. PMLR (2018)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Book Google Scholar
Boţ, R.I., Mertikopoulos, P., Staudigl, M., Vuong, P.T.: Minibatch forward-backward-forward methods for solving stochastic variational inequalities. Stoch. Syst. 11(2), 112–139 (2021)
Article MathSciNet Google Scholar
Böhm, A., Sedlmayer, M., Csetnek, E.R., Boţ, R.I.: Two steps at a time—taking GAN training in stride with Tseng’s method. ar**v preprint ar**v:2006.09033 (2020)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, pp. 177–186. Springer, Berlin (2010)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Briceño-Arias, L.M., Combettes, P.L.: A monotone+skew splitting model for composite monotone inclusions in duality. SIAM J. Optim. 21(4), 1230–1250 (2011)
Article MathSciNet Google Scholar
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
Article Google Scholar
Celis, L.E., Keswani, V.: Improved adversarial learning for fair classification. ar**v preprint ar**v:1901.10443 (2019)
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chavdarova, T., Pagliardini, M., Stich, S.U., Fleuret, F., Jaggi, M.: Taming GANs with lookahead-minmax. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=ZW0yXJyNmoG
Combettes, P.L., Eckstein, J.: Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Math. Program. 168(1–2), 645–672 (2018)
Article MathSciNet Google Scholar
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, Berlin (2011)
Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators. Set-Valued Var. Anal. 20(2), 307–330 (2012)
Article MathSciNet Google Scholar
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random swee**. SIAM J. Optim. 25(2), 1221–1248 (2015)
Article MathSciNet Google Scholar
Daskalakis, C., Ilyas, A., Syrgkanis, V., Zeng, H.: Training GANs with optimism. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=SJJySbbAZ
Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. Set-Valued Var. Anal. 25(4), 829–858 (2017)
Article MathSciNet Google Scholar
Diakonikolas, J.: Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. In: Conference on Learning Theory, pp. 1428–1451. PMLR (2020)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Eckstein, J.: A simplified form of block-iterative operator splitting and an asynchronous algorithm resembling the multi-block alternating direction method of multipliers. J. Optim. Theory Appl. 173(1), 155–182 (2017)
Article MathSciNet Google Scholar
Eckstein, J., Svaiter, B.F.: A family of projective splitting methods for the sum of two maximal monotone operators. Math. Program. 111(1), 173–199 (2008)
MathSciNet Google Scholar
Eckstein, J., Svaiter, B.F.: General projective splitting methods for sums of maximal monotone operators. SIAM J. Control. Optim. 48(2), 787–811 (2009)
Article MathSciNet Google Scholar
Edwards, H., Storkey, A.: Censoring representations with an adversary. ar**v preprint ar**v:1511.05897 (2015)
Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin, M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Solution of Boundary Value Problems, chap. IX, pp. 299–340. North-Holland, Amsterdam (1983)
Gidel, G., Berard, H., Vignoud, G., Vincent, P., Lacoste-Julien, S.: A variational inequality perspective on generative adversarial networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=r1laEnA5Ym
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates (2014)
Grnarova, P., Kilcher, Y., Levy, K.Y., Lucchi, A., Hofmann, T.: Generative minimization networks: training GANs without competition. ar**v preprint ar**v:2103.12685 (2021)
Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: On the convergence of single-call stochastic extra-gradient methods. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates (2019)
Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 16223–16234. Curran Associates (2020)
Huang, C., Kairouz, P., Chen, X., Sankar, L., Rajagopal, R.: Context-aware generative adversarial privacy. Entropy 19(12), 656 (2017)
Article MathSciNet ADS Google Scholar
Johnstone, P.R., Eckstein, J.: Convergence rates for projective splitting. SIAM J. Optim. 29(3), 1931–1957 (2019)
Article MathSciNet Google Scholar
Johnstone, P.R., Eckstein, J.: Single-forward-step projective splitting: exploiting cocoercivity. ar**v preprint ar**v:1902.09025 (2019)
Johnstone, P.R., Eckstein, J.: Projective splitting with forward steps only requires continuity. Optim. Lett. 14(1), 229–247 (2020)
Article MathSciNet Google Scholar
Johnstone, P.R., Eckstein, J.: Single-forward-step projective splitting: exploiting cocoercivity. Comput. Optim. Appl. 78(1), 125–166 (2021)
Article MathSciNet Google Scholar
Johnstone, P.R., Eckstein, J.: Projective splitting with forward steps. Math. Program. 191(2), 631–670 (2022)
Article MathSciNet Google Scholar
Korpelevich, G.: Extragradient method for finding saddle points and other problems. Matekon 13(4), 35–49 (1977)
MathSciNet Google Scholar
Kuhn, D., Esfahani, P.M., Nguyen, V.A., Shafieezadeh-Abadeh, S.: Wasserstein distributionally robust optimization: theory and applications in machine learning. In: Netessine, S. (ed.) Operations Research & Management Science in the Age of Analytics, Tutorials in Operations Research, pp. 130–166. INFORMS (2019)
Li, C.J., Yu, Y., Loizou, N., Gidel, G., Ma, Y., Roux, N.L., Jordan, M.I.: On the convergence of stochastic extragradient for bilinear games with restarted iteration averaging. ar**v preprint ar**v:2107.00464 (2021)
Lin, T., **, C., Jordan, M.: On gradient descent ascent for nonconvex-concave minimax problems. In: Singh, H.D. III, A (ed.) Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, pp. 6083–6093. PMLR (2020)
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)
Article MathSciNet ADS Google Scholar
Malitsky, Y., Tam, M.K.: A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30(2), 1451–1472 (2020)
Article MathSciNet Google Scholar
Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for GANs do actually converge? In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 3481–3490. PMLR (2018)
Mescheder, L., Nowozin, S., Geiger, A.: The numerics of GANs. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates (2017)
Monteiro, R.D., Svaiter, B.F.: On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM J. Optim. 20(6), 2755–2787 (2010)
Article MathSciNet Google Scholar
Nagarajan, V., Kolter, J.Z.: Gradient descent GAN optimization is locally stable. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates (2017)
Namkoong, H., Duchi, J.C.: Stochastic gradient methods for distributionally robust optimization with $f$-divergences. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates (2016)
Nemirovski, A.: Prox-method with rate of convergence O$(1/t)$ for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Article MathSciNet Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Google Scholar
Pedregosa, F., Fatras, K., Casotto, M.: Proximal splitting meets variance reduction. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 89, pp. 1–10. PMLR (2019)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Rockafellar, R.T.: Monotone operators associated with saddle-functions and minimax problems. Nonlinear Funct. Anal. 18(part 1), 397–407 (1970)
Ryu, E.K., Boyd, S.: Primer on monotone operator methods. Appl. Comput. Math 15(1), 3–43 (2016)
MathSciNet Google Scholar
Shafieezadeh-Abadeh, S., Esfahani, P.M., Kuhn, D.: Distributionally robust logistic regression. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1576–1584. Curran Associates (2015)
Sinha, A., Namkoong, H., Duchi, J.: Certifying some distributional robustness with principled adversarial training. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hk6kPgZA-
Tseng, P.: A modified forward–backward splitting method for maximal monotone map**s. SIAM J. Control. Optim. 38(2), 431–446 (2000)
Article MathSciNet Google Scholar
Van Dung, N., Vu, B.C.: Convergence analysis of the stochastic reflected forward–backward splitting algorithm. ar**v preprint ar**v:2102.08906 (2021)
Wadsworth, C., Vera, F., Piech, C.: Achieving fairness through adversarial learning: an application to recidivism prediction. ar**v preprint ar**v:1807.00199 (2018)
Yu, Y., Lin, T., Mazumdar, E., Jordan, M.I.: Fast distributionally robust learning with variance reduced min-max optimization. ar**v preprint ar**v:2104.13326 (2021)
Yurtsever, A., Vu, B.C., Cevher, V.: Stochastic three-composite convex minimization. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates (2016)
Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340 (2018)

Download references

Author information

Authors and Affiliations

Computational Science Initiative, Brookhaven National Laboratory, Upton, NY, USA
Patrick R. Johnstone, Thomas Flynn & Shinjae Yoo
Department of Management Science and Information Systems, Rutgers Business School Newark and New Brunswick, Rutgers University, Newark, NJ, USA
Jonathan Eckstein

Authors

Patrick R. Johnstone
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Eckstein
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Flynn
View author publications
You can also search for this author in PubMed Google Scholar
Shinjae Yoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick R. Johnstone.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare with regards to the current study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Approximation residuals

In this section we derive the approximation residual used to assess the performance of the algorithms in the numerical experiments. This residual relies on the following product-space reformulation of (1).

1.1 Appendix A.1: Product-space reformulation and residual principle

Recall (1), the monotone inclusion we are solving:

$$\begin{aligned} \text {Find}\,z\in \mathbb {R}^d: 0 \in \sum _{i=1}^nA_i(z) + B(z). \end{aligned}$$

In this section we demonstrate a “product-space” reformulation of (1) which allows us to rewrite it in a standard form involving just two operators, one maximal monotone and the other monotone and Lipschitz. This approach was pioneered in [10, 17]. Along with allowing for a simple definition of an approximation residual as a measure of approximation error in solving (1), it allows one to apply operator splitting methods originally formulated for two operators to problems such as (1) for any finite n.

Observe that solving (1) is equivalent to

$$\begin{aligned} \text {Find}\; (w_1,\ldots ,w_n,z)\in \mathbb {R}^{(n+1)d}: \quad w_i&\in A_i (z),\quad i\in 1..n\\ 0&\in \sum _{i=1}^nw_i + B(z). \end{aligned}$$

This formulation resembles that of the extended solution set $\mathcal {S}$ used in projective spitting, as given in (2), except that it combines the final two conditions in the definition of $\mathcal {S}$, and thus does not need the final dual variable $w_{n+1}$. From the definition of the inverse of an operator, the above formulation is equivalent to

$$\begin{aligned} \text {Find}\, (w_1,\ldots ,w_n,z)\in \mathbb {R}^{(n+1)d}: \quad 0&\in A_i^{-1}(w_i) - z,\quad i\in 1..n\\ 0&\in \sum _{i=1}^nw_i + B(z). \end{aligned}$$

These conditions are in turn equivalent to finding $(w_1,\ldots ,w_n,z)\in \mathbb {R}^{(n+1)d}$ such that

$$\begin{aligned} 0\in {\mathscr {A}}(w_1,\ldots ,w_n,z) + {\mathscr {B}}(w_1,\ldots ,w_n,z), \end{aligned}$$

(60)

where ${\mathscr {A}}$ is the set-valued map

$$\begin{aligned} {\mathscr {A}}(w_1,\ldots ,w_n,z)\mapsto A_1^{-1}(w_1)\times A_2^{-1}(w_2)\times \ldots \times A_n^{-1}(w_n)\times \{0\} \end{aligned}$$

(61)

and ${\mathscr {B}}$ is the single-valued operator

$$\begin{aligned} {\mathscr {B}}(w_1,\ldots ,w_n,z)\mapsto \left[ \begin{array}{cccc} 0 &{} \cdots &{} 0 &{} -I\\ \vdots &{} \ddots &{} \vdots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} -I\\ I &{} \cdots &{} I &{} 0 \end{array} \right] \left[ \begin{array}{c} w_1\\ \vdots \\ w_n\\ z \end{array} \right] + \left[ \begin{array}{c} 0\\ \vdots \\ 0\\ B(z) \end{array} \right] . \end{aligned}$$

(62)

It is easily established that ${\mathscr {B}}$ is maximal monotone and Lipschitz continuous, while ${\mathscr {A}}$ is maximal monotone. Letting $ \mathscr {T}\doteq {\mathscr {A}}+ {\mathscr {B}}, $ it follows from [5, Prop. 20.23] that $\mathscr {T}$ is maximal monotone. Thus, we have reformulated (1) as the monotone inclusion $0\in \mathscr {T}(q)$ for q in the product space $\mathbb {R}^{(n+1)d}$. A vector $z\in \mathbb {R}^d$ solves (1) if and only if there exists $(w_1,\ldots ,w_n)\in \mathbb {R}^{nd}$ such that $0\in \mathscr {T}(q)$, where $q=(w_1,\ldots ,w_n,z)$.

For any pair (q, v) such that $v\in \mathscr {T}(q)$, $\Vert v\Vert ^2$ represents an approximation residual for q in the sense that $v=0$ implies q is a solution to (60). One may take $\Vert v\Vert ^2$ as a measure of the error of q as an approximate solution to (60), and it can only be 0 if q is a solution. Given two approximate solutions $q_1$ and $q_2$ with certificates $v_1\in T(q_1)$ and $v_2\in \mathscr {T}(q_2)$, we will treat $q_1$ as a “better” approximate solution than $q_2$ if $\Vert v_1\Vert ^2<\Vert v_2\Vert ^2$. Doing so is somewhat analogous to the practice, common in optimization, of using the gradient $\Vert \nabla f(x)\Vert ^2$ as a measure of quality of an approximate minimizer of some differentiable function f. However, note that since $\mathscr {T}(q_1)$ is a set, there may exist elements of $\mathscr {T}(q_1)$ with smaller norm than $v_1$. Thus any given certificate $v_1$ only corresponds to an upper bound on ${{\,\textrm{dist}\,}}^2(0,\mathscr {T}(q_1))$.

1.2 Appendix A.2: Approximation residual for projective splitting

In SPS (Algorithm 1), for $i\in 1..n$, the pairs $(x_i^k,y_i^k)$ are chosen so that $y_i^k\in A_i(x_i^k)$. This can be seen from the definition of the resolvent. Thus $x_i^k\in A_i^{-1}(y_i^k)$. Observe that

$$\begin{aligned} v^k\doteq \left[ \begin{array}{c} x_1^k - z^k\\ \vdots \\ x_n^k - z^k\\ B(z^k) + \sum _{i=1}^ny_i^k \end{array} \right] \in \mathscr {T}(y_1^k,\ldots ,y_n^k,z^k). \end{aligned}$$

(63)

The approximation residual for SPS is thus

$$\begin{aligned} R_k&\doteq \Vert v^k\Vert ^2 = \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 + \big \Vert B(z^k) + \sum _{i=1}^ny_i^k \big \Vert ^2 \end{aligned}$$

(64)

which is an approximation residual for $(y_1^k,\ldots ,y_n^k,z^k)$ in the sense defined above. We may relate $R_k$ to the approximation residual ${\mathcal {G}}_k$ for SPS from Sect. 5.2 as follows:

$$\begin{aligned} R_k&= \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 +\left\| B(z^k) +\sum _{i=1}^ny_i^k\right\| ^2\\&= \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 +\left\| B(z^k) +\sum _{i=1}^ny_i^k - \sum _{i=1}^{n+1}w_i^k\right\| ^2\\&\le \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 + 2\Vert B(z^k) - w_{n+1}^k\Vert ^2 + 2\left\| \sum _{i=1}^n(y_i^k - w_i^k)\right\| ^2\\&\le \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 + 2\Vert B(z^k) - w_{n+1}^k\Vert ^2 + 2n\sum _{i=1}^n\left\| y_i^k - w_i^k\right\| ^2\\&\le 2n {\mathcal {G}}_k \end{aligned}$$

where in the second equality we have used the fact that $\sum _{i=1}^{n+1}w_i^k = 0$. Thus, $R_k$ has the same convergence rate as ${\mathcal {G}}_k$ given in Theorem 2.

Note that while the certificate given in (63) focuses on the primal iterate $z^k$, it may be changed to focus on any $x_i^k$ for $i=1,\ldots ,n$, by using

$$\begin{aligned} v^k_i\doteq \left[ \begin{array}{c} x_1^k - x_i^k\\ \vdots \\ x_n^k - x_i^k\\ B(x_i^k) + \sum _{i=1}^ny_i^k \end{array} \right] \in \mathscr {T}(y_1^k,\ldots ,y_n^k,x_i^k). \end{aligned}$$

The approximation residual $\Vert v^k_i\Vert ^2$ may also be shown to have the same rate as ${\mathcal {G}}_k$ by following similar derivations to those above for $R_k$.

1.3 Appendix A.3: Tseng’s method

Tseng’s method [58] can be applied to (60), resulting in the following recursion with iterates $q^k,{\bar{q}}^k \in \mathbb {R}^{(n+1)d}$:

$$\begin{aligned} {\bar{q}}^k&= J_{\alpha {\mathscr {A}}}(q^k - \alpha {\mathscr {B}}(q^k)) \end{aligned}$$

(65)

$$\begin{aligned} q^{k+1}&= {\bar{q}}^k + \alpha \left( {\mathscr {B}}(q^k) - {\mathscr {B}}({\bar{q}}^k)\right) , \end{aligned}$$

(66)

where ${\mathscr {A}}$ and ${\mathscr {B}}$ are defined in (61) and (62). The resolvent of ${\mathscr {A}}$ may be readily computed from the resolvents of the $A_i$ using Moreau’s identity [5, Prop. 23.20].

Analogous to SPS, Tseng’s method has an approximation residual, which in this case is an element of $\mathscr {T}({\bar{q}}^k)$. In particular, using the general properties of resolvent operators as applied to $J_{\alpha {\mathscr {A}}}$, we have

$$\begin{aligned} \frac{1}{\alpha }(q^k - {\bar{q}}^k) - {\mathscr {B}}(q^k) \in {\mathscr {A}}({\bar{q}}^k). \end{aligned}$$

Also, rearranging (66) produces

$$\begin{aligned} \frac{1}{\alpha }({\bar{q}}^k - q^{k+1}) + {\mathscr {B}}(q^k) = {\mathscr {B}}({\bar{q}}^k). \end{aligned}$$

Adding these two relations produces

$$\begin{aligned} \frac{1}{\alpha }(q^k - q^{k+1}) \in {\mathscr {A}}({\bar{q}}^k) + {\mathscr {B}}({\bar{q}}^k) = \mathscr {T}({\bar{q}}^k) \end{aligned}$$

Therefore,

$$\begin{aligned} R^{\text {Tseng}}_k \doteq \frac{1}{\alpha ^2}\Vert q^k - q^{k+1}\Vert ^2 \end{aligned}$$

represents a measure of the approximation error for Tseng’s method equivalent to $R_k$ defined in (64) for SPS.

1.4 Appendix A.4: FRB

The forward-reflected-backward method (FRB) [44] is another method that may be applied to the splitting $\mathscr {T}= {\mathscr {A}}+ {\mathscr {B}}$ for ${\mathscr {A}}$ and ${\mathscr {B}}$ as defined in (61) and (62). Doing so yields recursion

$$\begin{aligned} q^{k+1} = J_{\alpha {\mathscr {A}}} \!\Big (q^k - \alpha \big (2{\mathscr {B}}(q^k) - {\mathscr {B}}(q^{k-1})\big ) \Big ). \end{aligned}$$

Following similar arguments to those for Tseng’s method, it can be shown that

$$\begin{aligned} v_{\text {FRB}}^k \doteq \frac{1}{\alpha } \left( q^{k-1} -q^k \right) + {\mathscr {B}}(q^k) + {\mathscr {B}}(q^{k-2}) - 2{\mathscr {B}}(q^{k-1}) \in \mathscr {T}(q^k). \end{aligned}$$

Thus, FRB admits the following approximation residual equivalent to $R_k$ for SPS:

$$\begin{aligned} R^{\text {FRB}}_k\doteq \Vert v_{\text {FRB}}^k\Vert ^2. \end{aligned}$$

Finally, we remark that the stepsizes used in both the Tseng and FRB methods can be chosen via a linesearch procedure that we do not detail here.

1.5 Appendix A.5: Stochastic Tseng Method

The stochastic version of Tseng’s method of [7] (S-Tseng) may be applied to the inclusion $0\in {\mathscr {A}}(q)+{\mathscr {B}}(q)$, since the operator ${\mathscr {A}}$ may be written as a subdifferential. However, unlike the deterministic Tseng method, it does not produce a valid residual. Note also that S-Tseng outputs an ergodic sequence $q_{\text {erg}}^k$. To construct a residual for the ergodic sequence, we compute a deterministic step of Tseng’s method according to (65)-(66), starting at $q_{\text {erg}}^k$. That is, letting

$$\begin{aligned} {\bar{q}}^k&= J_{\alpha {\mathscr {A}}}(q_{\text {erg}}^k - {\mathscr {B}}(q_{\text {erg}}^k))\\ q^{k+1}&= {\bar{q}}^k + \alpha ({\mathscr {B}}(q_{\text {erg}}^k) - {\mathscr {B}}({\bar{q}}^k)), \end{aligned}$$

we can then compute essentially the same residual as in Sect. 1,

$$\begin{aligned} R^{\text {S-Tseng}}_k \doteq \frac{1}{\alpha ^2}\Vert q_{\text {erg}}^k - q^{k+1}\Vert ^2. \end{aligned}$$

To construct the stochastic oracle for S-Tseng, we assumed $B(z)=\frac{1}{m}\sum _{i=1}^m B_i(z)$. Then we used

$$\begin{aligned} {\tilde{{\mathscr {B}}}}(w_1,\ldots ,w_n,z)\mapsto \left[ \begin{array}{cccc} 0 &{} \cdots &{} 0 &{} -I\\ \vdots &{} \ddots &{} \vdots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} -I\\ I &{} \cdots &{} I &{} 0 \end{array} \right] \left[ \begin{array}{c} w_1\\ \vdots \\ w_n\\ z \end{array} \right] + \left[ \begin{array}{c} 0\\ \vdots \\ 0\\ \frac{1}{|{\textbf{B}}|}\sum _{j\in {\textbf{B}}}B_j(z) \end{array} \right] . \end{aligned}$$

(67)

for some minibatch ${\textbf{B}}\in \{1,\ldots ,m\}$.

1.6 Appendix A.6: Variance-reduced FRB

The FRB-VR method of [1] can also be applied to $0\in {\mathscr {A}}(q)+{\mathscr {B}}(q)$, using the same stochastic oracle ${\tilde{{\mathscr {B}}}}$ defined in (67). if we let the iterates of FRB-VR be $(q^k,p^k)$, then line 4 of Algorithm 1 of [1] can be written as

$$\begin{aligned} {\hat{q}}^k&= q^k - \tau ({\mathscr {B}}(p^k) + {\tilde{{\mathscr {B}}}}(q^k) - {\tilde{{\mathscr {B}}}}(p^k)) \end{aligned}$$

(68)

$$\begin{aligned} q^{k+1}&= J_{\tau {\mathscr {A}}}({\hat{q}}^k). \end{aligned}$$

(69)

Once again, the method does not directly produce a residual, but one can be developed from the algorithm definition as follows: (69) yields $\tau ^{-1}({\hat{q}}^k - q^{k+1}) \in {\mathscr {A}}(q^{k+1})$ and hence

$$\begin{aligned} \tau ^{-1}({\hat{q}}^k - q^{k+1})+{\mathscr {B}}(q^{k+1})\in ({\mathscr {A}}+{\mathscr {B}})(q^{k+1}). \end{aligned}$$

Therefore we use the residual

$$\begin{aligned} R_k^{\text {FRB-VR}} = \Vert \tau ^{-1}({\hat{q}}^k - q^{k+1})+{\mathscr {B}}(q^{k+1})\Vert ^2. \end{aligned}$$

Figure 1 plots $R_k$ for SPS, $R^{\text {Tseng}}_k$ for Tseng’s method, $R^{\text {FRB}}_k$ for FRB, $R^\text {S-Tseng}_k$ for S-Tseng, and $R^\text {FRB-VR}_k$ for FRB-VR.

Appendix B: Additional information about the numerical experiments

We now show how we converted Problem (59) to the form (1) for our experiments. Let z be a shorthand for $(\lambda ,\beta ,\gamma )$, and define

$$\begin{aligned} {\mathcal {L}}(z)\doteq \lambda (\delta - \kappa ) + \frac{1}{m}\sum _{i=1}^m\Psi (\langle {\hat{x}}_i,\beta \rangle ) + \frac{1}{m} \sum _{i=1}^m \gamma _i( {\hat{y}}_i\langle {\hat{x}}_i,\beta \rangle - \lambda \kappa ). \end{aligned}$$

The first-order necessary and sufficient conditions for the convex–concave saddlepoint problem in (59) are

$$\begin{aligned} 0 \in B(z) + A_1(z) + A_2(z) \end{aligned}$$

(70)

where the vector field B(z) is defined as

$$\begin{aligned} B(z) \doteq \left[ \begin{array}{c} \nabla _{\lambda ,\beta } {\mathcal {L}}(z)\\ -\nabla _{\gamma } {\mathcal {L}}(z) \end{array} \right] , \end{aligned}$$

(71)

with

$$\begin{aligned} \nabla _{\lambda ,\beta } {\mathcal {L}}(z) = \left[ \begin{array}{c} \delta - \kappa (1+\frac{1}{m}\sum _{i=1}^m\gamma _i)\\ \frac{1}{m}\sum _{i=1}^m\Psi '(\langle {\hat{x}}_{i},\beta \rangle ){\hat{x}}_i +\frac{1}{m}\sum _{i=1}^m\gamma _i{\hat{y}}_i{\hat{x}}_i \end{array} \right] \end{aligned}$$

and

$$\begin{aligned} \nabla _\gamma {\mathcal {L}}(z) = \left[ \begin{array}{c} \frac{1}{m}({\hat{y}}_1\langle {\hat{x}}_{1},\beta \rangle -\lambda \kappa ) \\ \vdots \\ \frac{1}{m}({\hat{y}}_m\langle {\hat{x}}_{m},\beta \rangle -\lambda \kappa ) \end{array} \right] . \end{aligned}$$

It is readily confirmed that B defined in this manner is Lipschitz. The monotonicity of B follows from its being the generalized gradient of a convex–concave saddle function [54]. For the set-valued operators, $A_1(z)$ corresponds to the constraints and $A_2(z)$ to the nonsmooth $\ell _1$ regularizer, and are defined as

$$\begin{aligned} A_1(z) \doteq N_{\mathcal {C}_1}(\lambda ,\beta )\times N_{\mathcal {C}_2}(\gamma ), \end{aligned}$$

where

$$\begin{aligned} \mathcal {C}_1 \doteq \big \{ (\lambda ,\beta ): \Vert \beta \Vert _2\le \lambda /(L_\Psi +1) \big \} \quad \text {and} \quad \mathcal {C}_2\doteq \{\gamma : \Vert \gamma \Vert _\infty \le 1 \}, \end{aligned}$$

and

$$\begin{aligned} A_2(z) \doteq \{{\textbf{0}}_{1\times 1}\}\times c\partial \Vert \beta \Vert _1 \times \{{\textbf{0}}_{m\times 1}\}. \end{aligned}$$

Here, the notation ${\textbf{0}}_{p\times 1}$ denotes the p-dimensional vector of all zeros. $\mathcal {C}_1$ is a scaled version of the second-order cone, well known to be a closed convex set, while $\mathcal {C}_2$ is the unit ball of the $\ell _\infty $ norm, also closed and convex. Since $A_1$ is a normal cone map of a closed convex set and $A_2$ is the subgradient map of a closed proper convex function (the scaled 1-norm), both of these operators are maximal monotone and problem (70) is a special case of (1) for $n=2$.

Stochastic oracle implementation The operator $B:\mathbb {R}^{m+d+1}\mapsto \mathbb {R}^{m+d+1}$, defined in (71), can be written as

$$\begin{aligned} B(z) = \frac{1}{m}\sum _{i=1}^m B_i(z) \end{aligned}$$

where

$$\begin{aligned} B_i(z) \doteq \left[ \begin{array}{c} \delta - \kappa (1+\gamma _i)\\ \Psi '(\langle {\hat{x}}_{i},\beta \rangle ){\hat{x}}_i +\gamma _i{\hat{y}}_i{\hat{x}}_i \\ {\textbf{0}}_{(i-1)\times 1} \\ -({\hat{y}}_i\langle {\hat{x}}_{i},\beta \rangle -\lambda \kappa ) \\ {\textbf{0}}_{(m - i)\times 1} \end{array} \right] . \end{aligned}$$

In our SPS experiments, the stochastic oracle for B is simply ${\tilde{B}}(z) = \frac{1}{|{\textbf{B}}|}\sum _{i\in {\textbf{B}}} B_i(z)$ for some minibatch ${\textbf{B}}\subseteq \{1,\ldots ,m\}$. We used a batchsize of 100.

Resolvent computations The resolvent of $A_1$ is readily constructed from the projection maps of the simple sets $\mathcal {C}_1$ and $\mathcal {C}_2$, while the resolvent $A_2$ involves the proximal operator of the $\ell _1$ norm. Specifically,

$$\begin{aligned} J_{ \rho A_1}(z) = \left[ \begin{array}{c} \text {proj}_{\mathcal {C}_1}\!(\lambda ,\beta )\\ \text {proj}_{\mathcal {C}_2}\!(\gamma ) \end{array} \right] \quad \text {and} \quad J_{\rho A_2}(z) = \left[ \begin{array}{c} {\textbf{0}}_{1\times 1}\\ \text {prox}_{\rho c\Vert \cdot \Vert _1}\!(\beta )\\ {\textbf{0}}_{m\times 1} \end{array} \right] . \end{aligned}$$

The constraint $\mathcal {C}_1$ is a scaled second-order cone and $\mathcal {C}_2$ is the $\ell _\infty $ ball, both of which have closed-form projections. The proximal operator of the $\ell _1$ norm is the well-known soft-thresholding operator [51, Sec. 6.5.2]. Therefore all resolvents in the formulation may be computed quickly and accurately.

SPS stepsize choices For the stepsize in SPS, we ordinarily require $\rho _k \le \overline{\rho }< 1/L$ for the global Lipschitz constant L of B. However, since the global Lipschitz constant may be pessimistic, better performance can often be achieved by experimenting with larger stepsizes. If divergence is observed, then the stepsize can be decreased. This type of strategy is common for SGD and similar stochastic methods. Thus, for SPS-decay we set $\alpha _k = C_d k^{-0.51} $ and $ \rho _k = C_d k^{-0.25}, $ and performed a grid search to select the best $C_d$ from $\{0.1,0.5,1,5,10\}$, arriving at $C_d=1$ for epsilon and SUSY, and $C_d=0.5$ for real-sim. For SPS-fixed we used $\rho = K^{-1/4}$ and $\alpha = C_f\rho ^2$, and performed a grid search to select $C_f$ over $\{0.1,0.5,1,5,10\}$, arriving at $C_f=1$ for epsilon and real-sim, and $C_f=5$ for SUSY. The total number of iterations for SPS-fixed was chosen as follows: For the epsilon dataset, we used $K=5000$, for SUSY we used $K=200$, and for real-sim we used $K=1000$.

Parameter choices for the other algorithms All methods are initialized at the same random point. For Tseng’s method, we used the backtracking linesearch variant with an initial stepsize of 1, $\theta =0.8$, and a stepsize reduction factor of 0.7. For FRB, we used the backtracking linesearch variant with the same settings as for Tseng’s method. For deterministic PS, we used a fixed stepsize of 0.9/L. For the stochastic Tseng’s method of [7], the stepsize $\alpha _k$ must satisfy: $\sum _{k=1}^\infty \alpha _k=\infty $ and $\sum _{k=1}^\infty \alpha _k^2<\infty $. So we set $\alpha _k=C k ^{-d}$ and perform a grid search over $\{C,d\}$ in the range $[10^{-4},10]\times [0.51,1]$, checking $5\times 5$ values to find the best setting for each of the three problems. The selected values are in Table 1.

Table 1 Parameter Values for S-Tseng

Full size table

The work of [7] also introduced FBFp, a stochastic version of Tseng’s method that reuses a previously-computed gradient and therefore only needs one additional gradient calculation per iteration. In our experiments, the performance of the two methods was about the same, so we only report the performance of stoch. Tseng’s method.

For variance-reduced FRB, the main parameter is the probability p. We hand-tuned p,arriving at $p=0.01$ for all problems. We set the stepsize to its maximum allowed value of

$$\begin{aligned} \tau = \frac{1-\sqrt{1-p}}{2L}. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Johnstone, P.R., Eckstein, J., Flynn, T. et al. Stochastic projective splitting. Comput Optim Appl 87, 397–437 (2024). https://doi.org/10.1007/s10589-023-00528-6

Download citation

Received: 05 July 2022
Accepted: 01 September 2023
Published: 23 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10589-023-00528-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic projective splitting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Stochastic relaxed inertial forward-backward-forward splitting for monotone inclusions in Hilbert spaces

A Projective Splitting Method for Monotone Inclusions: Iteration-Complexity and Application to Composite Optimization

Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimization

Data Availability

Change history

27 October 2023

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Approximation residuals

1.1 Appendix A.1: Product-space reformulation and residual principle

1.2 Appendix A.2: Approximation residual for projective splitting

1.3 Appendix A.3: Tseng’s method

1.4 Appendix A.4: FRB

1.5 Appendix A.5: Stochastic Tseng Method

1.6 Appendix A.6: Variance-reduced FRB

Appendix B: Additional information about the numerical experiments

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Stochastic projective splitting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Stochastic relaxed inertial forward-backward-forward splitting for monotone inclusions in Hilbert spaces

A Projective Splitting Method for Monotone Inclusions: Iteration-Complexity and Application to Composite Optimization

Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimization

Data Availability

Change history

27 October 2023

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Approximation residuals

1.1 Appendix A.1: Product-space reformulation and residual principle

1.2 Appendix A.2: Approximation residual for projective splitting

1.3 Appendix A.3: Tseng’s method

1.4 Appendix A.4: FRB

1.5 Appendix A.5: Stochastic Tseng Method

1.6 Appendix A.6: Variance-reduced FRB

Appendix B: Additional information about the numerical experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation