Log in

Stochastic projective splitting

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

A Correction to this article was published on 27 October 2023

This article has been updated

Abstract

We present a new, stochastic variant of the projective splitting (PS) family of algorithms for inclusion problems involving the sum of any finite number of maximal monotone operators. This new variant uses a stochastic oracle to evaluate one of the operators, which is assumed to be Lipschitz continuous, and (deterministic) resolvents to process the remaining operators. Our proposal is the first version of PS with such stochastic capabilities. We envision the primary application being machine learning (ML) problems, with the method’s stochastic features facilitating “mini-batch” sampling of datasets. Since it uses a monotone operator formulation, the method can handle not only Lipschitz-smooth loss minimization, but also min–max and noncooperative game formulations, with better convergence properties than the gradient descent-ascent methods commonly applied in such settings. The proposed method can handle any number of constraints and nonsmooth regularizers via projection and proximal operators. We prove almost-sure convergence of the iterates to a solution and a convergence rate result for the expected residual, and close with numerical experiments on a distributionally robust sparse logistic regression problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1:
Fig. 1

Similar content being viewed by others

Data Availability

The data analyzed during the current study are from the public LIBSVM repository available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Change history

Notes

  1. Original data source http://largescale.ml.tu-berlin.de/instructions/.

  2. Original data source https://people.cs.umass.edu/~mccallum/data.html.

References

  1. Alacaoglu, A., Malitsky, Y., Cevher, V.: Forward-reflected-backward method with variance reduction. Comput. Optim. Appl. (2021)

  2. Alotaibi, A., Combettes, P.L., Shahzad, N.: Solving coupled composite monotone inclusions by successive Fejér approximations of their Kuhn–Tucker set. SIAM J. Optim. 24(4), 2076–2095 (2014)

    Article  MathSciNet  Google Scholar 

  3. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)

    Article  Google Scholar 

  4. Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of \(n\)-player differentiable games. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 354–363. PMLR (2018)

  5. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)

    Book  Google Scholar 

  6. Boţ, R.I., Mertikopoulos, P., Staudigl, M., Vuong, P.T.: Minibatch forward-backward-forward methods for solving stochastic variational inequalities. Stoch. Syst. 11(2), 112–139 (2021)

    Article  MathSciNet  Google Scholar 

  7. Böhm, A., Sedlmayer, M., Csetnek, E.R., Boţ, R.I.: Two steps at a time—taking GAN training in stride with Tseng’s method. ar**v preprint ar**v:2006.09033 (2020)

  8. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, pp. 177–186. Springer, Berlin (2010)

  9. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  10. Briceño-Arias, L.M., Combettes, P.L.: A monotone+skew splitting model for composite monotone inclusions in duality. SIAM J. Optim. 21(4), 1230–1250 (2011)

    Article  MathSciNet  Google Scholar 

  11. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)

    Article  Google Scholar 

  12. Celis, L.E., Keswani, V.: Improved adversarial learning for fair classification. ar**v preprint ar**v:1901.10443 (2019)

  13. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  14. Chavdarova, T., Pagliardini, M., Stich, S.U., Fleuret, F., Jaggi, M.: Taming GANs with lookahead-minmax. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=ZW0yXJyNmoG

  15. Combettes, P.L., Eckstein, J.: Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Math. Program. 168(1–2), 645–672 (2018)

    Article  MathSciNet  Google Scholar 

  16. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, Berlin (2011)

  17. Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators. Set-Valued Var. Anal. 20(2), 307–330 (2012)

    Article  MathSciNet  Google Scholar 

  18. Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random swee**. SIAM J. Optim. 25(2), 1221–1248 (2015)

    Article  MathSciNet  Google Scholar 

  19. Daskalakis, C., Ilyas, A., Syrgkanis, V., Zeng, H.: Training GANs with optimism. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=SJJySbbAZ

  20. Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. Set-Valued Var. Anal. 25(4), 829–858 (2017)

    Article  MathSciNet  Google Scholar 

  21. Diakonikolas, J.: Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. In: Conference on Learning Theory, pp. 1428–1451. PMLR (2020)

  22. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  23. Eckstein, J.: A simplified form of block-iterative operator splitting and an asynchronous algorithm resembling the multi-block alternating direction method of multipliers. J. Optim. Theory Appl. 173(1), 155–182 (2017)

    Article  MathSciNet  Google Scholar 

  24. Eckstein, J., Svaiter, B.F.: A family of projective splitting methods for the sum of two maximal monotone operators. Math. Program. 111(1), 173–199 (2008)

    MathSciNet  Google Scholar 

  25. Eckstein, J., Svaiter, B.F.: General projective splitting methods for sums of maximal monotone operators. SIAM J. Control. Optim. 48(2), 787–811 (2009)

    Article  MathSciNet  Google Scholar 

  26. Edwards, H., Storkey, A.: Censoring representations with an adversary. ar**v preprint ar**v:1511.05897 (2015)

  27. Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin, M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Solution of Boundary Value Problems, chap. IX, pp. 299–340. North-Holland, Amsterdam (1983)

  28. Gidel, G., Berard, H., Vignoud, G., Vincent, P., Lacoste-Julien, S.: A variational inequality perspective on generative adversarial networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=r1laEnA5Ym

  29. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates (2014)

  30. Grnarova, P., Kilcher, Y., Levy, K.Y., Lucchi, A., Hofmann, T.: Generative minimization networks: training GANs without competition. ar**v preprint ar**v:2103.12685 (2021)

  31. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: On the convergence of single-call stochastic extra-gradient methods. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates (2019)

  32. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 16223–16234. Curran Associates (2020)

  33. Huang, C., Kairouz, P., Chen, X., Sankar, L., Rajagopal, R.: Context-aware generative adversarial privacy. Entropy 19(12), 656 (2017)

    Article  MathSciNet  ADS  Google Scholar 

  34. Johnstone, P.R., Eckstein, J.: Convergence rates for projective splitting. SIAM J. Optim. 29(3), 1931–1957 (2019)

    Article  MathSciNet  Google Scholar 

  35. Johnstone, P.R., Eckstein, J.: Single-forward-step projective splitting: exploiting cocoercivity. ar**v preprint ar**v:1902.09025 (2019)

  36. Johnstone, P.R., Eckstein, J.: Projective splitting with forward steps only requires continuity. Optim. Lett. 14(1), 229–247 (2020)

    Article  MathSciNet  Google Scholar 

  37. Johnstone, P.R., Eckstein, J.: Single-forward-step projective splitting: exploiting cocoercivity. Comput. Optim. Appl. 78(1), 125–166 (2021)

    Article  MathSciNet  Google Scholar 

  38. Johnstone, P.R., Eckstein, J.: Projective splitting with forward steps. Math. Program. 191(2), 631–670 (2022)

    Article  MathSciNet  Google Scholar 

  39. Korpelevich, G.: Extragradient method for finding saddle points and other problems. Matekon 13(4), 35–49 (1977)

    MathSciNet  Google Scholar 

  40. Kuhn, D., Esfahani, P.M., Nguyen, V.A., Shafieezadeh-Abadeh, S.: Wasserstein distributionally robust optimization: theory and applications in machine learning. In: Netessine, S. (ed.) Operations Research & Management Science in the Age of Analytics, Tutorials in Operations Research, pp. 130–166. INFORMS (2019)

  41. Li, C.J., Yu, Y., Loizou, N., Gidel, G., Ma, Y., Roux, N.L., Jordan, M.I.: On the convergence of stochastic extragradient for bilinear games with restarted iteration averaging. ar**v preprint ar**v:2107.00464 (2021)

  42. Lin, T., **, C., Jordan, M.: On gradient descent ascent for nonconvex-concave minimax problems. In: Singh, H.D. III, A (ed.) Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, pp. 6083–6093. PMLR (2020)

  43. Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)

    Article  MathSciNet  ADS  Google Scholar 

  44. Malitsky, Y., Tam, M.K.: A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30(2), 1451–1472 (2020)

    Article  MathSciNet  Google Scholar 

  45. Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for GANs do actually converge? In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 3481–3490. PMLR (2018)

  46. Mescheder, L., Nowozin, S., Geiger, A.: The numerics of GANs. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates (2017)

  47. Monteiro, R.D., Svaiter, B.F.: On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM J. Optim. 20(6), 2755–2787 (2010)

    Article  MathSciNet  Google Scholar 

  48. Nagarajan, V., Kolter, J.Z.: Gradient descent GAN optimization is locally stable. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates (2017)

  49. Namkoong, H., Duchi, J.C.: Stochastic gradient methods for distributionally robust optimization with \(f\)-divergences. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates (2016)

  50. Nemirovski, A.: Prox-method with rate of convergence O\((1/t)\) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)

    Article  MathSciNet  Google Scholar 

  51. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)

    Google Scholar 

  52. Pedregosa, F., Fatras, K., Casotto, M.: Proximal splitting meets variance reduction. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 89, pp. 1–10. PMLR (2019)

  53. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  54. Rockafellar, R.T.: Monotone operators associated with saddle-functions and minimax problems. Nonlinear Funct. Anal. 18(part 1), 397–407 (1970)

  55. Ryu, E.K., Boyd, S.: Primer on monotone operator methods. Appl. Comput. Math 15(1), 3–43 (2016)

    MathSciNet  Google Scholar 

  56. Shafieezadeh-Abadeh, S., Esfahani, P.M., Kuhn, D.: Distributionally robust logistic regression. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1576–1584. Curran Associates (2015)

  57. Sinha, A., Namkoong, H., Duchi, J.: Certifying some distributional robustness with principled adversarial training. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hk6kPgZA-

  58. Tseng, P.: A modified forward–backward splitting method for maximal monotone map**s. SIAM J. Control. Optim. 38(2), 431–446 (2000)

    Article  MathSciNet  Google Scholar 

  59. Van Dung, N., Vu, B.C.: Convergence analysis of the stochastic reflected forward–backward splitting algorithm. ar**v preprint ar**v:2102.08906 (2021)

  60. Wadsworth, C., Vera, F., Piech, C.: Achieving fairness through adversarial learning: an application to recidivism prediction. ar**v preprint ar**v:1807.00199 (2018)

  61. Yu, Y., Lin, T., Mazumdar, E., Jordan, M.I.: Fast distributionally robust learning with variance reduced min-max optimization. ar**v preprint ar**v:2104.13326 (2021)

  62. Yurtsever, A., Vu, B.C., Cevher, V.: Stochastic three-composite convex minimization. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates (2016)

  63. Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrick R. Johnstone.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare with regards to the current study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Approximation residuals

In this section we derive the approximation residual used to assess the performance of the algorithms in the numerical experiments. This residual relies on the following product-space reformulation of (1).

1.1 Appendix A.1: Product-space reformulation and residual principle

Recall (1), the monotone inclusion we are solving:

$$\begin{aligned} \text {Find}\,z\in \mathbb {R}^d: 0 \in \sum _{i=1}^nA_i(z) + B(z). \end{aligned}$$

In this section we demonstrate a “product-space” reformulation of (1) which allows us to rewrite it in a standard form involving just two operators, one maximal monotone and the other monotone and Lipschitz. This approach was pioneered in [10, 17]. Along with allowing for a simple definition of an approximation residual as a measure of approximation error in solving (1), it allows one to apply operator splitting methods originally formulated for two operators to problems such as (1) for any finite n.

Observe that solving (1) is equivalent to

$$\begin{aligned} \text {Find}\; (w_1,\ldots ,w_n,z)\in \mathbb {R}^{(n+1)d}: \quad w_i&\in A_i (z),\quad i\in 1..n\\ 0&\in \sum _{i=1}^nw_i + B(z). \end{aligned}$$

This formulation resembles that of the extended solution set \(\mathcal {S}\) used in projective spitting, as given in (2), except that it combines the final two conditions in the definition of \(\mathcal {S}\), and thus does not need the final dual variable \(w_{n+1}\). From the definition of the inverse of an operator, the above formulation is equivalent to

$$\begin{aligned} \text {Find}\, (w_1,\ldots ,w_n,z)\in \mathbb {R}^{(n+1)d}: \quad 0&\in A_i^{-1}(w_i) - z,\quad i\in 1..n\\ 0&\in \sum _{i=1}^nw_i + B(z). \end{aligned}$$

These conditions are in turn equivalent to finding \((w_1,\ldots ,w_n,z)\in \mathbb {R}^{(n+1)d}\) such that

$$\begin{aligned} 0\in {\mathscr {A}}(w_1,\ldots ,w_n,z) + {\mathscr {B}}(w_1,\ldots ,w_n,z), \end{aligned}$$
(60)

where \({\mathscr {A}}\) is the set-valued map

$$\begin{aligned} {\mathscr {A}}(w_1,\ldots ,w_n,z)\mapsto A_1^{-1}(w_1)\times A_2^{-1}(w_2)\times \ldots \times A_n^{-1}(w_n)\times \{0\} \end{aligned}$$
(61)

and \({\mathscr {B}}\) is the single-valued operator

$$\begin{aligned} {\mathscr {B}}(w_1,\ldots ,w_n,z)\mapsto \left[ \begin{array}{cccc} 0 &{} \cdots &{} 0 &{} -I\\ \vdots &{} \ddots &{} \vdots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} -I\\ I &{} \cdots &{} I &{} 0 \end{array} \right] \left[ \begin{array}{c} w_1\\ \vdots \\ w_n\\ z \end{array} \right] + \left[ \begin{array}{c} 0\\ \vdots \\ 0\\ B(z) \end{array} \right] . \end{aligned}$$
(62)

It is easily established that \({\mathscr {B}}\) is maximal monotone and Lipschitz continuous, while \({\mathscr {A}}\) is maximal monotone. Letting \( \mathscr {T}\doteq {\mathscr {A}}+ {\mathscr {B}}, \) it follows from [5, Prop. 20.23] that \(\mathscr {T}\) is maximal monotone. Thus, we have reformulated (1) as the monotone inclusion \(0\in \mathscr {T}(q)\) for q in the product space \(\mathbb {R}^{(n+1)d}\). A vector \(z\in \mathbb {R}^d\) solves (1) if and only if there exists \((w_1,\ldots ,w_n)\in \mathbb {R}^{nd}\) such that \(0\in \mathscr {T}(q)\), where \(q=(w_1,\ldots ,w_n,z)\).

For any pair (qv) such that \(v\in \mathscr {T}(q)\), \(\Vert v\Vert ^2\) represents an approximation residual for q in the sense that \(v=0\) implies q is a solution to (60). One may take \(\Vert v\Vert ^2\) as a measure of the error of q as an approximate solution to (60), and it can only be 0 if q is a solution. Given two approximate solutions \(q_1\) and \(q_2\) with certificates \(v_1\in T(q_1)\) and \(v_2\in \mathscr {T}(q_2)\), we will treat \(q_1\) as a “better” approximate solution than \(q_2\) if \(\Vert v_1\Vert ^2<\Vert v_2\Vert ^2\). Doing so is somewhat analogous to the practice, common in optimization, of using the gradient \(\Vert \nabla f(x)\Vert ^2\) as a measure of quality of an approximate minimizer of some differentiable function f. However, note that since \(\mathscr {T}(q_1)\) is a set, there may exist elements of \(\mathscr {T}(q_1)\) with smaller norm than \(v_1\). Thus any given certificate \(v_1\) only corresponds to an upper bound on \({{\,\textrm{dist}\,}}^2(0,\mathscr {T}(q_1))\).

1.2 Appendix A.2: Approximation residual for projective splitting

In SPS (Algorithm 1), for \(i\in 1..n\), the pairs \((x_i^k,y_i^k)\) are chosen so that \(y_i^k\in A_i(x_i^k)\). This can be seen from the definition of the resolvent. Thus \(x_i^k\in A_i^{-1}(y_i^k)\). Observe that

$$\begin{aligned} v^k\doteq \left[ \begin{array}{c} x_1^k - z^k\\ \vdots \\ x_n^k - z^k\\ B(z^k) + \sum _{i=1}^ny_i^k \end{array} \right] \in \mathscr {T}(y_1^k,\ldots ,y_n^k,z^k). \end{aligned}$$
(63)

The approximation residual for SPS is thus

$$\begin{aligned} R_k&\doteq \Vert v^k\Vert ^2 = \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 + \big \Vert B(z^k) + \sum _{i=1}^ny_i^k \big \Vert ^2 \end{aligned}$$
(64)

which is an approximation residual for \((y_1^k,\ldots ,y_n^k,z^k)\) in the sense defined above. We may relate \(R_k\) to the approximation residual \({\mathcal {G}}_k\) for SPS from Sect. 5.2 as follows:

$$\begin{aligned} R_k&= \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 +\left\| B(z^k) +\sum _{i=1}^ny_i^k\right\| ^2\\&= \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 +\left\| B(z^k) +\sum _{i=1}^ny_i^k - \sum _{i=1}^{n+1}w_i^k\right\| ^2\\&\le \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 + 2\Vert B(z^k) - w_{n+1}^k\Vert ^2 + 2\left\| \sum _{i=1}^n(y_i^k - w_i^k)\right\| ^2\\&\le \sum _{i=1}^n\Vert z^k - x_i^k\Vert ^2 + 2\Vert B(z^k) - w_{n+1}^k\Vert ^2 + 2n\sum _{i=1}^n\left\| y_i^k - w_i^k\right\| ^2\\&\le 2n {\mathcal {G}}_k \end{aligned}$$

where in the second equality we have used the fact that \(\sum _{i=1}^{n+1}w_i^k = 0\). Thus, \(R_k\) has the same convergence rate as \({\mathcal {G}}_k\) given in Theorem 2.

Note that while the certificate given in (63) focuses on the primal iterate \(z^k\), it may be changed to focus on any \(x_i^k\) for \(i=1,\ldots ,n\), by using

$$\begin{aligned} v^k_i\doteq \left[ \begin{array}{c} x_1^k - x_i^k\\ \vdots \\ x_n^k - x_i^k\\ B(x_i^k) + \sum _{i=1}^ny_i^k \end{array} \right] \in \mathscr {T}(y_1^k,\ldots ,y_n^k,x_i^k). \end{aligned}$$

The approximation residual \(\Vert v^k_i\Vert ^2\) may also be shown to have the same rate as \({\mathcal {G}}_k\) by following similar derivations to those above for \(R_k\).

1.3 Appendix A.3: Tseng’s method

Tseng’s method [58] can be applied to (60), resulting in the following recursion with iterates \(q^k,{\bar{q}}^k \in \mathbb {R}^{(n+1)d}\):

$$\begin{aligned} {\bar{q}}^k&= J_{\alpha {\mathscr {A}}}(q^k - \alpha {\mathscr {B}}(q^k)) \end{aligned}$$
(65)
$$\begin{aligned} q^{k+1}&= {\bar{q}}^k + \alpha \left( {\mathscr {B}}(q^k) - {\mathscr {B}}({\bar{q}}^k)\right) , \end{aligned}$$
(66)

where \({\mathscr {A}}\) and \({\mathscr {B}}\) are defined in (61) and (62). The resolvent of \({\mathscr {A}}\) may be readily computed from the resolvents of the \(A_i\) using Moreau’s identity [5, Prop. 23.20].

Analogous to SPS, Tseng’s method has an approximation residual, which in this case is an element of \(\mathscr {T}({\bar{q}}^k)\). In particular, using the general properties of resolvent operators as applied to \(J_{\alpha {\mathscr {A}}}\), we have

$$\begin{aligned} \frac{1}{\alpha }(q^k - {\bar{q}}^k) - {\mathscr {B}}(q^k) \in {\mathscr {A}}({\bar{q}}^k). \end{aligned}$$

Also, rearranging (66) produces

$$\begin{aligned} \frac{1}{\alpha }({\bar{q}}^k - q^{k+1}) + {\mathscr {B}}(q^k) = {\mathscr {B}}({\bar{q}}^k). \end{aligned}$$

Adding these two relations produces

$$\begin{aligned} \frac{1}{\alpha }(q^k - q^{k+1}) \in {\mathscr {A}}({\bar{q}}^k) + {\mathscr {B}}({\bar{q}}^k) = \mathscr {T}({\bar{q}}^k) \end{aligned}$$

Therefore,

$$\begin{aligned} R^{\text {Tseng}}_k \doteq \frac{1}{\alpha ^2}\Vert q^k - q^{k+1}\Vert ^2 \end{aligned}$$

represents a measure of the approximation error for Tseng’s method equivalent to \(R_k\) defined in (64) for SPS.

1.4 Appendix A.4: FRB

The forward-reflected-backward method (FRB) [44] is another method that may be applied to the splitting \(\mathscr {T}= {\mathscr {A}}+ {\mathscr {B}}\) for \({\mathscr {A}}\) and \({\mathscr {B}}\) as defined in (61) and (62). Doing so yields recursion

$$\begin{aligned} q^{k+1} = J_{\alpha {\mathscr {A}}} \!\Big (q^k - \alpha \big (2{\mathscr {B}}(q^k) - {\mathscr {B}}(q^{k-1})\big ) \Big ). \end{aligned}$$

Following similar arguments to those for Tseng’s method, it can be shown that

$$\begin{aligned} v_{\text {FRB}}^k \doteq \frac{1}{\alpha } \left( q^{k-1} -q^k \right) + {\mathscr {B}}(q^k) + {\mathscr {B}}(q^{k-2}) - 2{\mathscr {B}}(q^{k-1}) \in \mathscr {T}(q^k). \end{aligned}$$

Thus, FRB admits the following approximation residual equivalent to \(R_k\) for SPS:

$$\begin{aligned} R^{\text {FRB}}_k\doteq \Vert v_{\text {FRB}}^k\Vert ^2. \end{aligned}$$

Finally, we remark that the stepsizes used in both the Tseng and FRB methods can be chosen via a linesearch procedure that we do not detail here.

1.5 Appendix A.5: Stochastic Tseng Method

The stochastic version of Tseng’s method of [7] (S-Tseng) may be applied to the inclusion \(0\in {\mathscr {A}}(q)+{\mathscr {B}}(q)\), since the operator \({\mathscr {A}}\) may be written as a subdifferential. However, unlike the deterministic Tseng method, it does not produce a valid residual. Note also that S-Tseng outputs an ergodic sequence \(q_{\text {erg}}^k\). To construct a residual for the ergodic sequence, we compute a deterministic step of Tseng’s method according to (65)-(66), starting at \(q_{\text {erg}}^k\). That is, letting

$$\begin{aligned} {\bar{q}}^k&= J_{\alpha {\mathscr {A}}}(q_{\text {erg}}^k - {\mathscr {B}}(q_{\text {erg}}^k))\\ q^{k+1}&= {\bar{q}}^k + \alpha ({\mathscr {B}}(q_{\text {erg}}^k) - {\mathscr {B}}({\bar{q}}^k)), \end{aligned}$$

we can then compute essentially the same residual as in Sect. 1,

$$\begin{aligned} R^{\text {S-Tseng}}_k \doteq \frac{1}{\alpha ^2}\Vert q_{\text {erg}}^k - q^{k+1}\Vert ^2. \end{aligned}$$

To construct the stochastic oracle for S-Tseng, we assumed \(B(z)=\frac{1}{m}\sum _{i=1}^m B_i(z)\). Then we used

$$\begin{aligned} {\tilde{{\mathscr {B}}}}(w_1,\ldots ,w_n,z)\mapsto \left[ \begin{array}{cccc} 0 &{} \cdots &{} 0 &{} -I\\ \vdots &{} \ddots &{} \vdots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} -I\\ I &{} \cdots &{} I &{} 0 \end{array} \right] \left[ \begin{array}{c} w_1\\ \vdots \\ w_n\\ z \end{array} \right] + \left[ \begin{array}{c} 0\\ \vdots \\ 0\\ \frac{1}{|{\textbf{B}}|}\sum _{j\in {\textbf{B}}}B_j(z) \end{array} \right] . \end{aligned}$$
(67)

for some minibatch \({\textbf{B}}\in \{1,\ldots ,m\}\).

1.6 Appendix A.6: Variance-reduced FRB

The FRB-VR method of [1] can also be applied to \(0\in {\mathscr {A}}(q)+{\mathscr {B}}(q)\), using the same stochastic oracle \({\tilde{{\mathscr {B}}}}\) defined in (67). if we let the iterates of FRB-VR be \((q^k,p^k)\), then line 4 of Algorithm 1 of  [1] can be written as

$$\begin{aligned} {\hat{q}}^k&= q^k - \tau ({\mathscr {B}}(p^k) + {\tilde{{\mathscr {B}}}}(q^k) - {\tilde{{\mathscr {B}}}}(p^k)) \end{aligned}$$
(68)
$$\begin{aligned} q^{k+1}&= J_{\tau {\mathscr {A}}}({\hat{q}}^k). \end{aligned}$$
(69)

Once again, the method does not directly produce a residual, but one can be developed from the algorithm definition as follows: (69) yields \(\tau ^{-1}({\hat{q}}^k - q^{k+1}) \in {\mathscr {A}}(q^{k+1})\) and hence

$$\begin{aligned} \tau ^{-1}({\hat{q}}^k - q^{k+1})+{\mathscr {B}}(q^{k+1})\in ({\mathscr {A}}+{\mathscr {B}})(q^{k+1}). \end{aligned}$$

Therefore we use the residual

$$\begin{aligned} R_k^{\text {FRB-VR}} = \Vert \tau ^{-1}({\hat{q}}^k - q^{k+1})+{\mathscr {B}}(q^{k+1})\Vert ^2. \end{aligned}$$

Figure 1 plots \(R_k\) for SPS, \(R^{\text {Tseng}}_k\) for Tseng’s method, \(R^{\text {FRB}}_k\) for FRB, \(R^\text {S-Tseng}_k\) for S-Tseng, and \(R^\text {FRB-VR}_k\) for FRB-VR.

Appendix B: Additional information about the numerical experiments

We now show how we converted Problem (59) to the form (1) for our experiments. Let z be a shorthand for \((\lambda ,\beta ,\gamma )\), and define

$$\begin{aligned} {\mathcal {L}}(z)\doteq \lambda (\delta - \kappa ) + \frac{1}{m}\sum _{i=1}^m\Psi (\langle {\hat{x}}_i,\beta \rangle ) + \frac{1}{m} \sum _{i=1}^m \gamma _i( {\hat{y}}_i\langle {\hat{x}}_i,\beta \rangle - \lambda \kappa ). \end{aligned}$$

The first-order necessary and sufficient conditions for the convex–concave saddlepoint problem in (59) are

$$\begin{aligned} 0 \in B(z) + A_1(z) + A_2(z) \end{aligned}$$
(70)

where the vector field B(z) is defined as

$$\begin{aligned} B(z) \doteq \left[ \begin{array}{c} \nabla _{\lambda ,\beta } {\mathcal {L}}(z)\\ -\nabla _{\gamma } {\mathcal {L}}(z) \end{array} \right] , \end{aligned}$$
(71)

with

$$\begin{aligned} \nabla _{\lambda ,\beta } {\mathcal {L}}(z) = \left[ \begin{array}{c} \delta - \kappa (1+\frac{1}{m}\sum _{i=1}^m\gamma _i)\\ \frac{1}{m}\sum _{i=1}^m\Psi '(\langle {\hat{x}}_{i},\beta \rangle ){\hat{x}}_i +\frac{1}{m}\sum _{i=1}^m\gamma _i{\hat{y}}_i{\hat{x}}_i \end{array} \right] \end{aligned}$$

and

$$\begin{aligned} \nabla _\gamma {\mathcal {L}}(z) = \left[ \begin{array}{c} \frac{1}{m}({\hat{y}}_1\langle {\hat{x}}_{1},\beta \rangle -\lambda \kappa ) \\ \vdots \\ \frac{1}{m}({\hat{y}}_m\langle {\hat{x}}_{m},\beta \rangle -\lambda \kappa ) \end{array} \right] . \end{aligned}$$

It is readily confirmed that B defined in this manner is Lipschitz. The monotonicity of B follows from its being the generalized gradient of a convex–concave saddle function [54]. For the set-valued operators, \(A_1(z)\) corresponds to the constraints and \(A_2(z)\) to the nonsmooth \(\ell _1\) regularizer, and are defined as

$$\begin{aligned} A_1(z) \doteq N_{\mathcal {C}_1}(\lambda ,\beta )\times N_{\mathcal {C}_2}(\gamma ), \end{aligned}$$

where

$$\begin{aligned} \mathcal {C}_1 \doteq \big \{ (\lambda ,\beta ): \Vert \beta \Vert _2\le \lambda /(L_\Psi +1) \big \} \quad \text {and} \quad \mathcal {C}_2\doteq \{\gamma : \Vert \gamma \Vert _\infty \le 1 \}, \end{aligned}$$

and

$$\begin{aligned} A_2(z) \doteq \{{\textbf{0}}_{1\times 1}\}\times c\partial \Vert \beta \Vert _1 \times \{{\textbf{0}}_{m\times 1}\}. \end{aligned}$$

Here, the notation \({\textbf{0}}_{p\times 1}\) denotes the p-dimensional vector of all zeros. \(\mathcal {C}_1\) is a scaled version of the second-order cone, well known to be a closed convex set, while \(\mathcal {C}_2\) is the unit ball of the \(\ell _\infty \) norm, also closed and convex. Since \(A_1\) is a normal cone map of a closed convex set and \(A_2\) is the subgradient map of a closed proper convex function (the scaled 1-norm), both of these operators are maximal monotone and problem (70) is a special case of (1) for \(n=2\).

Stochastic oracle implementation The operator \(B:\mathbb {R}^{m+d+1}\mapsto \mathbb {R}^{m+d+1}\), defined in (71), can be written as

$$\begin{aligned} B(z) = \frac{1}{m}\sum _{i=1}^m B_i(z) \end{aligned}$$

where

$$\begin{aligned} B_i(z) \doteq \left[ \begin{array}{c} \delta - \kappa (1+\gamma _i)\\ \Psi '(\langle {\hat{x}}_{i},\beta \rangle ){\hat{x}}_i +\gamma _i{\hat{y}}_i{\hat{x}}_i \\ {\textbf{0}}_{(i-1)\times 1} \\ -({\hat{y}}_i\langle {\hat{x}}_{i},\beta \rangle -\lambda \kappa ) \\ {\textbf{0}}_{(m - i)\times 1} \end{array} \right] . \end{aligned}$$

In our SPS experiments, the stochastic oracle for B is simply \({\tilde{B}}(z) = \frac{1}{|{\textbf{B}}|}\sum _{i\in {\textbf{B}}} B_i(z)\) for some minibatch \({\textbf{B}}\subseteq \{1,\ldots ,m\}\). We used a batchsize of 100.

Resolvent computations The resolvent of \(A_1\) is readily constructed from the projection maps of the simple sets \(\mathcal {C}_1\) and \(\mathcal {C}_2\), while the resolvent \(A_2\) involves the proximal operator of the \(\ell _1\) norm. Specifically,

$$\begin{aligned} J_{ \rho A_1}(z) = \left[ \begin{array}{c} \text {proj}_{\mathcal {C}_1}\!(\lambda ,\beta )\\ \text {proj}_{\mathcal {C}_2}\!(\gamma ) \end{array} \right] \quad \text {and} \quad J_{\rho A_2}(z) = \left[ \begin{array}{c} {\textbf{0}}_{1\times 1}\\ \text {prox}_{\rho c\Vert \cdot \Vert _1}\!(\beta )\\ {\textbf{0}}_{m\times 1} \end{array} \right] . \end{aligned}$$

The constraint \(\mathcal {C}_1\) is a scaled second-order cone and \(\mathcal {C}_2\) is the \(\ell _\infty \) ball, both of which have closed-form projections. The proximal operator of the \(\ell _1\) norm is the well-known soft-thresholding operator [51, Sec. 6.5.2]. Therefore all resolvents in the formulation may be computed quickly and accurately.

SPS stepsize choices For the stepsize in SPS, we ordinarily require \(\rho _k \le \overline{\rho }< 1/L\) for the global Lipschitz constant L of B. However, since the global Lipschitz constant may be pessimistic, better performance can often be achieved by experimenting with larger stepsizes. If divergence is observed, then the stepsize can be decreased. This type of strategy is common for SGD and similar stochastic methods. Thus, for SPS-decay we set \(\alpha _k = C_d k^{-0.51} \) and \( \rho _k = C_d k^{-0.25}, \) and performed a grid search to select the best \(C_d\) from \(\{0.1,0.5,1,5,10\}\), arriving at \(C_d=1\) for epsilon and SUSY, and \(C_d=0.5\) for real-sim. For SPS-fixed we used \(\rho = K^{-1/4}\) and \(\alpha = C_f\rho ^2\), and performed a grid search to select \(C_f\) over \(\{0.1,0.5,1,5,10\}\), arriving at \(C_f=1\) for epsilon and real-sim, and \(C_f=5\) for SUSY. The total number of iterations for SPS-fixed was chosen as follows: For the epsilon dataset, we used \(K=5000\), for SUSY we used \(K=200\), and for real-sim we used \(K=1000\).

Parameter choices for the other algorithms All methods are initialized at the same random point. For Tseng’s method, we used the backtracking linesearch variant with an initial stepsize of 1, \(\theta =0.8\), and a stepsize reduction factor of 0.7. For FRB, we used the backtracking linesearch variant with the same settings as for Tseng’s method. For deterministic PS, we used a fixed stepsize of 0.9/L. For the stochastic Tseng’s method of [7], the stepsize \(\alpha _k\) must satisfy: \(\sum _{k=1}^\infty \alpha _k=\infty \) and \(\sum _{k=1}^\infty \alpha _k^2<\infty \). So we set \(\alpha _k=C k ^{-d}\) and perform a grid search over \(\{C,d\}\) in the range \([10^{-4},10]\times [0.51,1]\), checking \(5\times 5\) values to find the best setting for each of the three problems. The selected values are in Table 1.

Table 1 Parameter Values for S-Tseng

The work of [7] also introduced FBFp, a stochastic version of Tseng’s method that reuses a previously-computed gradient and therefore only needs one additional gradient calculation per iteration. In our experiments, the performance of the two methods was about the same, so we only report the performance of stoch. Tseng’s method.

For variance-reduced FRB, the main parameter is the probability p. We hand-tuned p,arriving at \(p=0.01\) for all problems. We set the stepsize to its maximum allowed value of

$$\begin{aligned} \tau = \frac{1-\sqrt{1-p}}{2L}. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnstone, P.R., Eckstein, J., Flynn, T. et al. Stochastic projective splitting. Comput Optim Appl 87, 397–437 (2024). https://doi.org/10.1007/s10589-023-00528-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-023-00528-6

Keywords

Navigation