Generalization in Deep RL for TSP Problems via Equivariance and Local Search

Ouyang, Wenbin; Wang, Yisen; Weng, Paul; Han, Shaochen

doi:10.1007/s42979-024-02689-5

Generalization in Deep RL for TSP Problems via Equivariance and Local Search

Original Research
Published: 29 March 2024

Volume 5, article number 369, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Wenbin Ouyang^1,2^na1,
Yisen Wang^1,2^na1,
Paul Weng ORCID: orcid.org/0000-0002-2008-4569³ &
…
Shaochen Han²

156 Accesses
Explore all metrics

Abstract

Deep reinforcement learning (RL) has proved to be a competitive heuristic for solving small-sized instances of traveling salesman problems (TSP), but its performance on larger-sized instances is insufficient. Since training on large instances is impractical, we design a novel deep RL approach with a focus on generalizability. Our proposition consisting of a simple deep learning architecture, which learns with novel RL training techniques, exploits two main ideas. First, we exploit equivariance to facilitate training. Second, we interleave efficient local search heuristics with the usual RL training to smooth the value landscape. Our experimental evaluation demonstrates that our method achieves state-of-the-art performances not only when generalizing to large random TSP instances, but also on realistic TSP instances. Moreover, an ablation study shows that all the components of our method contribute to its performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning

Article Open access 23 July 2021

Unsupervised Training for Neural TSP Solver

Learning Heuristics for the TSP by Policy Gradient

Notes

Our implementation takes advantage of GPU acceleration when possible. The source code will be shared after publication.

References

Applegate D, Ribert B, Vasek C, et al. Concorde TSP solver. http://www.math.uwaterloo.ca/tsp/concorde 2004.
Bai R, Chen X, Chen ZL, et al. Analytics and machine learning in vehicle routing research. Int J Prod Res. 2021;61(1):4–30. https://doi.org/10.1080/00207543.2021.2013566.
Article Google Scholar
Battaglia PW, Hamrick JB, Bapst V, et al. Relational inductive biases, deep learning, and graph networks. ar**v:1806.01261 2018.
Bello I, Pham H, Le QV, et al. Neural combinatorial optimization with reinforcement learning. In: International conference on learning representations; 2016.
Cai Q, Hang W, Mirhoseini A, et al. (2019) Reinforcement learning driven heuristic optimization. In: DRL4KDD.
Cohen TS, Welling M. Group equivariant convolutional networks. In: International conference on machine learning, pp 2990–2999 2016.
da Costa P, Rhuggenaath J, Zhang Y, et al. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian conference on machine learning, 2020;pp 465–480.
da Costa P, Rhuggenaath J, Zhang Y, et al. Learning 2-opt heuristics for routing problems via deep reinforcement learning. SN Comput Sci. 2021;2:1–16. https://doi.org/10.1007/s42979-021-00779-2.
Article Google Scholar
Dai H, Khalil EB, Zhang Y, et al. Learning combinatorial optimization algorithms over graphs. Adv Neural Inf Process Syst. 2017;30:6351–61.
Google Scholar
Deudon M, Cournut P, Lacoste A, et al. Learning heuristics for the TSP by policy gradient. In: International conference on the integration of constraint programming, artificial intelligence, and operations research, vol 10848 LNCS. Springer Verlag. 2018; pp 170–181, https://doi.org/10.1007/978-3-319-93031-2_12
François-Lavet V, Henderson P, Islam R, et al. An introduction to deep reinforcement learning. Found Trends Mach Learn. 2018;11(3–4):219. https://doi.org/10.1561/2200000071.
Article Google Scholar
Fu Z, Qiu K, Zha H. Generalize a small pre-trained model to arbitrarily large TSP instances. In: AAAI conference on artificial intelligence. 2021; pp 7474–7482. https://doi.org/10.1609/aaai.v35i8.16916
Gens R, Domingos PM. Deep symmetry networks. In: Advances in neural information processing systems. 2014.
Gerez SH. Algorithms for VLSI design automation, Wiley, chap Routing. 1999.
Helsgaun K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Technical report, Roskilde University; 2017.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Article Google Scholar
Jones NC, Pevzner PA. An introduction to bioinformatics algorithms. MIT Press; 2004.
Google Scholar
Joshi CK, Laurent T, Bresson X. An efficient graph convolutional network technique for the travelling salesman problem. ar**v:1906.01227. 2019a.
Joshi CK, Laurent T, Bresson X. On learning paradigms for the travelling salesman problem. In: NeurIPS graph representation learning workshop, ar**v:1910.07210. 2019b
Kool W, van Hoof H, Welling M. Attention, learn to solve routing problems! In: International conference on learning representations 2019.
Kwon YD, Choo J, Kim B, et al. POMO: policy optimization with multiple optima for reinforcement learning. Adv Neural Inf Process Syst. 2020;33:21188–98.
Google Scholar
LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series. Cambridge, MA, USA: MIT Press; 1998. p. 255–8.
Google Scholar
Li Z, Chen Q, Koltun V. Combinatorial optimization with graph convolutional networks and guided tree search. Adv Neural Inf Process Syst. 2018;31:539–48.
Google Scholar
Lisicki M, Afkanpour A, Taylor GW. Evaluating curriculum learning strategies in neural combinatorial optimization. In: NeurIPS workshop on learning meets combinatorial algorithms, ar**v:2011.06188 2020.
Ma Q, Ge S, He D, et al. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: AAAI workshop on deep learning on graphs: methodologies and applications, ar**v:1911.04936. 2020.
Ouyang W, Wang Y, Han S, et al. Improving generalization of deep reinforcement learning-based TSP solvers. In: IEEE SSCI ADPRL, ar**v:2110.02843 2021.
Papadimitriou CH. The Euclidean travelling salesman problem is NP-complete. Theor Comput Sci. 1977;4(3):237–44. https://doi.org/10.1016/0304-3975(77)90012-3.
Article Google Scholar
Peng B, Wang J, Zhang Z. A deep reinforcement learning algorithm using dynamic attention model for vehicle routing problems. In: Artificial intelligence algorithms and applications. Springer, Singapore, Communications in Computer and Information Science, pp 636–650, https://doi.org/10.1007/978-981-15-5577-0_51 2020.
Perron L, Furnon V. Or-tools. https://developers.google.com/optimization/ 2019.
Prates MOR, Avelar PHC, Lemos H, et al. Learning to solve np-complete problems - a graph neural network for decision TSP. In: AAAI conference on artificial intelligence, 2019;4731–4738, https://doi.org/10.1609/aaai.v33i01.33014731.
Reinelt G. TSPLIB-a traveling salesman problem library. ORSA J Comput. 1991;3(4):376–84. https://doi.org/10.1287/ijoc.3.4.376.
Article Google Scholar
Snyder L, Shen ZJ. Fundamentals of Supply Chain Theory, Wiley, chap The Traveling Salesman Problem, 2019;403–461.
Soviany P, Ionescu RT, Rota P, et al. Curriculum learning: a survey. Int J Comput Vis. 2021;130:1526–65. https://doi.org/10.1007/s11263-022-01611-x.
Article Google Scholar
Sutton R, Barto A. Reinforcement learning: an introduction. MIT Press; 1998.
Google Scholar
Vinyals O, Fortunato M, Jaitly N. Pointer networks. Adv Neural Inf Process Syst. 2015;28:2692–700.
Google Scholar
Vo TQT, Nguyen VH, Weng P, et al. Improving subtour elimination constraint generation in branch-and-cut algorithms for the TSP with machine learning. In: Learning and intelligent optimization conference 2023.
Weinshall D, Cohen G, Amir D. Curriculum learning by transfer learning: theory and experiments with deep networks. In: International conference on machine learning, pp 5235–5243, https://doi.org/10.48550/ar**v.1802.03796 2018.
Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992;8:229–56. https://doi.org/10.1007/BF00992696.
Article Google Scholar
Wu Y, Song W, Cao Z, et al. Learning improvement heuristics for solving routing problems. IEEE Trans Neural Netw Learn Syst. 2021. https://doi.org/10.1109/TNNLS.2021.3068828.
Article Google Scholar
**ng Z, Tu S. A graph neural network assisted Monte Carlo tree search approach to traveling salesman problem. IEEE Access. 2020;8:108418–28. https://doi.org/10.1109/ACCESS.2020.3000236.
Article Google Scholar
Zheng J, He K, Zhou J, et al. Combining reinforcement learning with Lin-Kernighan-Helsgaun algorithm for the traveling salesman problem. In: AAAI conference on artificial intelligence, 2021;12,445–12,452, https://doi.org/10.1609/aaai.v35i14.17476.

Download references

Funding

This work has been supported in part by the program of National Natural Science Foundation of China (No. 62176154) and the program of the Shanghai NSF (No. 19ZR1426700).

Author information

Wenbin Ouyang and Yisen Wang have contributed equally for this work.

Authors and Affiliations

EECS Department, University of Michigan, Ann Arbor, USA
Wenbin Ouyang & Yisen Wang
UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China
Wenbin Ouyang, Yisen Wang & Shaochen Han
Data Science Research Center, Duke Kunshan University, Kunshan, China
Paul Weng

Authors

Wenbin Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Yisen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Paul Weng
View author publications
You can also search for this author in PubMed Google Scholar
Shaochen Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Weng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Policy Gradient of eMAGIC

As promised in Sect. “Algorithm & Training” - Smoothed Policy Gradient, we elaborate on the detailed mathematical derivation for Equation (13).

The objective function $J(\varvec{\theta })$ in Eq. (3) can be approximated with the empirical mean of the total rewards using B trajectories sampled with policy $\pi _{\varvec{\theta }}$:

$$\begin{aligned} J(\varvec{\theta }) = {-{\mathbb {E}}[L_\sigma ({\varvec{X}})]}\approx {-\hat{{\mathbb {E}}}_B\left[ \sum _{t=1}^N r^{(b)}_{t}\right] } = {-\frac{1}{|B|} \sum _{b=1}^{|B|} \sum _{t=1}^{N} r^{(b)}_{t}}, \end{aligned}$$

(16)

where $\hat{{\mathbb {E}}}_B$ represents the empirical mean operator, $L_\sigma ({\varvec{X}})$ is the tour length of $\sigma$ output by the RL policy, and $r^{(b)}_{t}$ is the t-th reward of the b-th trajectory. The policy gradient to optimize $J(\varvec{\theta })$ can be estimated by:

$$\begin{aligned} \begin{aligned} \nabla _{\varvec{\theta }} J(\varvec{\theta })&=-{\mathbb {E}}_\tau \left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}_{t} \mid \varvec{s}_{t})\right) \left( \sum _{t=1}^{N} r_t\right) \right] \\ {}&\approx -\hat{\mathbb E}_B\left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}^{(b)}_{t} \mid \varvec{s}^{(b)}_{t})\right) \left( \sum _{t=1}^{N} r^{(b)}_{t}\right) \right] \end{aligned} \end{aligned}$$

(17)

However, recall $J(\varvec{\theta })$ is the standard objective used in most deep RL methods applied to TSP. Instead, we optimize $J^+(\varvec{\theta })= -{\mathbb {E}}[L_{\sigma _+}({\varvec{X}})]$ where $L_{\sigma _+}({\varvec{X}})$ is the tour length of $\sigma$ after applying local search. This helps integrate better RL and local search by smoothing the value landscape and training an RL agent to output a tour that can be improved by local search. This new objective function can be rewritten:

$$\begin{aligned} \begin{aligned} J^+(\varvec{\theta })&= -{\mathbb {E}}_{\sigma \sim \pi _{\varvec{\theta }},\sigma _+\sim \rho (\sigma )}[L_{\sigma _+}({\varvec{X}})]\\&= -{\mathbb {E}}_{\sigma \sim \pi _{\varvec{\theta }}}[\mathbb E_{\sigma _+\sim \rho (\sigma )}[L_{\sigma _+}({\varvec{X}}) \mid \sigma ]] \end{aligned} \end{aligned}$$

(18)

where $\rho (\sigma )$ denotes the distribution over tours induced by the application of the stochastic local search on $\sigma$. Taking the gradient of this new objective:

$$\begin{aligned} \begin{aligned}&\nabla _{\varvec{\theta }} J^+(\varvec{\theta })= - {\mathbb {E}}_\tau \left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}_{t} \mid {\varvec{s}}_{t})\right) {\mathbb {E}}_{\sigma _+\sim \rho (\sigma )}[L_{\sigma _+}({\varvec{X}}) \mid \sigma ] \right] \\&\approx - \hat{{\mathbb {E}}}_B\left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}^{(b)}_{t} \mid {\varvec{s}}^{(b)}_{t})\right) \left( L_{\sigma ^{(b)}_+}({\varvec{X}}^{(b)})\right) \right] \\&\approx - \frac{1}{|B|} \sum _{b=1}^{|B|} \left( \sum _{t=2}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}^{(b)}_{t} \mid \varvec{s}^{(b)}_{t})\right) \left( L_{\sigma _+^{(b)}}(\varvec{X}^{(b)})-l^{(b)}\right) , \end{aligned} \end{aligned}$$

(19)

where $\tau = ({\varvec{s}}_1, a_1, \ldots )$ and $\sigma$ is its associated tour. We simply approximate the conditional expectation over $\rho (\sigma )$ by a sample. Therefore, our gradient estimate is an unbiased estimator of the gradient of our new objective $J^+(\varvec{\theta })$.

Using our policy rollout baseline introduces some bias to the estimation of the smoothed policy gradient. However, the variance reduction helps with achieving greater performance and smaller variance, as we observed in our experiments.

Appendix B: Settings and Hyperparameters

All our experiments are run on a computer with an Intel(R) Xeon(R) E5-2678 v3 CPU and a NVIDIA 1080Ti GPU. In consistency with previous work, all our randomly generated TSP instances are sampled in $[0,1]^2$ uniformly. During training, stochastic CL chooses a TSP size in ${\mathcal {R}}=\{10,11,\dots ,50\}$ for each epoch e according to Eqs. (14) and (15), with $\sigma _N$ set to be 3. We trained for 200 epochs in each experiment, with 1000 batches of 128 random TSP instances in each epoch. We set the learning rate to be $10^{-3}$ and the learning rate decay to be 0.96 in each experiment. In each experiment, we set $\alpha =0.5$, $\beta =1.5$, $\gamma =0.25$ and $I=10$ for the parameters of our combined local search. As for random TSP testing and the ablation study, we test on TSP instances with size 20, 50, 100, 200, 500, and 1000 to evaluate the generalization capability of our model; as for realistic TSP, we test on TSP instances with sizes up to 1002 from the TSPLIB library. With respect to our model architecture, our MLP encoder has an input layer with dimension 2, two hidden layers with dimension 128 and 256, respectively, and an output layers with dimension 128; for the GNN encoder, we set $H=128$ and $n_{GNN}=3$.

Addendix C: More Experiments on eMAGIC

More versions of eMAGIC

As promised in Sect. 6, we illustrate all versions of eMAGIC in this section. See Tables 7 and 8.

Table 7 Results of eMAGIC vs baselines, tested on 10,000 instances for TSP 20, 50, and 100

Full size table

Table 8 Results of all versions of eMAGIC, tested on 128 instances for TSP 200, 500, and 1000

Full size table

Variance Analysis of eMAGIC

As promised in Sect. 6, we provide the variance analysis of eMAGIC in this section. In our experiments, we repeated all our experiments (training + testing) with 3 random seeds. The variances are shown in Table 9.

Table 9 Variance analysis of eMAGIC

Full size table

The variances are quite low, showing that our method gives relatively good and stable results. Note the variances increase with the number of sampling (s1, s10 and S) for larger TSP instances since there is more room for improvement.

Experiments on Extremely Large TSP Instances

In Table 10, we show the performances of eMAGIC on extremely large TSP instances (i.e., TSP 10,000) and the comparisons with a few methods that are able to generalize to TSP 10,000. For the hyperparameters of the combined local search, we use $I=2$ and $\beta =1.4$. We modify the hyperparameters in this way because we find that for large TSP instances, doing too many iterations of local search is not efficient. The other experimental settings in this test are the same as Sect. 6. Also, we test [12]’s method with a limited the time budget, denoted by AGCRN+M$^{lim}$: Since we only reproduce their method successfully with CPU, we decrease their hyperparameter T (the MCTS will end no longer than T seconds) from 0.04n to $0.04n / 1.8 * (28/60) = 0.010n$ (1.8h and 28 m are, respectively, the runtimes of their model and eMAGIC(G)). By this, we expect that their model and eMAGIC(G) will have a similar running time.

Table 10 Results of eMAGIC vs baselines, tested on 16 instances for TSP 10,000

Full size table

We can observe that our models outperform AM [20] to a large extent. Comparing to AGCRN+M [12] and LKH3 [15], our methods run much faster without a big gap of performance. Plus, if we give AGCRN+M [12] a time budget comparable to our method (i.e., run AGCRN+M and eMAGIC(G) for the same amount of time), we can see that our algorithm outperforms AGCRN+M. Note that MCTS in AGCRN+M is written in C++. We could reduce our runtime if we had also written our code in C++ instead of Python. Therefore, our method offers a better trade-off in terms of performance vs runtime: It can generate relatively good results in much less time.

Appendix D: Evaluation of Equivariant Preprocessing Methods

Comparisons Between Different Symmetries Used During Preprocessing

We present the comparison results of applying rotation, translation and reflection. The comparisons are done by testing on random TSP instances followed the same setting in the experiment section. As in Table 11, the rotation has the best performance on small and large TSP instances. By this, we choose rotation for our final algorithm.

Table 11 Comparisons between rotation, translation, and reflection

Full size table

Comparisons Between One Preprocessing Application and Iteration Preprocessing Applications

We present the comparison results of one preprocessing application and iteration preprocessing applications. The comparisons are done by testing on random TSP instances followed the same setting in the experiment section. As in Table 12, the iteration preprocessing applications have better performance on small and large TSP instances. By this, we choose iteration preprocessing applications for our final algorithm.

Table 12 Comparisons between iteration preprocessing and one preprocessing

Full size table

Realist TSP Instances in TSPLIB

As promised in Sect. 6, Tables 13, 14 and 15 list the performances of various learning-based TSP solvers and heuristic approaches upon realistic TSP instances in TSPLIB. The bold numbers show the best performance among all the approaches. We can observe that most bold numbers are provided by eMAGIC, meaning our approach provides excellent results for TSPLIB. In addition, we also provide our results for the larger instances (with size up to 1002) from TSPLIB in Table 16. Table 16 only contains the experiments of our model since no other model can generalize to large size TSP instances with adequate performance.

Table 13 Comparison of Performances on TSPLIB—Part I

Full size table

Table 14 Comparison of performances on TSPLIB—Part II

Full size table

Table 15 Comparison of Performances on TSPLIB—Part III

Full size table

Table 16 Comparison of performance on large size TSP instances in TSPLIB

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ouyang, W., Wang, Y., Weng, P. et al. Generalization in Deep RL for TSP Problems via Equivariance and Local Search. SN COMPUT. SCI. 5, 369 (2024). https://doi.org/10.1007/s42979-024-02689-5

Download citation

Received: 19 August 2022
Accepted: 06 February 2024
Published: 29 March 2024
DOI: https://doi.org/10.1007/s42979-024-02689-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Generalization in Deep RL for TSP Problems via Equivariance and Local Search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning

Unsupervised Training for Neural TSP Solver

Learning Heuristics for the TSP by Policy Gradient

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate

Additional information

Publisher's Note

Appendices

Appendix A: Policy Gradient of eMAGIC

Appendix B: Settings and Hyperparameters

Addendix C: More Experiments on eMAGIC

More versions of eMAGIC

Variance Analysis of eMAGIC

Experiments on Extremely Large TSP Instances

Appendix D: Evaluation of Equivariant Preprocessing Methods

Comparisons Between Different Symmetries Used During Preprocessing

Comparisons Between One Preprocessing Application and Iteration Preprocessing Applications

Realist TSP Instances in TSPLIB

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Generalization in Deep RL for TSP Problems via Equivariance and Local Search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning

Unsupervised Training for Neural TSP Solver

Learning Heuristics for the TSP by Policy Gradient

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate

Additional information

Publisher's Note

Appendices

Appendix A: Policy Gradient of eMAGIC

Appendix B: Settings and Hyperparameters

Addendix C: More Experiments on eMAGIC

More versions of eMAGIC

Variance Analysis of eMAGIC

Experiments on Extremely Large TSP Instances

Appendix D: Evaluation of Equivariant Preprocessing Methods

Comparisons Between Different Symmetries Used During Preprocessing

Comparisons Between One Preprocessing Application and Iteration Preprocessing Applications

Realist TSP Instances in TSPLIB

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation