Log in

Generalization in Deep RL for TSP Problems via Equivariance and Local Search

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Deep reinforcement learning (RL) has proved to be a competitive heuristic for solving small-sized instances of traveling salesman problems (TSP), but its performance on larger-sized instances is insufficient. Since training on large instances is impractical, we design a novel deep RL approach with a focus on generalizability. Our proposition consisting of a simple deep learning architecture, which learns with novel RL training techniques, exploits two main ideas. First, we exploit equivariance to facilitate training. Second, we interleave efficient local search heuristics with the usual RL training to smooth the value landscape. Our experimental evaluation demonstrates that our method achieves state-of-the-art performances not only when generalizing to large random TSP instances, but also on realistic TSP instances. Moreover, an ablation study shows that all the components of our method contribute to its performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5

Similar content being viewed by others

Notes

  1. Our implementation takes advantage of GPU acceleration when possible. The source code will be shared after publication.

References

  1. Applegate D, Ribert B, Vasek C, et al. Concorde TSP solver. http://www.math.uwaterloo.ca/tsp/concorde 2004.

  2. Bai R, Chen X, Chen ZL, et al. Analytics and machine learning in vehicle routing research. Int J Prod Res. 2021;61(1):4–30. https://doi.org/10.1080/00207543.2021.2013566.

    Article  Google Scholar 

  3. Battaglia PW, Hamrick JB, Bapst V, et al. Relational inductive biases, deep learning, and graph networks. ar**v:1806.01261 2018.

  4. Bello I, Pham H, Le QV, et al. Neural combinatorial optimization with reinforcement learning. In: International conference on learning representations; 2016.

  5. Cai Q, Hang W, Mirhoseini A, et al. (2019) Reinforcement learning driven heuristic optimization. In: DRL4KDD.

  6. Cohen TS, Welling M. Group equivariant convolutional networks. In: International conference on machine learning, pp 2990–2999 2016.

  7. da Costa P, Rhuggenaath J, Zhang Y, et al. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian conference on machine learning, 2020;pp 465–480.

  8. da Costa P, Rhuggenaath J, Zhang Y, et al. Learning 2-opt heuristics for routing problems via deep reinforcement learning. SN Comput Sci. 2021;2:1–16. https://doi.org/10.1007/s42979-021-00779-2.

    Article  Google Scholar 

  9. Dai H, Khalil EB, Zhang Y, et al. Learning combinatorial optimization algorithms over graphs. Adv Neural Inf Process Syst. 2017;30:6351–61.

    Google Scholar 

  10. Deudon M, Cournut P, Lacoste A, et al. Learning heuristics for the TSP by policy gradient. In: International conference on the integration of constraint programming, artificial intelligence, and operations research, vol 10848 LNCS. Springer Verlag. 2018; pp 170–181, https://doi.org/10.1007/978-3-319-93031-2_12

  11. François-Lavet V, Henderson P, Islam R, et al. An introduction to deep reinforcement learning. Found Trends Mach Learn. 2018;11(3–4):219. https://doi.org/10.1561/2200000071.

    Article  Google Scholar 

  12. Fu Z, Qiu K, Zha H. Generalize a small pre-trained model to arbitrarily large TSP instances. In: AAAI conference on artificial intelligence. 2021; pp 7474–7482. https://doi.org/10.1609/aaai.v35i8.16916

  13. Gens R, Domingos PM. Deep symmetry networks. In: Advances in neural information processing systems. 2014.

  14. Gerez SH. Algorithms for VLSI design automation, Wiley, chap Routing. 1999.

  15. Helsgaun K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Technical report, Roskilde University; 2017.

  16. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

    Article  Google Scholar 

  17. Jones NC, Pevzner PA. An introduction to bioinformatics algorithms. MIT Press; 2004.

    Google Scholar 

  18. Joshi CK, Laurent T, Bresson X. An efficient graph convolutional network technique for the travelling salesman problem. ar**v:1906.01227. 2019a.

  19. Joshi CK, Laurent T, Bresson X. On learning paradigms for the travelling salesman problem. In: NeurIPS graph representation learning workshop, ar**v:1910.07210. 2019b

  20. Kool W, van Hoof H, Welling M. Attention, learn to solve routing problems! In: International conference on learning representations 2019.

  21. Kwon YD, Choo J, Kim B, et al. POMO: policy optimization with multiple optima for reinforcement learning. Adv Neural Inf Process Syst. 2020;33:21188–98.

    Google Scholar 

  22. LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series. Cambridge, MA, USA: MIT Press; 1998. p. 255–8.

    Google Scholar 

  23. Li Z, Chen Q, Koltun V. Combinatorial optimization with graph convolutional networks and guided tree search. Adv Neural Inf Process Syst. 2018;31:539–48.

    Google Scholar 

  24. Lisicki M, Afkanpour A, Taylor GW. Evaluating curriculum learning strategies in neural combinatorial optimization. In: NeurIPS workshop on learning meets combinatorial algorithms, ar**v:2011.06188 2020.

  25. Ma Q, Ge S, He D, et al. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: AAAI workshop on deep learning on graphs: methodologies and applications, ar**v:1911.04936. 2020.

  26. Ouyang W, Wang Y, Han S, et al. Improving generalization of deep reinforcement learning-based TSP solvers. In: IEEE SSCI ADPRL, ar**v:2110.02843 2021.

  27. Papadimitriou CH. The Euclidean travelling salesman problem is NP-complete. Theor Comput Sci. 1977;4(3):237–44. https://doi.org/10.1016/0304-3975(77)90012-3.

    Article  Google Scholar 

  28. Peng B, Wang J, Zhang Z. A deep reinforcement learning algorithm using dynamic attention model for vehicle routing problems. In: Artificial intelligence algorithms and applications. Springer, Singapore, Communications in Computer and Information Science, pp 636–650, https://doi.org/10.1007/978-981-15-5577-0_51 2020.

  29. Perron L, Furnon V. Or-tools. https://developers.google.com/optimization/ 2019.

  30. Prates MOR, Avelar PHC, Lemos H, et al. Learning to solve np-complete problems - a graph neural network for decision TSP. In: AAAI conference on artificial intelligence, 2019;4731–4738, https://doi.org/10.1609/aaai.v33i01.33014731.

  31. Reinelt G. TSPLIB-a traveling salesman problem library. ORSA J Comput. 1991;3(4):376–84. https://doi.org/10.1287/ijoc.3.4.376.

    Article  Google Scholar 

  32. Snyder L, Shen ZJ. Fundamentals of Supply Chain Theory, Wiley, chap The Traveling Salesman Problem, 2019;403–461.

  33. Soviany P, Ionescu RT, Rota P, et al. Curriculum learning: a survey. Int J Comput Vis. 2021;130:1526–65. https://doi.org/10.1007/s11263-022-01611-x.

    Article  Google Scholar 

  34. Sutton R, Barto A. Reinforcement learning: an introduction. MIT Press; 1998.

    Google Scholar 

  35. Vinyals O, Fortunato M, Jaitly N. Pointer networks. Adv Neural Inf Process Syst. 2015;28:2692–700.

    Google Scholar 

  36. Vo TQT, Nguyen VH, Weng P, et al. Improving subtour elimination constraint generation in branch-and-cut algorithms for the TSP with machine learning. In: Learning and intelligent optimization conference 2023.

  37. Weinshall D, Cohen G, Amir D. Curriculum learning by transfer learning: theory and experiments with deep networks. In: International conference on machine learning, pp 5235–5243, https://doi.org/10.48550/ar**v.1802.03796 2018.

  38. Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992;8:229–56. https://doi.org/10.1007/BF00992696.

    Article  Google Scholar 

  39. Wu Y, Song W, Cao Z, et al. Learning improvement heuristics for solving routing problems. IEEE Trans Neural Netw Learn Syst. 2021. https://doi.org/10.1109/TNNLS.2021.3068828.

    Article  Google Scholar 

  40. **ng Z, Tu S. A graph neural network assisted Monte Carlo tree search approach to traveling salesman problem. IEEE Access. 2020;8:108418–28. https://doi.org/10.1109/ACCESS.2020.3000236.

    Article  Google Scholar 

  41. Zheng J, He K, Zhou J, et al. Combining reinforcement learning with Lin-Kernighan-Helsgaun algorithm for the traveling salesman problem. In: AAAI conference on artificial intelligence, 2021;12,445–12,452, https://doi.org/10.1609/aaai.v35i14.17476.

Download references

Funding

This work has been supported in part by the program of National Natural Science Foundation of China (No. 62176154) and the program of the Shanghai NSF (No. 19ZR1426700).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Weng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Policy Gradient of eMAGIC

As promised in Sect. “Algorithm & Training” - Smoothed Policy Gradient, we elaborate on the detailed mathematical derivation for Equation (13).

The objective function \(J(\varvec{\theta })\) in Eq. (3) can be approximated with the empirical mean of the total rewards using B trajectories sampled with policy \(\pi _{\varvec{\theta }}\):

$$\begin{aligned} J(\varvec{\theta }) = {-{\mathbb {E}}[L_\sigma ({\varvec{X}})]}\approx {-\hat{{\mathbb {E}}}_B\left[ \sum _{t=1}^N r^{(b)}_{t}\right] } = {-\frac{1}{|B|} \sum _{b=1}^{|B|} \sum _{t=1}^{N} r^{(b)}_{t}}, \end{aligned}$$
(16)

where \(\hat{{\mathbb {E}}}_B\) represents the empirical mean operator, \(L_\sigma ({\varvec{X}})\) is the tour length of \(\sigma\) output by the RL policy, and \(r^{(b)}_{t}\) is the t-th reward of the b-th trajectory. The policy gradient to optimize \(J(\varvec{\theta })\) can be estimated by:

$$\begin{aligned} \begin{aligned} \nabla _{\varvec{\theta }} J(\varvec{\theta })&=-{\mathbb {E}}_\tau \left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}_{t} \mid \varvec{s}_{t})\right) \left( \sum _{t=1}^{N} r_t\right) \right] \\ {}&\approx -\hat{\mathbb E}_B\left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}^{(b)}_{t} \mid \varvec{s}^{(b)}_{t})\right) \left( \sum _{t=1}^{N} r^{(b)}_{t}\right) \right] \end{aligned} \end{aligned}$$
(17)

However, recall \(J(\varvec{\theta })\) is the standard objective used in most deep RL methods applied to TSP. Instead, we optimize \(J^+(\varvec{\theta })= -{\mathbb {E}}[L_{\sigma _+}({\varvec{X}})]\) where \(L_{\sigma _+}({\varvec{X}})\) is the tour length of \(\sigma\) after applying local search. This helps integrate better RL and local search by smoothing the value landscape and training an RL agent to output a tour that can be improved by local search. This new objective function can be rewritten:

$$\begin{aligned} \begin{aligned} J^+(\varvec{\theta })&= -{\mathbb {E}}_{\sigma \sim \pi _{\varvec{\theta }},\sigma _+\sim \rho (\sigma )}[L_{\sigma _+}({\varvec{X}})]\\&= -{\mathbb {E}}_{\sigma \sim \pi _{\varvec{\theta }}}[\mathbb E_{\sigma _+\sim \rho (\sigma )}[L_{\sigma _+}({\varvec{X}}) \mid \sigma ]] \end{aligned} \end{aligned}$$
(18)

where \(\rho (\sigma )\) denotes the distribution over tours induced by the application of the stochastic local search on \(\sigma\). Taking the gradient of this new objective:

$$\begin{aligned} \begin{aligned}&\nabla _{\varvec{\theta }} J^+(\varvec{\theta })= - {\mathbb {E}}_\tau \left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}_{t} \mid {\varvec{s}}_{t})\right) {\mathbb {E}}_{\sigma _+\sim \rho (\sigma )}[L_{\sigma _+}({\varvec{X}}) \mid \sigma ] \right] \\&\approx - \hat{{\mathbb {E}}}_B\left[ \left( \sum _{t=1}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}^{(b)}_{t} \mid {\varvec{s}}^{(b)}_{t})\right) \left( L_{\sigma ^{(b)}_+}({\varvec{X}}^{(b)})\right) \right] \\&\approx - \frac{1}{|B|} \sum _{b=1}^{|B|} \left( \sum _{t=2}^{N} \nabla _{\varvec{\theta }} \log \pi _{\varvec{\theta }}({a}^{(b)}_{t} \mid \varvec{s}^{(b)}_{t})\right) \left( L_{\sigma _+^{(b)}}(\varvec{X}^{(b)})-l^{(b)}\right) , \end{aligned} \end{aligned}$$
(19)

where \(\tau = ({\varvec{s}}_1, a_1, \ldots )\) and \(\sigma\) is its associated tour. We simply approximate the conditional expectation over \(\rho (\sigma )\) by a sample. Therefore, our gradient estimate is an unbiased estimator of the gradient of our new objective \(J^+(\varvec{\theta })\).

Using our policy rollout baseline introduces some bias to the estimation of the smoothed policy gradient. However, the variance reduction helps with achieving greater performance and smaller variance, as we observed in our experiments.

Appendix B: Settings and Hyperparameters

All our experiments are run on a computer with an Intel(R) Xeon(R) E5-2678 v3 CPU and a NVIDIA 1080Ti GPU. In consistency with previous work, all our randomly generated TSP instances are sampled in \([0,1]^2\) uniformly. During training, stochastic CL chooses a TSP size in \({\mathcal {R}}=\{10,11,\dots ,50\}\) for each epoch e according to Eqs. (14) and (15), with \(\sigma _N\) set to be 3. We trained for 200 epochs in each experiment, with 1000 batches of 128 random TSP instances in each epoch. We set the learning rate to be \(10^{-3}\) and the learning rate decay to be 0.96 in each experiment. In each experiment, we set \(\alpha =0.5\), \(\beta =1.5\), \(\gamma =0.25\) and \(I=10\) for the parameters of our combined local search. As for random TSP testing and the ablation study, we test on TSP instances with size 20, 50, 100, 200, 500, and 1000 to evaluate the generalization capability of our model; as for realistic TSP, we test on TSP instances with sizes up to 1002 from the TSPLIB library. With respect to our model architecture, our MLP encoder has an input layer with dimension 2, two hidden layers with dimension 128 and 256, respectively, and an output layers with dimension 128; for the GNN encoder, we set \(H=128\) and \(n_{GNN}=3\).

Addendix C: More Experiments on eMAGIC

More versions of eMAGIC

As promised in Sect. 6, we illustrate all versions of eMAGIC in this section. See Tables 7 and 8.

Table 7 Results of eMAGIC vs baselines, tested on 10,000 instances for TSP 20, 50, and 100
Table 8 Results of all versions of eMAGIC, tested on 128 instances for TSP 200, 500, and 1000

Variance Analysis of eMAGIC

As promised in Sect. 6, we provide the variance analysis of eMAGIC in this section. In our experiments, we repeated all our experiments (training + testing) with 3 random seeds. The variances are shown in Table 9.

Table 9 Variance analysis of eMAGIC

The variances are quite low, showing that our method gives relatively good and stable results. Note the variances increase with the number of sampling (s1, s10 and S) for larger TSP instances since there is more room for improvement.

Experiments on Extremely Large TSP Instances

In Table 10, we show the performances of eMAGIC on extremely large TSP instances (i.e., TSP 10,000) and the comparisons with a few methods that are able to generalize to TSP 10,000. For the hyperparameters of the combined local search, we use \(I=2\) and \(\beta =1.4\). We modify the hyperparameters in this way because we find that for large TSP instances, doing too many iterations of local search is not efficient. The other experimental settings in this test are the same as Sect. 6. Also, we test [12]’s method with a limited the time budget, denoted by AGCRN+M\(^{lim}\): Since we only reproduce their method successfully with CPU, we decrease their hyperparameter T (the MCTS will end no longer than T seconds) from 0.04n to \(0.04n / 1.8 * (28/60) = 0.010n\) (1.8h and 28 m are, respectively, the runtimes of their model and eMAGIC(G)). By this, we expect that their model and eMAGIC(G) will have a similar running time.

Table 10 Results of eMAGIC vs baselines, tested on 16 instances for TSP 10,000

We can observe that our models outperform AM [20] to a large extent. Comparing to AGCRN+M [12] and LKH3 [15], our methods run much faster without a big gap of performance. Plus, if we give AGCRN+M [12] a time budget comparable to our method (i.e., run AGCRN+M and eMAGIC(G) for the same amount of time), we can see that our algorithm outperforms AGCRN+M. Note that MCTS in AGCRN+M is written in C++. We could reduce our runtime if we had also written our code in C++ instead of Python. Therefore, our method offers a better trade-off in terms of performance vs runtime: It can generate relatively good results in much less time.

Appendix D: Evaluation of Equivariant Preprocessing Methods

Comparisons Between Different Symmetries Used During Preprocessing

We present the comparison results of applying rotation, translation and reflection. The comparisons are done by testing on random TSP instances followed the same setting in the experiment section. As in Table 11, the rotation has the best performance on small and large TSP instances. By this, we choose rotation for our final algorithm.

Table 11 Comparisons between rotation, translation, and reflection

Comparisons Between One Preprocessing Application and Iteration Preprocessing Applications

We present the comparison results of one preprocessing application and iteration preprocessing applications. The comparisons are done by testing on random TSP instances followed the same setting in the experiment section. As in Table 12, the iteration preprocessing applications have better performance on small and large TSP instances. By this, we choose iteration preprocessing applications for our final algorithm.

Table 12 Comparisons between iteration preprocessing and one preprocessing

Realist TSP Instances in TSPLIB

As promised in Sect. 6, Tables 13, 14 and 15 list the performances of various learning-based TSP solvers and heuristic approaches upon realistic TSP instances in TSPLIB. The bold numbers show the best performance among all the approaches. We can observe that most bold numbers are provided by eMAGIC, meaning our approach provides excellent results for TSPLIB. In addition, we also provide our results for the larger instances (with size up to 1002) from TSPLIB in Table 16. Table 16 only contains the experiments of our model since no other model can generalize to large size TSP instances with adequate performance.

Table 13 Comparison of Performances on TSPLIB—Part I
Table 14 Comparison of performances on TSPLIB—Part II
Table 15 Comparison of Performances on TSPLIB—Part III
Table 16 Comparison of performance on large size TSP instances in TSPLIB

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ouyang, W., Wang, Y., Weng, P. et al. Generalization in Deep RL for TSP Problems via Equivariance and Local Search. SN COMPUT. SCI. 5, 369 (2024). https://doi.org/10.1007/s42979-024-02689-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-02689-5

Keywords

Navigation