Abstract
Deep reinforcement learning (RL) has proved to be a competitive heuristic for solving small-sized instances of traveling salesman problems (TSP), but its performance on larger-sized instances is insufficient. Since training on large instances is impractical, we design a novel deep RL approach with a focus on generalizability. Our proposition consisting of a simple deep learning architecture, which learns with novel RL training techniques, exploits two main ideas. First, we exploit equivariance to facilitate training. Second, we interleave efficient local search heuristics with the usual RL training to smooth the value landscape. Our experimental evaluation demonstrates that our method achieves state-of-the-art performances not only when generalizing to large random TSP instances, but also on realistic TSP instances. Moreover, an ablation study shows that all the components of our method contribute to its performance.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Figa_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Figb_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Figc_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Figd_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42979-024-02689-5/MediaObjects/42979_2024_2689_Fige_HTML.png)
Similar content being viewed by others
Notes
Our implementation takes advantage of GPU acceleration when possible. The source code will be shared after publication.
References
Applegate D, Ribert B, Vasek C, et al. Concorde TSP solver. http://www.math.uwaterloo.ca/tsp/concorde 2004.
Bai R, Chen X, Chen ZL, et al. Analytics and machine learning in vehicle routing research. Int J Prod Res. 2021;61(1):4–30. https://doi.org/10.1080/00207543.2021.2013566.
Battaglia PW, Hamrick JB, Bapst V, et al. Relational inductive biases, deep learning, and graph networks. ar**v:1806.01261 2018.
Bello I, Pham H, Le QV, et al. Neural combinatorial optimization with reinforcement learning. In: International conference on learning representations; 2016.
Cai Q, Hang W, Mirhoseini A, et al. (2019) Reinforcement learning driven heuristic optimization. In: DRL4KDD.
Cohen TS, Welling M. Group equivariant convolutional networks. In: International conference on machine learning, pp 2990–2999 2016.
da Costa P, Rhuggenaath J, Zhang Y, et al. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian conference on machine learning, 2020;pp 465–480.
da Costa P, Rhuggenaath J, Zhang Y, et al. Learning 2-opt heuristics for routing problems via deep reinforcement learning. SN Comput Sci. 2021;2:1–16. https://doi.org/10.1007/s42979-021-00779-2.
Dai H, Khalil EB, Zhang Y, et al. Learning combinatorial optimization algorithms over graphs. Adv Neural Inf Process Syst. 2017;30:6351–61.
Deudon M, Cournut P, Lacoste A, et al. Learning heuristics for the TSP by policy gradient. In: International conference on the integration of constraint programming, artificial intelligence, and operations research, vol 10848 LNCS. Springer Verlag. 2018; pp 170–181, https://doi.org/10.1007/978-3-319-93031-2_12
François-Lavet V, Henderson P, Islam R, et al. An introduction to deep reinforcement learning. Found Trends Mach Learn. 2018;11(3–4):219. https://doi.org/10.1561/2200000071.
Fu Z, Qiu K, Zha H. Generalize a small pre-trained model to arbitrarily large TSP instances. In: AAAI conference on artificial intelligence. 2021; pp 7474–7482. https://doi.org/10.1609/aaai.v35i8.16916
Gens R, Domingos PM. Deep symmetry networks. In: Advances in neural information processing systems. 2014.
Gerez SH. Algorithms for VLSI design automation, Wiley, chap Routing. 1999.
Helsgaun K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Technical report, Roskilde University; 2017.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Jones NC, Pevzner PA. An introduction to bioinformatics algorithms. MIT Press; 2004.
Joshi CK, Laurent T, Bresson X. An efficient graph convolutional network technique for the travelling salesman problem. ar**v:1906.01227. 2019a.
Joshi CK, Laurent T, Bresson X. On learning paradigms for the travelling salesman problem. In: NeurIPS graph representation learning workshop, ar**v:1910.07210. 2019b
Kool W, van Hoof H, Welling M. Attention, learn to solve routing problems! In: International conference on learning representations 2019.
Kwon YD, Choo J, Kim B, et al. POMO: policy optimization with multiple optima for reinforcement learning. Adv Neural Inf Process Syst. 2020;33:21188–98.
LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series. Cambridge, MA, USA: MIT Press; 1998. p. 255–8.
Li Z, Chen Q, Koltun V. Combinatorial optimization with graph convolutional networks and guided tree search. Adv Neural Inf Process Syst. 2018;31:539–48.
Lisicki M, Afkanpour A, Taylor GW. Evaluating curriculum learning strategies in neural combinatorial optimization. In: NeurIPS workshop on learning meets combinatorial algorithms, ar**v:2011.06188 2020.
Ma Q, Ge S, He D, et al. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: AAAI workshop on deep learning on graphs: methodologies and applications, ar**v:1911.04936. 2020.
Ouyang W, Wang Y, Han S, et al. Improving generalization of deep reinforcement learning-based TSP solvers. In: IEEE SSCI ADPRL, ar**v:2110.02843 2021.
Papadimitriou CH. The Euclidean travelling salesman problem is NP-complete. Theor Comput Sci. 1977;4(3):237–44. https://doi.org/10.1016/0304-3975(77)90012-3.
Peng B, Wang J, Zhang Z. A deep reinforcement learning algorithm using dynamic attention model for vehicle routing problems. In: Artificial intelligence algorithms and applications. Springer, Singapore, Communications in Computer and Information Science, pp 636–650, https://doi.org/10.1007/978-981-15-5577-0_51 2020.
Perron L, Furnon V. Or-tools. https://developers.google.com/optimization/ 2019.
Prates MOR, Avelar PHC, Lemos H, et al. Learning to solve np-complete problems - a graph neural network for decision TSP. In: AAAI conference on artificial intelligence, 2019;4731–4738, https://doi.org/10.1609/aaai.v33i01.33014731.
Reinelt G. TSPLIB-a traveling salesman problem library. ORSA J Comput. 1991;3(4):376–84. https://doi.org/10.1287/ijoc.3.4.376.
Snyder L, Shen ZJ. Fundamentals of Supply Chain Theory, Wiley, chap The Traveling Salesman Problem, 2019;403–461.
Soviany P, Ionescu RT, Rota P, et al. Curriculum learning: a survey. Int J Comput Vis. 2021;130:1526–65. https://doi.org/10.1007/s11263-022-01611-x.
Sutton R, Barto A. Reinforcement learning: an introduction. MIT Press; 1998.
Vinyals O, Fortunato M, Jaitly N. Pointer networks. Adv Neural Inf Process Syst. 2015;28:2692–700.
Vo TQT, Nguyen VH, Weng P, et al. Improving subtour elimination constraint generation in branch-and-cut algorithms for the TSP with machine learning. In: Learning and intelligent optimization conference 2023.
Weinshall D, Cohen G, Amir D. Curriculum learning by transfer learning: theory and experiments with deep networks. In: International conference on machine learning, pp 5235–5243, https://doi.org/10.48550/ar**v.1802.03796 2018.
Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992;8:229–56. https://doi.org/10.1007/BF00992696.
Wu Y, Song W, Cao Z, et al. Learning improvement heuristics for solving routing problems. IEEE Trans Neural Netw Learn Syst. 2021. https://doi.org/10.1109/TNNLS.2021.3068828.
**ng Z, Tu S. A graph neural network assisted Monte Carlo tree search approach to traveling salesman problem. IEEE Access. 2020;8:108418–28. https://doi.org/10.1109/ACCESS.2020.3000236.
Zheng J, He K, Zhou J, et al. Combining reinforcement learning with Lin-Kernighan-Helsgaun algorithm for the traveling salesman problem. In: AAAI conference on artificial intelligence, 2021;12,445–12,452, https://doi.org/10.1609/aaai.v35i14.17476.
Funding
This work has been supported in part by the program of National Natural Science Foundation of China (No. 62176154) and the program of the Shanghai NSF (No. 19ZR1426700).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical Approval
Not applicable.
Consent to Participate
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Policy Gradient of eMAGIC
As promised in Sect. “Algorithm & Training” - Smoothed Policy Gradient, we elaborate on the detailed mathematical derivation for Equation (13).
The objective function \(J(\varvec{\theta })\) in Eq. (3) can be approximated with the empirical mean of the total rewards using B trajectories sampled with policy \(\pi _{\varvec{\theta }}\):
where \(\hat{{\mathbb {E}}}_B\) represents the empirical mean operator, \(L_\sigma ({\varvec{X}})\) is the tour length of \(\sigma\) output by the RL policy, and \(r^{(b)}_{t}\) is the t-th reward of the b-th trajectory. The policy gradient to optimize \(J(\varvec{\theta })\) can be estimated by:
However, recall \(J(\varvec{\theta })\) is the standard objective used in most deep RL methods applied to TSP. Instead, we optimize \(J^+(\varvec{\theta })= -{\mathbb {E}}[L_{\sigma _+}({\varvec{X}})]\) where \(L_{\sigma _+}({\varvec{X}})\) is the tour length of \(\sigma\) after applying local search. This helps integrate better RL and local search by smoothing the value landscape and training an RL agent to output a tour that can be improved by local search. This new objective function can be rewritten:
where \(\rho (\sigma )\) denotes the distribution over tours induced by the application of the stochastic local search on \(\sigma\). Taking the gradient of this new objective:
where \(\tau = ({\varvec{s}}_1, a_1, \ldots )\) and \(\sigma\) is its associated tour. We simply approximate the conditional expectation over \(\rho (\sigma )\) by a sample. Therefore, our gradient estimate is an unbiased estimator of the gradient of our new objective \(J^+(\varvec{\theta })\).
Using our policy rollout baseline introduces some bias to the estimation of the smoothed policy gradient. However, the variance reduction helps with achieving greater performance and smaller variance, as we observed in our experiments.
Appendix B: Settings and Hyperparameters
All our experiments are run on a computer with an Intel(R) Xeon(R) E5-2678 v3 CPU and a NVIDIA 1080Ti GPU. In consistency with previous work, all our randomly generated TSP instances are sampled in \([0,1]^2\) uniformly. During training, stochastic CL chooses a TSP size in \({\mathcal {R}}=\{10,11,\dots ,50\}\) for each epoch e according to Eqs. (14) and (15), with \(\sigma _N\) set to be 3. We trained for 200 epochs in each experiment, with 1000 batches of 128 random TSP instances in each epoch. We set the learning rate to be \(10^{-3}\) and the learning rate decay to be 0.96 in each experiment. In each experiment, we set \(\alpha =0.5\), \(\beta =1.5\), \(\gamma =0.25\) and \(I=10\) for the parameters of our combined local search. As for random TSP testing and the ablation study, we test on TSP instances with size 20, 50, 100, 200, 500, and 1000 to evaluate the generalization capability of our model; as for realistic TSP, we test on TSP instances with sizes up to 1002 from the TSPLIB library. With respect to our model architecture, our MLP encoder has an input layer with dimension 2, two hidden layers with dimension 128 and 256, respectively, and an output layers with dimension 128; for the GNN encoder, we set \(H=128\) and \(n_{GNN}=3\).
Addendix C: More Experiments on eMAGIC
More versions of eMAGIC
As promised in Sect. 6, we illustrate all versions of eMAGIC in this section. See Tables 7 and 8.
Variance Analysis of eMAGIC
As promised in Sect. 6, we provide the variance analysis of eMAGIC in this section. In our experiments, we repeated all our experiments (training + testing) with 3 random seeds. The variances are shown in Table 9.
The variances are quite low, showing that our method gives relatively good and stable results. Note the variances increase with the number of sampling (s1, s10 and S) for larger TSP instances since there is more room for improvement.
Experiments on Extremely Large TSP Instances
In Table 10, we show the performances of eMAGIC on extremely large TSP instances (i.e., TSP 10,000) and the comparisons with a few methods that are able to generalize to TSP 10,000. For the hyperparameters of the combined local search, we use \(I=2\) and \(\beta =1.4\). We modify the hyperparameters in this way because we find that for large TSP instances, doing too many iterations of local search is not efficient. The other experimental settings in this test are the same as Sect. 6. Also, we test [12]’s method with a limited the time budget, denoted by AGCRN+M\(^{lim}\): Since we only reproduce their method successfully with CPU, we decrease their hyperparameter T (the MCTS will end no longer than T seconds) from 0.04n to \(0.04n / 1.8 * (28/60) = 0.010n\) (1.8h and 28 m are, respectively, the runtimes of their model and eMAGIC(G)). By this, we expect that their model and eMAGIC(G) will have a similar running time.
We can observe that our models outperform AM [20] to a large extent. Comparing to AGCRN+M [12] and LKH3 [15], our methods run much faster without a big gap of performance. Plus, if we give AGCRN+M [12] a time budget comparable to our method (i.e., run AGCRN+M and eMAGIC(G) for the same amount of time), we can see that our algorithm outperforms AGCRN+M. Note that MCTS in AGCRN+M is written in C++. We could reduce our runtime if we had also written our code in C++ instead of Python. Therefore, our method offers a better trade-off in terms of performance vs runtime: It can generate relatively good results in much less time.
Appendix D: Evaluation of Equivariant Preprocessing Methods
Comparisons Between Different Symmetries Used During Preprocessing
We present the comparison results of applying rotation, translation and reflection. The comparisons are done by testing on random TSP instances followed the same setting in the experiment section. As in Table 11, the rotation has the best performance on small and large TSP instances. By this, we choose rotation for our final algorithm.
Comparisons Between One Preprocessing Application and Iteration Preprocessing Applications
We present the comparison results of one preprocessing application and iteration preprocessing applications. The comparisons are done by testing on random TSP instances followed the same setting in the experiment section. As in Table 12, the iteration preprocessing applications have better performance on small and large TSP instances. By this, we choose iteration preprocessing applications for our final algorithm.
Realist TSP Instances in TSPLIB
As promised in Sect. 6, Tables 13, 14 and 15 list the performances of various learning-based TSP solvers and heuristic approaches upon realistic TSP instances in TSPLIB. The bold numbers show the best performance among all the approaches. We can observe that most bold numbers are provided by eMAGIC, meaning our approach provides excellent results for TSPLIB. In addition, we also provide our results for the larger instances (with size up to 1002) from TSPLIB in Table 16. Table 16 only contains the experiments of our model since no other model can generalize to large size TSP instances with adequate performance.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ouyang, W., Wang, Y., Weng, P. et al. Generalization in Deep RL for TSP Problems via Equivariance and Local Search. SN COMPUT. SCI. 5, 369 (2024). https://doi.org/10.1007/s42979-024-02689-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02689-5