1 Introduction

NP-hard combinatorial optimization problems are the family of integer constrained optimization problems which are intractable to solve optimally at large scales. Robust approximation algorithms to popular problems have immense practical applications and are the backbone of modern industries. Among combinatorial problems, the 2D Euclidean Travelling Salesperson Problem (TSP) has been the most intensely studied NP-hard graph problem in the Operations Research (OR) community, with applications in logistics, genetics and scheduling [1]. TSP is intractable to solve optimally above thousands of nodes for modern computers [2]. In practice, the Concorde TSP solver [3] uses linear programming with carefully handcrafted heuristics to find solutions up to tens of thousands of nodes, but with prohibitive execution times.Footnote 1 Besides, the development of problem-specific OR solvers such as Concorde for novel or under-studied problems arising in scientific discovery [4] or computer architecture [5] requires significant time and specialized knowledge.

An alternate approach by the Machine Learning community is to develop generic learning algorithms which can be trained to solve any combinatorial problem directly from problem instances themselves [6,7,8]. Using classical problems such as TSP, Minimum Vertex Cover and Boolean Satisfiability as benchmarks, recent end-to-end approaches [9,10,11] leverage advances in graph representation learning [12,13,14,15] and have shown competitive performance with OR solvers on trivially small problem instances up to few hundreds of nodes. Once trained, approximate solvers based on Graph Neural Networks (GNNs) have significantly favorable time complexity than their OR counterparts, making them highly desirable for real-time decision-making problems such as TSP and the associated class of Vehicle Routing Problems (VRPs).

1.1 Motivation

Scaling end-to-end approaches to practical and real-world instances is still an open question [8] as the training phase of state-of-the-art models on large graphs is extremely time-consuming. For graphs larger than few hundreds of nodes, the gap between GNN-based solvers and simple non-learnt heuristics is especially evident for routing problems like TSP [16, 17].

Fig. 1
figure 1

Computational challenges of learning large scale TSP. We compare three identical autoregressive GNN-based models trained on 12.8 Million TSP instances via reinforcement learning. We plot average optimality gap to the Concorde solver on 1,280 held-out TSP200 instances vs. number of training samples (left) and wall clock time (right) during the learning process. Training on large TSP200 from scratch is intractable and sample inefficient. Active Search [7], which learns to directly overfit to the 1,280 held-out samples, further demonstrates the computational challenge of memorizing very few TSP200 instances. Comparatively, learning efficiently from trivial TSP20-TSP50 allows models to better generalize to TSP200 in a zero-shot manner, indicating positive knowledge transfer from small to large graphs. Performance can further improve via rapid finetuning on 1.28 Million TSP200 instances or by Active Search. Within our computational budget, a simple non-learnt furthest insertion heuristic still outperforms all models. Precise experimental setup is described in Appendix A

As an illustration, Fig. 1 presents the computational challenge of learning TSP on 200-node graphs (TSP200) in terms of both sample efficiency and wall clock time. Surprisingly, it is difficult to outperform a simple insertion heuristic when directly training on 12.8 Million TSP200 samples for 500 hours on university-scale hardware.

We advocate for an alternative to expensive large-scale training: learning efficiently from trivially small TSP and transferring the learnt policy to larger graphs in a zero-shot fashion or via fast finetuning. Thus, identifying promising inductive biases, architectures and learning paradigms that enable such zero-shot generalization to large and more complex instances is a key concern for training practical solvers for real-world problems.

1.2 Contributions

Towards end-to-end learning of scale-invariant TSP solvers, we unify several state-of-the-art architectures and learning paradigms [16,17,18,19] into one experimental pipeline and provide the first principled investigation on zero-shot generalization to large instances. Our findings suggest that learning scale-invariant TSP solvers requires rethinking the status quo of neural combinatorial optimization to explicitly account for generalization:

  • The prevalent evaluation paradigm overshadows models’ poor generalization capabilities by measuring performance on fixed or trivially small TSP sizes.

  • Generalization performance of GNN aggregation functions and normalization schemes benefits from explicit redesigns which account for shifting graph distributions, and can be further boosted by enforcing regularities such as constant graph diameters when defining problems using graphs.

  • Autoregressive decoding enforces a sequential inductive bias which improves generalization over non-autoregressive models, but is costlier in terms of inference time.

  • Models trained with expert supervision are more amenable to post-hoc search, while reinforcement learning approaches scale better with more computation as they do not rely on labelled data.

Our framework and datasets are available online.Footnote 2 Additionally, we use our pipeline to characterize the recent state-of-the-art in deep learning for routing problems and provide new directions for future research.

2 Related work

Neural combinatorial optimization

In a recent survey, Bengio et al. [8] identified three broad approaches to leveraging machine learning for combinatorial optimization problems: learning alongside optimization algorithms [20,21,22], learning to configure optimization algorithms [23, 24], and end-to-end learning to approximately solve optimization problems, a.k.a. neural combinatorial optimization [6, 7].

State-of-the-art end-to-end approaches for TSP use Graph Neural Networks (GNNs) [12,13,14,15] and sequence-to-sequence learning [25] to construct approximate solutions directly from problem instances. Architectures for TSP can be classified as: (1) autoregressive approaches, which build solutions in a step-by-step fashion [9, 16, 19, 26,27,28]; and (2) non-autoregressive models, which produce the solution in one shot [17, 18, 29,30,31]. Models can be trained to imitate optimal solvers via supervised learning or by minimizing the length of TSP tours via reinforcement learning [32].

Other classical problems tackled by similar architectures include Vehicle Routing [33, 34], Maximum Cut [9], Minimum Vertex Cover [11], Boolean Satisfiability [10, 35], and Graph Coloring [36]. Using TSP as an illustration, we present a unified pipeline for characterizing neural combinatorial optimization architectures in Section 3.

Notably, TSP has emerged as a challenging testbed for neural combinatorial optimization. Whereas generalization to problem instances larger and more complex than those seen in training has at least partially been demonstrated on non-sequential problems such as SAT, MaxCut, and MVC [10, 11], the same architectures do not show strong generalization for TSP [16, 17].

Combinatorial optimization and GNNs

From the perspective of graph representation learning, algorithmic and combinatorial problems have recently been used to characterize the expressive power of GNNs [37, 38]. An emerging line of work on learning to execute graph algorithms [39, 40] has lead to the development of provably more expressive GNNs [41] and improved understanding of their generalization capability [42, 43]. Towards tackling realistic and large-scale combinatorial problems, this paper aims to quantify the limitations of prevalent GNN architectures and learning paradigms via zero-shot generalization to problems larger than those seen during training.

Novel applications

Advances on classical combinatorial problems have shown promising results in downstream applications to novel or under-studied optimization problems in the physical sciences [4, 44] and computer architecture [5, 45, 46], where the development of exact solvers is expensive and intractable. For example, autoregressive architectures provide a strong inductive bias for device placement optimization problems [47, 48], while non-autoregressive models [49] are competitive with autoregressive approaches [50, 51] for molecule generation tasks.

3 Neural combinatorial optimization pipeline

NP-hard problems can be formulated as sequential decision making tasks on graphs due to their highly structured nature. Towards a controlled study of neural combinatorial optimization for TSP, we unify recent ideas [16,17,18,19] via a five stage end-to-end pipeline illustrated in Fig. 2. Our discussion focuses on TSP, but the pipeline presented is generic and can be extended to characterize modern architectures for several NP-hard graph problems.

Fig. 2
figure 2

End-to-end neural combinatorial optimization pipeline: The entire model in trained end-to-end via imitating an optimal solver (i.e. supervised learning) or through minimizing a cost function (i.e. reinforcement learning)

3.1 Problem definition

The 2D Euclidean TSP is defined as follows: “Given a set of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?” Formally, given a fully-connected input graph of n cities (nodes) in the two dimensional unit square \(S = \{x_{i}\}_{i=1}^{n}\) where each \(x_{i} \in {[0,1]}^{2}\), we aim to find a permutation of the nodes π, termed a tour, that visits each node once and has the minimum total length, defined as:

$$L(\pi | s) = \|x_{\pi_{n}} - x_{\pi_{1}}\|_{2} + \sum\limits_{i=1}^{n-1} \|x_{\pi_{i}} - x_{\pi_{i+1}}\|_{2},$$
(1)

where ∥⋅∥2 denotes the 2 norm.

Graph sparsification

Classically, TSP is defined on fully-connected graphs, see Fig. 2(b). Graph sparsification heuristics based on k-nearest neighbors aim to reduce TSP graphs, enabling models to scale up to large instances where pairwise computation for all nodes is intractable [9] or learn faster by reducing the search space [17]. Notably, problem-specific graph reduction techniques have proven effective for out-of-distribution generalization to larger graphs for other NP-hard problems such as MVC and SAT [11].

Fixed size vs. variable size graphs

Most work on learning for TSP has focused on training with a fixed graph size [16, 17], likely due to ease of implementation. Learning from multiple graph sizes naturally enables better generalization within training size ranges, but its impact on generalization to larger TSP instances remains to be analyzed.

3.2 Graph embedding

A Graph Neural Network (GNN) encoder computes d-dimensional representations for each node in the input TSP graph, see Fig. 2(c). At each layer, nodes gather features from their neighbors to represent local graph structure via recursive message passing [13]. Stacking L layers allows the network to build representations from the L-hop neighborhood of each node. Let \(h_{i}^{\ell }\) and \(e_{ij}^{\ell }\) denote respectively the node and edge feature at layer associated with node i and edge ij. We define the feature at the next layer via an anisotropic message passing scheme using an edge gating mechanism [52]:

$$\begin{array}{@{}rcl@{}} h_{i}^{\ell+1} &=& h_{i}^{\ell} + \text{ReLU} \Big(\textsc{Norm} \Big(U^{\ell} h_{i}^{\ell} + \textsc{Aggr}_{j \in \mathcal{N}_{i}} \Big(\sigma(e_{ij}^{\ell}) \odot V^{\ell} h_{j}^{\ell} \Big) \Big)\Big) \ , \end{array}$$
(2)
$$\begin{array}{@{}rcl@{}} e_{ij}^{\ell+1} &=& e^{\ell}_{ij} + \text{ReLU}\Big(\textsc{Norm} \Big(A^{\ell} e_{ij}^{\ell} + B^{\ell} h^{\ell}_{i} + C^{\ell} h^{\ell}_{j} \Big)\Big) , \end{array}$$
(3)

where \(U^{\ell }, V^{\ell }, A^{\ell }, B^{\ell }, C^{\ell } \in \mathbb {R}^{d \times d}\) are learnable parameters, Norm denotes the normalization layer (BatchNorm [53], LayerNorm [54]), Aggr represents the neighborhood aggregation function (Sum, Mean or Max), σ is the sigmoid function, and ⊙ is the Hadamard product. As inputs \(h_{i}^{\ell =0}\) and \(e_{ij}^{\ell =0}\), we use d-dimensional linear projections of the node coordinate xi and the euclidean distance ∥xixj2, respectively.

Anisotropic aggregation

We make the aggregation function anisotropic or directional via a dense attention mechanism which scales the neighborhood features \(h_{j}, \forall j \in \mathcal {N}_{i},\) using edge gates σ(eij). Anisotropic and attention-based GNNs such as Graph Attention Networks [14], Transformers [55, 56], and Gated Graph ConvNets [52] have been shown to outperform isotropic Graph ConvNets [12] across several challenging domains [57], including TSP [16, 17].

3.3 Solution decoding

Non-autoregressive decoding (NAR)

Consider TSP as a link prediction task: each edge may belong/not belong to the optimal TSP solution independent of one another [18]. We define the edge predictor as a two layer MLP on the node embeddings produced by the final GNN encoder layer L, following Joshi et al. [17], see Fig. 2(d). For adjacent nodes i and j, we compute the unnormalized edge logits:

$$\hat{p}_{ij} = W_{2} \Big(\text{ReLU} \big(W_{1} \big(\left[ h_{G}, {h_{i}^{L}}, {h_{j}^{L}} \right] \big) \big) \Big), \ \text{where} \ h_{G} = \frac{1}{n} \sum\limits_{i=0}^{n} {h_{i}^{L}},$$
(4)

\(W_{1} \in \mathbb {R}^{3d \times d}, W_{2} \in \mathbb {R}^{d \times 2}\), and [⋅,⋅,⋅] is the concatenation operator. The logits \(\hat {p}_{ij}\) are converted to probabilities over each edge pij via a softmax.

Autoregressive decoding (AR)

Although NAR decoders are fast as they produce predictions in one shot, they ignore the sequential ordering of TSP tours. Autoregressive decoders, based on attention [16, 19] or recurrent neural networks [6, 26], explicitly model this sequential inductive bias through step-by-step graph traversal. We follow the attention decoder from Kool et al. [16], which starts from a random node and outputs a probability distribution over its neighbors at each step. Greedy search is used to perform the traversal over n time steps and masking enforces constraints such as not visiting previously visited nodes.

At time step t at node i, the decoder builds a context \(\hat {h}_{i}^{C}\) for the partial tour \(\pi ^{\prime }_{t^{\prime }}\), generated at time \(t^{\prime } < t\), by packing together the graph embedding hG and the embeddings of the first and last node in the partial tour: \(\hat {h}_{i}^{C} = W_{C} \left [ h_{G}, h^{L}_{\pi ^{\prime }_{t-1}}, h^{L}_{\pi ^{\prime }_{1}} \right ],\) where \(W_{C} \in \mathbb {R}^{3d \times d}\) and learned placeholders are used for \(h^{L}_{\pi ^{\prime }_{t-1}}\) and \(h^{L}_{\pi ^{\prime }_{1}}\) at t = 1. The context \(\hat {h}_{i}^{C}\) is then refined via a standard Multi-Head Attention (MHA) operation [55] over the node embeddings:

$${h_{i}^{C}} = \text{MHA} \left( Q=\hat{h}_{i}^{C}, K=\left\{{h^{L}_{1}},\dots,{h^{L}_{n}}\right\}, V=\left\{{h^{L}_{1}},\dots,{h^{L}_{n}}\right\} \right),$$
(5)

where Q,K,V are inputs to the M-headed MHA (M = 8). The unnormalized logits for each edge eij are computed via a final attention mechanism between the context \({h_{i}^{C}}\) and the embedding hj:

$$\hat{p}_{ij} = \begin{cases} C \cdot \tanh \left( \frac{ \left( W_{Q} {h_{i}^{C}} \right)^{T} \cdot \left( W_{K} {h^{L}_{j}} \right) }{\sqrt{d}} \right) & \text{if } j \neq \pi_{t^{\prime}} \quad \forall t^{\prime} < t \\ -\infty & \text{otherwise}. \end{cases}$$
(6)

The \(\tanh\) is used to maintain the value of the logits within [−C,C] (C = 10) [7]. The logits \(\hat {p}_{ij}\) at the current node i are converted to probabilities pij via a softmax over all edges.

Inductive biases

NAR approaches, which make predictions over edges independently of one-another, have shown strong out-of-distribution generalization for non-sequential problems such as SAT and MVC [11]. On the other hand, AR decoders come with the sequential/tour constraint built-in and are the default choice for routing problems [16]. Although both approaches have shown close to optimal performance on fixed and small TSP sizes under different experimental settings, it is important to fairly compare which inductive biases are most useful for generalization.

3.4 Solution search

Greedy search

For AR decoding, the predicted probabilities at node i are used to select the edge to travel along at the current step via sampling from the probability distribution pi or greedily selecting the most probable edge pij, i.e. greedy search. Since NAR decoders directly output probabilities over all edges independent of one-another, we can obtain valid TSP tours using greedy search to traverse the graph starting from a random node and masking previously visited nodes. Thus, the probability of a partial tour \(\pi ^{\prime }\) can be formulated as \(p(\pi ^{\prime }) = {\prod }_{j^{\prime } \sim i^{\prime } \in \pi ^{\prime }} p_{i^{\prime }j^{\prime }}\), where each node \(j^{\prime }\) follows node \(i^{\prime }\).

Beam search and sampling

During inference, we can increase the capacity of greedy search via limited width breadth-first beam search, which maintains the b most probable tours during decoding. Similarly, we can sample b solutions from the learnt policy and select the shortest tour among them. Naturally, searching longer, with more sophisticated techniques, or sampling more solutions allows trading off run time for solution quality. However, it has been noted that using large b for search/sampling or local search during inference may overshadow an architecture’s inability to generalize [58]. To better understand generalization, we focus on using greedy search and beam search/sampling with small b = 128.

3.5 Policy learning

Supervised learning

Models can be trained end-to-end via imitating an optimal solver at each step (i.e. supervised learning). For models with NAR decoders, the edge predictions are linked to the ground-truth TSP tour by minimizing the binary cross-entropy loss for each edge [17]. For AR architectures, at each step, we minimize the cross-entropy loss between the predicted probability distribution over all edges leaving the current node and the next node from the groundtruth tour, following Vinyals et al. [6]. We use teacher-forcing to stabilize training [59].

Reinforcement learning

Reinforcement learning is a elegant alternative in the absence of groundtruth solutions, as is often the case for understudied combinatorial problems. Models can be trained by minimizing problem-specific cost functions (the tour length in the case of TSP) via policy gradient algorithms [7, 16] or Q-Learning [9]. We focus on policy gradient methods due to their simplicity, and define the loss for an instance s parameterized by the model 𝜃 as \({\mathcal{L}}(\theta | s) = \mathbb {E}_{p_{\theta }(\pi | s)}\left [L(\pi )\right ]\), the expectation of the tour length L(π), where p𝜃(π|s) is the probability distribution from which we sample to obtain the tour π|s. We use the REINFORCE gradient estimator [60] to minimize \({\mathcal{L}}\):

$$\nabla \mathcal{L}(\theta | s) = \mathbb{E}_{p_{\theta}(\pi | s)}\left[\left( L(\pi) - b(s)\right) \nabla \log p_{\theta}(\pi | s)\right],$$
(7)

where the baseline b(s) reduces gradient variance. Our experiments compare standard critic network baselines [7, 19] and the greedy rollout baseline proposed by Kool et al. [16].

4 Experimental setup

We design controlled experiments to probe the unified pipeline described in Section 3 in order to identify inductive biases, architectures and learning paradigms that promote zero-shot generalization. We focus on learning efficiently from small problem instances (TSP20-50) and measure generalization to a wider range of sizes, including large instances which are intractable to learn from (e.g. TSP200). Each experiment starts with a ‘base’ model configuration and ablates the impact of a specific component of the five-stage pipeline. We aim to fairly compare state-of-the-art ideas in terms of model capacity and training data, and expect models with good inductive biases for TSP to: (1) learn trivially small TSPs without hundreds of millions of training samples and model parameters; and (2) generalize reasonably well across smaller and larger instances than those seen in training.

To quantify ‘good’ generalization, we additionally evaluate our models against a simple, non-learnt furthest insertion heuristic baseline, which constructively builds a partial tour \(\pi ^{\prime }\) by inserting node i between tour nodes \(j_{1}, j_{2} \in \pi ^{\prime }\) such that the distance from node i to its nearest node j1 is maximized. Kool et al. [16] provide a detailed description of insertion heuristic baselines.

Training datasets

We perform ablation studies of each component of the pipeline by training on variable TSP20-50 graphs for rapid experimentation. We also compare to learning from fixed graph sizes up to TSP100. Each TSP instance consist of n nodes sampled uniformly in the unit square \(S = \{x_{i}\}_{i=1}^{n}\) and \(x_{i} \in {[0,1]}^{2}\). In the supervised learning paradigm, we generate a training set of 1,280,000 TSP samples and groundtruth tours using the optimal Concorde solver as an oracle. Models are trained using the Adam optimizer for 10 epochs with a batch size of 128 and a fixed learning rate 1e − 4. For reinforcement learning, models are trained for 100 epochs on 128,000 TSP samples which are randomly generated for each epoch (without optimal solutions) with the same batch size and learning rate. Thus, both learning paradigms see 12,800,000 TSP samples in total. Considering that TSP20-50 are trivial in terms of complexity as they can be solved by simpler non-learnt heuristics, training good solvers at this scale should ideally not require billions of instances.

Model hyperparameters

For models with AR decoders, we use 3 GNN encoder layers followed by the attention decoder head, setting hidden dimension d = 128. For NAR models, we use the same hidden dimension and opt for 4 GNN encoder layers followed by the edge predictor. This results in approximately 350,000 trainable parameters for each model, irrespective of decoder type. Unless specified, most experiments use our best model configuration: AR decoding scheme and Graph ConvNet encoder with Max aggregation and BatchNorm (with batch statistics). All models are trained via supervised learning except when comparing learning paradigms.

Evaluation

We compare models on a held-out test set of 25,600 TSPs, consisting of 1,280 samples each of TSP10, TSP20, \(\dots\), TSP200. Our evaluation metric is the optimality gap w.r.t. the Concorde solver, i.e. the average percentage ratio of predicted tour lengths relative to optimal tour lengths. To compare design choices among identical models, we plot line graphs of the optimality gap as TSP size increases (along with a 99%-ile confidence interval) using beam search with a width of 128. Compared to previous work which evaluated on fixed problem sizes, our protocol identifies not only those models that perform well on training sizes, but also those that generalize better than non-learnt heuristics for large instances which are intractable to train on.

Fig. 3
figure 3

Learning from various TSP sizes. The prevalent protocol of evaluation on training sizes overshadows brittle out-of-distribution performance to larger and smaller graphs

Fig. 4
figure 4

Impact of graph sparsification. Maintaining a constant graph diameter across TSP sizes leads to better generalization on larger problems than using full graphs

5 Results

5.1 Does learning from variable sizes help generalization?

We train five identical models on fully connected graphs of instances from TSP20, TSP50, TSP100, TSP200 and variable TSP20-50. The line plots of optimality gap across TSP sizes in Fig. 3 indicates that learning from variable TSP sizes helps models retain performance across the range of graph sizes seen during training (TSP20-50). Variable graph training compared to training solely on the maximum sized instances (TSP50) leads to marginal gains on small instances but, somewhat counter-intuitively, does not enable better generalization to larger problems. Learning from small TSP20 is unable to generalize to large sizes while TSP100 models generalize poorly to trivially easy sizes, suggesting that the prevalent protocol of evaluation on training sizes [16,

Fig. 5
figure 5

Impact of GNN aggregation functions. For larger graphs, aggregators that are agnostic to node degree (Mean, Max) are able to outperform theoretically more expressive aggregators

Fig. 6
figure 6

Impact of normalization schemes. Modifying BatchNorm to account for changing graph statistics across graph sizes leads to better generalization

5.3 What is the relationship between GNN aggregation functions and normalization layers?

In Fig. 5, we compare identical models with anisotropic Sum, Mean and Max aggregation functions. As baselines, we consider the Transformer encoder on full graphs [16, 19] as well as a structure-agnostic MLP on each node, which can be instantiated by not using any aggregation function in (2), i.e. \(h_{i}^{\ell +1} = h_{i}^{\ell } + \text {ReLU} \left (\textsc {Norm} \left (U^{\ell } h_{i}^{\ell } \right ) \right )\).

We find that the choice of GNN aggregation function does not have an impact when evaluating models within the training size range TSP20-50. As we tackle larger graphs, GNNs with aggregation functions that are agnostic to node degree (Mean and Max) are able to outperform Transformers and MLPs. Importantly, the theoretically more expressive Sum aggregator [61] generalizes worse than structure-agnostic MLPs, as it cannot handle the distribution shift in node degree and neighborhood statistics across graph sizes, leading to unstable or exploding node embeddings [39]. We use the Max aggregator in further experiments, as it generalizes well for both AR and NAR decoders (not shown).

We also experiment with the following normalization schemes: (1) standard BatchNorm which learns mean and variance from training data, as well as (2) BatchNorm with batch statistics; and (3) LayerNorm, which normalizes at the embedding dimension instead of across the batch. Figure 6 indicates that BatchNorm with batch statistics and LayerNorm are able to better account for changing statistics across different graph sizes. Standard BatchNorm generalizes worse than not doing any normalization, thus our other experiments use BatchNorm with batch statistics.

We further dissect the relationship between graph representations and normalization in Appendix D, confirming that poor performance on large graphs can be explained by unstable representations due to the choice of aggregation and normalization schemes. Using Max aggregators and BatchNorm with batch statistics are temporary hacks to overcome the failure of the current architectural components. Overall, our results suggest that inference beyond training sizes will require the development of expressive GNN mechanisms that are able to leverage global graph topology [62] while being invariant to distribution shifts in terms of node degree and other graph statistics [63].

Fig. 7
figure 7

Comparing AR and NAR decoders. Sequential AR decoding is a powerful inductive bias for TSP as it enables significantly better generalization, even in the absence of graph structure (MLP encoders)

Fig. 8
figure 8

Inference time for various decoders. One-shot NAR decoding is significantly faster than sequential AR, especially when re-embedding the graph at each decoding step [9]

5.4 Which decoder has a better inductive bias for TSP?

Figure 7 compares NAR and AR decoders for identical models. To isolate the impact of the decoder’s inductive bias without the inductive bias imposed by GNNs, we also show Transformer encoders on full graphs as well as structure-agnostic MLPs. Within our experimental setup, AR decoders are able to fit the training data as well as generalize significantly better than NAR decoders, indicating that sequential decoding is powerful for TSP even without graph information.

Conversely, NAR architectures are a poor inductive bias as they require significantly more computation to perform competitively to AR decoders. For instance, recent models [17, 18] used more than 30 GNN layers with over 10 Million parameters. We believe that such overparameterized networks are able to memorize all patterns for small TSP training sizes [64], but the learnt policy is unable to generalize beyond training graph sizes. At the same time, when compared fairly within the same experimental settings, NAR decoders are significantly faster than AR decoders described in Section 3.3 as well as those which re-embed the graph at each decoding step [9], see Fig. 8.

5.5 How do learning paradigms impact the search phase?

Identical models are trained via supervised learning (SL) and reinforcement learning (RL).Footnote 3 Figure 9 illustrates that, when using greedy decoding during inference, RL models perform better on the training size as well as on larger graphs. Conversely, SL models improve over their RL counterparts when performing beam search or sampling.

In Appendix C, we find that the rollout baseline, which encourages better greedy behaviour, leads to the model making very confident predictions about selecting the next node at each decoding step, even out of training size range. In contrast, SL models are trained with teacher forcing, i.e. imitating the optimal solver at each step instead of using their own prediction. This results in less confident predictions and poor greedy decoding, but makes the probability distribution more amenable to beam search and sampling, as shown in Fig. 10. Our results advocate for tighter coupling between the training and inference phase of learning-driven TSP solvers, mirroring recent findings in generative models for text [65].

Fig. 9
figure 9

Comparing solution search settings. Under greedy decoding, RL demonstrates better performance and generalization. Conversely, SL models improve over their RL counterparts when performing beam search or sampling

Fig. 10
figure 10

Impact of increasing beam width. Teacher-forcing during SL leads to poor generalization under greedy decoding, but makes the probability distribution more amenable to beam search

5.6 Which learning paradigm scales better?

Our experiments till this point have focused on isolating the impact of various pipeline components on zero-shot generalization under limited computation. At the same time, recent results in natural language processing have highlighted the power of large scale pre-training for effective transfer learning [66]. To better understand the impact of learning paradigms when scaling computation, we double the model parameters (up to 750,000) and train on tens times more data (12.8M samples) for AR architectures. We monitor optimality gap on the training size range (TSP20-50) as well as a larger size (TSP100) vs. the number of training samples.

In Fig. 11, we see that increasing model capacity leads to better learning. Notably, RL models, which train on unique randomly generated samples throughout, are able to keep improving their performance within as well as outside of training size range as they see more samples. On the other hand, SL is bottlenecked by the need for optimal groundtruth solutions: SL models iterate over the same 1.28M unique labelled samples and stop improving at a point. Beyond favorable inductive biases, distributed and sample-efficient RL algorithms [67] may be a key ingredient for learning from and scaling up to larger TSPs beyond tens of nodes.

Fig. 11
figure 11

Scaling computation and parameters for SL and RL-trained models. All models are trained on TSP20-50. We plot optimality gap on 1,280 held-out samples of both TSP50 (performance on training size) and TSP100 (out-of-distribution generalization) under greedy decoding. Note that SL models are less amenable than RL models to greedy search. RL models are able to keep improving their performance within as well as outside of training size range with more data. On the other hand, SL performance is bottlenecked by the need for optimal groundtruth solutions