Learning Optimal Solutions via an LSTM-Optimization Framework

Yilmaz, Dogacan; Büyüktahtakın, İ. Esra

doi:10.1007/s43069-023-00224-5

Learning Optimal Solutions via an LSTM-Optimization Framework

Research
Published: 06 June 2023

Volume 4, article number 48, (2023)
Cite this article

Download PDF

Operations Research Forum Aims and scope Submit manuscript

Learning Optimal Solutions via an LSTM-Optimization Framework

Download PDF

Dogacan Yilmaz¹ &
İ. Esra Büyüktahtakın²

2210 Accesses
Explore all metrics

This article has been updated

Abstract

In this study, we present a deep learning-optimization framework to tackle dynamic mixed-integer programs. Specifically, we develop a bidirectional Long Short Term Memory (LSTM) framework that can process information forward and backward in time to learn optimal solutions to sequential decision-making problems. We demonstrate our approach in predicting the optimal decisions for the single-item capacitated lot-sizing problem (CLSP), where a binary variable denotes whether to produce in a period or not. Due to the dynamic nature of the production and inventory levels that must meet the periodic demand, the CLSP can be treated as a sequence labeling task where a recurrent neural network can capture the problem’s temporal dynamics. Computational results show that our LSTM-Optimization (LSTM-Opt) framework significantly reduces the solution time of benchmark CLSP problems without much loss in feasibility and optimality. For example, the predictions at the 85% level reduce the CPLEX solution time by a factor of 9 on average for over 240000 test instances with an optimality gap of less than 0.05% and 0.4% infeasibility in the test set. Also, models trained using shorter planning horizons can successfully predict the optimal solution of the instances with longer planning horizons. For the hardest data set, the LSTM predictions at the 25% level reduce the solution time of 70 CPU hours to less than 2 CPU minutes with an optimality gap of 0.8% and without any infeasibility. The LSTM-Opt framework outperforms classical ML algorithms, such as the logistic regression and random forest, in terms of the solution quality, and exact approaches, such as the ($\ell$, S) and dynamic programming-based inequalities, with respect to the solution time improvement. Our machine learning approach could be beneficial in tackling sequential decision-making problems similar to CLSP, which need to be solved repetitively, frequently, and in a fast manner.

One-Shot Learning for MIPs with SOS1 Constraints

Article 01 July 2024

A non-anticipative learning-optimization framework for solving multi-stage stochastic programs

Article Open access 03 July 2024

Predicting the Optimal Period for Cyclic Hoist Scheduling Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the recent decade, significant progress has been achieved with the use of machine learning (ML) in various fields, such as image recognition and natural language processing. A subfield of ML, deep learning, has inspired much success over the last decade and has led to a growing interest in research and practice. ML and operations research (OR) are historically interconnected through optimization, but only recently the use of ML for OR has received more attention. In this study, we will focus on this direction. Specifically, we will leverage deep learning algorithms to predict solutions to an OR problem by taking advantage of previously solved problems. In various applications in operations planning and management, such as energy demand-side management, airline scheduling, and vehicle routing, problems with the same structures must be solved repeatedly with different parameters within a very short period of time. In such settings, a reduced solution time obtained by fast algorithms can be highly beneficial for improving the efficiency and performance of businesses.

One complex and recurring problem for industrial companies is to determine the amount and timing of production over a planning horizon under resource constraints. It is an important challenge in industry and supply chain management because a production plan directly impacts companies’ output and their ability to compete in operational costs and customer service levels [1]. Production planning is also a highly complex task because firms strive to optimize multiple conflicting objectives, such as minimizing production and inventory costs, while maximizing customer satisfaction under tight constraints on resources, such as budget, raw materials, and machine availability.

In this paper, we present a general prediction framework to learn optimal solutions of combinatorial optimization problems, while focusing on tackling one core production planning problem: the single-item Capacitated Lot Sizing Problem (CLSP). The CLSP is an appropriate selection to demonstrate the computational benefits of our framework for two reasons. First, it is a well-studied and practical combinatorial optimization problem. Second, decisions are made in a dynamic setting involving time. The practical importance of the CLSP is apparent from numerous examples of its application in various production and manufacturing industries, including but not limited to the textile industry, oil and gas companies, car manufacturers, and the pharmaceutical industry [1, 2]. The CLSP determines the optimal production and inventory levels that meet periodic demand under a given production capacity by minimizing the sum of production, setup, and inventory holding cost over a finite planning horizon. In the mixed-integer programming (MIP) formulation of the CLSP, the decision of whether to produce or not is represented by a binary variable. Thus, the CLSP with a time-varying capacity is NP-hard, a very difficult problem to optimize [3, 4]. In this paper, we focus on tackling the computational difficulty of the CLSP and provide an ML-based optimization framework to solve its MIP formulation more efficiently. Thus, we study the CLSP at a high formulation level rather than focusing on a specific real-life application.

The CLSP is a sequential decision-making problem because, in each time period, the production level is determined to meet the periodic demand, and any additional produced items not used for current demand are placed in inventory to be used for future demand. Therefore, demand, capacity, cost inputs, and production decisions constitute highly correlated temporal sequences that the classical supervised classification might not capture. Thus, the CLSP can be treated as a sequence labeling task where a recurrent neural network (RNN) is applicable.

The RNN is a specialized type of neural network that can process sequential data by enabling information flow through various time steps. The neural network with the same parameters is applied at each time step of the sequence. Input to layer at each time step consists of data at that time step and the network activations from the previous step. As a result, RNNs allow previous inputs to affect the output rather than just the current input. Developed by Hochreiter and Schmidhuber [5], Long Short-Term Memory (LSTM) is a specialized RNN that can store information for long time steps, which can be challenging to handle by a classical RNN [6]. Bidirectional RNN (BRNN) allows using input information of future time steps rather than processing information in sequential order [7]. The main idea is to train two separate RNNs in both time directions that connect to the same output layer. Bidirectional LSTM is an extension of BRNNs by using LSTM architecture [8]. The LSTM architecture might be preferable to the classical RNN due to its ability to capture long-term dependencies that come at a computational cost. We train the bidirectional LSTM network on datasets with different characteristics and evaluate the quality of resulting predictions in terms of feasibility and optimality. Our computational results show that a significant reduction in solution time can be achieved without much loss in feasibility and optimality.

Our main contribution is to develop an LSTM-based framework for learning optimal solutions to CLSP quickly. We propose using bidirectional LSTM to predict the binary production decision variable. Bidirectional LSTM can process information in both time directions, which is critical to predicting solutions of various OR problems with dynamic nature and where data is available for the planning horizon. Also, instead of using all the predicted variables, we propose using them partially to reduce the number of infeasible solutions. We present the results on how the quality of the predictions changes regarding feasibility and optimality with different levels of predicted variables. Additionally, we show the results of generalization on instances with different characteristics and instances with longer time horizons and compare them with those of traditional dynamic programming and cutting plane algorithms and well-known learning algorithms: logistic regression and random forest.

2 Literature Review and Contributions

2.1 Literature Review

In recent years, significant results have been achieved in various fields by deep learning, which is a sub-field of ML. As a result, there has been a growing literature on the interaction between ML and OR. In this paper, we focus on the use of ML to improve solving OR problems, particularly focusing on the CLSP. The origin of the interest in using ML algorithms for OR can be traced back to the 1980 s when Neural Computation was used for solving Combinatorial Optimization problems (see, e.g., the survey of Smith [9] on this topic). The approaches in the literature are structured into two parts: approaches that use ML for predicting the solutions directly from inputs and approaches that predict valuable pieces of information to utilize in the solution algorithms.

In one of the studies, which focuses on predicting the optimal solution directly, Larsen et al. [10] propose a new methodology to predict solution descriptions of a stochastic load planning problem using deep learning. According to the authors, a solution can be described at different levels. The most detailed solution describes the values taken by each variable, and the least detailed solution gives the value of the objective function. Their desired level of description is somewhere in the middle. At the time of the prediction, using a deterministic optimization model is not possible because the information available is imperfect, and the computational budget is limited. They generate training data by solving a large number of deterministic problems offline and combining solutions to the desired level of description at the prediction time. They train feedforward neural networks using this generated data and predict the actual problem instances. With a similar approach, Fischetti and Fraccaro [11] use various ML techniques, including neural networks, to estimate the optimal value of the offshore wind farm layout optimization problem. Their goal is to determine the optimal allocation of the wind turbines in a site to maximize park power production. The authors argue that ML can be used as a fast tool to estimate the optimal value of the problem for pre-selecting between candidate wind farm sites. The optimization model can be evaluated at these promising sites instead of all candidates. Based on their findings, a fast ML+OR tool can dramatically increase the number of sites and turbine types investigated.

In a recent study, Bertsimas and Stellato [23] provide a solution methodology based on dynamic programming (DP) for lot-sizing. An exact solution approach presented by Barany et al. [24] involves generating valid ($\ell$, S) inequalities and adding them to the formulation with a separation algorithm. Eppen and Martin [25] redefine variables to generate a graph representation of the problem, which has a tighter linear relaxation than the original formulation. More recently, DP-based and partial-objective inequalities have been proposed for the single-item CLSP [4] and multi-item CLSP [26], respectively. For a detailed discussion of the exact and heuristic approaches to different versions of the lot-sizing problem, we refer the readers to the excellent review of [27].

Readers are referred to Goodfellow et al. [28] for a detailed discussion on deep learning algorithms. We refer to Graves [29] for a detailed discussion on RNN, LSTM, and sequence labeling. Readers are referred to Karimi et al. [2] and Pochet and Wolsey [27] for an extensive survey on the capacitated lot-sizing problem, their variants, and exact and heuristic approaches for their solution.

There has been a growing interest in using ML algorithms to help solve OR problems in recent years. Despite all the advancements in the ML-OR integration, there is still a research gap in learning optimal solutions to MIP problems, including CLSP from previously-solved instances, and evaluating the effectiveness and generalization of the learning-based optimization approach.

2.2 Key Contributions of the Paper

To our knowledge, none of the former studies have used a deep learning algorithm, such as LSTM, to capture the sequential nature of multi-period MIP models and predict their optimal solution. Decisions are closely linked over multiple periods in a multi-stage or sequential problem. Thus an ML approach that does not consider patterns across time may not capture the dynamic nature of the problem. The LSTM, on the other hand, is a recurrent network capable of understanding long and short-term dependencies and temporal differences in the data of optimal solutions given specific problem characteristics.

In this study, we present a new deep learning LSTM-Optimization (LSTM-Opt) architecture to learn the optimal solutions for one of the most famous combinatorial optimization problems and a classical example of a sequential decision-making problem, CLSP. Our goal here is to reduce the solution time, where numerous similar CLSP need to be solved repetitively and in a fast manner with a small optimality gap. Our specific contributions are:

1.
To our knowledge, this is the first study that utilizes an LSTM approach to make predictions from the optimal solutions of CLSP instances and use those predictions to solve similar CLSP with different data. Specifically, we propose an LSTM-Opt framework, which predicts binary decision variables of the CLSP problem. The bidirectional LSTM learns optimal solutions to sequential decision-making problems where the input data is available for the planning horizon. We compare the computational performance of our algorithm with other ML approaches, such as logistic regression and random forest. We show that the LSTM networks capture the time-wise dependency in sequential decision making and thus are superior compared to those ML algorithms.
2.
We evaluate the effectiveness of predictions in terms of their feasibility and optimality for the original CLSP by defining optimization-based metrics, such as the optimality gap and the percent of feasibly-predicted instances in the test set. The use of all predictions could help reduce the solution time but also may increase the infeasibility in the test set. To improve the feasibility of the solutions, we propose using the predictions partially as an input into the MIP solver, CPLEX [30]. This approach provides a significant reduction in solution time while improving the optimality gap and the feasibility of solutions. To remedy the infeasibility problem, additional methods, such as the CPLEX user cuts, are utilized to solve the problem with a reasonable optimality gap with no infeasibility.
3.
We utilize benchmark CLSP instances in the literature to demonstrate the efficiency of our LSTM-Opt approach. In addition to comparing with direct solutions of CLSP by CPLEX, we utilize a dynamic programming formulation [23], dynamic programming-based inequalities [4], and ($\ell$, S) inequalities [24] to show that the LSTM-Opt can be beneficial to reduce the solution time even when compared with these traditional exact OR methodologies proposed for solving the CLSP more efficiently.
4.
Our LSTM-Opt framework helps decrease the CPLEX solution time by multiple orders of magnitude when predicting CLSP instances. Furthermore, this prediction architecture provides more time-gain benefits as the CLSP instances get harder, i.e., for the most difficult test problems that are generated with the same distribution as training instances, the solution time is reduced by a factor of 13 without any infeasibility or an optimality gap.
5.
We investigate if the trained LSTM model can predict instances with different underlying data distributions or instances with a larger planning horizon. The results imply that one must be careful in picking the prediction level to solve instances with different characteristics. The computational results also show that the trained LSTM model can successfully predict longer and, thus harder instances without extra training. As an example, in those generalization experiments to predict longer planning horizons, using a prediction level of 25%, we have reduced an average solution time of 70 CPU hours to only 2 CPU minutes with a 0.8% optimality gap, which is quite a significant computational achievement.
6.
Once an LSTM model is trained from previously solved instances, predictions to new problems can be generated in milliseconds in an online setting. Thus, our LSTM-Opt approach could, in particular, be useful for solving practical and recurring sequential decision-making problems, such as power generation scheduling, energy demand-side management, and pricing optimization, where the same problem formulations are solved repeatedly over time with updated parameters.
7.
Our LSTM-Opt framework is generalizable since it does not assume any specific information about CLSP. Thus it can be applied to other MIPs, such as the Binary Knapsack problem, one of the most well-known MIP formulations and a relaxation of the CLSP.

The remainder of the paper is as follows. Section 3 presents the MIP formulation of the CLSP. Section 4 describes the proposed LSTM-Opt framework. Section 5 describes the details of implementation and experimentation. Section 6 presents the computational results on datasets with different characteristics and a comparison with other ML and exact approaches. Section 7 concludes the paper with future research directions. Appendix A1-A3 provides a discussion on the LSTM training time and more results with different datasets and characteristics, respectively.

3 Capacitated Lot Sizing Problem

CLSP is a fundamental problem in production planning. The CLSP determines the production and inventory levels in a multi-period planning horizon to fulfill the deterministic demand without back-ordering to minimize the sum of production, setup, and inventory holding costs. The CLSP with time-varying capacity is NP-Hard, and it has numerous variations and applications in the production and manufacturing industries [31].

To formulate the CLSP as a mixed-integer program (MIP), the following parameters and decision variables are defined. Let T be the number of periods considered in the planning horizon. For each period $t \in \left\{ 1,2,\ldots ,T \right\}$ demand $d_{t}$ is known in advance. For each period $t \in \left\{ 1,2,\ldots ,T \right\}$ associated costs are unit production cost $p_{t}$, setup cost $f_{t}$, and unit inventory holding cost $h_{t}$. Note that setup cost $f_{t}$ is not per unit based. For each period $t \in \left\{ 1,2,\ldots ,T \right\}$ production capacity is denoted by $c_{t}$. Without loss of generality, all parameters can be assumed to be non-negative. The number of units produced and ending inventory in period t is represented by non-negative variables $x_{t}$ and $s_{t}$, respectively. Binary variable $y_{t}$ takes value 1 if there is production in period t, and takes value 0 otherwise. The CLSP can be formulated as:

$$\begin{aligned} \min{} & {} \sum _{t=1}^{T} (p_{t}x_{t} + f_{t}y_{t} + h_{t}s_{t})\end{aligned}$$

(1a)

$$\begin{aligned} \text {s.t.}{} & {} s_{t-1} + x_{t} - d_{t} = s_{t} \quad \quad \ \forall t=1,2,\ldots ,T \end{aligned}$$

(1b)

$$\begin{aligned}{} & {} x_{t} \le y_{t}c_{t} \quad \quad \ \forall t=1,2,\ldots ,T \end{aligned}$$

(1c)

$$\begin{aligned}{} & {} x_{t},s_{t} \ge 0 \quad \quad \ \forall t=1,2,\ldots ,T \end{aligned}$$

(1d)

$$\begin{aligned}{} & {} y_{t} \in \{0,1\} \quad \quad \quad \ \forall t=1,2,\ldots ,T. \end{aligned}$$

(1e)

The objective function (1a) minimizes the sum of production costs, setup costs, and inventory holding costs over all periods $t \in \left\{ 1,2,\ldots ,T \right\}$. Constraints (1b) ensure the inventory flow over multiple periods. Specifically, the demand in period t must be satisfied by inventory at the end of period $t-1$ and units produced in period t. The remaining amount is the inventory at the end of period t. Constraints (1c) limit the production by capacity and ensure that a fixed cost of production is incurred in the objective function if there is production in period t. Constraints (1d) enforce that the amounts of units produced and kept in inventory are non-negative. Finally, constraints (1e) ensure that $y_t$ are binary variables. The parameter $s_{0}$ represents the initial inventory and is assumed to be zero.

4 LSTM-Optimization Framework

In this section, we present the LSTM-Opt framework that we develop to predict the optimal solution of the CLSP. Using the LSTM-Opt framework, we only predict the binary decision variables $y_{t}$ that correspond to a production decision instead of predicting all decision variables. As depicted in Fig. 1, the LSTM-based framework starts with data generation. The datasets with different characteristics are constructed according to the data-generation scheme described in Sect. 5.1. The resulting datasets are divided into three categories involving the training, validation, and test sets, which consist of 64%, 16%, and 20% of the data, respectively. The LSTM network parameters are optimized using a training set. This is done by minimizing a loss function that measures the performance of the model’s predictions compared to actual values. The binary cross-entropy is a common choice as a smooth loss function for binary classification because it leads to faster training with a better generalization performance than the sum of squares error [32]. The objective function of the CLSP given by equation (1a) is not minimized by the LSTM network. In the training step, we minimize the binary cross-entropy loss function given in the following equation (2):

$$L (y^*,\hat{y}) = -\frac{1}{T} \times \sum\limits_{t=1}^{T} (y^*_{t} \times log(\hat{y}_{t}) + (1-y^*_{t}) \times log(1-\hat{y}_{t}))$$

(2)

where $y^*$ represents the optimal values of the binary decision variables, and $\hat{y}$ represents predicted values of the binary decision variables of a CLSP instance. The binary cross-entropy loss function in Eq. (2) measures the discrepancy between $y^*$ and $\hat{y}$.

The LSTM model consists of several bidirectional LSTM layers that can process the information in both time directions and an output layer with a sigmoid activation function. Figure 2 shows the flow of information in the forward and backward layers in bidirectional LSTM. The input layer for LSTM consists of available features for that period: unit production cost $p_{t}$, setup cost $f_{t}$, production capacity $c_{t}$, and demand $d_{t}$ for $t \in \left\{ 1,2,\ldots ,T \right\}$. Note that we omitted the holding cost $h_{t}$ because it is taken as constant. For period t, information is carried from period $t-1$ in the forward layer and used to generate output in period t. In the backward layer, information is carried from period $t+1$ to period t, and it is used to generate output together with inputs in period t. The outputs of forward and backward layers are combined to generate prediction $\hat{y}_{t}$. After each hidden layer, a dropout layer is added for regularization.

We compare the models with different parameters, using the instances in the validation set, in a method known as hyperparameter tuning. We then choose the model with the highest validation accuracy, which is the proportion of the correctly predicted variables. Note that the validation set is not used to minimize the binary cross-entropy in Eq. (2); it is only used to compare LSTM networks with different hyperparameters, such as learning rate, number of layers, hidden nodes, and dropout rate. Then for each instance in the test set, a prediction is generated using the picked model. The framework described does not provide results on the feasibility of the resulting prediction and how good it is compared to the objective function value. The resulting predictions are added to problem (1a-1e) as constraints, and then CLSP is re-solved using CPLEX. The described approach can deliver optimal solutions fast and accurately without much loss in feasibility and optimality, as demonstrated in the computational results under Sect. 6.

CLSP is an MIP because of the binary decision variables. Predicting all binary decision variables and then fixing the predicted values in the MIP formulation (1a)-(1e) makes the problem a linear program, which yields a significant reduction in the solution time. This approach often leads to infeasibility due to its strict nature. Instead, predicting some of the binary decision variables results in more flexibility when resolving the problem instance and reduces the number of infeasible problems while still improving the solution time.

Additionally, the integral nature of the other two decision variables is preserved by solely predicting the binary variable because once the binary variables are fixed in the CLSP, it reduces to a linear program [27]. Also, predicting the binary variable carries an interpretable meaning of the production decision and its timing. Once the decision of whether to produce or not is determined and fixed in a period, the MIP solver determines the amount of production and the inventory levels.

Our integrated ML+OR tool can be beneficial for real-time applications where problems with different parameters are solved repeatedly. Lot-sizing and its variants commonly arise in the energy, pharmaceutical, electronics, food, processing, and consumer goods industries [33]. After an ML model is trained, it is not necessary to update the trained model after each prediction. Therefore, once an ML model is trained, predictions can be achieved in milliseconds by an LSTM forward pass to solve many CLSP instances sampled from the same distribution in a quite fast manner. While LSTM-Opt requires an initial computational investment for training and hyperparameter optimization, the long-term computational benefits significantly outweigh the early drawbacks. The reason is that the training time per each solved problem reduces significantly as our framework is utilized to solve problems frequently. In addition, deep learning models can be trained in parallel on large computing clusters to reduce the training time.

In this study, we present a general framework where learning-based decisions and mathematical solvers are integrated to generate a complete solution framework. However, special solution characteristics can be exploited for certain practical applications to provide further benefits during training, and predictions can be integrated into uniquely crafted solution algorithms. For the sake of generalization, we opted not to use such approaches.

5 Implementation and Experimentation

This section presents the CLSP instance generation scheme and the implementation details of our LSTM-Opt framework. All the codes are written in C++ and Python to generate CLSP instances and run the LSTM-Opt framework.

5.1 CLSP Instance Generation

The training, validation, and testing data were generated by the scheme presented in Atamtürk and Muñoz [34]. The difficulty of problems was determined by two main factors: tightness of the capacities with respect to demand and the ratio between setup and holding cost. Following the parameters used in Büyüktahtakın and Liu [35], instances are generated from capacity-to-demand ratios $c \in \left\{ 3,5,8\right\}$, setup-to-holding cost ratios $f \in \left\{ 1000,10000\right\}$ and the number of periods $T \in \left\{ 90,120\right\}$. The parameters regarding demand $d_{t}$, unit production cost $p_{t}$, production capacity $c_{t}$, and setup cost $f_{t}$ are generated from integer uniform distribution with the ranges $d_{t} \in \left[ 1,600\right]$, $p_{t} \in \left[ 1,5\right]$, $c_{t} \in \left[ 0.7c\bar{d},1.1c\bar{d}\right]$, $f_{t} \in \left[ 0.9f\bar{h},1.1f\bar{h}\right]$, where $\bar{d}= \frac{1}{T} \left( \sum _{t=1}^{T} d_{t}\right)$ and $\bar{h}= \frac{1}{T} \left( \sum _{t=1}^{T} h_{t}\right)$, respectively. Unit inventory holding cost $h_{t}$ is set at one at each period. Those CLSP instances span a variety of real applications. For example, the setup costs increase as the complexity and value of a product increase [36]. The type of CLSP problems (e.g., small versus big bucket types) also impact the length of each time period and, thus the size of the full planning horizon [1, 2]. In our computational study, we tested a variety of CLSP instances with different sizes and characteristics, parameter configuration, and computational complexity, similar to those former studies that test the success of CLSP solution algorithms (see, e.g., [4, 34, 37,38,39,40]). Moreover, the solution times of problems generated with the described parameters change between a few seconds to a couple of days. Therefore, generated problems represent a wide range of computationally different problems.

For each of the 12 combinations of parameters c, f, and T as described above, 100000 instances (problems) are generated, resulting in a total of 1200000 instances. All instances are solved using CPLEX. Infeasible problems are eliminated and replaced by feasible instances by regenerating new instances. The training, validation, and test set consists of 64000, 16000, and 20000 CLSP instances, respectively, for each combination of parameters.

5.2 LSTM-Opt Implementation Details

Before the training, the data is standardized by subtracting the feature mean and dividing by the feature standard deviation as a preprocessing step, which is often practically useful for faster convergence if different inputs have typical values that differ significantly [41]. In the hyperparameter tuning step, we compared LSTM models with different parameters, such as learning rate, number of layers, hidden nodes, and dropout rate, using the validation set. The values considered are $[2, 6]$ for the number of hidden layers, $[10, 150]$ for the number of units in hidden layers, $[0.1, 0.5]$ for the dropout rate, and $[0.1, 0.001]$ for the learning rate. The selected LSTM model contains three hidden LSTM layers, each with 40 hidden units in each time direction. Therefore, bidirectional LSTM for each layer has 80 hidden units. After each LSTM layer, a dropout layer with a drop rate of 0.3 is added to regularize the network. We used Adam optimizer with an initial learning rate of 0.01, which is an adaptive learning rate optimization algorithm that has been shown to work well in practice [42].

Hyperparameter tuning is critical for successfully training deep learning models, including LSTMs, and can be challenging due to the large number of hyperparameters [43]. For this step, we utilized a random search strategy considered superior to grid search [44] and preferable in many practical settings [45]. In this strategy, the hyperparameters are sampled independently from each other for each experiment to form a predetermined distribution. Different ranges for hyperparameters might be required for various problems. For example, a higher range for the number of layers and hidden nodes can be considered for a more challenging problem to increase the model complexity. For a detailed discussion on hyperparameter optimization, we refer the reader to the review of Yu and Zhu [46]. Also, alternative approaches such as Bayesian optimization [47] or particle swarm optimization [48] can be employed for efficient hyperparameter selection. Furthermore, a neural network architecture search can be performed to automate the network architecture design and reduce the dependency on the decisions made by a human expert [49].

For each instance in the test set, the values of binary variables are predicted. For each period $t \in \left\{ 1,2,\ldots ,T \right\}$, the LSTM network generates a prediction in the range of $[0, 1]$ for each $y_{t}$. The value of $\max (\hat{y_{t}},1-\hat{y_{t}})$ for $t \in \left\{ 1,2,\ldots ,T \right\}$, where $\hat{y_{t}}$ represents the predicted value of the binary variable $y_{t}$, is calculated and ordered in decreasing order. Predicted variables are selected up to the desired level, and $\hat{y_{t}}$ is labeled as 0 or 1 using a cut-off value of 0.5. Those variables can be interpreted as the ones closest to either zero or one, and thus we are more confident in the LSTM model’s prediction. Let $D \subseteq T$ be the set of indices of those binary decision variables predicted and selected using the $\max (\hat{y_{t}},1-\hat{y_{t}})$ function and a pre-set prediction percentage. Finally, those values are added as a constraint to the original model (1a-1e), as shown in the following modified problem (3a-3c):

$$\begin{aligned} \min{} & {} \sum _{t=1}^{T} (p_{t}x_{t} + f_{t}y_{t} + h_{t}s_{t})\end{aligned}$$

(3a)

$$\begin{aligned} \text {s.t.}{} \ Constraints \quad (1b)-(1e)\end{aligned}$$

(3b)

$$\begin{aligned}{} & {} y_{t} = \hat{y_{t}} \quad \quad \quad \ \forall t\in D. \end{aligned}$$

(3c)

Problem (3a-3c) is solved again to assess the quality of the LSTM predictions.

6 Computational Results

This section presents results from computational experiments performed using the LSTM-Opt framework described in Sect. 4 on randomly generated CLSP instances with various characteristics, as defined in Sect. 5. All experiments are performed on a computer running Windows 10 Intel i7 with 3.6 GHz GPU and 64 GB of memory. The CLSP instances are solved with IBM ILOG CPLEX 12.7.1. All of the results regarding the test-set solution times are presented in CPU seconds. The detailed results for training LSTM models are presented in Appendix A1.

We solve problem (1a-1e) instances using the default CPLEX as a benchmark to compare the performance of our LSTM-Opt framework for solving similar and different CLSP instances. The test dataset consists of 20000 CLSP instances for each combination of the parameters f, T, and c.

As an alternative to solving problem (3a-3c), where we fix the predicted values in the CLSP formulation (1a-1e) as constraints, we utilize two CPLEX solver methods–AddUserCuts and AddMIPStart methods of CPLEX to eliminate infeasibility as described below:

100(UC): AddUserCuts method of CPLEX, which enables CPLEX to add cuts into the user cut pool and use the cuts as needed.
100(MS): AddMIPStart method of CPLEX, which enables CPLEX to provide a starting solution to the model with a user cut pool.

6.1 Quality of Predictions

Here, we present a number of metrics with their formal definitions that are used to evaluate the effectiveness of our LSTM-Opt framework. Specifically, those metrics assess the proposed method to solve optimization instances with respect to the improvement in the solution time as well as the feasibility and optimality of the resulting solutions. The following metrics are used in Tables 1-6 and 8-11:

timeCPX: Mean solution time of a CLSP instance of the problem (1a-1e) in CPU seconds without any predictions using default CPLEX.
timeML: Mean solution time of the LSTM-Opt framework, including the prediction generation time by the LSTM model.
pred(%): Percent of binary variables predicted by the LSTM-Opt framework.

Additionally, we provide the following metrics and their formal definitions and use a combination of them to present the results:

Definition 6.1

The solution time factor improvement factor obtained by fixing the predicted variables as a constraint (or using AddUserCuts and addMIPStart) is given by:

$$\begin{aligned} \textbf{timeimp} = \frac{timeCPX}{timeML}. \end{aligned}$$

(4)

Definition 6.2

The percent solution time gain obtained by fixing the predicted variables as a constraint (or using AddUserCuts and addMIPStart) is given by:

$$\begin{aligned} \mathbf {timegain (\%)} = \frac{(timeCPX-timeML)}{timeCPX} \times 100. \end{aligned}$$

(5)

Definition 6.3

The percent infeasibility of a test set resulted from using predicted binary variables is given by:

$$\begin{aligned} \mathbf {inf (\%)} = \frac{\hat{m}}{m} \times 100, \end{aligned}$$

(6)

where $\hat{m}$ represents the number of CLSP instances that become infeasible by adding predictions as a constraint, and m represents the total number of CLSP instances in the test set.

Definition 6.4

Let $\left( x^*,y^*,s^*\right)$ be the optimal solution for the original MIP problem (1a)-(1e) that is obtained by the CPLEX solver and ${Z}(x^*,y^*,s^*)$ be the corresponding optimal objective value. Let $\hat{y}$ be the partial or full prediction of binary variables, $\left( \tilde{x},\tilde{y},\tilde{s}\right)$ be the optimal solution obtained by CPLEX using predictions $\hat{y}$ in problem (3a-3c), and ${Z}(\tilde{x},\tilde{y},\tilde{s})$ be the resulting objective function value. Note $\tilde{y}$ is equivalent to $\hat{y}$ when a full prediction is made. The optimality gap due to using solutions in our LSTM-Opt prediction framework is defined over feasibly-solved instances as follows:

$$\begin{aligned} \mathbf {optgap (\%)} = \frac{\left( {Z}(\tilde{x},\tilde{y},\tilde{s})-{Z}(x^*,y^*,s^*)\right) }{{Z}(x^*,y^*,s^*) } \times 100. \end{aligned}$$

(7)

In the next section, we present computational results to demonstrate the effectiveness of our prediction-optimization method, using the training and test instances with the same distribution. The predictions fed into the CPLEX solver may not be feasible for the CLSP instance. The test set instances for which the LSTM prediction leads to an infeasible solution are not included in the calculations of timeML, timeimp, timegain(%), and optgap(%).

6.2 Predicting Instances with Same Distribution

Each row of Tables 1, 3, 8, 9, 10, and 11 presents the averages of 20000 instances, whereas Tables 4, 5, and 6 present the mean result for 10 instances due to long solution times for each c value and each $f-T$ pair. Table 1 presents the results for instances with $T = 120$ with $f = 10000$. The dataset of $c = 3$ is harder than the dataset with $c=5$ and $c=8$. Both 25% and 50% prediction levels achieve all-feasible predictions that do not increase the objective function value. With the 50% prediction level, the mean solution time decreases by more than 10-fold. As the prediction level increases, the time factor improvement increases as well. Predictions at the 75% level reduce the solution time by a factor of 50 with an infeasibility of the test set below 0.3% and without any optimality gap. However, as the prediction levels increase, predictions lead to more infeasible instances in the test set. At the full prediction level (pred(%)=100), more than half of the predictions lead to infeasible solutions to problem (3a-3c); however, the issue of infeasibility is remedied by the CPLEX’s user cuts approach (AddUserCuts), which eliminates the infeasible solutions in the cut pool. The user cuts (UC) approach provides a significant solution time factor improvement of 68 with an optimality gap of over 1% without any infeasibility in the test set. The detailed results of experiments with $f = 10000$, $T = 90$ and $f = 1000$, $T = 90,120$ are presented in Tables A2-A4 in Appendix A2.

Table 1 Summary of experiments for $f = 10000$ and $T = 120$

Full size table

Figure 3a-d summarize the results with the optgap(%), inf(%), and timeimp for changing c for instances with $T = 90,120$ and $f = 1000,10000$. Figure 3a-d show that as the level of predicted variables increases, the time improvement also increases at the price of an increased optimality gap and infeasibility in the test set. The problems are harder when $c = 3$ ($f = 10000$) compared to $c = 8$ ($f = 1000$) for the same T. Predicting at lower levels provides good results for harder problems with significant time improvement without causing much optimality gap and infeasibility, e.g., a time improvement factor of 13 is achieved with the 50% prediction level without any infeasibility or optimality gap for instances with $c = 3$, $T = 120$, and $f = 10000$ (Fig. 3d). The time improvement factor is the highest when using the 100% prediction. For instances with $f=1000$, the full (100%) prediction results in less than 1.5% infeasibility. However, it could provide high levels of infeasibility in the test set for harder problems with $f=10000$. On the other hand, the 50% prediction level provides over a time factor improvement of 3 and reduces the infeasibility to 0.01% and the optimality gap to zero (Fig. 3b). When $f = 1000$, time improvement increases significantly with the level of prediction, but the increase in the optimality gap and infeasibility is much less than the counterpart instances with $f = 10000$, e.g., an infeasibility of 0.4% and optimality gap of zero is obtained for instances with $c = 5$, $T = 90$, and $f = 1000$ compared to an infeasibility of 23.3% and optimality gap of 1.2% is observed for instances with $c = 5$, $T = 90$, and $f = 10000$, using full predictions.

Figures 4a-d show averages for changing c, f, and T, and the overall average. As the value of c increases, the optimality gap, infeasibility, and time improvement generally decrease (Fig. 4a). Figure 4b shows that the value of f has a significant impact on results. The optimality gap and infeasibility are significantly lower when $f = 10000$, with lower-level predictions. Also, the time factor improvement is significantly greater at all prediction levels when $f = 10000$. Both the optimality gap and time improvement are slightly higher when $T = 120$ compared to the instances with $T = 90$. Figure 4d shows that using a prediction level of around 85% can balance all evaluation metrics by providing a solution time factor improvement of 9.

Table 2 shows the averages presented in Tables 1, 8, 9, and 10 for the LSTM-Opt 85% prediction level, the 100(MS), and the 100(UC). When $f = 1000$, the average infeasibility and the optimality gap is zero with the 85% prediction level. The UC provides a similar average time improvement without infeasibility. When $f = 10000$, the average time improvement is around 6 and 27 for the instances with $T = 90$ and $T = 120$, respectively, with a prediction level of 85%. The infeasibility is slightly higher than the instances with $f = 1000$ since the higher-level predictions increase the percent infeasibility for the harder test instances. The UC remedies the infeasibility problem and improves the solution time with a factor of 6 and 28 and with an optimality gap of around 0.7% and 0.8% for the instances with $T = 90$ and $T = 120$, respectively. Also, the UC outperforms the MS in time gain for all cases. When looking at the overall averages in Table 2, the LSTM-Opt predictions at the 85% level reduce the CPLEX solution time by a factor of 9 on average for over 240000 test instances with an infeasibility below 0.4% and an optimality gap of less than 0.05%. The UC provides a similar time gain without any infeasibility and a slightly higher optimality gap of 0.4% than the 85% level of prediction.

Table 2 Summary of averages in Tables 1, 8, 9, and 10

Full size table

In summary, the level of predictions used to get the best results varies notably between datasets with different characteristics. This level should be adjusted carefully considering the trade-off between time gain, infeasibility, and optimality gap. Using an appropriate level of predicted variables leads to major reductions in solution time up to an order of magnitude without increasing any infeasibility or optimality gap. It is beneficial to use lower prediction levels for harder instances and higher prediction levels for easier instances, but a prediction level of around 85% can be a reasonable level for all instances considered in this study. Also, the UC outperforms the MS in terms of providing lower optimality gaps. The UC can also be an alternative to the approach that uses predictions as constraints because it achieves zero infeasibility at the cost of a slightly higher optimality gap. As the c, f, and T increase, i.e., the instances get harder, and we observe a higher time factor improvement using the LSTM-Opt framework, highlighting the potential of our approach for solving instances with varying sizes and distributions, as discussed in the next section.

6.3 Results on Generalization

Here, we present the results on the generalization of our approach to instances with a larger planning horizon T. It is not uncommon to have long production planning periods for industries where a daily (even hourly) production planning is necessary, such as large-scale semiconductor manufacturing, biofuel production, and energy production [50,51,52]. Results on different data distributions are presented in Appendix A3. We omit the results with the MS approach in favor of UC due to its lack of performance. Generalization is a desired property because it might be beneficial to train the LSTM model in a relatively small horizon to predict instances with a larger planning horizon, saving from the training time. Specifically, the time to train the LSTM model is shorter than the training time for the instance for which the prediction is made due to the smaller number of model parameters. We also compare our framework with two other well-known ML algorithms (logistic regression and random forests) and the state-of-the-art cutting plane algorithms proposed for the CLSP.

6.3.1 Predicting Instances with Longer Horizons

Table 3 presents the results for predicting datasets with longer planning horizons. The predictions for a larger horizon are generated by concatenating the smaller LSTM predictions obtained by the LSTM model, which has a shorter planning horizon. For example, in the second block of rows in Table 3, the LSTM model with $c = 3$, $f = 1000$, and $T = 90$ is used to generate predictions for the dataset with the same c and f, and $T = 360$. Here, four separate prediction sets, each with 90 periods, are concatenated into a single set of predictions for generating a prediction for the test set with $T = 360$.

For those instances, predicting 85% of variables results in a time improvement of 3, with a 0.3% optimality gap and zero infeasibility in the test set of 20000 instances. The dataset with $c = 5$, $f = 10000$, and $T = 180$ is predicted with the LSTM model trained using instances with the same c and f but a half-length planning horizon of $T = 90$, as shown in the third block of rows in Table 3. For those instances, predicting 50% of variables yields a significant time improvement of 9 and all feasible solutions in the test set at the cost of an optimality gap, which is below 0.5%. The dataset with $c = 8$, $f = 10000$, and $T = 480$ constitutes the hardest instances presented in Table 3 with a mean solution time over 40 s and is predicted using the LSTM model trained with $T = 120$. Here, we observe significant solution time factor improvements over CPLEX using our LSTM-Opt framework. Predictions at the 75% level reduce the CPLEX solution time by a factor of 25 with no infeasibility and an optimality gap of 1%.

Table 3 Summary of generalization experiments to test datasets with longer planning horizons

Full size table

Table 4 presents the results for predicting datasets with significantly longer planning horizons; therefore, the test instances are much harder than the training instances. The predictions are generated using the model trained with $c = 8$, $f = 10000$, and $T = 120$. The test sets for all three datasets consist of 10 instances, instead of 20000 as previously presented, due to computational complexity and long solution times. For the first dataset with $T = 600$, problems are solved 70 times faster than the default CPLEX using predictions at the 50% level without any infeasibility and with an optimality gap below 1%. The mean CPLEX solution time for the next dataset with $T = 720$ is more than 8 CPU hours. Here, the solution time of 8 h is reduced to under 1 min, with the predictions used at the 25% level without any infeasibility and with an optimality gap below 1%. Predictions used at 75% reduce the solution time by more than four orders of magnitude from more than 8 CPU hours to 2.5 CPU seconds without infeasibility and with an optimality gap below 2%. The last test dataset with $c = 8$, $f = 10000$, and $T = 960$ constitutes the hardest instances presented with a mean solution time of over 70 h using CPLEX at its default settings. For those instances, predictions at the 25% level reduce the solution time of 70 CPU hours to only 79 CPU seconds with an optimality gap of 0.8% and without any infeasibility. Predictions at the 50% level reduce the solution time by more than a factor of 16000, with an optimality gap of 1.2% and zero infeasibility.

Generating 100000 instances for training data with $c = 8$, $f = 10000$, and $T = 120$ takes 140170 s, whereas the LSTM training time takes 52724 s. In this specific example, it can be concluded that generating training data, training the LSTM model, and resolving with predictions for a single instance takes 16 h less than solving the instance with CPLEX. For such hard problems, our LSTM-Opt framework achieves a significant time reduction even in the case where just a single instance must be solved. The results discussed above highlight that our approach could be generalizable to predict larger instances with substantial benefits in reducing the solution time of those hard CLSPs with the cost of a small optimality gap.

Table 4 Summary of generalization experiments to test datasets with longer planning horizons contd^*

Full size table

6.4 Comparison with other ML and Exact Algorithms

Table 5 presents the computational comparison of our LSTM-Opt framework with two other machine learning approaches that perform a binary classification task and shows that their prediction quality is not comparable to LSTM-Opt. Additionally, the comparison with two other exact approaches is presented in Table 6 to show that the LSTM-Opt framework could produce good-quality solutions in much less time compared to those exact approaches. The machine learning and exact approaches used to compare with the LSTM-Opt are defined as follows.

ML Approaches:

Logistic Regression (LR): An extension of linear regression. The LR is more interpretable than the tree-based ensemble methods, such as random forest at the cost of accuracy.
Random Forest (RF): One of the best algorithms for classification tasks [53] in terms of classification accuracy at the cost of reduced interpretability.

Exact Approaches:

CPLEX (CPX): Direct solution of the CLSP formulation (1a)-(1e) with default CPLEX.
Dynamic programming-based inequalities (DPineq) of Hartman et al. [4]: We used the weaklu strategy to create a tighter CLSP polyhedron. The generated inequalities are added to the formulation (1a)-(1e), and the proposed algorithm is shown to outpace the dynamic programming algorithm by Hartman et al. [4] for some cases. In the experiments, we generate cuts for the first 100 periods with $c = 3$, for the first 75 periods with $c = 5$, and for the first 50 periods with $c = 8$ for instances with $T = 360$ to combat the growing DP-based inequality generation time with increasing c.
The ($\ell$, S) inequalities (LSineq) of Barany et al. [24]: Implemented with a separation algorithm since the number of ($\ell$, S) inequalities grow exponentially. The separation algorithm is iterated five times which is inclined to give the best computational achievements [26].
Dynamic programming (DP) solution approach for CLSP [4, 23]: Results are omitted from the tables due to lack of performance. For example, while the mean solution times of $c = 3,5,8$ instances with CPLEX were 22.4, 3.3, and 1.3 s, respectively, the dynamic programming solution times were 610.7, 1054.9, and 1938.8 s for the same first 20 instances presented in Table 1. Additionally, the dynamic programming approach has a complexity of $\mathcal {O}(TD_{T}^2)$ where $D_{T} = \sum _{t=1}^{T} d_{t}$. We do not further include the dynamic programming solution to compare with the LSTM-Opt framework since CPLEX is superior for the considered instances.

Table 5 presents a set of instances that are tested for the comparison of LSTM-Opt, LR, and RF. Here, we have utilized a different structure to generate our test instances. For the instances with $c = 3,5$, a solution time limit of 86400 CPU seconds (24 CPU hours) is set for CPLEX to restrict the solution time. The metrics, including time improvement, time gain, and optimality gap, are calculated based on the best solution found by CPLEX within the solution time limit. The $IGap=100 \times (objCPX-objLP)/objCPX$, where objCPX is the objective function value of the best feasible solution to the original problem, and objLP is the objective function to its linear programming relaxation, is 7.5%, 16.1%, and 27.7% for the instances with $c=3,5,$ and 8, respectively. CPLEX reports an MIP optimality gap of 0.64%, 0.06%, and 0.00% on average for test instances with $c = 3,5,$ and 8, respectively, with a one-day time limit. For the same instances with $c = 3,5$, our preliminary results revealed that the test problems were still computationally very expensive to solve without a time limit with the 25% prediction level. Therefore the results with the 25% prediction level are omitted from the results in this section.

In Table 5, for the $c = 3$ instances, predictions at the 50% level improve the solution time by more than a factor of 7500 by reducing the limited average solution time from one day to only 12 s, without any infeasibility, and with an optimality gap of 0.8%. Predictions of more than 50% lead to some infeasibility in the test set, while the predictions at the 100% level lead to all infeasible predictions, and thus the corresponding results are presented in a “-” in the first-row block of results in Table 5. For the same instances solved at the 50% prediction level, LR and RF have caused an infeasibility of 70% and 50%, respectively, since neither considers sequential dependencies like the LSTM networks. The optimality gaps of the feasible instances were significantly higher than the 0.8% of the LSTM-Opt framework at 1.9% and 2.3%, respectively, for both LR and RF. Also, the time improvements are not as big as the ones of the LSTM-Opt framework. For the instances with $c = 5$, predictions using LSTM-Opt at the 50% level decrease the limited solution time from 1-day to 20 CPU seconds and reduce the solution time by more than a factor of 4000, including the prediction generation time without any infeasibility and with an optimality gap of 0.9%. The 85% level with LSTM-Opt reduces the solution time by five orders of magnitude without infeasibility and with an optimality gap of 2.3%. For the same prediction level, both the LR ad RF cause all infeasible predictions. The datasets with $c = 8$ constitute significantly faster to solve compared to instances with $c = 3,5$, and the results resemble the structure with easier instances. The LR and RF cause a high infeasibility in all prediction levels compared to the LSTM-Opt framework. Table 5 shows that a significant solution time reduction of around four to five orders of magnitude is achieved by LSTM-Opt without any infeasibility, depending on the desired optimality gap. We anticipate that the solution time gains would have further increased if the original solution time was not limited to 24 h. Additionally, both the LR and RF cause higher infeasibility, a higher optimality gap, and a lower time improvement.

Table 6 presents a comparison of LSTM-Opt at the 50% prediction level with two exact approaches, namely ($\ell$, S) valid inequalities (LSineq) and DP-based cutting planes (DPineq). The solution time is limited by 1 h, including the inequality generation time for both approaches. Here, the time improvement metric has been calculated with respect to the set solution time of 1 h. The optimality gap metric has been calculated with respect to the best integer solution found by the CPLEX within a 1-day solution time limit. For the $c=3$ instances, the LSTM-Opt framework achieves a 1.6% optimality gap. The objective function value is reduced below the 1-day limited CPLEX solution value with a 1-hour time limit in both formulations with DP-based and ($\ell$, S) inequalities, resulting in a negative optimality gap of $-$0.01%. Therefore, both types of inequalities are effective at reducing the optimality gap, but they can not solve the CLSP very fast though they speed up the solution for harder instances of $c = 3,5$. Even though the LSTM-Opt framework has an optimality gap of 0.8%, the solution was found 322 times faster than CPLEX, showing that the LSTM-Opt can solve those problems very fast. The results with $c = 5$ show a similar pattern, and the LSTM-Opt framework has an optimality gap of 0.9% on top of the CPLEX gap. The results with $c = 8$ show that LSTM-Opt framework can achieve a time improvement of 3 while inequality-based methods increase the solution time for easier instances.

Figure 5a and b present a summary of the results for the instances presented in Tables 5 and 6, for $c=3$ and 5, respectively. Both DPineq and LSineq improve over the CPLEX gap in Fig. 5a with $c = 3$ instances. LSTM achieves a solution with a slightly larger optimality gap much faster than those exact inequality methods and also is faster than LR and RF with a lower gap. The latter two algorithms result in a high infeasibility of 70% and 50%, respectively, while LSTM-Opt achieves all-feasible predictions. The results show similarity for the $c = 5$ instances in Fig. 5b, with the exception that both exact methods do not improve over the CPLEX gap or solution time. Even though LR and RF have a higher time improvement than LSTM-Opt, both lead to high and unpractical infeasibility rates of 60% and 40%, respectively.

Table 5 Computational results for comparing LSTM-Opt with other ML algorithms^*

Full size table

Table 6 Computational results for comparing LSTM-Opt with different exact methods^*

Full size table

6.5 Summary of Results

The results presented in generalization experiments show that a network trained on a smaller planning horizon can be used to successfully predict the optimal solutions of the instances with larger horizons without any additional training. The solution time can be reduced up to six orders of magnitude without increasing the optimality gap or infeasibility much, especially in harder problems. Also, LSTM-Opt can capture sequential dependencies while LR and RF can not. Classical exact approaches can not produce very fast solutions like the LSTM-Opt.

Figure 6a-f present the results for datasets for longer planning horizons. The results for $T = 360$ in Fig. 6a are similar to the dataset where the LSTM model is trained with $c = 3$, $f = 1000$, and $T = 90$, as shown in Fig. 3a. For the dataset with $c = 8$, $f = 10000$, and $T = 720$ in Fig. 6c, using predictions at the 50% level reduces the solution time from more than 8 h to under 8 s without any infeasibility and with an optimality gap of 1.2%. Figure 6d constitutes the instances with the longest solution times. The instances with $c = 8$, $f = 10000$, and $T = 960$ have a mean solution time of over 70 h. Predictions at the 25% and 50% levels reduce the solution time of those instances by more than a factor of 3000 and 16000, with an optimality gap of 0.8% and 1.2%, without any infeasibility in the test set, respectively.

Overall averages in Table 4 show that the solution time can be decreased with a factor of more than 9000, with an infeasibility in the test set of only 0.5% and an optimality gap of approximately 2.1%. The UC reduces the solution time by more than a factor of 90000 on average without infeasibility in the test set and an average optimality gap of around 3.5%. Overall, predictions at the levels between 25% and 85% provide significant time improvements with less than a 1% optimality gap and without any infeasibility in the test set. Specifically, predictions at the 85% level can balance a high solution time factor improvement with infeasibility and optimality gap at reasonable levels when predicting for longer periods using the LSTM model trained with instances of shorter and, thus easier instances.

7 Conclusions and Future Work

In this study, we present a new LSTM-Opt framework to predict the optimal solution of the CLSP, a fundamental production planning problem in various industry settings. Our ML approach could be beneficial in reducing the solution time for many practical problems that are solved repeatedly with different parameters. We utilize bidirectional LSTMs to process information in both time directions. The metrics, defined as time factor improvement, infeasibility, and optimality gap, are presented to assess the quality of the predictions. The results for the CLSP instances with the same characteristics show that a time factor improvement of more than an order of magnitude can be achieved without much loss in feasibility or the optimality gap if the level of predictions used to solve the problem is well adjusted. Also, we tested if the trained LSTM models could generalize to instances with different data distributions or longer planning horizons. The results show that one should be careful in selecting the prediction level for predicting instances with different data distributions. The LSTM models trained on shorter planning horizons achieve great success in predicting instances with longer planning horizons with any prediction level and reduce the solution time up to six orders of magnitude with a small optimality gap. Specifically, we observe the highest computational benefit from our LSTM-Opt approach when predicting the hardest set of instances. Also, the LSTM-Opt framework outperforms classical ML algorithms in terms of the quality of the solution and exact approaches with respect to the solution time improvement.

Our LSTM-Opt framework can be especially useful for reducing the solution time of dynamic combinatorial optimization problems that are solved in a repetitive setting [54]. In this paper, we have used the CLSP as a specific case to show that deep-learning approaches have great potential for learning optimal solutions to MIP problems. Future research could further investigate the generalizability of our approach to instances with a larger planning horizon and different distributions in more detail. Another possible research direction is to develop methods to eliminate infeasible predictions. Additionally, the developed LSTM-Opt framework can be extended to solve more complex versions of the CLSP, such as the multi-item or multi-level CLSP, as well as other sequential decision-making problems. Moreover, other contemporary sequential learning models, such as the recurrent encoder-decoder with attention [55] or transformers [56] can be utilized to tackle more complex problems.

Data Availability

All data sets (CLSP solutions, models, predictions, and results) with the explanation file are available here.

Code Availability

All the codes written in C++ and Python to generate CLSP instances and run the LSTM-Optimization framework are available here.

Change history

22 June 2023
On pages 13, 15, and 16, "problem ()" must be replaced by "problem (3a-3c)".

References

Gicquel C, Minoux M, Dallery Y (2008) Capacitated lot sizing models: A literature review
Karimi B, Ghomi SF, Wilson J (2003) The capacitated lot sizing problem: a review of models and algorithms. Omega 31(5):365–378
Article Google Scholar
Bitran GR, Yanasse HH (1982) Computational complexity of the capacitated lot size problem. Manag Sci 28(10):1174–1186
Article Google Scholar
Hartman JC, Büyüktahtakın İE, Smith JC (2010) Dynamic-programming-based inequalities for the capacitated lot-sizing problem. IIE Trans 42(12):915–930
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–80
Article Google Scholar
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks / a Publication of the IEEE Neural Networks Council 5:157–66
Article Google Scholar
Schuster M, Paliwal K (1997) Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on 45:2673–2681
Article Google Scholar
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, vol 4. pp 2047–2052
Smith KA (1999) Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS J Comput 11(1):15–34
Article Google Scholar
Larsen E, Lachapelle S, Bengio Y, Fre**ger E, Lacoste-Julien S, Lodi A (2021) Predicting tactical solutions to operational planning problems under imperfect information. INFORMS J Comput
Fischetti M, Fraccaro M (2019) Machine learning meets mathematical optimization to predict the optimal production of offshore wind parks. Comput Oper Res 106:289–297
Article Google Scholar
Bertsimas D, Stellato B (2019) Online mixed-integer optimization in milliseconds. ar**v preprint ar**v:1907.02206
Bushaj S, Büyüktahtakın İE (2023) A K-means supported reinforcement learning algorithm to solve multi-dimensional knapsack problem. Under Review
Bushaj S, Yin X, Beqiri A, Andrews D, Büyüktahtakın İE (2022) A simulation-deep reinforcement learning (SiRL) approach for epidemic control optimization. Ann Oper Res 1–33
Yilmaz D, Büyüktahtakın İE (2023) A deep reinforcement learning framework for solving two-stage stochastic programs. Accepted for Publication in Optimization Letters
Oroojlooyjadid A, Snyder LV, Takáč M (2019) Applying deep learning to the newsvendor problem. IISE Transactions 52(4):444–463
Article Google Scholar
Khalil EB, Bodic PL, Song L, Nemhauser G, Dilkina B (2016) Learning to branch in mixed integer programming. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 724–731. AAAI Press
Khalil EB, Dilkina B, Nemhauser GL, Ahmed S, Shao Y (2017) Learning to run heuristics in tree search. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 pages 659–666
Lodi A, Zarpellon G (2017) On learning and branching: a survey. TOP 25(2):207–236
Article Google Scholar
Xavier AS, Qiu F, Ahmed S (2019) Learning to solve large-scale security-constrained unit commitment problems
Kruber M, Lübbecke M, Parmentier A (2017) Learning when to use a decomposition. pages 202–210
Bonami P, Lodi A, Zarpellon G (2018) Learning a classification of mixed-integer quadratic programming problems. In: van Hoeve WJ (ed) Integration of Constraint Programming, Artificial Intelligence, and Operations Research. Cham. Springer International Publishing, pp 595–604
Chapter Google Scholar
Florian M, Lenstra JK, Rinnooy Kan A (1980) Deterministic production planning: Algorithms and complexity. Manag Sci 26(7):669–679
Article Google Scholar
Barany I, Van Roy TJ, Wolsey LA (1984) Strong formulations for multi-item capacitated lot sizing. Manag Sci 30(10):1255–1261
Article Google Scholar
Eppen GD, Martin RK (1987) Solving multi-item capacitated lot-sizing problems using variable redefinition. Oper Res 35(6):832–848
Article Google Scholar
Büyüktahtakın İE, Smith JC, Hartman JC (2018) Partial objective inequalities for the multi-item capacitated lot-sizing problem. Comput Oper Res 91:132–144
Article Google Scholar
Pochet Y, Wolsey LA (2006) Production planning by mixed integer programming. Springer Science & Business Media
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press. http://www.deeplearningbook.org
Graves A (2012) Supervised Sequence Labelling with Recurrent Neural Networks. Springer, Berlin Heidelberg
Book Google Scholar
ILOG I (2016) Cplex optimizer 12.7. 0
Quadt D, Kuhn H (2007) Capacitated lot-sizing with extensions: A review. 4OR 6(1):61–83
Bishop CM et al (1995) Neural networks for pattern recognition. Oxford University Press
Google Scholar
Copil K, Wörbelauer M, Meyr H, Tempelmeier H (2017) Simultaneous lot-sizing and scheduling problems: a classification and review of models. OR Spectr 39(1):1–64
Article Google Scholar
Atamtürk A, Muñoz JC (2004) A study of the lot-sizing polytope. Math Program 99(3):443–465
Article Google Scholar
Büyüktahtakın İE, Liu N (2016) Dynamic programming approximation algorithms for the capacitated lot-sizing problem. J Glob Optim 65(2):231–259
Article Google Scholar
Bitran GR, Haas EA, Matsuo H (1986) Production planning of style goods with high setup costs and forecast revisions. Oper Res 34(2):226–236
Article Google Scholar
Atamtürk A, Küçükyavuz S (2005) Lot sizing with inventory bounds and fixed costs: Polyhedral study and computation. Oper Res 53(4):711–730
Article Google Scholar
Büyüktahtakın İE (2023) Scenario-dominance to multi-stage stochastic lot-sizing and knapsack problems. Comput Oper Res 106149
Yilmaz D, Büyüktahtakın IE (2023) An expandable learning-optimization framework for sequentially dependent decision-making. Under Review
Yilmaz D, Büyüktahtakın IE (2023) A non-anticipative learning-optimization framework for solving multi-stage stochastic programs. Under Review
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In Neural networks: Tricks of the trade pages 9–48. Springer
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. ar**v preprint ar**v:1412.6980
Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 415:295–316
Article Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(2)
Bischl B, Binder M, Lang M, Pielok T, Richter J, Coors S, Thomas J, Ullmann T, Becker M, Boulesteix AL, et al (2021) Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery page e1484
Yu T, Zhu H (2020) Hyper-parameter optimization: A review of algorithms and applications. ar**v preprint ar**v:2003.05689
Wu J, Chen XY, Zhang H, **ong LD, Lei H, Deng SH (2019) Hyperparameter optimization for machine learning models based on Bayesian optimization. J Electrochem Sci Technol 17(1):26–40
Google Scholar
Lorenzo PR, Nalepa J, Kawulok M, Ramos LS, Pastor JR (2017) Particle swarm optimization for hyper-parameter selection in deep neural networks. In Proceedings of the genetic and evolutionary computation conference pages 481–488
Elsken T, Metzen JH, Hutter F (2019) Neural architecture search: A survey. J Mach Learn Res 20(1):1997–2017
Google Scholar
Kantas AB, Cobuloglu HI, Büyüktahtakın İE (2015) Multi-source capacitated lot-sizing for economically viable and clean biofuel production. J Clean Prod 94:116–129
Article Google Scholar
Shrouf F, Miragliotta G (2015) Energy management based on internet of things: practices and framework for adoption in production management. J Clean Prod 100:235–246
Article Google Scholar
Uzsoy R, Lee CY, Martin-Vega LA (1992) A review of production planning and scheduling models in the semiconductor industry part I: System characteristics, performance evaluation and production planning. IIE Trans 24(4):47–60
Article Google Scholar
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181
Google Scholar
Büyüktahtakın IE (2022) Stage-t scenario dominance for risk-averse multi-stage stochastic mixed-integer programs. Ann Oper Res 309(1):1–35. Springer
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. ar**v preprint ar**v:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Proces Syst 30

Download references

Funding

We gratefully acknowledge the support of the National Science Foundation CAREER Award co-funded by the CBET/ENG Environmental Sustainability program and the Division of Mathematical Sciences in MPS/NSF under Grant No. CBET-1554018. We acknowledge the support of the New Jersey Institute of Technology Academic & Research Computing Systems for the High-Performance Computing resources. We also thank the referees for their constructive reviews.

Author information

Authors and Affiliations

Department of Mechanical and Industrial Engineering, New Jersey Institute of Technology, Newark, USA
Dogacan Yilmaz
Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, USA
İ. Esra Büyüktahtakın

Authors

Dogacan Yilmaz
View author publications
You can also search for this author in PubMed Google Scholar
İ. Esra Büyüktahtakın
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DY: Conceptualization, Methodology, Software, Data Curation, Validation, Visualization, Experimentation, Writing- Original draft preparation; İEB: Conceptualization, Methodology, Supervision, Investigation, Writing- Reviewing and Editing.

Corresponding author

Correspondence to İ. Esra Büyüktahtakın.

Ethics declarations

Ethics Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Consent for publication will be given upon the acceptance of the paper.

Conflicts of Interest Statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

1.1 A1 Results for Training LSTM Models

For each dataset, several LSTM models are trained, and the best achieving model is selected with a process called hyperparameter tuning, as described in Sect. 5.2. Table 7 presents the LSTM training time in CPU seconds for different datasets. The training times do not follow an identifiable pattern with c; however, they show an increasing trend as each of the T and f parameters increases. For example, the instances are solved faster when $c=8$ compared to the instances with $c=3$, but the LSTM training time is higher in three out of four datasets with $c=8$, indicating that training time is not positively correlated with the hardness of the instances for a given T and f. The training times presented in Table 7 can be decreased significantly with the use of graphics processing units (GPUs). Those training times are not included in the timeML, timeimp, and timegain(%) because the training of the models can be performed in an offline setting. However, as mentioned above, the prediction generation time using the trained LSTM model is included in the timeML, timeimp, and timegain(%). The training times were 3.9, 4.4, and 5.2 s for the logistic regression and 4430.6, 4459.4, and 4186.8 s for the random forest model. Even though they are significantly lower than the LSTM training times, their performance is inferior to the LSTM-Opt framework.

Table 7 LSTM training times for the model with the highest validation accuracy (in CPU seconds)

Full size table

1.2 A2 More Results on the Experiments

Table 8 presents the results for the instances with $T = 90$ and $f = 1000$, which are relatively easier than the instances with $T = 120$ and $f = 10000$ in Table 1. It can be seen that as the value of capacity multiplier c increases from 3 to 5, the solution time decreases. As more variables are predicted, the solution time gain increases considerably with an increase in the number of infeasible instances in the test set. Predicting all binary variables (pred(%)=100) reduces the mean solution time by more than half, but the reduction in solution time comes at a price, which is 1.45% infeasibility in all CLSP instances and 0.13% average optimality gap. In other words, 290 instances had infeasible predictions among 20000 CLSP test instances, and the average objective value of the feasibly-predicted instances only deviated by 0.13% from the average optimal objective value by CPLEX. The user cuts approach with 100% of variables predicted achieves a time improvement of 2, with zero infeasibility and a slightly higher optimality gap. In the dataset with $c = 5$ of Table 8, time improvements are similar to the previous dataset with $c = 3$, but for the same level of prediction, we observe less number of instances in the test set for which the predictions are infeasible. When $c = 8$, a low-level prediction increases the average solution time, but the complete prediction of binary variables (pred(%)=100) achieves zero optimality gap, with only 0.01% of the test instances having infeasible solutions, and solution time is reduced by a factor of 3. The instances presented in Table 8 are the easiest among all datasets. For those instances, predicting the majority of binary variables could be the most beneficial approach, reducing the solution time without a significant increase in the optimality gap or infeasibility.

Table 8 Summary of experiments for $f = 1000$ and $T = 90$

Full size table

Table 9 presents the results for instances with $T = 90$ and $f = 10000$. These are harder than the instances shown in Table 8, as seen from the mean solution time. For the instances with $c = 3$, we observe that predicting 50% of variables reduces the average solution time by a factor of 3, with no infeasibility in the test set, and zero optimality gap. However, as more variables are predicted, solutions may become significantly infeasible. For example, at the full (100%) prediction of the binary variables, more than half of the predictions are infeasible, which is not desired. The user cuts approach remedies this problem by resulting in all feasible solutions for all 20000 test instances with a significant reduction in the solution time. The dataset with $c = 5$ follows a similar pattern to the dataset with $c=3$ but demonstrates less infeasibilities at the higher level of predictions. Results for the dataset with $c = 8$ are similar to the instances with the same c in Table 8. Here, the approach that predicts the variables at higher levels achieves the most computational gain. For example, with the 75% prediction, the solution time is improved by a factor of 2, without any loss in optimality or feasibility. It can be seen from the results that predicting variables at higher levels can be beneficial for easier instances in terms of the time gain without sacrificing feasibility or optimality.

Table 9 Summary of experiments for $f = 10000$ and $T = 90$

Full size table

Table 10 demonstrates the results for instances with $T = 120$ and $f = 1000$, which have a similar solution time to the instances $T = 90$ and $f = 1000$ in Table 8. Also, the quality of the predictions follows a similar pattern, as in Table 8. For all datasets, predictions at the lower proportions do not provide significant reductions in solution time, while predicting up to a greater extent can yield noteworthy improvement. For $c = 3$ instances, predicting 100% of variables results in a significant solution time improvement of 4, with only 1.3% of the instances in the test set becoming infeasible. The user cuts approach provides time improvement of 2 without causing infeasibility in any of the 20000 test instances. For $c = 5$ instances, a significant time improvement is achieved by the full prediction of variables at the cost of 0.35% infeasibility in the test set and 0.04% optimality gap. For the $c = 8$ dataset, the full prediction results in a significant time improvement of 4 with a 0.02% infeasibility and 0.01% optimality gap. The user cuts achieve a small solution time reduction without any infeasibility in the test set.

Table 10 Summary of experiments for $f = 1000$ and $T = 120$

Full size table

1.3 A3 Predicting Instances with Different Distributions

Table 11 presents the results on how the trained LSTM models generalize to the instances with different distributions. In the first dataset, we examine the predictive performance of the LSTM model trained on instances with $c = 3$, $f = 1000$, and $T = 90$ on the dataset with the same c and T, but $f = 10000$. Here, the LSTM is trained on easier instances and used to predict the harder instances. We observe a time improvement of 2 at the 25% prediction level, with an optimality gap under 0.5% without any infeasibility in the test set. As the prediction level increases, time improvement increases significantly at the cost of an increased optimality gap. For example, at the 75% prediction level, we improve the CPLEX solution time with a factor of 10 without any infeasibility but with an optimality gap of 3.8%. In the next dataset, we examine the opposite scenario, where instances are trained on a harder dataset with $f = 10000$ to predict easier instances with $f = 1000$. At the lower proportions of predictions, the time improvement does not increase significantly, but predictions do also not cause any infeasible solutions. As the level of predictions increases, both the percent infeasibility in the test set and the optimality gap increase significantly. The UC remedies the infeasibility and achieves an optimality gap slightly above 5%.

The next two datasets have completely different underlying distributions. The LSTM model is trained on easier instances with $c = 3$, $f = 1000$ and $T = 120$ to predict harder instances with $c = 5$, $f = 10000$ and $T = 120$. At the 25% prediction level, a time improvement of 2 is achieved at the cost of an optimality gap, which is slightly more than 2%, without any infeasibility in the test set. As the prediction level increases, the optimality gap increases significantly, but solutions remain highly feasible. In the opposite scenario, the LSTM model is trained on instances with $c = 5$, $f = 10000$ and $T = 120$ to predict the instances with $c = 3$, $f = 1000$ and $T = 120$. Using an entirely different underlying distribution to make predictions leads to an increased level of infeasibility in the test set, where the prediction level of more than 75% causes all infeasible solutions. Therefore, for those predictions, we recommend a lower-level prediction (e.g., 25%) or the use of the CPLEX user cuts approach to remedy the infeasibility problem.

Overall, training on the easier instances to predict harder instances with a similar c value can provide good results at the low level of predictions or with the UC approach. For example, the LSTM model trained with $c = 3$, $f = 1000$, and $T = 90$ improves the CPLEX solution time with a factor of 3 without any infeasibility and with an optimality gap of 1% at the 50% prediction level when predicting the instances with $c = 3$, $f = 10000$ and $T = 90$.

Table 11 Summary of generalization experiments to test datasets with different characteristics

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yilmaz, D., Büyüktahtakın, İ.E. Learning Optimal Solutions via an LSTM-Optimization Framework. Oper. Res. Forum 4, 48 (2023). https://doi.org/10.1007/s43069-023-00224-5

Download citation

Received: 27 June 2022
Accepted: 02 May 2023
Published: 06 June 2023
DOI: https://doi.org/10.1007/s43069-023-00224-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning Optimal Solutions via an LSTM-Optimization Framework

Abstract

Similar content being viewed by others

One-Shot Learning for MIPs with SOS1 Constraints

A non-anticipative learning-optimization framework for solving multi-stage stochastic programs

Predicting the Optimal Period for Cyclic Hoist Scheduling Problems

1 Introduction

2 Literature Review and Contributions

2.1 Literature Review

2.2 Key Contributions of the Paper

3 Capacitated Lot Sizing Problem

4 LSTM-Optimization Framework

5 Implementation and Experimentation

5.1 CLSP Instance Generation

5.2 LSTM-Opt Implementation Details

6 Computational Results

6.1 Quality of Predictions

Definition 6.1

Definition 6.2

Definition 6.3

Definition 6.4

6.2 Predicting Instances with Same Distribution

6.3 Results on Generalization

6.3.1 Predicting Instances with Longer Horizons

6.4 Comparison with other ML and Exact Algorithms

6.5 Summary of Results

7 Conclusions and Future Work

Data Availability

Code Availability

Change history

22 June 2023

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Consent to Participate

Consent for Publication

Conflicts of Interest Statement

Additional information

Publisher's Note

Appendices

Appendices

1.1 A1 Results for Training LSTM Models

1.2 A2 More Results on the Experiments

1.3 A3 Predicting Instances with Different Distributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation