Introduction

With obvious climate change occurring during the last decade, reducing carbon emissions is an urgent need (Yang et al. 2022). Wastewater treatment is one of the most energy-consuming industries in China; it consumed 18.4 billion kWh in 2020, and its consumption level continues to increase every year (Chang et al. 2021). Large amounts of greenhouse gases (GHGs), such as CO2, N2O, and CH4, are generated during wastewater treatment and have been identified as anthropogenic sources of GHG emissions (Yoshida et al. 2014). Among the total energy consumption of wastewater treatment plants (WWTPs), the electricity consumed during aeration accounts for 70–80% of this amount, followed by pumps and chemicals (Yang et al. 2021). At present, the aeration control process in Chinese WWTPs is slipshod and mostly manual, and excess aeration is always supplied, which generates unnecessary energy consumption (Yang et al. 2021), especially in small-scale WWTPs. In most European WWTPs, the SCADA system has been installed to precisely control the aeration process. However, most Chinese WWTPs are only equipped with facilities to check whether the effluent meets the imposed discharge standards. Currently, as discharge standards become stricter, the requirement for energy savings is increasing. It is necessary to determine the energy savings potential of the existing equipment, as well as meet these standards.

During the last decade, with the rapid development of information technologies such as big data and artificial intelligence (AI), new energy saving solutions have been provided by constructing intelligent management systems (Wang et al. 2022). Machine learning (ML) is an AI technology that is used to recognize specific patterns and provide necessary data for prediction or classification. Benefiting from its high precision based only on data relationships, ML is becoming more popular in various fields, such as effluent prediction and process optimization (Picos-Benítez et al. 2020). Since wastewater treatment is a nonlinear process, it is often difficult to construct simple models, while a data-driven approach based on ML is preferable (Bagherzadeh et al. 2021). Many studies have verified the feasibility of ML in wastewater treatment tasks. Nourani et al. (2021) proposed an approach based on black-box models, including a feedforward neural network, support vector regression (SVR) and an adaptive neuro-fuzzy inference system, to predict effluent biological oxygen demand (BOD5) and chemical oxygen demand (COD). Ly et al. (2022) investigated and compared six ML algorithms for predicting effluent total phosphorus (TP). El-Rawy et al. (2021) compared the performance of different models for predicting the removal efficiencies of total suspended solids, COD, BOD5, and NH4+–N (ammonia nitrogen). Wan et al. (2022) proposed a model based on a convolutional neural network, weight-sharing long short-term memory and Gaussian process regression for paper-making wastewater treatment, which exhibited a comprehensive forecasting ability. Das et al. (2021) used standard mean absolute error (MAE) and root mean square error (RMSE) metrics as evaluation indices to compare four ML algorithms, and a gated recurrent unit was selected as the best model. Żyłka et al. (2020) evaluated the least-squares linear regression model for predicting electricity consumption and found that the main parameters were the organic loading rate and temperature.

However, to date, research using ML in wastewater treatment scenarios has mainly focused on effluent quality, while few studies has been conducted on energy consumption. In addition, some factors for determining energy consumption have not been considered, and their influence has not yet been fully evaluated; thus, model performance is strongly determined by the accuracy of the input data.

In this study, a map** relationship between energy consumption and management parameters was established, and an energy-saving strategy for WWTPs was developed based on a genetic algorithm. Daily operation parameters were ranked through regression analysis, and then the top-ranking parameters were selected as inputs to establish an XGB model for predicting energy consumption. Furthermore, an energy-saving control strategy was evaluated, which is expected to offer an instant energy-saving strategy for WWTPs in practical applications.

Materials and methodology

Background of the target WWTP

An urban WWTP (an anaerobic anoxic aerobic process) located in Henan Province (China) had a designed flow rate of 30,000 m3 day−1, and it followed the Chinese discharge standard of GB18918-2022. The sludge was dewatered by a belt filtering press and then treated by a sanitary landfill (Fig. 1). The operation data were collected for 353 days from 1st January to 31st December 2020, which consisted of the influent flow rate (IFR), influent NH4+–N concentration (IAN), influent COD (ICOD), influent TN (ITN), influent TP (ITP), effluent NH4+–N concentration (EAN), effluent COD (ECOD), effluent TN (ETN), effluent TP (ETP), mixed liquid suspended solids (MLSS) of the aerobic tank, DO at the end of the aerobic tank (DO), ORP at the end of the anoxic tank (ORP), organic loading rate (OLR), NH4+–N loading rate (ANLR) and energy consumption per cubic metre (EC). The units and statistics of each feature are shown in Table 1.

Fig. 1
figure 1

Treatment process of the WWTP

Table 1 Dataset

Feature selection

Ordinary least squares

Ordinary least squares (OLS) is a method that can be used for variable selection. For a dataset \(D=\{\left({{\varvec{x}}}_{1},{y}_{1}\right),\left({{\varvec{x}}}_{2},{y}_{2}\right),\dots ,\left({{\varvec{x}}}_{n},{y}_{n}\right)\}\), where \({\varvec{x}}\in {\mathbb{R}}^{d},y\in {\mathbb{R}}\), one of its basic expressions is as follows. The fitting criterion is to minimize the sum of squared residuals between \(y\) and \(f\left(x\right)\). The parameters without zero coefficients are selected.

$$ \begin{array}{*{20}c} {f\left( {\varvec{x}} \right) = {\varvec{\omega}}^{T} x + \beta } \\ \end{array} $$
(1)
$$ \begin{array}{*{20}c} {\min \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - f\left( {{\varvec{x}}_{i} } \right)} \right)^{2} } \\ \end{array} $$
(2)

where \({{\varvec{\omega}}}^{T}\) and \(\beta \) are undetermined coefficients.

Least absolute shrinkage and selection operator

To avoid overfitting during OLS, a penalty function with the L1 norm is added to the objective function to simplify the structure and decrease the empirical risk in the least absolute shrinkage and selection operator (Lasso). Compared with ridge regression, which adapts the L2 norm, the Lasso more easily obtains sparse solutions and selects features.

$$ \begin{array}{*{20}c} {\min \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - f\left( {x_{i} } \right)} \right)^{2} + \lambda {\varvec{\omega}}_{1} } \\ \end{array} $$
(3)

Smoothly clipped absolute deviation

The smoothly clipped absolute deviation (SCAD) approach was proposed by Fan and Li (2001). Compared with the Lasso, this method reduces the bias of parameter estimation. The basic principle behind the imposed penalty is similar to MCP, but the difference is the utilized transition method. The penalty function of SCAD is defined as Eq. \(4\).

$$ \begin{array}{*{20}c} {p_{\lambda ,\gamma } \left( \theta \right) = \left\{ {\begin{array}{*{20}l} {\lambda \theta ,} \hfill & {\theta \le \lambda } \hfill \\ { - \frac{{\theta^{2} - 2\gamma \lambda \theta + \lambda^{2} }}{2\theta - 2},} \hfill & { \lambda < \theta \le \gamma \lambda } \hfill \\ {\frac{{\left( {\gamma + 1} \right)\lambda^{2} }}{2}, } \hfill & {\theta > \gamma \lambda } \hfill \\ \end{array} } \right.} \\ \end{array} $$
(4)

where \(\lambda \ge 0\) and \(\gamma >2\).

The objective function used by SCAD is as follows:

$$ \begin{array}{*{20}c} {\min \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - f\left( {x_{i} } \right)} \right)^{2} + \lambda \mathop \sum \limits_{j = 1}^{d} p_{\lambda ,\gamma } \left( {\omega_{j} } \right)} \\ \end{array} $$
(5)

C p criterion

Mallow's Cp is used for variable selection through the sum of the residual squares obtained from an OLS regression model (James et al. 2021). The independent variable subset that minimizes Cp is selected, and the regression equation corresponding to this independent variable subset is the optimal regression equation.

$$ \begin{array}{*{20}c} {C_{{\text{p}}} = \frac{{{\text{SSR}}_{d} }}{{\hat{\sigma }^{2} }} - n + 2d} \\ \end{array} $$
(6)

where \({\widehat{\sigma }}^{2}\) is an estimate of the variance of the residuals, \(d\) is the number of parameters, and \(n\) is the sample size.

The above methods are based on a linear regression model. By adding a penalty item, the coefficients of the input variables are compressed to varying degrees, and this plays a role in the feature selection process.

Akaike information criterion

The Akaike Information Criterion (AIC) is widely used to evaluate the fitness levels of statistical models, and it works as follows. The first term reflects fitness, and the second term reflects the number of parameters. The best model is the one with the minimum AIC (Ingdal et al. 2019).

$$ \begin{array}{*{20}c} {{\text{AIC}} = - 2\ln \hat{L} + 2d} \\ \end{array} $$
(7)

where \(\widehat{L}\) is the likelihood estimator and \(d\) is the number of parameters.

Bayesian information criterion

Similar to the AIC, the Bayesian information criterion (BIC) is also a basic method for decision-making tasks involving statistical models, and it helps to strike a balance between simplicity and map** ability. The standard expression of the BIC is as follows. Different from the AIC, its second term is concerned with both the number of parameters and the sample size. The model with the minimum value achieves the best balance between simplicity and map** ability (Liu et al. 2022b).

$$ \begin{array}{*{20}c} {{\text{BIC}} = - 2\ln \hat{L} + d\ln n} \\ \end{array} $$
(8)

where \(\widehat{L}\) is the likelihood estimator, \(d\) is the number of parameters, and \(n\) is the sample size.

Minimum redundancy–maximum relevance

The minimum redundancy-maximum relevance (mRMR) algorithm is used to solve the problem that due to the existence of redundant variables, the best eigenvalue obtained by maximizing the correlation degree between a feature and the target variable does not necessarily obtain the best prediction accuracy (Ding and Peng 2003).Mutual information is used to measure the correlation between two variables. The mutual information \(I\) of two discrete random variables \(x\) and \(y\) is defined as follows:

$$ \begin{array}{*{20}c} {I\left( {x,y} \right) = \mathop \sum \limits_{i,j} p\left( {x_{i} ,y_{i} } \right)\log \frac{{p\left( {x_{i} ,y_{i} } \right)}}{{p\left( {x_{i} ,y_{i} } \right)}}} \\ \end{array} $$
(9)

where \(p(x,y)\) is the joint probabilistic distribution of two variables \(x\) and \(y\), and \(p(x)\) and \(p(y)\) are the marginal probabilities of \(x\) and \(y\), respectively.

For the mRMR algorithm, the mutual information \(I(x,c)\) is used to find the feature subset \(S\) among the \(m\) features that are most closely related to category \(c\).

$$ \begin{array}{*{20}c} {\max D\left( {S,c} \right),D = \frac{1}{\left| S \right|}\mathop \sum \limits_{{x_{i} }} I\left( {x_{i} ,c} \right)} \\ \end{array} $$
(10)

The minimum redundant feature condition is:

$$ \begin{array}{*{20}c} {\min R\left( S \right),R = \frac{1}{{\left| S \right|^{2} }}\mathop \sum \limits_{{x_{i} ,x_{j} \smallint S}} I\left( {x_{i} ,x_{j} } \right)} \\ \end{array} $$
(11)

Then, the maximal correlation-minimal redundancy feature set \(S\) is:

$$ \begin{array}{*{20}c} {{\text{mRMR}} = \max \left[ {\frac{1}{\left| S \right|}\mathop \sum \limits_{{x_{i} }} I\left( {x_{i} ,c} \right) - \frac{1}{{\left| {S^{2} } \right|}}\mathop \sum \limits_{{x_{i} ,x_{j} \smallint S}} I\left( {x_{i} ,x_{j} } \right)} \right]} \\ \end{array} $$
(12)

By selecting the subset of variables that minimize or maximize the target value, the redundant variables can be eliminated.

Model construction

XGB

XGB is an optimized version of boosting (Chen and Guestrin 2016). When a tree is added, a new function \(f\left(x\right)\) can be obtained to fit the residual of the last prediction. Once \(K\) trained trees are obtained, every tree falls to a corresponding leaf node, and every node corresponds to a score. The predicted value is the summation of the scores produced by different trees, which is calculated as follows:

$$ \begin{array}{*{20}c} {\hat{y} = \mathop \sum \limits_{k = 1}^{K} f_{k} \left( {x_{i} } \right)} \\ \end{array} $$
(13)

where \(K\) is the number of trees and \({{\text{f}}}_{{\text{k}}}\left({{\text{x}}}_{{\text{i}}}\right)\) is the score of each tree.

The objective function consists of a loss function and a regularization penalty:

$$ \begin{array}{*{20}c} {{\text{Obj}}\left( \theta \right) = \mathop \sum \limits_{i = 1}^{n} l\left( {y_{i} ,\widehat{{y_{i} }}} \right) + \mathop \sum \limits_{k = 1}^{K} \Omega \left( {f_{k} } \right)} \\ \end{array} $$
(14)

where \(l\left({y}_{i},{\widehat{{\text{y}}}}_{{\text{i}}}\right)\) is the error of the i-th sample and \(\Omega \left({{\text{f}}}_{{\text{k}}}\right)\) is the regularization penalty term of the k-th tree.

$$ \Omega \left( {f_{k} } \right) = \alpha T + \frac{1}{2}\lambda \mathop \sum \limits_{j = 1}^{T} \omega_{j}^{2} $$
(15)

where \(T\) is the total number of leaf nodes in the t-th tree, \({\omega }_{j}\) is the weight of the j-th leaf node, and \(\mathrm{\alpha },\uplambda \) are scalars. \(\alpha \) controls the number of leaf nodes, and \(\gamma \) guarantees that the weights of the leaf nodes are small.

For the t-th iteration, the objective function can be expressed as follows:

$$ {\text{Obj}}^{\left( t \right)} = \sum\limits_{i = 1}^{n} {l\left[ {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right]} + \sum\limits_{k = 1}^{K} {\Omega \left( {f_{k} } \right) + C} $$
(16)

where \({f}_{t}\left({x}_{i}\right)\) is the newly added t-th tree and \(C\) is the complexity of the previous t−1 trees, that is, \({\text{C}}={\sum }_{{\text{i}}=1}^{{\text{k}}-1}\Omega \left({{\text{f}}}_{{\text{i}}}\right)\).

With a descending constant term \({\text{l}}\left({{\text{y}}}_{{\text{i}}},{\widehat{{\text{y}}}}_{{\text{i}}}^{\left(t-1\right)}\right)\) and \(C\), the second-order Taylor expansion is employed to approximate the original loss function:

$$ \begin{array}{*{20}c} {{\text{Obj}}^{\left( t \right)} \simeq \mathop \sum \limits_{j = 1}^{T} \left[ {G_{j} w_{j} + \frac{1}{2}\left( {H_{j} + \lambda } \right)w_{j}^{2} } \right] + \gamma T} \\ \end{array} $$
(17)

where \({G}_{j}={\sum }_{i\in {I}_{j}}{\partial }_{\widehat{y}\left(t-1\right)}l\left({y}_{{\text{i}}},{y}^{\left(t-1\right)}\right),{H}_{j}={\sum }_{i\in {I}_{j}}{\partial }_{{\widehat{y}}^{\left(t-1\right)}}^{2}l\left({y}_{{\text{i}}},{y}^{\left(t-1\right)}\right)\)

When the partial derivative of the objective function \({{\text{Obj}}}^{\left({\text{t}}\right)}\) with respect to \({\upomega }_{{\text{j}}}\) equals 0, the optimal weight value can be obtained as follows:

$$ \begin{array}{*{20}c} {\omega_{j}^{*} = - \frac{{G_{j} }}{{H_{j} + \lambda }}} \\ \end{array} $$
(18)

The optimal objective function value is

$$ \begin{array}{*{20}c} {{\text{Obj}}^{\left( t \right)} = - \frac{1}{2}\mathop \sum \limits_{j = 1}^{T} \frac{{G_{j}^{2} }}{{H_{j} + \lambda }} + \gamma T} \\ \end{array} $$
(19)

Multilayer perceptron artificial neural network

As a robust ML technology, the multilayer perceptron artificial neural network (MLPANN) model has been widely used for prediction in various energy systems. The basic MLPANN consists of three layers: an input layer, a hidden layer and an output layer (Faegh et al. 2021). Given a series of characteristics \(X=({x}_{1}, {x}_{2},...)\) and a target \(Y\), a multilayer perceptron can learn the relationships between the features and targets for classification or regression purposes.

Light gradient boosting machine

The light gradient boosting machine (LightGBM) is a distributed gradient boosting framework based on a decision tree algorithm. For a given training set, the LightGBM can obtain a strong classifier by combining multiple classification and regression trees (Sun et al. 2022).

Support vector regression

Support vector regression (SVR) is a powerful approach for problems with small sample sizes and high dimensionality (Huang et al. 2022). With the SVR algorithm, a regression plane can be found, and the distance between all data in a set and the plane can be minimized.

Model evaluation

To evaluate different models in terms of energy consumption prediction, the coefficient of determination (R2), MAE, mean absolute percentage error (MAPE), and RMSE can be calculated as follows:

$$ \begin{array}{*{20}c} {R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \frac{1}{n}\mathop \sum \nolimits_{i}^{n} y_{i} } \right)^{2} }}} \\ \end{array} $$
(20)
$$ \begin{array}{*{20}c} {{\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\hat{y}_{i} - y_{i} } \right)^{2} } } \\ \end{array} $$
(21)
$$ \begin{array}{*{20}c} {{\text{MAE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\hat{y}_{i} - y_{i} } \right|} \\ \end{array} $$
(22)
$$ \begin{array}{*{20}c} {{\text{MAPE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{\hat{y}_{i} - y_{i} }}{{y_{i} }}} \right| \times 100\% } \\ \end{array} $$
(23)

where \(n\) is the sample size, \({y}_{i}\) is the actual value, \({\widehat{y}}_{i}\) is the predicted value and \(i\) is the index.

R2 is a classic indicator for evaluating the fitness between actual and predicted values. Both the MAE and RMSE are not sensitive to dimensions, but the RMSE magnifies the gap between larger errors. A MAPE of 0% indicates a perfect model, while a value greater than 100% indicates an inferior model.

SEGA

The genetic algorithm is a heuristic algorithm originating from the Darwin evolutionary system that simulates natural selection as reproduction, crossover, and mutation in DNA. During the evolutionary process, individuals with low adaptability are eliminated. Through continuous selection and the 3 behaviors of DNA, the optimal result can be obtained. However, it has been proven that the canonical GA, which only uses the selection, crossover and mutation operators with crossover and mutation probabilities between (0,1), cannot converge to the optimal value. Thus, an elitism strategy is adopted to select the best individual and copy it to the next generation without a crossover operator (Fig. 2).

Fig. 2
figure 2

Flow chart of the strengthen elitist genetic algorithm

Results and discussion

Parameter selection results

If all the existing parameters are used as the inputs of the constructed model, its complexity will be very high, which may result in a long training time and poor performance. Therefore, it is necessary to select inputs first, which is crucial to model performance (Chu et al. 2009). Both the linear and nonlinear selection methods mentioned above were used to select the input parameters (Table 2). The eight parameters with the highest frequencies were selected as the inputs, and they could be classified into three categories: water quantity (IFR), water quality (ETN, IAN, ITP and ETP) and management regulation (DO, ANLR and ORP) parameters. Furthermore, a Kendall correlation analysis was carried out between all parameters and the energy consumption level to evaluate the above results (Fig. 3). The absolute values of the correlation coefficients between the selected parameters and the energy consumption level were mostly at the top, which confirmed that the selected results were reasonable. Among the eight selected parameters, ETP and ANLR had limited correlations with energy consumption (0 < r < 0.2) (Khamis 2008).

Table 2 Parameter selection frequency
Fig. 3
figure 3

Heatmap of the Kendall correlation coefficient

In terms of water quantity, the IFR exhibited a strong negative correlation with EC. This can be attributed to the fact that most equipment cannot operate under effective energy conditions when the IFR deviates from the designed value (Hanna et al. 2018). In addition, poor management and limited regulation may also account for the excessive energy consumption of small-scale WWTPs (Vaccari et al. 2018).

In terms of water quality, the ETN was closely related to nitrogen removal performance. The conventional nitrogen removal process consists of nitrification and denitrification. To achieve sufficient nitrification, a large amount of energy is consumed for aeration. Once the needed oxygen is supplied, excessive aeration not only wastes energy but also increases the effluent NO3–N (Liu et al. 2022a). When the MLSS is low, the phosphorus removal process is mainly dependent on the performance of phosphorus-accumulating organisms. The biological phosphorus removal process consists of phosphorus release and phosphorus uptake. Electron acceptors such as oxygen are necessary to uptake phosphorus, so more aeration is needed when the ITP is larger.

In terms of management regulation, a lower OLR implied a longer SRT, which increased the MLSS. Hence, less NH4+-N was associated with a lower ANLR, and DO should theoretically have a negative correlation with the OLR. Due to the excess aeration strategy, DO usually changed more violently than the OLR requirement. With the increases in the OLR and influent COD, excessive aeration was supplied to ensure that the effluent COD satisfied the discharge standard, and DO was positively correlated with the OLR. The excessive DO at the end of the aerobic tank reflowed to the anaerobic tank through external reflow, resulting in a higher ORP. Therefore, the correlation between the energy consumption level and the ORP at the end of the anaerobic tank was weak. However, when the influent COD decreased, aeration control fell behind the actual demand, resulting in excessive energy consumption for aeration.

Performance analysis of the XGB model

An XGB model and three other models were established based on the relationships between energy consumption and the selected parameters. The first 70% of the data were used as the training set, and the other 30% were used as the test set. A grid search was used to optimize the hyperparameters of all models in Table 3. The prediction performance is shown in Table 4 and Fig. 4. Compared with the other methods, XGB is the best model. Although a gap remains between the real and predicted values, their variation trends are almost the same, which verifies that the model is feasible for predicting energy consumption.

Table 3 Hyperparameters of each model
Table 4 XGB model evaluation indicators
Fig. 4
figure 4

Comparison between the values predicted by XGB and the actual values. a Training set; b testing set

Parameter impact analysis

The F score was used to evaluate the influence of each input on energy consumption (Fig. 5). The effect of each parameter in descending order was IFR, DO, IAN, ITP, ETN, ANLR, ORP, and ETP. It was found that this order was similar to that obtained by the Kendall analysis. If a significant parameter changes obviously, the energy consumption also changes significantly.

Fig. 5
figure 5

F score of XGB

Among the uncontrollable parameters, the IFR, IAN and ITP had large influences. This means that once the treatment process and designed flow rate have been selected, the energy consumption levels has almost been determined. Among the controllable parameters, ETN, DO, and ANLR can influence energy consumption.

Energy saving performance

The energy savings achieved under different conditions were evaluated. For each parameter, the average, maximum and minimum values were taken in the established model, while the other seven variables remained unchanged from their average values (Fig. 6). The energy saving efficiency was most sensitive to the influent flow rate, which was consistent with its maximal correlation coefficient. Furthermore, the energy saving efficiency variations were also in accord with the Kendall correlation coefficient. In addition, the amount of energy saved under the maximum (or minimum) values was always similar to the amount of energy wasted under the minimum (or maximum) values.

Fig. 6
figure 6

Energy saving efficiency

In practical applications, these parameters often change simultaneously. Considering the synergy among the different parameters, the SEGA was used to optimize the energy consumption level, and XGB served as the map** function between the eight parameters and EC. In the SEGA, eight parameters are the genes of the population, and the EC calculated by XGB is the fitness function. The upper and lower bounds (UBs and LBs, respectively) of the eight input parameters were determined by the discharge standard and the extreme values contained in the historical data. Due to the randomness of the SEGA, several optimization steps were performed (Table 5).

Table 5 Optimized parameters

In scenarios 1–3, the UBs and LBs of the eight parameters were set to investigate the resulting changes in the optimal EC. In scenario 1, the UBs of ETN and ETP were the maximum allowed values, which were 15 mg L−1 and 0.5 mg L−1, respectively. The other UBs were the historical maximum, and the LBs were zeros (except that of ORP). The UBs of scenario 2 were the same as those of scenario 1, but its LBs were the historical minima. The UB of ORP in scenario 3 was set to 0, and the other boundary conditions of scenario 3 remained the same as those in scenario 2. The restriction imposed on the search area of ORP did not affect the final energy savings. However, the practical water quality and quantity were unfeasible to regulate, so their UBs and LBs were set as the mean values in scenario 4. The management regulation parameters were optimized by setting their boundary conditions according to the historical extreme values.

The DO and ORP probes are widely used in practical applications. Based on previous research, the DO concentration at the end of the aerobic tank was 1–5 mg L−1 (Qiu et al. 2017), and to achieve biological phosphorus removal, the ORP in an anaerobic environment should be no larger than − 50 mv (Tae et al. 2005; Tang et al. 2012). The optimal parameters obtained from the GA were basically within reasonable ranges. According to the above results, 13–27% of the total energy consumption could be saved (with an average of 22%) by optimizing the management process, and the effluent could meet the discharge standard all the time. When the IFR was maximized, the IAN was close to the average value and the ANLR was high, the minimum energy consumption level could be obtained. Specifically, energy savings could be achieved by setting the management regulation parameters near the optimization results. For ORP, its optimal value could be achieved by adjusting the internal reflow rate. Flexibly opening and closing the air pumps and adjusting the air supply could make DO reach the value given by the GA. For the ANLR, due to the uncontrollable IAN, the difference between IAN and EAN was the determining factor. This showed that energy savings in WWTPs can be achieved by adjusting the operation parameters through the GA, which provides a simple and feasible energy saving strategy.

Conclusions

The energy consumption levels of WWTPs can be predicted and optimized by XGB and the SEGA. In terms of prediction, the XGB model achieved good performance, which was verified by a series of indicators, such as the R2, MAE, MAPE, and RMSE metrics. The most important parameter influencing energy consumption is the influent flow rate. Therefore, compared with small-scale WWTPs, large-scale WWTPs with high IFR values need less EC and lower operating costs. In terms of optimization, 13–27% of the total energy consumption (with an average of 22%) could be saved by the optimized management regulation parameters obtained from the SEGA model. This research provides a convenient and reliable strategy for saving energy in WWTPs, which can be used in other treatment processes in practical applications.