Introduction

Climate change is one of the biggest challenges facing humanity, with the risk of dramatic consequences if certain limits of warming are exceeded (Pörtner et al. 2022). To mitigate climate change, the energy system must be decarbonized. A difficulty in decarbonization is that renewable energy supply fluctuates depending on the weather. However, supply and demand must be balanced in the grid at every moment to prevent outages (Machowski et al. 1997). In addition, with the ongoing decentralization of the renewable energy supply and the installation of large consumers, such as electric vehicle chargers and heat pumps, low-voltage grids are expected to reach their limits (Çakmak and Hagenmeyer 2022). Thus, to balance the grid and to avoid congestions, advanced operation and control mechanisms must be installed in the smart grid of the future (Ramchurn et al. 2012; Haben et al. 2021). This requires accurate forecasts on various aggregation levels, up to fine-grained low-voltage level load forecasts (Haben et al. 2021; Ordiano et al. 2018). Such fine-grained load forecasts can be used for demand-side management, energy management systems, distribution grid state estimation, grid management, storage optimization, peer-to-peer trading, peak shaving, smart electrical vehicle charging, dispatchable feeders, provision of feedback to customers, anomaly detection and intervention evaluation (Haben et al. 2021; Yildiz et al. 2017; Voß et al. 2018; Grabner et al. 2023; Werling et al. 2022). Moreover, the aggregation of fine-grained load forecasts can result in a more accurate forecast of the aggregated load (Hong et al. 2020).

With the smart meter rollout, fine-grained electrical load data will become available for an increasing number of clients. In such a scenario where load time series from multiple clients are available, different model training strategies are possible. The goal of our work is to compare training strategies for the Transformer (Vaswani et al. 2017), which was recently used for load forecasting (Zhang et al. 2022; Hertel et al. 2022a, b; Cao et al. 2022; Giacomazzi et al. 2023; Huy et al. 2022).

Fig. 1
figure 1

The three training strategies, with models depicted as networks. An example with three load time series, four days input and one day output is shown. a Multivariate: one model processes all load time series simultaneously; b local: separate models (blue, orange, green) process each load time series; c global: one model (black) processes all load time series one at a time

Task definition

We address the following multiple load time series forecasting problem: At a time step t, given the history of the electrical load of C clients \(x_0^c, \ldots x_t^c\) with \(1 \le c \le C\), the goal is to predict the next h electrical load values \(x_{t+1}^c,\ldots, x_{t+h}^c\) for all clients \(1 \le c \le C\), where h is called the forecast horizon.

Contribution

We compare three training strategies for the Transformer in a scenario with multiple load time series. The training strategies are depicted in Fig. 1.

  1. 1.

    A multivariate model training strategy, where a single model gets all load time series as input and forecasts all load time series simultaneously.

  2. 2.

    A local model training strategy, where a separate univariateFootnote 1 model is trained for each load time series.

  3. 3.

    A global model training strategy, where a generalized univariate model is used to forecast each load time series separately.

We compare our models with the models from related work (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022; Nie et al. 2022), as well as with multiple baselines. In particular, we compare with the linear models used in Zeng et al. (2022), to figure out if Transformers are effective for load forecasting and which training strategy is the most promising one.

Paper structure

First, we describe the "Related work". Then, the Transformer architecture and the training strategies are described in the "Approach". This is followed by the "Experimental setup", "Results" and a "Discussion". Finally, the paper concludes with the "Conclusion and future work".

Related work

This section first presents related work on long time series forecasting and load forecasting with Transformers. Most of the load forecasting literature uses local models, but few works use global models, which are presented next. The global training strategy can be understood as a transfer learning technique. We therefore discuss transfer learning in the field of load forecasting at the end of this section.

As Transformer are often used for long time series forecasting with up to one month horizon, various extensions to the Transformer architecture exist that aim to reduce the time and space complexity. This is done by the Informer using ProbSparse self-attention (Zhou et al. 2021), by the Autoformer using auto-correlation (Wu et al. 2021), by the FEDformer using frequency enhanced decomposition (Zhou et al. 2022) and by PatchTST using patching (Nie et al. 2022). The proposed models are multivariate or local, except for the global PatchTST (Nie et al. 2022). The experiments in these works are conducted on six datasets from different domains, including one load forecasting dataset, which we also use in our experiments (see section Datasets). A global linear model called LTSF-Linear (Zeng et al. 2022) gives better results than the aforementioned multivariate Transformers. Parallel to our work, global Transformers were shown to beat the aforementioned multivariate Transformers (Murphy and Chen 2023). However, this work does not optimize the model’s lookback size and therefore achieves sub-optimal results. PatchTST (Nie et al. 2022) is a global Transformer with patched inputs and is superior to LTSF-Linear (Zeng et al. 2022) on the six datasets.

Transformer architectures for short-term load forecasting are designed to use external calendar and weather features (Wang et al. 2022; Huy et al. 2022). An evaluation of different architectures is undertaken in Hertel et al. (2022a). Further work modifies the architecture for multi-energy load forecasting (Wang et al. 2022). Upstream decompositions are used to improve the forecast quality (Ran et al. 2023). These models are not compared on a common benchmark dataset, but evaluated on different datasets on city or national level. There, usually only one load time series is available, which only allows for local models. Furthermore, the models are not compared to the Transformer architectures for long time series.

Global load forecasting models are already used with convolutional neural networks (Voß et al. 2018) and N-BEATS (Grabner et al. 2023). A mixture between a multivariate and a global model is investigated in Shi et al. (2017), where a single recurrent neural network (RNN) model is trained on randomly pooled subsets of the time series. Some works cluster the time series and then train global or multivariate models for each cluster (Han et al. 2020; Yang and Youn 2021). PatchTST (Nie et al. 2022) is a global Transformer with patched inputs. We compare to this approach in our experiments.

The authors of Pinto et al. (2022) and Himeur et al. (2022) present current literature on transfer learning in the domain of energy systems. They define a taxonomy of transfer learning methods and discuss different strategies of using transfer learning with buildings from different domains. Two works (Nawar et al. 2023; Gao et al. 2022) use transfer learning by pre-training and fine-tuning Transformers. Transferability from one building to another is tested in Nawar et al. (2023), and from one district to another in Gao et al. (2022). In contrast to these works, our transfer learning approach is to train a generalized model on the data from many clients, without fine-tuning for a target time series.

Approach

We use an encoder–decoder Transformer (Vaswani et al. 2017) as a load forecasting model. This model architecture has self-attention and cross-attention as its main components and was initially used for machine translation. It was used as a forecasting model in Wu et al. (2020) and later adopted for load forecasting (Zhang et al. 2022; Hertel et al. 2022a, b). We use the model implementation from Hertel et al. (2022a).

The encoder gets L vectors as input, which represent the last L time steps, where L is called the lookback size. Each input vector consists of one (in the case of local and global models) or C (in the case of multivariate models) load values, and nine additional time and calendar features. The features are the hour of the day, the day of the week and the month (all cyclically encoded with a sine and a cosine function), whether it is a workday, whether it is a holiday and whether the next day is a workday (all binary features). The input to the decoder consists of h vectors, which represent the following h time steps for which a forecast will be made. In the decoder input, the load values are set to zero, so that each value is forecasted independently from the previous forecasted values, allowing for a direct multi-step forecast instead of generating all values iteratively. The input vectors to the encoder and the decoder are first fed through linear layers to increase the dimensionality to the hidden dimension of the model \(d_{\text{model}}\). Both the encoder and the decoder consist of multiple layers with eight self-attention heads and the decoder layers have eight additional masked cross-attention heads. Finally, a linear layer transforms the h decoder output vectors into a forecast with \(h \times 1\) (for local and global models) or \(h \times C\) (for multivariate models) values. We varied the number of encoder and decoder layers and the hidden dimension \(d_{\text{model}}\), and found three layers with \(d_{\text{model}} = 128\) to give the best results. The full model architecture is shown in Fig. 2.

Fig. 2
figure 2

Architecture of the Transformer forecasting model. The input and output dimensions differ for the multivariate model and the local and global models. The shown dimensions refer to the Electricity dataset with 321 clients

Training strategies

We compare multivariate, local and global Transformers. The training strategies are depicted in Fig. 1 and are further explained in the following. Details on the inputs, outputs, number of models and training data size for each training strategy are given in Table 1.

  • Multivariate training strategy: In the input to the model, each time step is represented by a vector of size \(C + f\), where C is the number of load time series and f is the number of calendar features. The model forecasts C values for the next h time steps, i.e. its output consists of h vectors of size C. A single model is used to forecast all time series simultaneously.

  • Local training strategy: Local models get only one time series as input and generate a forecast for this time series. In the input, each time step is represented by a vector with \(f + 1\) entries for the f calendar features and the electrical load value. C separate models are trained for the C time series, each using the training data from one time series.

  • Global training strategy: The global approach is a single model that generalizes for all load time series. The model gets one load time series as input and generates a forecast for that load time series. In contrast to the local models, only one global model is trained on samples from all load time series, and this model is used to forecast all load time series. This results in C times as many training data for the global model as for a local model. To generate forecasts for all C time series, the global model is used C times with the history of one load time series as input.

Table 1 Training strategy details for the Electricity dataset with 321 load time series, 2.1 years training data and nine time and calendar features

Experimental setup

Datasets

As recommended in recent literature reviews on load forecasting (Haben et al. 2021; Hong et al. 2020; vom Scheidt et al. 2020), we conduct experiments on multiple datasets, namely the Electricity and the Ausgrid solar home datasets. For both datasets we make a temporal split and use the first 70% of each time series for training, the next 10% for validation, and the last 20% for testing, as in related work (Wu et al. 2021; Zhou et al. 2022; Nie et al. 2022; Zeng et al. 2022).

The Electricity datasetFootnote 2 is published in Lai et al. (2018) and used in related work on long-term forecasting (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022; Nie et al. 2022; Zeng et al. 2022). It is a subset of the UCI Electricity Load Diagrams datasetFootnote 3 first presented in Rodrigues and Trindade (2018), only containing the time series without missing values. The dataset contains hourly electrical load data from 321 clients of a Portuguese energy supplier. The clients are from different economic sectors, including offices, factories, supermarkets, hotels, restaurants, among others (Rodrigues and Trindade 2018). The time series range from 2012 to 2014.

The Ausgrid solar home datasetFootnote 4 contains solar generation and electrical load data from 300 clientsFootnote 5 of an Australian energy supplier. The clients are private houses with rooftop solar systems. The time series range from July 2010 to June 2013. We only use the electrical load data transformed into hourly resolution.

Comparison methods

We compare our models with models from related work (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022; Nie et al. 2022; Zeng et al. 2022), as well as with a persistence baseline, linear regression models, multi-layer perceptrons and long short-term memory networks.

  • Models from related work: For Informer (Zhou et al. 2021), Autoformer (Wu et al. 2021), FEDformer (Zhou et al. 2022), PatchTST (Nie et al. 2022) and LTSF-Linear (Zeng et al. 2022), we take the results reported in the publications where applicable, and run the code published with the papers otherwise. All parameters except for the forecast horizon are left unchanged.

  • Persistence baseline: The persistence baseline takes the value from one week before the predicted hour as a forecast for the 24 h and 96 h horizons, and the value from 1 month before the predicted hour as the 720 h forecast.

  • Linear regression: For each load time series, we train a linear regression model with h outputs. The input consists of the last 336 load values and the nine time and calendar features for the current hour when the prediction is made (see “Approach” for a description of the features). The main difference to LTSF-Linear (Zeng et al. 2022) is that the linear regression models are local models, but LTSF-Linear is a global model. Furthermore, the two approaches use different training algorithms and LTSF-Linear does not use time and calendar features.

  • Multi-layer perceptron (MLP): As for the linear regression, we train a local MLP for each load time series. The MLPs get the last 168 load values and the nine time and calendar features of the current hour as input. Using more than 168 load values as input did not improve the results. Each MLP has two hidden layers with ReLU activation (ReLU 2023) and 1024 neurons per layer.

  • Long short-term memory (LSTM): We train multivariate, local and global LSTM (Hochreiter and Schmidhuber 1997) models. We use the same architecture as in Kong et al. (2017), consisting of two LSTM layers with 20 units each and a linear prediction layer. Using larger models did not improve the results.

Training details

All models are trained with the AdamW optimizer (Loshchilov and Hutter 2019) using the mean squared error loss. We use a batch size of 128 and a learning rate of 0.0001 with 1000 warm-up steps and cosine decay with \(\gamma = 0.8\). When testing different lookback sizes L, we find one week to be optimal for the multivariate Transformer and the local Transformers. For the global Transformer, the results improve with increasing lookback size until \(L=336\) (two weeks), and stay almost the same for \(L=720\) (one month). For Transformer models with two weeks input and one month output, the batch size has to be reduced to 64 due to the quadratic memory consumption of the model. For the multivariate Transformer, the batch size is set to 32 as in related work (Zhou et al. \(\gamma = 0.5\) after every epoch.

Metric

As in related work (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022; Nie et al. 2022; Zeng et al. 2022), every load time series is standardized by subtracting its mean and dividing by its standard deviation and the metrics are computed on these standardized time series. For every hour \(t \in T_{\text{test}}\) in the test set, a forecasting model predicts the next h hourly loads \(\hat{y}_t^c = \hat{y}_{t,t+1}^c,\ldots, \hat{y}_{t,t+h}^c\) for time series c. Then, the mean absolute error (MAE) between the predictions \(\hat{y}^c = \{\hat{y}_i^c\ \forall \ i \in T_{\text{test}}\)} and the ground truth \(y^c = y_1^c,\ldots, y_{T_{\text{test}}}^c\) is computed. As the final result, the MAE averaged across all C load time series, the \(T_{\text{test}}\) evaluation time points and the h forecasting steps is reported.

$$\begin{aligned} {\mathrm {MAE}(y, \hat{y}}) = \dfrac{1}{C \cdot |T_{\text{test}}| \cdot h} \sum _{c=1}^C \sum _{t\in T_{\text{test}}} \sum _{i=1}^{h} |y_{t+i}^c - \hat{y}_{t,t+i}^c| \end{aligned}.$$

The mean squared error (MSE) is computed analogously, using the squared residuals instead of the absolute residuals.

Results

Forecast accuracy

Table 2 shows the MAE results on the two datasetsFootnote 6. On the Electricity dataset, the global Transformer is the best model for the 24 h horizon, and PatchTST is the best model for longer horizons. On the Ausgrid solar home dataset, PatchTST is the best model for all three horizons. The global Transformer beats the local Transformers and the multivariate Transformer across all tested horizons. On average, it reduces the error by 21.8% compared to the multivariate Transformer and by 12.8% compared to the local Transformers. Compared to the best local model, the linear regression, it reduces the error by 2.9%. Compared to the best multivariate model, FEDformer, it reduces the error by 15.4%. All multivariate models, including Informer, Autoformer, FEDformer and the multivariate Transformer, perform poorly and do not beat the persistence baseline with a lag of one week. The local linear regression models are slightly better than the global linear model, LTSF-Linear, on the Electricity dataset, but it is vice versa on the Ausgrid solar home dataset. The MLP is in five out of six cases a bit worse than the linear regression, with a 1.5% larger error on average. The local LSTMs are better than the local Transformers, but the Transformer is better as a multivariate model and as a global model (except for the one month horizon on the Electricity dataset). The forecast errors are lower on the Electricity dataset than on the Ausgrid dataset which is a more fine-grained dataset containing single private houses.

Computational cost

The training times are given in Table 3. The local Transformer models need by far the longest time to train. Their training time increases sharply with longer forecast horizons. The multivariate Transformer trains fast and is even faster than the MLPs for short horizons. Training a global Transformer is much faster than training the many local Transformers but takes longer than the linear regression, MLP and the multivariate Transformer. The LSTM always trains faster than the Transformer with the same training strategy.

Table 2 MAE results on the two datasets, with 24, 96 and 720 h forecast horizon
Table 3 Training times in hours, measured on a machine with a Nvidia 3090 RTX GPU

Discussion

Best Transformer training strategy: On the two datasets, the global Transformer is superior to the multivariate and local Transformers. We hypothesize that this is a result of the larger number of training samples for the global model (see Table 1). The Transformer benefits from more training data, even if the training data comes from different sources. The multivariate models on the other hand are prone to overfitting.

Best Transformer architecture: PatchTST is the best model in five out of six cases. However, the difference to the global Transformer is small. This shows that the success of PatchTST is mainly a result of its global training strategy. Its improvement upon the global Transformer can be due to the patching mechanism, a better hyperparameter configuration, or the encoder-only architecture. Among the multivariate models, Autoformer (Wu et al. 2021) and FEDformer (Zhou et al. 2022) give better results than the multivariate Transformer. It remains an open question whether these architectures are also better global models than the standard Transformer and PatchTST (Nie et al. 2022). Another promising architecture is the Temporal Fusion Transformer (Huy et al. 2022). In previous work with just one aggregated time series, the Informer (Zhou et al. 2021) also gave better results than the Transformer (Hertel et al. 2022a).

Comparison with the state of the art: The global Transformer achieves a better result for short-term forecasting on the Electricity dataset than related work (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022; Nie et al. 2022; Zeng et al. 2022), and achieves close results to the best results from PatchTST (Nie et al. 2022) for longer horizons and on the Ausgrid solar home dataset. However, to establish a state of the art for short-term and medium-term load forecasting, a comparison to other forecasting models must be undertaken, including models that are not based on the Transformer architecture and that are more sophisticated than our baselines. Using weather data could improve the forecasts, because some electrical load patterns, such as the usage of electrical heating, are weather-dependent. Weather features could affect which model gives the best results, because some models might be better in capturing these dependencies than others.

Linear models: As in related work (Zeng et al. 2022), we observe that linear models are strong baselines. The linear regression is in five out of six cases the best local model and only outperformed by the local MLP for the one day horizon on the Electricity dataset. No general answer can be given on whether the local linear regression models are better or the global LTSF-Linear is better, because each variant is better on one dataset.

Task complexity: For longer horizons, the global Transformer’s performance compared to the linear models deteriorates. This can be due to the increasing complexity when the model forecasts many values simultaneously. We chose a direct multi-step forecasting model because good results were achieved with this procedure before (Nie et al. 2022; Zeng et al. 2022). However, other multi-step forecasting procedures, such as iterative single-step and iterative multi-step forecasting (An and Anh 2015; Sahoo et al. 2020), could be beneficial for long-term forecasting because they reduce the number of forecasted values per model run.

Transfer learning: According to the definition of transfer learning in Pinto et al. (2022), the global training strategy can be seen as a transfer learning method, because the model must transfer knowledge between different types of buildings with different consumption patterns. Pre-training on other tasks than forecasting or on less similar data from domains other than electricity, as well as fine-tuning for a time series of interest, could improve the results. An advantage of the global model is that it can be applied to new time series without retraining. In Hertel et al. (2022b) it was shown that the Transformer generalizes better to new time series than other approaches, but the forecasts are still better when training data from the target time series is available.

Other forecasting tasks: The Transformer model and the different training strategies are not designed for load forecasting in particular, but can also be applied to other forecasting tasks. We hypothesize that the global training strategy can also be beneficial for other datasets containing multiple time series with similar patterns.

Conclusion and future work

We compare three Transformer training strategies for load forecasting on two datasets with multiple years of data for multiple hundred clients. We show that the multivariate training strategy used in related work on forecasting with Transformers (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022) is not optimal, and it is better to use a global model instead. This shows that the right training strategy is crucial to get good results from a Transformer. Our approach achieves better results than related work (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022), and comes close to the best results from PatchTST (Nie et al. 2022). In particular, our approach gives better results than the linear models from Zeng et al. (2022) for one day to four days forecasting horizons, which shows that, with the right training strategy, Transformers are effective for load forecasting. However, simple linear models give decent results for both short-term and medium-term horizons and train much faster than the Transformers.

In the future, more sophisticated Transformer architectures could be tested with the global training strategy. A comparison to other forecasting methods could be undertaken, and weather data could be incorporated into the models to see how it affects the results. Experiments with other datasets and varying amounts of training data could show under which circumstances the global Transformer model is better than other approaches. Additionally, transfer learning from other tasks and datasets could be tested. Future work could experiment with different datasets with varying amounts of data to see how much training data is needed for the global model to surpass the local models. A compromise between local and global models could be established by first clustering similar time series and then training one global model per cluster. The cluster-specific models would have less training data than the global model, but could benefit from the training data being more similar. Potentially, the global training strategy could also be beneficial for other forecasting tasks than load forecasting.