1 Introduction

1.1 Overview

The success of machine learning in the field of artificial intelligence has led to the development of a variety of algorithms, each with its own strengths and weaknesses. This has created a level of complexity for practitioners when deciding on an appropriate algorithm for a given dataset, as the decision is not always obvious. Automated machine learning (Auto-ML) is a mechanism proposed to address this situation through the use of an automated pipeline that requires minimal manual effort [1]. While it is common to use this process for regression and classification problems, multistep-ahead time series prediction problems have not received the same attention.

Multistep-ahead time series prediction involves forecasting multiple future time steps based on historical data. Unlike single-step prediction, where only the next time step is forecasted, multistep-ahead prediction requires predicting a sequence of future values. This task is more challenging due to the accumulation of errors over multiple steps and the increasing uncertainty in longer-term predictions.

Time series prediction finds applications across diverse domains, particularly those characterized by time-ordered data streams. Economical time series [2], financial time series [3], agricultural time series [4] and recommender time series [5] are just a few examples of numerous domains where predictive models play a pivotal role. Traditionally, time series prediction has relied on parametric approaches such as exponential smoothing (ES)[6], Holt–Winters [7] and an autoregressive integrated moving average (ARIMA) model [8]. Multistep-ahead time series prediction endeavors to forecast n steps ahead, diverging from the focus on individual steps, thereby constituting a specific branch of time series problems.

The advent of machine learning techniques such as neural networks (NN), recurrent neural networks (RNN) and support vector regression (SVR) offered nonparametric solutions where limited assumptions can be made about the data prior to the development of the model [9]. Nonparametric approaches do not necessarily outperform parametric approaches across all types of data, as making limited a priori assumptions can lead to a large and complex problem space [10]. Within the nonparametric space, there are also no available guidelines on the most appropriate algorithm choice. Effectively, researchers are faced with the problem that no single method is guaranteed to outperform all others [11].

1.2 Problem statement

Meta-learning has been used to find an appropriate algorithm within the Auto-ML pipeline generation and as a stand-alone process. Within Auto-ML pipeline generation, significant contributions have been made from libraries such as Auto-Sklearn 2.0 [12], Auto-Keras and Auto-Weka [1]. These systems have primarily focused on creating efficient search mechanisms for hyper-parameter selection and feature engineering for both regression and classification problems. More recently, the Auto-Sklearn 2.0 library has been upgraded to predict multistep-ahead time series using a multilayer perceptron (MLPs) as the base architecture [13]. This does not, however, include methods such as RNN or long short-term memory networks (LSTM), which can outperform MLPs. Outside Auto-ML pipeline generation, model selection has typically been implemented using classifiers and relatively simply architectures. This eliminates the opportunity to include expert knowledge in the decision-making process. For example, if the difference between two models is negligible, the model that requires less computational power can be chosen by the expert where limited computing power is available. While both approaches have their advantages, there is a gap in the breadth of architectures used in both approaches, and the process for choosing an algorithm in stand-alone Meta-learning is relatively rigid.

1.3 Contribution

The study of Auto-ML pipelines and learners has predominantly focused on either hyper-parameter selection or the use of classifiers to select a relatively simple algorithm. In this research, a regression model is used to build a Meta-Learner that can suggest complex time series approaches to practitioners or future Auto-ML systems. Thus, our contribution can be articulated as follows:

  • A Meta-learning approach for time series multistep-ahead prediction problems is presented that uses a regression model as the performance modeling engine. The benefits of using such an approach are that practitioners can make subtle decisions on using a comparative analysis on the predicted performance of individual models.

  • The current state-of-the-art Auto-ML or Meta-learning approaches require the generation of a significant number of meta-features to establish the appropriateness of a range of algorithms. In contrast, this research demonstrates that a reduced selection of meta-features can successfully find an appropriate multistep-ahead time series prediction from a set of bespoke complex time series algorithms.

  • Our evaluation uses 5,000 time series which have been engineered to be sufficiently diverse [14], thus ensuring a robust challenge for the Meta-Learner.

1.4 Paper structure

The rest of this paper is organized as follows: Sect. 2 provides a literature review of Meta-Learner research with a comparison table showing the commonality and differences in these approaches; Sect. 3 presents a four-step method to constructing and deploying the Meta-Learner; the evaluation and discussion of the experimental results are presented in Sect. 4; and finally, conclusions are presented in Sect. 5.

2 Literature review

In this section, a review of the major studies that have addressed model selection in the time series prediction domain for multistep-ahead problems is provided. In particular, a range of prediction approaches evaluated and the learners that were used to model the meta-features were identified. This is subsequently presented in a summary analysis of the most relevant research.

2.1 Background research

The research conducted in [15] was one of the earliest studies to suggest that the accuracy of prediction methods depends on the properties of a given time series. This was followed by [16], where a selection of characteristics from a time series was extracted and used in a pre-trained knowledge-based system to recommend an appropriate method. Subsequent studies, including the M-Competition, have confirmed this viewpoint, demonstrating that the performance of time series prediction methods is influenced by the specific characteristics of the time series.

In [17], the authors developed a rule induction-based approach for selecting prediction methods. They trained a decision tree (using the ID3 technique) on a set of time series features to recommend appropriate prediction methods. The decision tree induced a set of rules representing the conditions within the feature space under which specific prediction methods are recommended. However, the sample size used to build the decision tree was limited, raising concerns about the generalizability and reliability of the obtained rules.

In [18], 26 features were used to train a discriminant analysis (DA) Meta-learner to recommend an appropriate prediction model from among several statistical approaches. Despite offering evaluations from various perspectives, this work only analyzes some deterministic time series prediction methods and does not cover advanced models such as artificial neural networks (ANNs).

The research conducted in [19] was pioneering in using neural networks (NN) as a method selector. The input to the NN was a set of features and candidate predictions made by various prediction methods, and the output was the final forecast. This contrasted with previous approaches that recommended a single candidate model. Although the neural network-based selection approach allows for the development of more complex rules for prediction method selection, the overall prediction performance remains constrained by the limited capabilities of the candidate models used.

In [20], the authors suggested an alternative approach that incorporated expert knowledge in conjunction with time series features to recommend the appropriate prediction method. This work also presented a set of criteria for selecting the appropriate prediction method. A major challenge in this study is that the selection strategy depends on expert judgments, which can be inaccurate as time series become increasingly complex. [21] presented a rule-based expert system for model selection. This work proposed an automatic feature selection algorithm with a set of (mainly judgemental) rules for determining the appropriate prediction model. This work also poses challenge in regard to the growth in the complexity of time series due to the use of judgemental features.

An approach to prediction method selection was suggested in [22], where the model chosen was based on the goodness of the prediction performance (minimum error). Although this work presents a quantitative (error-based) selection strategy, at the same time, it poses a more significant challenge with ignoring the factors that lead to inaccuracies in individual prediction models. In other words, prediction error can depend on various factors and furthermore cannot solely represent prediction performance. [23] compared the traditional validation-based model selection approach with the information criterion-based model selection approach. The results demonstrated that the information criterion approach was preferable with regard to model selection. This work provides a gentle improvement in model selection via the inclusion of both validation and information in the selection process. However, depending without regarding the benefits of validation-based strategy, again this work is still limited to the narrow range of the time series features used.

The first explicit use of the term “Meta-learning” in the context of time series prediction was in [24]. This work suggested two approaches. The first used six time series features to determine the appropriate prediction method, while the second approach employed the NOEMON model that is an intelligent assistant for classifier selection, originated from the NOEMON project [25]. This work was later improved in [26] by the same authors and used the NOEMON ranking technique as the basis of the selection approach. Despite introducing a ranking instead of proposing a single best model, which leads to more flexibility, this work shows limited potentials as it lacks analysis for advanced prediction models like ANN.

In [27], a prediction method selection approach was presented which was based on a combination of analysis of variance and Duncan’s multiple range tests on time series data. However, it was only applied to ARIMA, regression and a decomposition-based method. In [28], a rule induction approach was presented for prediction method selection. In this work, the self-organizing map (SOM) and decision tree classification techniques were applied to a set of characteristics, including measures of chaos, self-similarity and traditional statistics (trend, seasonality and kurtosis), to extract a set of prediction method selection inference rules.

In [29], the authors built a large pool of meta-features and attempted to relate the prediction model performance to these features using a number of approaches including NN, DT, SVM, zoomed ranking of the best and zoomed ranking of the combination. Despite the provided flexibility via the use of zoomed ranking method, this work is still lacks analysis for extended complexity in time series and also for the case of the use of RNNs in the candidate models set. The work by [30] suggests a prediction method selection approach that recommends the appropriate method based on out-of-sample rolling horizon weighted errors, which is an extension of the traditional selection criteria that uses the minimum one-step out-of-sample error for performance evaluation.

An approach recommending the prediction method based on its previous performance on similar datasets which required the database to record the historical records of the predictors’ performances was presented in [31]. The similarity of time series datasets is measured on a set of time series characteristics. In this work, principal component analysis (PCA) was used to reduce the dimensionality of the data. Prediction method selection was also studied in [32] for chaotic time series. This approach uses Meta-learning and the SOM, where the SOM provides a topology-preserving map** from the high-dimensional space to map units that preserve the relative distance between the points. Points that are relatively close to each other in the input space are mapped to nearby map units in the SOM. The SOM can therefore serve as a cluster analysis tool for high-dimensional data [33].

In [34], the authors studied the predictive accuracy of using different feature sets for a NN Meta-Learner which recommends the appropriate method from a set of four statistical forecasting models. This work incorporates a set of error-based features (land markers) together with several statistical tests to build time series meta-features. Despite providing higher capability for rule extraction via the use of an NN-based model selection, this work present still low capacity as the candidate prediction models set is limited to a number of exponential smoothing methods. In [35], the authors presented an approach referred to as the self-learning (method selection) approach that conducts cluster analysis to recommend the most appropriate prediction method. However, the limited cluster dimensions and furthermore, the limited candidate prediction models set, indicates that this approach will struggle with many complex time series examples.

In [36], the authors explored how judgments can be used to improve the selection of a forecasting model. They compared the performance of judgmental model selection against a standard algorithm based on information criteria. They also examined the efficacy of a judgmental model generation approach, in which experts were asked to decide on the existence of the structural components (trend and seasonality) of the time series instead of directly selecting a model from a candidate model pool. While this study provides valuable insights into the role of expert judgment in model selection, it lacks a comprehensive comparison with advanced Meta-learning techniques that can automate model selection and potentially outperform both judgmental and traditional algorithmic methods

Meta-learning has been proposed as a means of choosing a machine learning model in Auto-ML solutions for both regression and classification problems [37]. In Auto-ML, a machine learning model is incorporated on the top of a model selection architecture to recommend the appropriate model hyper-parameters from a set of candidate machine learning models. It effectively uses meta-features extracted from the candidate dataset to train a machine to identify an appropriate algorithm from a range of pre-built candidate models. More recently, [37] have proposed an approach which uses a portfolio approach to remove meta-features from the model selection process. These approaches have been successful in the past, but were limited to a relatively narrow number of machine learning approaches [13], and focused on deep learning architectures such as RNN. These approaches have not focused on alternatives such as LSTM networks or multiresolution forecast aggregation (MRFA). Additionally, this work focused on just 20 datasets.

2.2 Summary

This review suggests that an effective comparative analysis should focus on the following factors: prediction models, features (time series features), selection criteria, incorporation of hyper-parameter selection, the provision of experimental results and the type of Meta-Learner (the method selection approach). A summary of the literature is presented in Table 1. In this table, MAPS stands for multistep-ahead prediction, while Selection criteria, Err, Var, RI and MCCV represent the error, variance, rule induction and Markov chain cross-validation, respectively. Feature vector size is represented by S, M, L and VL, corresponding to small (less than 5 features), medium (5–10 features), large (10–20 features) and very large (more than 10 features), respectively. Additionally, Advanced features encompass frequency-domain features, Hurst exponent and detrended fluctuation analysis, as elucidated in Sect. 3.1.

Table 1 Comparison between the existing Meta-learning approaches

In summary, most prior approaches use stochastic models in their analyses. Only [24, 28, 29, 32] and [35] incorporated machine learning models into their candidate prediction models, and of these, only [24, 28] and [32] implemented NN (and no other machine learning method) in their candidate models. The approaches presented in [29] and [35] were the only studies to use state-of-the-art techniques, such as RNN as candidate prediction models. While RNN and LSTM have recently received considerable attention, special consideration is required as they are highly sensitive to the choice of their hyper-parameters. In [29] and [35], hyper-parameter selection was not considered. The recent study by [13] has incorporated an Auto-ML approach and specifically focused on multistep-ahead problems, but it does not include LSTM, NN or any traditional approaches. It was also implemented on a relatively small dataset. Table 2 identifies which of the most commonly used machine learning models for time series prediction were included in the Meta-learning studies as candidate models. The table shows that ARIMA has been the most popular candidate model, followed by NN. RNN was used only once, despite its recent popularity in time series analysis. SVR and LSTM have not been studied as candidate models in past Meta-learning studies. Although LSTM has recently received considerable attention in time series prediction applications, primarily due to its ability to handle long-term memory, it has not featured in any of these studies. Similarly, SVR has been widely used for prediction purposes due to its good generalization abilities for time series, but does not feature either.

Table 2 Most common machine learning approaches used for time series prediction

3 Methodology

In this section, a four-step strategy is presented and illustrated in Fig. 1 to determine the best prediction model for a given time series dataset. At the outset, we have a time series dataset D and a set of machine learning models M = {\(m_1\),\(m_2\),...,\(m_n\)}. The goal is the construction of a Meta-Learner (\(\hbox {ML}\)) which approximates a function denoted by \(\mathcal {G} \) and described in Eq. 1, where \(\mathcal {G} \) calculates the predicted \(\hbox {error}\) and is a function of the model \(m_i\) and the set of time series features F.

$$\begin{aligned} \hbox {error} = \mathcal {G} (m_i,F) \end{aligned}$$
(1)
Fig. 1
figure 1

MSAP Meta-learner architecture

The approach outlined as shown in Fig. 1 is as follows:

  • Feature selection Given a time series dataset D, segmented into \(D_{80}\) and \(D_{20}\) (an 80/20 split), this process generates feature sets \(F_{80}\) and \(F_{20}\), respectively.

  • Baseline model evaluation This takes as input, F, a feature set from Step 1 and M, the set of models, and tests each model to produce \(E_{m \times f}\), a matrix of normalized mean squared errors (nMSE) for each model \(m_i\) by feature set instance \(f_j\).

  • Meta-learner construction This step takes \(F_{80}\) and E and constructs the Meta-Learner \(\hbox {ML}\).

  • Model selection The final step takes feature set \(F_{20}\) and computes \(\hbox {EML}_{m \times f}\), a matrix of (nMSE) for each model \(m_i\) and instance set \(F_{20j}\), computed by the Meta-Learner \(\hbox {ML}\).

From Table 2, NN [38], RNN [39], LSTM [40], SVR [41] and ARIMA [42] emerged as the most commonly used Meta-Learner approaches and were thus chosen to constitute the prediction model set M. Additionally, we opt for either one-step-ahead prediction (OSAP) or multistep-ahead prediction (MSAP) strategies as the time series method to combine these approaches.

3.1 Feature selection

From the literature review outlined in Sect. 2, a set of 10 features were selected to describe the generated time series with. These features were carefully selected to represent a basic representation of the possible effective feature set when time series prediction is concerned. The choice of features, however, might be questioned as the challenge is an open problem in the community and has recently undergone significant attention. For this research, we will focus on these 10 features and will leave the existing potentials for further research since a valid analysis on the selection of features requires comprehensive analysis and thus is beyond the capacity/scope of this study. Here, the entire dataset is separated into two parts, one for training \(D_{80}\) with 80% of the samples and one for test \(D_{20}\) with 20% of the samples. Two sets of 10 features are then extracted, \(F_{80}\) and \(F_{20}\) from the time series datasets \(D_{80}\) and \(D_{20}\), respectively. The series of functions providing each feature in \(F_{20}\) (and \(F_{80}\)) are listed below.

  1. 1.

    Shannon entropy [43] For a signal y with sample size N, sample entropy is the negative logarithm of the conditional probability that a sub-series of length M matches point-wise with the next point with tolerance (distance less than) r. Shannon entropy was used as a feature as it reflects the difficulty of encoding the inherent patterns in time series data and thus can help classify a time series based on the complexity level of the predictivity of the encoded patterns.

  2. 2.

    Spectral entropy [44] Spectral entropy is calculated using Shannon entropy and quantifies the spectral complexity or randomness of the power spectrum for the time series over a long period of time. Spectral entropy was used as it can help classify a time series through the presence of dominant peaks. Different classes would occur and would be a function of the presence of anomalies, repeating peaks or occasional unconditional peaks.

  3. 3.

    Singular value decomposition entropy [45] Singular value decomposition (SVD) entropy is an indicator of the dimensionality of the time series, i.e., the number of eigenvectors needed for an adequate description of the given time series. SVD entropy was used as a feature as it indicates the difficulty of encoding time series patterns in short vectors and thus is reflective of the dimensions of the majority of occurring patterns.

  4. 4.

    Fisher information [46] Fisher information (FI) quantifies the amount of information that the data represents regarding unknown parameters and determines how much information can be obtained from a specific quantity of data. FI was used as it gives a quantity of the level of information within the time series.

  5. 5.

    Kurtosis [47] This function measures the number of outliers in the dataset with respect to a normal distribution. When kurtosis is high, the number of outliers is high. When kurtosis is low, outliers are low or zero. Kurtosis was used as it reflects how much the formation of the time series is affected by the presence of outliers.

  6. 6.

    Skewness [47] This metric measures the degree of asymmetry of the distribution of the given time series. Skewness was used as it can evaluate time series based on their levels of asymmetry in their distribution.

  7. 7.

    Gaussianity of the differences [48] Gaussianity of the differences (GoD) measures the normality of the distribution of the first lag difference of the time series. The Shapiro–Wilk test was implemented in the present method.

  8. 8.

    The Hurst exponent [49] The Hurst exponent H attempts to explain long-range dependence as a property of stochastic self-similar processes. It can also help to classify a time series based on the presence/quality of long-term memory within the data.

  9. 9.

    Detrended fluctuation analysis [50] Detrended fluctuation analysis (DFA) is a method for evaluating the statistical self-similarity of a signal.

  10. 10.

    Stationarity [51] Stationarity can be defined as the state in which the statistical properties of the given time series such as mean, variance and auto-correlation remain steady over time. Stationarity was calculated using the augmented Dickey–Fuller test.

3.2 Baseline model evaluation

In this step, each candidate model is applied to the training time series features F and the error matrix \(E_{m \times f}\) is computed. Both standardized error and hyper-parameter selection are computed to train and construct the Meta-Learner, as outlined below.

3.2.1 Standardized error

An appropriate error metric is required to represent the prediction accuracy of candidate models and should be independent of the scale of the time series. The mean squared error (MSE) is the most popular metric for prediction accuracy in regression problems. However, MSE is not independent of the scale or the magnitude of the time series. The nMSE was therefore incorporated to circumvent the problem of scale dependency in MSE. The nMSE provides independence of scale and is calculated using Eq. 2. The present Meta-Learner is trained using \(\hbox {log}(n\)MSE) to maximize its performance.

$$\begin{aligned} n\hbox {MSE} = \sum _{i=1}^{N} \left( \frac{y_i-t_i}{\hbox {Max}(y)-\hbox {Min}(y)}\right) ^2 \end{aligned}$$
(2)

3.2.2 Hyper-parameter selection

To build the Meta-Learner, a large number of time series should be predicted using multiple candidate prediction models. Therefore, implementing hyper-parameter optimization for each candidate model on individual time series is computationally expensive and often impractical. The present approach is as follows: for each hyper-parameter \(h_i\) \(1 \le i \le I \), a set of choices \(c_{i1}\), \(c_{i2}\), ..., \(c_{in_i}\) is considered. Here, I denotes the total number of hyper-parameters that need to be tuned, and \(n_i\) is the number of possible values for hyper-parameter \(h_i\). The model’s error (in terms of nMSE) is recorded for all possible combinations of the hyper-parameters. The total number of combinations is given by \(\prod _{i=1}^{I} n_i\). For example, if there are three hyper-parameters with choices \(c_{11}\), \(c_{12}\), \(c_{21}\), \(c_{22}\), \(c_{23}\) and \(c_{31}\), \(c_{32}\), the total number of combinations would be 2\(\times \)3\(\times \)2=12.

To manage the computational complexity, the average nMSE across different time series is computed for each combination of hyper-parameters. This approach is chosen because nMSE is a common evaluation metric that allows for comparison across different time series. Bayesian optimization was not used to fine-tune the hyper-parameters in this study. This method, although powerful, assumes that the objective function follows a Gaussian process. This assumption might not hold for our problem, which involves diverse time series data that may not exhibit Gaussian properties. Therefore, we opted for an exhaustive search over a pre-defined grid of hyper-parameter values, ensuring a more straightforward and interpretable optimization process.

To validate each combination of the hyper-parameters, the error (nMSE) is measured through bootstrap**. In a bootstrap** process, a number of k=5 validation sets are randomly selected from the data and the average prediction error is reported. Depending on the choice of hyper-parameters, machine learning models can produce substantially different output for each experiment (given the same input). This instability is due to the incorporation of random initial settings (for instance random weights in NN), which is a component of learning algorithms. To address this, each experiment is repeated d=10 times and the average error (nMSE) is reported.

The range of the hyper-parameters for each model are outlined as follows:

  • For all models, the Lags (size of the sliding window) to predict the given time series, ranged from 4 to 12 backward steps. The choice of window size was typically what was found in Sect. 2.

  • For each series, the order of the ARIMA(p,d,q) model was obtained using the maximum likelihood method with a Kalman filter, as motivated in [52].

  • The set of neurons in the hidden layer of the NN model was {4, 6, 8, 10, 12}.

  • The number of neurons in the hidden layer of the RNN model was {4, 6, 8, 10} and the number of recurrent connections was {1, 2, 3, 4, 5, 6, 7}.

  • The set of C-values for the SVR model was {0.1, 0.5, 1, 10, 100, 1000}.

  • The number of cells in the LSTM model was {6, 8, 10, 12}.

3.3 Model selection

The final step sees the application of the Meta-Learner to a set of features extracted from a given time series with a recommendation for the model that shows the least average error. For this purpose, consider time series TS with features \(F'\) as the input to the Meta-Learner. The Meta-Learner recommends the \( i^{\textrm{th}} \) model based on Eq. 3:

$$\begin{aligned} i = \hbox {argmin}(\hbox {ML}(M_{1..n},F') \end{aligned}$$
(3)

In Eq. 3, ML is the Meta-Learner, \(F'\) is the feature set for the given time series, and \( M_{1..n} \) is a set of candidate models. Thus, \( \hbox {argmin} \) returns the index for the model that has shown the least error in predicting the given time series.

3.4 Meta-learner construction

The aim of the Meta-Learner in the present research is to determine if a regression approach would successfully estimate the nMSE of an algorithm given the features of a proposed time series. In theory, any machine learning regression model that approximates a variable with a real value (nMSE of a candidate model) could be used to achieve this task. However, nonparametric approaches are preferable as the level of assumptions is minimal. In the present study, three of the most popular nonparametric approaches are proposed as the base Meta-Learner: SVR (ML\(_{\textrm{SVR}}\)), NN (ML\(_{\textrm{NN}}\)) and random forest regression (ML\(_{\textrm{RF}}\)), and an evaluation of each is used to determine the most appropriate.

4 Evaluation

In this section, the evaluation strategy is presented. This begins with a discussion on the experimental setup and the need for a sufficiently diverse set of time series to robustly test predictive models. This is followed by validation of a range of machine learning models for time series prediction. This step helps to rank time series models as a prerequisite to assessing Meta-Learner performance. The following steps articulate the evaluation methodology: predictive models are ranked based on performance. The ability of each Meta-Learner to select the highest performing models is then evaluated. Finally, the best performing Meta-Learner is selected (using a new real dataset), and a final evaluation to analyze its performance on unseen data is performed.

4.1 Experimental setup

There are a number of existing multistep-ahead prediction strategies, such as recursive (REC), direct (DIR), direct-recursive (DiREC) [53, 54] and multistep recursive filter approach (MRFA) [55]. In the present research, a REC strategy is implemented, as it is the most popular multistep-ahead prediction strategy in the literature [56]. REC is known for its advantage of maintaining serial relationships between subsequent observations by using the prediction from the previous step as an input for the next step [54].

DIR strategy involves predicting each step ahead separately using models trained specifically for each step, which can reduce error propagation, but may ignore inter-step dependencies. DiREC combines both REC and DIR by recursively applying direct strategies for intermediate steps, aiming to balance error accumulation and step-specific training.

The MRFA is a relatively recent approach to multistep-ahead prediction. It leverages recursive filters to generate predictions for multiple steps ahead, effectively balancing the trade-off between complexity and prediction accuracy [55]. MRFA applies a recursive filter mechanism to smooth the predictions, reducing the impact of potential errors in earlier predictions on subsequent steps.

From the models selected in Sect. 3, all five models (NN, RNN, LSTM, SVR and ARIMA) were implemented with the REC strategy due to its robustness and popularity. However, for multistep-ahead prediction, only SVR, RNN and NN were specifically implemented and tested. This choice was made because these models are particularly suited to handle the computational load and complexities involved in generating multiple sequential predictions. While the work focuses on multistep-ahead prediction, the implementation was strategically limited to these models to ensure computational efficiency and accuracy.

In constructing the Meta-Learner, the training dataset needs to consider two key factors. Firstly, it should encompass a diverse range of time series data to ensure that the Meta-Learner can generalize effectively and recommend suitable prediction models across various types and features of time series. Secondly, the dataset selection must be tailored to our prediction tasks, as time series analysis encompasses different problem categories, including classification, segmentation, motif discovery and prediction. Each category presents specific requirements, thus necessitating careful consideration during dataset curation.

In [14], a library of 50,000 synthetic time series datasets was created to help researchers conduct more robust evaluations of new time series algorithms. This work explains that the dataset was designed to offer a high level of diversity. Additionally, it demonstrates methods for testing and validating diversity across a large time series dataset. While a full discussion of the dataset construction is outside the scope of this paper, the important points which are relevant to this validation are highlighted. The dataset was partitioned into 378 categories using five features, for which the categorizations were as follows:

  • Spectral entropy was categorized according to three categories, where A:\( X<1 \), B:\( 1\le X<9 \) and C:\( 9 \le X \).

  • Kurtosis was categorized into three categories, where A:\( X<-0.3 \), B:\( -0.3\le X< 0.3 \) and C:\( 0.3 \le X \), based on [57].

  • Skewness was categorized into three categories, where A:\( X<-0.3 \), B:\( -0.3\le X< 0.3 \) and C:\( 0.3 \le X \), based on [57].

  • GoD was categorized into two categories, where A:\( X<0.02 \) and B:\( 0.02 \le X \).

  • DFA was categorized into seven categories, where A:\( X<0.45 \), B:\( 0.45\le X < 0.55 \), C:\( 0.55 \le X < 0.95 \), D:\( 0.95 \le X < 1.05 \), E:\( 1.05\le X<1.45 \), F:\( 1.45\le X < 1.55 \) and G:\( 1.55 \le X \).

This categorization enables the division of the feature space into multiple subspaces, which can subsequently aid in identifying similar groups within time series datasets. The application of the 8 candidate models (5 REC and 3 MRFA) was executed across 5,819 randomly selected datasets. This resulted in a total of 46,552 model implementations for each of the 20 prediction horizon steps. The nMSE, expected to range between 0 and 1, was calculated for each implementation. Our evaluation dataset is accessible on Zenodo [58].

4.2 Evaluating the machine learning models

As stated previously, each multistep-ahead prediction strategy was implemented across the complete set of datasets. Figure 2 outlines the average log nMSE adjusted by dataset for each multistep-ahead prediction from 1 to 20 steps. As expected, the error for each approach increases as the prediction horizon lengthens. However, in terms of models, the RNN and SVR strategies (i.e., RNN-OSAP, RNN-MRFA, SVR-OSAP and SVR-MRFA) deliver better results than other approaches (i.e., ARIMA-OSAP, LSTM-OSAP, NN-OSAP and NN-MRFA). The average log nMSE of the RNN and SVR strategies is closer to 0 as the prediction horizon increases, suggesting that these strategies’ predictions are close to the actual values, resulting in lower error.

Fig. 2
figure 2

Mean adjusted Log nMSE by prediction horizon

Fig. 3
figure 3

Frequency of algorithms required for the complete prediction horizon

Fig. 4
figure 4

Comparison between regression models for implementing the Meta-Learner over the entire prediction horizon

Unlike many time series research studies, an analysis for each of the 20 time points in the prediction horizon was performed, where for every dataset, the single best performing model was identified. This delivered a drill down analysis which provided additional insights on the diversity of the challenge. This also highlights how the different models perform across the horizon. It is interesting to note that the best performing strategy does not always work for all time horizons. In fact, for \(77\%\) of the datasets, two or more best performing methods were required to cover the prediction horizon as shown in Fig. 3.

4.3 Evaluation of Meta-learner regression models

The next step in our evaluation was to analyze three different Meta-Learner regression approaches to determine which performs best in identifying the most appropriate model given a particular dataset. For this, NN, SVR and random forest models were selected as the base regression Meta-Learner, together with the meta-features outlined in Sect. 3, using an 80/20 train/test strategy. Configurations were determined using a grid search approach.

For the three techniques, a deterioration in performance was identified as the number of prediction steps move closer to the end of the prediction horizon. Figure 4 compares the performances of ML\(_{\textrm{SVR}}\), ML\(_{\textrm{NN}}\) and ML\(_{\textrm{RF}}\) for prediction horizons spanning 20 steps, using \(R^2\) as the evaluation metric, showing the ensemble approach as best Meta-Learner for this particular problem. \(R^2\) (coefficient of determination) is the square of the Pearson correlation coefficient [59]. Despite strong evidence for an ensemble method as the base learner, this requires further examination.

4.4 Random forest Meta-learner evaluation

As each part of the validation was performed using synthetic time series, it is a useful exercise to take the best performing Meta-Learner and perform a further evaluation step, this time using non-synthetic data. The best performing ML\(_{\textrm{RF}}\) Meta-Learner was deployed using a set of 20 non-synthetic time series. Based on the categorization strategy described in Sect. 4.1, these 20 series cover 13 categories (out of 378) in the feature space.

The 20 time series were divided into two categories: 1 covered: time series whose feature space is included in the training data, and 2 non-covered: time series whose feature space is not included in the training data. There were 9 time series in the covered category and 11 in the non-covered category. The evaluation used the same method: 8 different candidate models, resulting in 20\(\times \)8=160 test samples for the Meta-Learner.

The distributions of the Meta-Learner’s errors, i.e., \(\hbox {Actual}\) \(\hbox {log}(n\hbox {MSE})\) - \(\hbox {Predicted}\) \(\hbox {log}(n\hbox {MSE})\), for covered and non-covered series are compared in Fig. 5. Here, 1 represents covered and 0 represents non-covered. The results show that the median error for the covered series is closer to zero than that for the non-covered series, indicating that the Meta-Learner performs better on the covered dataset. However, the differences in the means and variances between the covered and non-covered data were not statistically significant, likely due to the small sample size.

Fig. 5
figure 5

Centered errors

Fig. 6
figure 6

One-way analysis of rank difference

A subsequent analysis was therefore implemented on the ranking of the models from most to least accurate. The predicted ranks are obtained by sorting the candidate models according to the \(\hbox {Predicted}\) \(\hbox {log}(n\)MSE) results. The performance of the Meta-Learner, in terms of ranking accuracy, is evaluated by comparing the predicted ranks with the ranks induced by the \(\hbox {Actual}\) \(\hbox {log}(n\)MSE). A variance analysis on the signed differences between the actual and the predicted ranks and compared the results for the covered and non-covered series, are shown in Fig. 6.

As shown in Fig. 6, the variance of rank differences (i.e., actual rank - predicted rank) is significantly smaller for the covered category, and thus, the Meta-Learner presents a better ranking of the candidate models for the series whose feature space is covered in the training data. This supports the requirement for confirming that the meta-features of the input dataset are covered in the training dataset.

The Wilcoxon test indicated that the predicted ranks were significantly more accurate for the covered category compared to the non-covered category, with \(Z= -2.383\) \((p= 0.017)\). The final evaluation investigates how accurately the Meta-Learner identifies the top ranked models. The results on the 20 series show that the Meta-Learner correctly identified the top two models for 7 out of 9 (\(78\%\)) series from the covered category, as opposed to 5 out of 11 \(45\%\) series from the non-covered category.

5 Conclusions

Meta-learning has been employed to select appropriate predictive models from a selection of candidate models in many data analysis applications including time series. Typically, they have used either a classification approach or a regression approach using Auto-ML. The classification approach is limited as it only reflects the up or down movement in the time series. The Auto-ML approach uses regression and optimizes the architecture using Bayesian optimization. However, this approach relies on repeated estimation of models and in practice can be slow to deliver results.

In the approach outlined in this research, an ensemble regression method was incorporated that estimates the log-based nMSE. This was used to construct a matrix of model error scores for every time series and thus can output either a single model for all time series or the best performing model for each individual time series. This level of granularity provides researchers with the ability to chose any combination of high performing models and avoid a potentially time expensive analysis. Furthermore, the synthetic data used for the evaluation was designed to create diversity to ensure a robust evaluation.

When assessing the performance of the 8 candidate models, our results revealed that the optimal approach for a specific dataset does not consistently involve a machine learning model. Surprisingly, the straightforward ARIMA-OSAP strategy emerged as the top performer on 35% of the datasets, indicating its potential utility as a baseline technique. Additionally, our validation indicated that over 77% of the datasets exhibited high performance with two or more strategies across the 20-step prediction horizon. While this observation might seem intuitive, it underscores the importance of considering the prediction horizon as a meta-feature in time series Meta-learning studies.

In evaluating our Meta-Learner, we observed strong overall performance, albeit with a decline towards the end of the prediction horizon, as expected. Among the three Meta-Learner models (RF, NN and SVR), the random forest method notably outperformed the other regression models, achieving an impressive \(R^2=0.958\). Notably, when applied to non-synthetic time series, our top-performing Meta-Learner demonstrated remarkable performance, particularly for time series with feature spaces covered by the training data. This finding suggests that further diversification of synthetic datasets could enhance the development of future Meta-learners.