Introduction

In general, variables can be absolute or relational [1]. A person’s age, education, gender, or level of environmental awareness are absolute variables. However, if a variable is defined by the relationship between entities, it is called a relational variable. Examples of relational variables are the difference in status between persons x and y, the intensity of friendships, the power of one person over another, and the existence and non-existence of trade relations between countries. Relational variables are often dynamic, i.e., they evolve over time.

Contemporary “big” data sets frequently exhibit relational variables [2, 3]. In retail, for instance, data sets are more granular than traditional data, often indexing individual products, outlets, or even users, rather than aggregating them at the group level [4, 5]. Consequently, there are different types of dependencies between the variables of interest (e.g., dependencies across products, dependencies among stores). The relational character of the data is, however, often neglected, notably in prediction tasks. Instead, univariate extrapolation approaches are applied; they cannot capture inter-series dependencies. Moreover, existing multivariate forecasting methods (e.g., vector autoregressions) are restricted to low-dimensional settings and are, hence, not suitable for practical use in large-scale forecasting problems [6, 7].

Tensor extrapolation intends to forecast relational time series data using multi-linear algebra. It proceeds as follows: Multi-way data are arranged in the form of multi-dimensional arrays, i.e., tensors. Tensor decompositions are then used to identify periodic patterns in the data. Subsequently, these patterns serve as input for time series methods. Tensor extrapolation originated in the work of Dunlavy et al. [8] and Spiegel et al. [9], but was limited to preselected time series approaches and binary data. Only recently, Schosser [10, 11] laid the foundations for applications in large-scale forecasting problems.

So far, tensor extrapolation has been restricted to complete data sets. However, values are often missing from time series data. Instances of corruption, inattention, or sensor malfunctions all require forecasting processes to be equipped to handle missing elements [8] and adapted by Schosser [11]. Except for the scaling (and, of course, missing values), the resulting data correspond to those used by Schosser [11]. First of all, the component matrices are generated. Thereby, we are not subject to specific limitations. In particular, the CP model does not place orthogonality constraints on the component matrices [8, 29]. The matrices \({\mathbf {A}}\) and \({\mathbf {B}}\) of size \((I \times R)\) and \((J \times R)\), respectively, are “entity participation” matrices. In other words, column \({\mathbf {a}}_r\) (resp. \({\mathbf {b}}_r\)) is the vector of participation levels of all the entities in component r. User participation is shown in matrix \({\mathbf {A}}\): Here, the users are supposed to react in a different way to a specific product-time combination. The extent of this reaction is measured according to the density function of a Gaussian distribution (cf. Fig. 1a). Product participation is demonstrated in matrix \({\mathbf {B}}\): There are groups of products that respond similarly to the same time and user effect combination (cf. Fig. 1b). The columns of matrix \({\mathbf {C}}\) of size \((T \times R)\) record different periodic patterns. We create a sinusoidal pattern, a trend, a structural break, and another break (cf. Fig. 1c). By aggregating the matrices \({\mathbf {A}}\), \({\mathbf {B}}\), and \({\mathbf {C}}\), a noise-free version of the tensor emerges. Finally, we add Gaussian noise to every entry. The standard deviation of the error term is assumed to equal half of the mean of the noise-free tensor entries.

Fig. 1
figure 1

Generation of synthetic data

Table 1 Forecasting accuracy based on synthetic data in terms of MAPE and sMAPE

To investigate the effect of missing entries, we eliminate different sized fractions of the data. In the words of Rubin [40], the elements are missing completely at random. That means there is no relationship between whether an entry is missing and any other entry in the data set, missing or observed. For instance, where the desired level of missingness equals 20%, each element has a probability of 0.2 of being missing. Our implementation uses consecutive draws of a Bernoulli random variable with probability of success equal to the intended level of missingness [41]. Due to the large number of entries, the realized amount of missing values differs only slightly from the level intended. As a consequence of our procedure, the missing elements are randomly scattered across the array without any specific pattern. Please note that we do not investigate systematically missing entries (the subject-based literature uses the somewhat confusing terms missing at random and missing not at random; [40]). The CP factorization proposed by Tomasi and Bro [14] gets along with systematic missingness. For techniques of preparatory completion, this only applies to a limited extent or not at all. A meaningful comparison is, therefore, not possible.

For each fraction of missing values, we split the data into an estimation sample and a hold-out sample. The latter consists of the 20 most recent observations. We thus obtain \(\underline{{\mathbf {X}}}^{est}\) of size \((160 \times 120 \times 80)\) and \(\underline{{\mathbf {X}}}^{hold}\) of size \((160 \times 120 \times 20)\), respectively. The methods under consideration are implemented, or trained, on the estimation sample. The forecasts are produced for the whole of the hold-out sample and arranged as tensor \(\underline{{\hat{\mathbf {X}}}}\) of size \((160 \times 120 \times 20)\). Finally, forecasts are compared to the actual withheld observations.

We should be aware of the fact that the included time series are differently scaled. This has two consequences. First, since CP minimizes squared error, differences in scale may lead to distortions. Therefore, data preprocessing is necessary [11]. We choose a simple centering across the time mode [31]. It is carried out by averaging the observations over the (available) elements of the respective time series, i.e., across mode C, and then subtracting each thus obtained average from all the observations that partake in it. Formally, the preprocessing step (please compare Line 1 in Algorithm 1) implies

$$\begin{aligned} x_{ijt,trans} = x_{ijt} - {\bar{x}}_{ij\cdot }, \end{aligned}$$

where the subscript dot is used to indicate the mean across \(t \in 1, \ldots , T\). During back-transformation (please compare Line 9 in Algorithm 1), the averages previously deducted are added back. This involves

$$\begin{aligned} {\hat{x}}_{ijt} = {\hat{x}}_{ijt,trans} + {\bar{x}}_{ij\cdot } \end{aligned}$$

for \(t \in T+1, \ldots , T+L\). Second, only scale-free performance measures can be employed. We use the Mean Absolute Percentage Error (MAPE) and the Symmetric Mean Absolute Percentage Error (sMAPE) [11, 42].

Following Schosser [11], our application of tensor extrapolation connects a state-of-the-art tensor decomposition with an automatic forecasting procedure based on a general class of state-space time series models. For this purpose, we use the programming language Python. The function parafac (library TensorLy; [43]) supports the factorization proposed by Tomasi and Bro [14] and, hence, allows for missing values. When parafac is called, the number of components R must be specified. The function ETSModel (library statsmodels; [44]) offers a Python-based implementation of the automatic exponential smoothing algorithm developed by Hyndman et al. [35] and Hyndman and Khandakar [24]. This algorithm provides a rich set of possible models and has been shown to perform well in comparisons with other forecasting methods [23, 26, 45, 46]. Further, in relation to more complex forecasting techniques, the requirements in terms of data availability and computational resources are fairly low [7, 11]. As a baseline, we use a univariate, i.e., per-series, extrapolation by means of ETSModel. Here, the missing values must be filled in. To this end, we propagate the last valid observation forward. Any remaining gaps are filled in backwards.

Results and discussion

Table 1 displays our results. With regard to small amounts of missing data, the situation is as follows: As measured by MAPE, tensor extrapolation outperforms the baseline. If, for instance, 5% of the data are missing, up to 21.40% of the prediction error can be reduced. On the basis of sMAPE, no clear ranking can be determined. As the level of missing data increases, tensor extrapolation becomes even more attractive. In the case of MAPE, the distance to univariate extrapolation increases. Where half of the data are missing, up to 26.28% of the prediction error can be reduced. In terms of sMAPE, our proposed method now also dominates. Our results are largely unaffected by the hyperparameter choice, i.e., the number of components R. One way to circumvent the problem of committing to a specific number of components is to use an (equally weighted) combination or ensemble of forecasts. Here, again, the results are encouraging, as is often the case with the combination of forecasts [47]. Moreover, the signal-to-noise ratio does not influence the hierarchy described. Detailed results on this are available upon request.

Using the Python library timeit, we quantify the computational burden associated with the methods in question. The measurements refer to a commodity notebook with Intel Core i5-6300 CPU 2x2.40 GHz and 8 GB RAM. By way of example, we assume 20% of the data to be missing. Given 100 executions, the average runtime of tensor extrapolation with \(R=4\) components equals 21.28 s. The baseline, ETSModel, takes on average 193.16 s. The reason for this difference lies in the computational cost associated with the automatic exponential smoothing algorithm, i.e., the selection and estimation of an adequate exponential smoothing method. Tensor extrapolation requires tensor decomposition, but significantly reduces the dimension of the forecasting problem. The automatic exponential smoothing algorithm is applied to \(R=4\) time series. In contrast, 19,200 function calls are necessary for the baseline approach. Regardless of computational resources, tensor extrapolation should be computationally cheaper even with the combination of forecasts.

Conclusions

In spite of the possibilities arising from the “big data revolution”, the relational character of many time series is largely neglected in forecasting tasks. Recently, tensor extrapolation has been shown to be effective in forecasting large-scale relational data [11]. However, the results so far are limited to complete data sets. The paper at hand adapts tensor extrapolation to situations with missing entries. The results demonstrate that the method can be successfully applied for up to 50% missing values. Notwithstanding the missing elements, tensor extrapolation is able to extract meaningful, latent structure in the data and to use this information for prediction. A preparatory completion of the data set (e.g., by replacing missing elements) is not required. Given the importance of missing values in practice [48], the findings of this paper provide a compelling argument in favor of tensor extrapolation.