Log in

Multivariate long-time series traffic passenger flow prediction using causal convolutional sparse self-attention MTS-Informer

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

As an important part of the operation preparation process of the intelligent transportation system, the passenger flow distribution law and forecast can guide the urban rail transit to formulate a reasonable operation scheduling plan. Due to the complexity, multi-variables, and instability of traffic passenger flow data, accurate passenger flow prediction takes a lot of work. Based on a convolutional neural network, a causal convolution self-attention traffic passenger flow prediction model MTS-Informer framework is proposed. This method follows the changing law of auxiliary variables, adopts the stabilization method to reduce the instability of the original sequence, and uses the causal convolution feature to improve the ability of the model’s self-attention mechanism to extract local information from the input sequence. The weakening effect of the self-attention mechanism ensures that it can learn similarly to the differential features in the original sequence data. In addition, the stationarity detection of the original sequence data is added. The experimental results show that the fitting degree of the sample data is significantly improved, and the standard error decreases between 10 and 40%, which verifies the effectiveness of the proposed modeling technique. It has higher prediction accuracy and operating efficiency and can provide a basis for urban traffic passenger flow prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Thailand)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Ismail AA, Gunady M, Corrada Bravo H et al (2020) Benchmarking deep learning interpretability in time series predictions[J]. Adv Neural Inf Process Syst 33:6441–6452

    Google Scholar 

  2. Fatima S, Uddin M (2022) On the forecasting of multivariate financial time series using hybridization of DCC-GARCH model and multivariate ANNs. Neural Comput Appl 34:21911–21925

    Google Scholar 

  3. Bitencourt HV, Orang O, de Souza LAF et al (2023) An embedding-based non-stationary fuzzy time series method for multiple output high-dimensional multivariate time series forecasting in IoT applications. Neural Comput Appl 35:9407–9420

    Google Scholar 

  4. McLeod AI, Li WK (1983) Diagnostic checking ARMA time series models using squared-residual autocorrelations[J]. J Time Ser Anal 4(4):269–273

    MathSciNet  MATH  Google Scholar 

  5. Su J, Wang S, Huang F (2020) ARMA nets: expanding receptive field for dense prediction[J]. Adv Neural Inf Process Syst 33:17696–17707

    Google Scholar 

  6. Piccolo D (1990) A distance measure for classifying ARIMA models[J]. J Time Ser Anal 11(2):153–164

    MathSciNet  MATH  Google Scholar 

  7. Benvenuto D, Giovanetti M, Vassallo L et al (2020) Application of the ARIMA model on the COVID-2019 epidemic dataset[J]. Data Brief 29:105340

    Google Scholar 

  8. Song YY, Ying LU (2015) Decision tree methods: applications for classification and prediction[J]. Shanghai Arch Psychiatry 27(2):130

    Google Scholar 

  9. Quinlan JR (1986) Induction of decision trees[J]. Mach Learn 1:81–106

    Google Scholar 

  10. Castro-Neto M, Jeong YS, Jeong MK et al (2009) Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions[J]. Expert Syst Appl 36(3):6164–6173

    Google Scholar 

  11. Chen Y, Xu P, Chu Y et al (2017) Short-term electrical load forecasting using the Support Vector Regression (SVR) model to calculate the demand response baseline for office buildings[J]. Appl Energy 195:659–670

    Google Scholar 

  12. Guo L, Fang W, Zhao Q et al (2021) The hybrid PROPHET-SVR approach for forecasting product time series demand with seasonality[J]. Comput Ind Eng 161:107598

    Google Scholar 

  13. Zhang J, Chen F, Cui Z et al (2020) Deep learning architecture for short-term passenger flow forecasting in urban rail transit[J]. IEEE Trans Intell Transp Syst 22(11):7004–7014

    Google Scholar 

  14. Li P, Wang S, Zhao H et al (2023) IG-Net: an interaction graph network model for metro passenger flow forecasting[J]. IEEE Trans Intell Transp Syst 24(4):4147–57

    Google Scholar 

  15. Chun-Hui Z, Song-Rui SY (2011) Kalman filter-based short-term passenger flow forecasting on bus stop[J]. J Transp Syst Eng Inf Technol 11(4):154

    Google Scholar 

  16. Wu P, Zhao H (2011) Some analysis and research of the AdaBoost algorithm[C]. In: Intelligent computing and information science: international conference, ICICIS (2011) Chongqing, China, January 8–9. Proceedings, Part I. Springer, Berlin Heidelberg 2011, pp. 1–5

  17. Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system[C]. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. pp. 785-794

  18. Speiser JL, Miller ME, Tooze J et al (2019) A comparison of random forest variable selection methods for classification prediction modeling[J]. Expert Syst Appl 134:93–101

    Google Scholar 

  19. Zhang D, Zhang D (2019) Wavelet transform[J]. Fundamentals of Image Data Mining: Analysis, Features, Classification and Retrieval, pp. 35-44

  20. Barbosh M, Singh P, Sadhu A (2020) Empirical mode decomposition and its variants: a review with applications in structural health monitoring[J]. Smart Mater Struct 29(9):093001

    Google Scholar 

  21. Ng WT, Siu K, Cheung AC, et al (2022) Expressing multivariate time series as graphs with time series attention transformer[J]. ar**v preprint ar**v:2208.09300

  22. Zhao Z, Chen W, Wu X et al (2017) LSTM network: a deep learning approach for short-term traffic forecast[J]. IET Intel Transport Syst 11(2):68–75

    Google Scholar 

  23. Yang D, Gao X, Kong L, Pang Y, Zhou B (2020) An event-driven convolutional neural architecture for non-intrusive load monitoring of residential appliance. IEEE Trans Consum Electron 66(2):173–182

    Google Scholar 

  24. Banerjee I, Ling Y, Chen MC et al (2019) Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification[J]. Artif Intell Med 97:79–88

    Google Scholar 

  25. Rathipriya R, Abdul Rahman AA, Dhamodharavadhani S et al (2023) Demand forecasting model for time-series pharmaceutical data using shallow and deep neural network model. Neural Comput Appl 35:1945–1957

    Google Scholar 

  26. Gaoshen L, Ling P, **ang L, Tong WU (2019) Study on short-term traffic forecast of Urban Bus stations based on LSTM [J]. J Highway Transp Res Develop 36(02):128–135

    Google Scholar 

  27. Li L, **ngzhi P, Xuemei LEI (2022) Temporal convolution attention network for remaining useful life estimation [J]. Comput Integ Manufact Syst 28(08):2375–2386

    Google Scholar 

  28. Zhang L. (2019) Metro passenger flow forecasting systems based on deep neural networks [D]. Bei**g Jiaotong University

  29. Lei W, Chuan L, Dandan P, Yiwei L (2021) Design of subway passenger flow prediction algorithm based on improved convolutional neural network [J]. Modern Electron Techn 44(24):87–91

    Google Scholar 

  30. Hewage P, Behera A, Trovati M et al (2020) Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station[J]. Soft Comput 24:16453–16482

    Google Scholar 

  31. Hassantabar S et al (2021) CovidDeep: SARS-CoV-2/COVID-19 test based on wearable medical sensors and efficient neural networks. IEEE Trans Consum Electron 67(4):244–256

    Google Scholar 

  32. Lai G, Chang W C, Yang Y, et al (2018) Modeling long-and short-term temporal patterns with deep neural networks[C]. In: The 41st international ACM SIGIR conference on research and development in information retrieval. pp. 95-104

  33. Salinas D, Flunkert V, Gasthaus J et al (2020) DeepAR: Probabilistic forecasting with autoregressive recurrent networks[J]. Int J Forecast 36(3):1181–1191

    Google Scholar 

  34. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need[J]. Advances in Neural Information Processing Systems, 30

  35. Shen L, Wang Y (2022) TCCT: Tightly-coupled convolutional transformer on time series forecasting[J]. Neurocomputing 480:131–145

    Google Scholar 

  36. Li B, Cui W, Zhang L et al (2023) DifFormer: multi-resolutional differencing transformer with dynamic ranging for time series analysis[J]. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2023.3293516

    Article  Google Scholar 

  37. Gong M, Zhao Y, Sun J et al (2022) Load forecasting of district heating system based on Informer[J]. Energy 253:124179

    Google Scholar 

  38. Wu Y, Lian C, Zeng Z et al (2022) An aggregated convolutional transformer based on slices and channels for multivariate time series classification[J]. IEEE Trans Emerg Topics Comput Intell 7(3):3768–779

    Google Scholar 

  39. Zhou H, Zhang S, Peng J, et al (2021) Informer: beyond efficient transformer for long sequence time-series forecasting[C]. In: Proceedings of the AAAI conference on artificial intelligence. 35(12): 11106-11115

  40. Yu B (2020) Veridical data science[C]. In: Proceedings of the 13th international conference on web search and data mining. pp. 4-5

  41. Wang W, Wang D (2020) Prediction of component concentrations in sodium aluminate liquor using stochastic configuration networks[J]. Neural Comput Appl 32(17):13625–13638

    Google Scholar 

  42. Bolstad BM, Irizarry RA, Astrand M et al (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias[J]. Bioinformatics 19(2):185–193

    Google Scholar 

  43. Jiang C, Lu Y, Zhong W et al (2021) Deep Bayesian slow feature extraction with application to industrial inferential modeling[J]. IEEE Trans Ind Inf 19(1):40–51

    Google Scholar 

  44. Tiantian T, Wei ZHOU (2022) Research on commodity sales forecast oriented on deep learning [J]. J Chongqing Univ Technol 36(07):310–316

    Google Scholar 

  45. Liu Y, Wu H, Wang J, et al (2022) Non-stationary transformers: exploring the stationarity in time series forecasting[C]. In: Advances in Neural Information Processing Systems

  46. Jian-Wei L, Hui-Dan Z, **ong-Lin L, Jun X (2020) Research progress on batch normalization of deep learning and its related algorithms [J]. Acta Autom Sinica 46(06):1090–1120. https://doi.org/10.16383/j.aas.c180564

    Article  Google Scholar 

  47. Dickey DA, Fuller WA (1981) Likelihood ratio statistics for autoregressive time series with a unit root[J]. Econometrica J Econom Soc 49:1057–1072

    MathSciNet  MATH  Google Scholar 

  48. Lee D, Schmidt P (1996) On the power of the KPSS test of stationarity against fractionally-integrated alternatives[J]. J Econom 73(1):285–302

    MathSciNet  MATH  Google Scholar 

  49. He Yunqiang (2015) Research on corporate governance of Southeast Asian Corporations[D]. Lanzhou University of Finance and Economics

  50. Razali NM, Wah YB (2011) Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests[J]. J Stat Model Anal 2(1):21–33

    Google Scholar 

  51. Gonzalez-Estrada E, Cosmes W (2019) Shapiro-Wilk test for skew normal distributions based on data transformations[J]. J Stat Comput Simul 89(17):3258–3272

    MathSciNet  MATH  Google Scholar 

  52. Hanusz Z, Tarasiska J (2015) Normalization of the Kolmogorov, Smirnov and Shapiro, Wilk tests of normality[J]. Biomet Lett 52(2):85–93

    Google Scholar 

  53. Gonzalez-Estrada E, Villasenor JA, Acosta-Pech R (2022) Shapiro-Wilk test for multivariate skew-normality[J]. Comput Stat 37(4):1985–2001

    MathSciNet  MATH  Google Scholar 

  54. Quraisy A (2020) Normalitas data Menggunakan Uji Kolmogorov-Smirnov dan Saphiro-Wilk: Studi kasus penghasilan orang tua mahasiswa Prodi Pendidikan Matematika Unismuh Makassar[J]. J-HEST J Health Educ Econ Sci Technol 3(1):7–11

    Google Scholar 

Download references

Acknowledgements

This work is supported by open project of State Key Laboratory of Integrated Automation of Process Industry, Northeastern University (No. 2020052).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Wang.

Ethics declarations

Conflict of interest

No conflict of interest exists in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

To reveal the causal convolutional deep neural network structure in Sect. 3.2, the built-in detailed description is also introduced:

A causal convolutional network consists of an input, hidden, and output layer, respectively, as shown in Fig. 8. Each layer uses the same type of neurons and uses a non-full connection between each layer, which is realized through a mask.

Fig. 8
figure 8

The input layer is used to input feature vectors, the hidden layer is used to extract raw data information, and the output layer is responsible for outputting results. The self-attention mechanism of the built-in causal convolutional neural network in this work mainly plays a crucial role in obtaining the local characteristics of the input time series

Supplementary explanation for the sparse probabilistic self-attention algorithm mentioned in Sect. 3.2:

$$\begin{aligned}{} & {} A(q_i,k,v)=\sum _{j}^{L_k}{\frac{k(q_i,k_j)}{\sum _{l}k(q_i,k_i)}}=E_{p(k_j\vert q_i)}[v_j] \end{aligned}$$
(1)
$$\begin{aligned}{} & {} p(k_j,q_i)=\frac{k(p_i,k_j)}{\sum _{l}(q_i,k_i)} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} q(k_j,q_j)=\frac{1}{L_k} \end{aligned}$$
(3)

Among them, let \(q_i\), \(k_i\) and \(v_i\) denote the i-th row in Q, K, and V, respectively, \(p(k_j,q_i)\) represents the attention probability distribution of the i-th query to all keys. \(q(k_j,q_j)\) represents the uniform distribution of query.

Use the KL divergence formula to calculate the distance between the two distributions of P and Q to measure the sparsity of the query. The discrete KL divergence is defined as follows:

$$\begin{aligned}{} & {} D(P\vert \vert Q)=\sum _{i\in x}P(i)*[\log {(\frac{P(i)}{Q(i)})}] \end{aligned}$$
(4)
$$\begin{aligned}{} & {} D(P\mid \mid Q)=\int _x P(x)*[\log {(\frac{P(i)}{Q(i)})}] dx \end{aligned}$$
(5)

Add the attention rate distribution and balanced distribution of the query to the KL divergence, and finally determine the approximate expression of the sparsity of the query as:

$$\begin{aligned} KL(q\vert \vert p)={max}_j\{q_ik_j^T \cdot d^*\}-\frac{1}{L_k}\sum _{j=1}^{L_k}(q_i k_j^T\cdot d^*) \end{aligned}$$
(6)

In the formula, the first item calculates the inner product of the ith query and all keys and selects the maximum value. Compared with the arithmetic mean of the second item, the more significant the result difference, the greater the difference between p and q. In the set selection interval, select the query with a higher difference ranking, and the sampling parameters determine the size of this interval. Therefore, the time series L uses the sparse probabilistic self-attention mechanism to calculate the similarity between Q and K only requires calculating the \(O(L\log L)\) dot product operation. The dot-product manipulation of traditional self-attention means it makes the time complexity and memory usage \(O(L^2)\) per layer, and stacking n encoder–decoders for long-term sequence inputs has a total usage of \(O(n \cdot L^2)\). The probabilistic sparse self-attention mechanism can achieve \(O(L\log L)\) regarding time complexity and memory usage by improving network components through the above methods.

Appendix B

In the experiment, the visualization results of the sequence normal distribution test are supplemented as follows:

Fig. 9
figure 9

Normality test P-P diagram

Fig. 10
figure 10

Normality test Q-Q plot

The cumulative probability graphic method (Fig. 9) is a scatter diagram drawn according to the cumulative probability of the predictor variable corresponding to the cumulative probability of the specified theoretical distribution. The x-axis is the cumulative probability of the sample and the corresponding y-axis ranges from (0, 1) between. The quantile graphical method (Fig. 10) uses quantiles to express the x-axis as the quantile of the expected overall distribution, and the y-axis as the quantile of the empirical distribution of the predicted sample. We use the above two graphical methods for normality testing, the expected distribution is set to a standard normal distribution, and it is necessary to examine whether the point falls on \(y=x\), the intercept of the straight line is the mean, and the slope is the standard deviation. Suppose the overall distribution corresponding to the sample is a normal distribution. In that case, the scatter points corresponding to the samples in the P-P and Q-Q diagram should fall near the \(45^{\circ }\) line from the origin.

The optimization operator is particularly critical in improving the self-attention performance of this paper. This group of control experiments can intuitively and clearly express its advantages. Whether to introduce the optimization operator has a particular impact on the self-attention mechanism, and its prediction effect is generally better than that of the basic. The effect of the model has been well proven on various data sets (Fig. 11).

Fig. 11
figure 11

Control optimization operator experimental group

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, M., Wang, W., Hu, X. et al. Multivariate long-time series traffic passenger flow prediction using causal convolutional sparse self-attention MTS-Informer. Neural Comput & Applic 35, 24207–24223 (2023). https://doi.org/10.1007/s00521-023-09003-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09003-z

Keywords

Navigation