Log in

TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Web crawler detection is critical for preventing unauthorized extraction of valuable information from websites. Current methods rely on heuristics, leading to time-consuming processes and inability to detect novel crawlers. Privacy protection and communication burdens during training are overlooked, resulting in potential privacy leaks. To address these issues, we propose a federated deep learning crawler detection model that analyzes access behaviors while preserving privacy. First, individual clients locally host website data, while the central server aggregates information for detection model parameters, eliminating raw user data transmission or access. We then develop an innovative algorithm constructing access path trees from user logs, effectively extracting temporal and spatial behavior features. Additionally, we propose a novel time series model with fused additive attention, enabling effective web crawler detection while preserving privacy and reducing data transmission. Finally, comprehensive evaluations on public datasets demonstrate robust privacy protection and effective detection of emerging crawler types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The implemented code used to support the findings of this study is available from the corresponding author upon request. The datasets used in this paper are publicly available for download.

References

  1. Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643

    Article  Google Scholar 

  2. Brown K, Doran D (2018) Contrasting web robot and human behaviors with network models. ar**v preprint ar**v:1801.09715

  3. Browser Capabilities Project (12, 2022) Browscap project. https://browscap.org/

  4. Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. ar**v preprint ar**v:1905.05928

  5. Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. ar**v preprint ar**v:1406.1078

  6. Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449

  7. COUNTER (11, 2022) Counter-robots. https://github.com/atmire/COUNTER-Robots

  8. Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805

  9. Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600

  10. Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606

    Article  Google Scholar 

  11. Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78

    Article  Google Scholar 

  12. Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128

    Article  Google Scholar 

  13. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  14. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456

  15. Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310

  16. Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35

    Article  Google Scholar 

  17. Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573

  18. Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028

    Article  Google Scholar 

  19. Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. ar** malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784

  20. Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605

  21. Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810

  22. McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282

  23. Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274

  24. npcassoc access log (2018) npcassoc.org. http://npcassoc.org/log/access.log

  25. Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9

    Google Scholar 

  26. Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001

    Article  Google Scholar 

  27. Ro I, Han JS, Im EG (2018) Detection method for distributed web-crawlers: a long-tail threshold model. Secur Commun Netw. https://doi.org/10.1155/2018/9065424

    Article  Google Scholar 

  28. SayWeee Inc (2023) Security incident. https://www.sayweee.com/en/view/february-2023-data-breach

  29. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  30. Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368

  31. Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  Google Scholar 

  32. Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481

  33. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27

  34. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

  35. Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6

  36. **a W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242

  37. Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036

  38. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. ar**v preprint ar**v:1409.2329

  39. Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728

Download references

Acknowledgements

We would like to sincerely thank the editors and anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Contributions

JZ (First Author): review and editing, supervision; RC (Author 2, Corresponding author): conceptualization, methodology, software, investigation, formal analysis, writing—original draft; PF (Author 3): methodology, software, investigation, data curation, writing—original draft.

Corresponding author

Correspondence to Rui Chen.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, J., Chen, R. & Fan, P. TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors. J Supercomput 80, 17400–17422 (2024). https://doi.org/10.1007/s11227-024-06133-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06133-6

Keywords

Navigation