Abstract
Web crawler detection is critical for preventing unauthorized extraction of valuable information from websites. Current methods rely on heuristics, leading to time-consuming processes and inability to detect novel crawlers. Privacy protection and communication burdens during training are overlooked, resulting in potential privacy leaks. To address these issues, we propose a federated deep learning crawler detection model that analyzes access behaviors while preserving privacy. First, individual clients locally host website data, while the central server aggregates information for detection model parameters, eliminating raw user data transmission or access. We then develop an innovative algorithm constructing access path trees from user logs, effectively extracting temporal and spatial behavior features. Additionally, we propose a novel time series model with fused additive attention, enabling effective web crawler detection while preserving privacy and reducing data transmission. Finally, comprehensive evaluations on public datasets demonstrate robust privacy protection and effective detection of emerging crawler types.
Similar content being viewed by others
Data availability
The implemented code used to support the findings of this study is available from the corresponding author upon request. The datasets used in this paper are publicly available for download.
References
Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643
Brown K, Doran D (2018) Contrasting web robot and human behaviors with network models. ar**v preprint ar**v:1801.09715
Browser Capabilities Project (12, 2022) Browscap project. https://browscap.org/
Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. ar**v preprint ar**v:1905.05928
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. ar**v preprint ar**v:1406.1078
Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449
COUNTER (11, 2022) Counter-robots. https://github.com/atmire/COUNTER-Robots
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600
Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606
Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78
Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456
Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310
Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35
Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573
Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028
Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. ar** malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784
Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605
Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810
McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282
Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274
npcassoc access log (2018) npcassoc.org. http://npcassoc.org/log/access.log
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001
Ro I, Han JS, Im EG (2018) Detection method for distributed web-crawlers: a long-tail threshold model. Secur Commun Netw. https://doi.org/10.1155/2018/9065424
SayWeee Inc (2023) Security incident. https://www.sayweee.com/en/view/february-2023-data-breach
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368
Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6
**a W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242
Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. ar**v preprint ar**v:1409.2329
Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728
Acknowledgements
We would like to sincerely thank the editors and anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Contributions
JZ (First Author): review and editing, supervision; RC (Author 2, Corresponding author): conceptualization, methodology, software, investigation, formal analysis, writing—original draft; PF (Author 3): methodology, software, investigation, data curation, writing—original draft.
Corresponding author
Ethics declarations
Conflict of interest
The author declares no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, J., Chen, R. & Fan, P. TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors. J Supercomput 80, 17400–17422 (2024). https://doi.org/10.1007/s11227-024-06133-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06133-6