TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

Zhao, **g; Chen, Rui; Fan, Pengcheng

doi:10.1007/s11227-024-06133-6

TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

Published: 27 April 2024

Volume 80, pages 17400–17422, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

**g Zhao¹,
Rui Chen¹ &
Pengcheng Fan¹

48 Accesses
Explore all metrics

Abstract

Web crawler detection is critical for preventing unauthorized extraction of valuable information from websites. Current methods rely on heuristics, leading to time-consuming processes and inability to detect novel crawlers. Privacy protection and communication burdens during training are overlooked, resulting in potential privacy leaks. To address these issues, we propose a federated deep learning crawler detection model that analyzes access behaviors while preserving privacy. First, individual clients locally host website data, while the central server aggregates information for detection model parameters, eliminating raw user data transmission or access. We then develop an innovative algorithm constructing access path trees from user logs, effectively extracting temporal and spatial behavior features. Additionally, we propose a novel time series model with fused additive attention, enabling effective web crawler detection while preserving privacy and reducing data transmission. Finally, comprehensive evaluations on public datasets demonstrate robust privacy protection and effective detection of emerging crawler types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey of AI-enabled phishing attacks detection techniques

Article 23 October 2020

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

Deepfakes: current and future trends

Article Open access 19 February 2024

Data availability

The implemented code used to support the findings of this study is available from the corresponding author upon request. The datasets used in this paper are publicly available for download.

References

Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643
Article Google Scholar
Brown K, Doran D (2018) Contrasting web robot and human behaviors with network models. ar**v preprint ar**v:1801.09715
Browser Capabilities Project (12, 2022) Browscap project. https://browscap.org/
Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. ar**v preprint ar**v:1905.05928
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. ar**v preprint ar**v:1406.1078
Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449
COUNTER (11, 2022) Counter-robots. https://github.com/atmire/COUNTER-Robots
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600
Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606
Article Google Scholar
Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78
Article Google Scholar
Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456
Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310
Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35
Article Google Scholar
Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573
Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028
Article Google Scholar
Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. ar** malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784
Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605
Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810
McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282
Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274
npcassoc access log (2018) npcassoc.org. http://npcassoc.org/log/access.log
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Google Scholar
Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001
Article Google Scholar
Ro I, Han JS, Im EG (2018) Detection method for distributed web-crawlers: a long-tail threshold model. Secur Commun Netw. https://doi.org/10.1155/2018/9065424
Article Google Scholar
SayWeee Inc (2023) Security incident. https://www.sayweee.com/en/view/february-2023-data-breach
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368
Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet Google Scholar
Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6
**a W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242
Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. ar**v preprint ar**v:1409.2329
Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728

Download references

Acknowledgements

We would like to sincerely thank the editors and anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

School of Software Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China
**g Zhao, Rui Chen & Pengcheng Fan

Authors

**g Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Pengcheng Fan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JZ (First Author): review and editing, supervision; RC (Author 2, Corresponding author): conceptualization, methodology, software, investigation, formal analysis, writing—original draft; PF (Author 3): methodology, software, investigation, data curation, writing—original draft.

Corresponding author

Correspondence to Rui Chen.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, J., Chen, R. & Fan, P. TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors. J Supercomput 80, 17400–17422 (2024). https://doi.org/10.1007/s11227-024-06133-6

Download citation

Accepted: 08 April 2024
Published: 27 April 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s11227-024-06133-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive survey of AI-enabled phishing attacks detection techniques

Social media analytics: a survey of techniques, tools and platforms

Deepfakes: current and future trends

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive survey of AI-enabled phishing attacks detection techniques

Social media analytics: a survey of techniques, tools and platforms

Deepfakes: current and future trends

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation