Web page classification: a survey of perspectives, gaps, and future directions

Hashemi, Mahdi

doi:10.1007/s11042-019-08373-8

Web page classification: a survey of perspectives, gaps, and future directions

Published: 10 January 2020

Volume 79, pages 11921–11945, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Mahdi Hashemi ORCID: orcid.org/0000-0003-0212-0228¹

2138 Accesses
50 Citations
3 Altmetric
Explore all metrics

Abstract

The explosive growth of the amount of information on Internet has made Web page classification essential for Web information management, retrieval, and integration, Web page indexing, topic-specific Web crawling, topic-specific information extraction models, advertisement removal, filtering out unwanted, futile, or harmful contents, and parental control systems. Owing to the recent staggering growth of performance and memory space in computing machines, along with specialization of machine learning models for text and image classification, many researchers have begun to target the Web page classification problem. Yet, automatic Web page classification remains at its early stages because of its complexity, diversity of Web pages’ contents (images of different sizes, text, hyperlinks, etc.), and its computational cost. This paper not only surveys the proposed methodologies in the literature, but also traces their evolution and portrays different perspectives toward this problem. Our study investigates the following: (a) metadata and contextual information surrounding the terms are mostly ignored in textual content classification, (b) the structure and distribution of text in HTML tags and hyperlinks are understudied in textual content classification, (c) measuring the effectives of features in distinguishing among Web page classes or measuring the contribution of each feature in the classification accuracy is a prominent research gap, (d) image classification methods rely heavily on computationally intensive and problem-specific analyses for feature extraction, (e) semi-supervised learning is understudied, despite its importance in Web page classification because of the massive amount of unlabeled Web pages and the high cost of labeling, (f) deep learning, convolutional and recurrent networks, and reinforcement learning remain underexplored but intriguing for Web page classification, and last but not least (g) develo** a detailed testbed along with evaluation metrics and establishing standard benchmarks remain a gap in assessing Web page classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Webpage Categorization Using Deep Learning

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Deep neural networks and transfer learning applied to multimedia web mining

References

Abbasi, A., & Chen, H. (2007). Detecting fake escrow websites using rich fraud cues and kernel-based methods. 17th Annual Workshop on Information Technologies and Systems, (pp. 55–60). Montreal, Canada.
Abbasi A, Chen H (2009) A comparison of tools for detecting fake websites. Computer 42(10):78–86
Article Google Scholar
Abin, A. A., Fotouhi, M., & Kasaei, S. (2008). Skin segmentation based on cellular learning automata. 6th International Conference on Advances in Mobile Computing and Multimedia (pp. 254-259). Linz, Austria: ACM.
Ahmadi A, Fotouhi M, Khaleghi M (2011) Intelligent classification of web pages using contextual and visual features. Appl Soft Comput 11(2):1638–1647
Article Google Scholar
Alvari H, Shakarian P, Snyder JK (2017) Semi-supervised learning for detecting human trafficking. Security Informatics 6(1). https://doi.org/10.1186/s13388-017-0029-8
Ap-Apid, R. (2005). An algorithm for nudity detection. 5th Philippine Computing Science Congress, (pp. 201-205).
Arentz WA, Olstad B (2004) Classifying offensive sites based on image content. Comput Vis Image Underst 94(1–3):295–310
Article Google Scholar
Baecchi C, Uricchio T, Bertini M, Bimbo AD (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525
Article Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural emachine translation by jointly learning to align and translate. ar**v preprint , ar**v:1409.0473.
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Bosson, A., Cawley, G. C., Chan, Y., & Harvey, R. (2002). Non-retrieval: blocking pornographic images. International Conference on Image and Video Retrieval (pp. 50-60). Berlin, Heidelberg: Springer.
Chan, Y., Harvey, R., & Bangham, J. A. (2000). Using colour features to block dubious images. 10th European Signal Processing Conference. 3, pp. 1-4. IEEE.
Chiu, J. P., & Nichols, E. (2015). Named entity recognition with bidirectional LSTM-CNNs. ar**v preprint , ar**v:1511.08308.
Chou N, Ledesma R, Teraguchi Y, Mitchell JC (2004) Client-side defense against web-based identity theft. In: 11th annual network and distributed system security symposium. Internet Society, San Diego
Google Scholar
Chua CE, Wareham J (2004) Fighting internet auction fraud: an assessment and proposal. Computer 37(10):31–37
Article Google Scholar
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537
MATH Google Scholar
Denoyer L, Gallinari P (2004) Bayesian network model for semi-structured document classification. Inf Process Manag 40(5):807–827
Article Google Scholar
Diligenti, M., Gori, M., Maggini, M., & Scarselli, F. (2001). Classification of html documents by hidden tree-markov models. Sixth International Conference on Document Analysis and Recognition (pp. 849-853). Seattle, WA, USA: IEEE.
Du, R., Safavi-Naini, R., & Susilo, W. (2003). Web filtering using text classification. The 11th IEEE International Conference on Networks (pp. 325-330). IEEE.
Dumais, S., & Chen, H. (2000). Hierarchical classification of Web content. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 256-263). ACM.
Fakeri-Tabrizi A, Amini M-R, Goutte C, Usunier N (2015) Multiview self-learning. Neurocomputing 155(1):117–127
Article Google Scholar
Farfade, S. S., Saberian, M. J., & Li, L.-J. (2015). Multi-view face detection using deep convolutional neural networks. 5th International Conference on Multimedia Retrieval (pp. 643-650). ACM.
Fauzi F, Belkhatir M (2010) A user study to investigate semantically relevant contextual information of WWW images. International Journal of Human-Computer Studies 68(5):270–287
Article Google Scholar
Fauzi F, Belkhatir M (2013) Multifaceted conceptual image indexing on the world wide web. Inf Process Manag 49(2):420–440
Article Google Scholar
Fersini E, Messina E, Archetti F (2008) Enhancing web page classification through image-block importance analysis. Inf Process Manag 44(4):1431–1447
Article Google Scholar
Forsyth DA, Fleck MM (1999) Automatic detection of human nudes. Int J Comput Vis 32(1):63–77
Article Google Scholar
Hammami, M., Chahir, Y., & Chen, L. (2003). WebGuard: web based adult content detection and filtering system. IEEE/WIC International Conference on Web Intelligence (pp. 574-578). IEEE.
Hammami M, Chahir Y, Chen L (2006) Webguard: a web filtering engine combining textual, structural, and visual content-based analysis. IEEE Trans Knowl Data Eng 18(2):272–284
Article Google Scholar
Hashemi M, Hall M (2018) Visualization, feature selection, machine learning: identifying the responsible group for extreme acts of violence. IEEE Access 6(1):70164–70171
Article Google Scholar
Hashemi, M., & Hall, M. (2018). Identifying the responsible group for extreme acts of violence through pattern recognition. International Conference on HCI in Business, Government, and Organizations (pp. 594-605). Cham: Springer.
Hashemi M, Hall M (2019) Detecting and classifying online dark visual propaganda. Image Vis Comput 89:95–105
Article Google Scholar
Ho, W. H., & Watters, P. A. (2004). Statistical and structural approaches to filtering internet pornography. IEEE International Conference on Systems, Man and Cybernetics. 5, pp. 4792-4798. IEEE.
Howard, A. G. (2013). Some improvements on deep convolutional neural network based image classification. ar**v preprint , ar**v:1312.5402.
Hu W, Wu O, Chen Z, Fu Z, Maybank S (2007) Recognition of pornographic web pages by classifying texts and images. IEEE Trans Pattern Anal Mach Intell 29(6):1019–1034
Article Google Scholar
Hu W, Zuo H, Wu O, Chen Y, Zhang Z, Suter D (2011) Recognition of adult images, videos, and web page bags. ACM Transactions on Multimedia Computing, Communications, and Applications 7(1):28
Google Scholar
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. ar**v preprint , ar**v:1508.01991.
Ioffe, S., & Forsyth, D. (1999a). Finding people by sampling. The Seventh IEEE International Conference on Computer Vision. 2, pp. 1092-1097. IEEE.
Ioffe, S., & Forsyth, D. A. (1999b). Learning to find pictures of people. Advances in Neural Information Processing Systems. 11, pp. 782-788. MIT Press.
Ioffe S, Forsyth DA (2001) Probabilistic methods for finding people. Int J Comput Vis 43(1):45–68
Article MATH Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: convolutional architecture for fast feature embedding. The 22nd ACM International Conference on Multimedia (pp. 675-678). ACM.
Jiao, F., Gao, W., Duan, L., & Cui, G. (2001). Detecting adult image using multiple features. International Conference on Info-tech and Info-net Proceedings. 3, pp. 378-383. Bei**g: IEEE.
**gHua B, **an ZX, Zhi**n L, **ao** L (2012) Mixture models for web page classification. Phys Procedia 25(1):499–505
Article Google Scholar
Jones MJ, Rehg JM (2002) Statistical color models with application to skin detection. Int J Comput Vis 46(1):81–96
Article MATH Google Scholar
Jurafsky D, Martin JH (2014) Speech and language processing. Pearson, London
Google Scholar
Kim S, Zhang B-T (2003) Genetic mining of HTML structures for effective web-document retrieval. Appl Intell 18(3):243–256
Article MATH Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, (pp. 1097-1105).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. ar**v preprint , ar**v:1603.01360.
Lee, J. Y., & Dernoncourt, F. (2016). Sequential short-text classification with recurrent and convolutional neural networks. ar**v preprint , ar**v:1603.03827.
Lee PY, Hui SC, Fong AC (2002) Neural networks for web content filtering. IEEE Intell Syst 17(5):48–57
Article Google Scholar
Lee PY, Hui SC, Fong AC (2005) An intelligent categorization engine for bilingual web content filtering. IEEE Transactions on Multimedia 7(6):1183–1190
Article Google Scholar
Lee J-H, Yeh W-C, Chuang M-C (2015) Web page classification based on a simplified swarm optimization. Appl Math Comput 270(1):13–24
MathSciNet MATH Google Scholar
Li L, Helenius M (2007) Usability evaluation of anti-phishing toolbars. J Comput Virol 3(2):163–184
Article Google Scholar
Li H, Xu Z, Li T, Sun G, Choo K-KR (2017a) An optimized approach for massive web page classification using entity similarity based on semantic network. Futur Gener Comput Syst 76(1):510–518
Article Google Scholar
Li X, Rao Y, ** social emotion classification with semantically rich hybrid neural networks. IEEE Trans Affect Comput 8(4):428–442
Article Google Scholar
Liang, K. M., Scott, S. D., & Waqas, M. (2004). Detecting pornographic images. Asian Conference on Computer Vision, (pp. 497-502).
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using random forests and weighted multimodal features. Information Retrieval Facility Conference (pp. 63-75). Springer.
Liu W, Deng X, Huang G, Fu AY (2006) An antiphishing strategy based on visual similarity assessment. IEEE Internet Comput 10(2):58–65
Article Google Scholar
Luo Y (2017) Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform 72(1):85–95
Article Google Scholar
McKenna SJ, Gong S, Raja Y (1998) Modelling facial colour and identity with gaussian mixtures. Pattern Recogn 31(12):1883–1892
Article Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. 11th Annual Conference of the International Speech Communication Association, 2, p. 3.
Moustafa, M. (2015). Applying deep learning to classify pornographic images and videos. 7th Pacific-Rim Symposium on Image and Video Technology (p. ar**v:1511.08899). At Auckland, New Zealand: ar**v preprint.
Munkhdalai, T., & Yu, H. (2016a). Reasoning with memory augmented neural networks for language comprehension. ar**v preprint , ar**v:1610.06454.
Munkhdalai, T., & Yu, H. (2017a). Neural semantic encoders. The Annual Meeting of the Association for Computational Linguistics. 1, pp. 397-407. HHS Public Access.
Munkhdalai, T., & Yu, H. (2017b). Neural tree indexers for text understanding. The Annual Meeting of the Association for Computational Linguistics. 1, pp. 11-21. HHS Public Access.
Munkhdalai, T., Lalor, J., & Yu, H. (2016b). Citation analysis with neural attention models. The 7th International Workshop on Health Text Mining and Information Analysis, (pp. 69-77).
Nian F, Li T, Wang Y, Xu M, Wu J (2016) Pornographic image detection utilizing deep convolutional neural networks. Neurocomputing 210(1):283–293
Article Google Scholar
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1717-1724). IEEE.
Özel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415
Article Google Scholar
Perez M, Avila S, Moreira D, Moraes D, Testoni V, Valle E et al (2017) Video pornography detection through deep learning techniques and motion information. Neurocomputing 230(1):279–293
Article Google Scholar
Porter MF (1980) An algorithm for suffix strip**. Program 14(3):130–137
Article Google Scholar
Rajaraman, A., & Ullman, J. D. (2011). Data mining. In Mining of Massive Datasets (pp. 1-17). Cambridge University Press.
Ribeiro, A., Fresno, V., Garcia-Alegre, M. C., & Guinea, D. (2003). Web page classification: a soft computing approach. International Atlantic Web Intelligence Conference (pp. 103-112). Berlin, Heidelberg: Springer.
Rowley HA, **g Y, Baluja S (2006) Large scale image-based adult-content filtering. In: International conference on computer vision theory and applications, 1, pp 290–296
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Segalin C, Cheng DS, Cristani M (2017b) Social profiling through image understanding: personality inference using convolutional neural networks. Comput Vis Image Underst 156(1):34–50
Article Google Scholar
Selamat A, Omatu S (2004) Web page feature selection and classification using neural networks. Inf Sci 158(1):69–88
Article MathSciNet Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ar**v preprint , ar**v:1409.1556.
Sun W, Su F, Wang L (2018) Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278(1):34–40
Article Google Scholar
Sundermeyer, M., Alkhouli, T., Wuebker, J., & Ney, H. (2014). Translation modeling with bidirectional recurrent neural networks. The Conference on Empirical Methods in Natural Language Processing, (pp. 14-25).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9). IEEE.
Tamura A, Watanabe T, Sumita E (2014) Recurrent neural networks for word alignment model. The 52nd Annual Meeting of the Association for. Computational Linguistics 1:1470–1480
Google Scholar
Terrillon, J. C., Shirazi, M. N., Fukamachi, H., & Akamatsu, S. (2000). Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images. Fourth IEEE International Conference on Automatic Face and Gesture Recognition (pp. 54-61). IEEE.
Tian L, Zheng D, Zhu C (2013) Image classification based on the combination of text features and visual features. Int J Intell Syst 28(3):242–256
Article Google Scholar
Trotman A (2005) Choosing document structure weights. Inf Process Manag 41(2):243–264
Article MATH Google Scholar
Ulges, A., & Stahl, A. (2011). Automatic detection of child pornography using color visual words. IEEE International Conference on Multimedia and Expo (pp. 1-6). IEEE.
Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112
Article Google Scholar
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. The 53rd Annual Meeting of the Association for. Computational Linguistics 2:707–712
Google Scholar
Wang, J. Z., Wiederhold, G., & Firschein, O. (1997). System for screening objectionable images using daubechies' wavelets and color histograms. International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services (pp. 20-30). Berlin, Heidelberg: Springer.
Wang JZ, Li J, Wiederhold G, Firschein O (1998) System for screening objectionable images. Comput Commun 21(15):1355–1360
Article Google Scholar
Wang M, Liu X, Wu X (2015) Visual classification by l1-hypergraph modeling. IEEE Trans Knowl Data Eng 27(9):2564–2574
Article Google Scholar
Wang, X., Cheng, F., Wang, S., Sun, H., Liu, G., & Zhou, C. (2018). Adult image classification by a local-context aware network. 25th IEEE International Conference on Image Processing (pp. 2989-2993). Athens, Greece: IEEE.
**ong S, Lv H, Zhao W, Ji D (2018) Owards twitter sentiment classification by multi-level sentiment-enriched word embeddings. Neurocomputing 275(1):2459–2466
Article Google Scholar
Xu Y, Li B, Xue X, Lu H (2005) Region-based pornographic image detection. IEEE 7th Workshop on Multimedia Signal Processing (pp. 1–4). IEEE, Shanghai, China
Google Scholar
Yan, X., Mou, L., Li, G., Chen, Y., Peng, H., & **, Z. (2015). Classifying relations via long short term memory networks along shortest dependency path. ar**v preprint , ar**v:1508.03720.
Yang Y, Slattery S, Ghani R (2002) A study of approaches to hypertext categorization. J Intell Inf Syst 18(2–3):219–241
Article Google Scholar
Yang X, Zhang T, Xu C (2015) Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17(1):64–78
Article Google Scholar
Yi, J., & Sundaresan, N. (2000). A classifier for semi-structured documents. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 340-344). ACM.
Yu J, Tao D, Wang M (2012) Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process 21(7):3262–3272
Article MathSciNet MATH Google Scholar
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision (pp. 818-833). Springer.
Zhang Y, Egelman S, Cranor L, Hong J (2007) Phinding phish: evaluating anti-phishing tools. In: 14th Annual Network & Distributed System Security Symposium. Internet Society, San Diego, CA
Google Scholar
Zhang H, Liu G, Chow TW, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22(10):1532–1546
Article Google Scholar
Zhao XG, Wang G, Bi X, Gong P, Zhao Y (2011) XML document classification based on ELM. Neurocomputing 74(16):2444–2451
Article Google Scholar
Zheng, H., Liu, H., & Daoudi, M. (2004). Blocking objectionable images: adult images and harmful symbols. IEEE International Conference on Multimedia and Expo. 2, pp. 1223-1226. IEEE.
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., et al. (2016). Attention-based bidirectional long short-term memory networks for relation classification. The 54th Annual Meeting of the Association for Computational Linguistics, 2, pp. 207-212.
Mahdi Hashemi, (2019) Enlarging smaller images before inputting into convolutional neural network: zero-padding vs. interpolation. Journal of Big Data 6 (1)
Hashemi, M., Hall, M. (2020). Criminal tendency detection from facial images and the gender bias effect. Journal of Big Data 7 (2)

Download references

Author information

Authors and Affiliations

Department of Information Sciences and Technology, George Mason University, 4400 University Dr, Fairfax, VA, 22030, USA
Mahdi Hashemi

Authors

Mahdi Hashemi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahdi Hashemi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hashemi, M. Web page classification: a survey of perspectives, gaps, and future directions. Multimed Tools Appl 79, 11921–11945 (2020). https://doi.org/10.1007/s11042-019-08373-8

Download citation

Received: 18 May 2018
Revised: 21 August 2019
Accepted: 09 October 2019
Published: 10 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11042-019-08373-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web page classification: a survey of perspectives, gaps, and future directions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Webpage Categorization Using Deep Learning

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Deep neural networks and transfer learning applied to multimedia web mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Web page classification: a survey of perspectives, gaps, and future directions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Webpage Categorization Using Deep Learning

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Deep neural networks and transfer learning applied to multimedia web mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation