Log in

Web page classification: a survey of perspectives, gaps, and future directions

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The explosive growth of the amount of information on Internet has made Web page classification essential for Web information management, retrieval, and integration, Web page indexing, topic-specific Web crawling, topic-specific information extraction models, advertisement removal, filtering out unwanted, futile, or harmful contents, and parental control systems. Owing to the recent staggering growth of performance and memory space in computing machines, along with specialization of machine learning models for text and image classification, many researchers have begun to target the Web page classification problem. Yet, automatic Web page classification remains at its early stages because of its complexity, diversity of Web pages’ contents (images of different sizes, text, hyperlinks, etc.), and its computational cost. This paper not only surveys the proposed methodologies in the literature, but also traces their evolution and portrays different perspectives toward this problem. Our study investigates the following: (a) metadata and contextual information surrounding the terms are mostly ignored in textual content classification, (b) the structure and distribution of text in HTML tags and hyperlinks are understudied in textual content classification, (c) measuring the effectives of features in distinguishing among Web page classes or measuring the contribution of each feature in the classification accuracy is a prominent research gap, (d) image classification methods rely heavily on computationally intensive and problem-specific analyses for feature extraction, (e) semi-supervised learning is understudied, despite its importance in Web page classification because of the massive amount of unlabeled Web pages and the high cost of labeling, (f) deep learning, convolutional and recurrent networks, and reinforcement learning remain underexplored but intriguing for Web page classification, and last but not least (g) develo** a detailed testbed along with evaluation metrics and establishing standard benchmarks remain a gap in assessing Web page classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Abbasi, A., & Chen, H. (2007). Detecting fake escrow websites using rich fraud cues and kernel-based methods. 17th Annual Workshop on Information Technologies and Systems, (pp. 55–60). Montreal, Canada.

  2. Abbasi A, Chen H (2009) A comparison of tools for detecting fake websites. Computer 42(10):78–86

    Article  Google Scholar 

  3. Abin, A. A., Fotouhi, M., & Kasaei, S. (2008). Skin segmentation based on cellular learning automata. 6th International Conference on Advances in Mobile Computing and Multimedia (pp. 254-259). Linz, Austria: ACM.

  4. Ahmadi A, Fotouhi M, Khaleghi M (2011) Intelligent classification of web pages using contextual and visual features. Appl Soft Comput 11(2):1638–1647

    Article  Google Scholar 

  5. Alvari H, Shakarian P, Snyder JK (2017) Semi-supervised learning for detecting human trafficking. Security Informatics 6(1). https://doi.org/10.1186/s13388-017-0029-8

  6. Ap-Apid, R. (2005). An algorithm for nudity detection. 5th Philippine Computing Science Congress, (pp. 201-205).

  7. Arentz WA, Olstad B (2004) Classifying offensive sites based on image content. Comput Vis Image Underst 94(1–3):295–310

    Article  Google Scholar 

  8. Baecchi C, Uricchio T, Bertini M, Bimbo AD (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525

    Article  Google Scholar 

  9. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural emachine translation by jointly learning to align and translate. ar**v preprint , ar**v:1409.0473.

  10. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  Google Scholar 

  11. Bosson, A., Cawley, G. C., Chan, Y., & Harvey, R. (2002). Non-retrieval: blocking pornographic images. International Conference on Image and Video Retrieval (pp. 50-60). Berlin, Heidelberg: Springer.

  12. Chan, Y., Harvey, R., & Bangham, J. A. (2000). Using colour features to block dubious images. 10th European Signal Processing Conference. 3, pp. 1-4. IEEE.

  13. Chiu, J. P., & Nichols, E. (2015). Named entity recognition with bidirectional LSTM-CNNs. ar**v preprint , ar**v:1511.08308.

  14. Chou N, Ledesma R, Teraguchi Y, Mitchell JC (2004) Client-side defense against web-based identity theft. In: 11th annual network and distributed system security symposium. Internet Society, San Diego

    Google Scholar 

  15. Chua CE, Wareham J (2004) Fighting internet auction fraud: an assessment and proposal. Computer 37(10):31–37

    Article  Google Scholar 

  16. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537

    MATH  Google Scholar 

  17. Denoyer L, Gallinari P (2004) Bayesian network model for semi-structured document classification. Inf Process Manag 40(5):807–827

    Article  Google Scholar 

  18. Diligenti, M., Gori, M., Maggini, M., & Scarselli, F. (2001). Classification of html documents by hidden tree-markov models. Sixth International Conference on Document Analysis and Recognition (pp. 849-853). Seattle, WA, USA: IEEE.

  19. Du, R., Safavi-Naini, R., & Susilo, W. (2003). Web filtering using text classification. The 11th IEEE International Conference on Networks (pp. 325-330). IEEE.

  20. Dumais, S., & Chen, H. (2000). Hierarchical classification of Web content. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 256-263). ACM.

  21. Fakeri-Tabrizi A, Amini M-R, Goutte C, Usunier N (2015) Multiview self-learning. Neurocomputing 155(1):117–127

    Article  Google Scholar 

  22. Farfade, S. S., Saberian, M. J., & Li, L.-J. (2015). Multi-view face detection using deep convolutional neural networks. 5th International Conference on Multimedia Retrieval (pp. 643-650). ACM.

  23. Fauzi F, Belkhatir M (2010) A user study to investigate semantically relevant contextual information of WWW images. International Journal of Human-Computer Studies 68(5):270–287

    Article  Google Scholar 

  24. Fauzi F, Belkhatir M (2013) Multifaceted conceptual image indexing on the world wide web. Inf Process Manag 49(2):420–440

    Article  Google Scholar 

  25. Fersini E, Messina E, Archetti F (2008) Enhancing web page classification through image-block importance analysis. Inf Process Manag 44(4):1431–1447

    Article  Google Scholar 

  26. Forsyth DA, Fleck MM (1999) Automatic detection of human nudes. Int J Comput Vis 32(1):63–77

    Article  Google Scholar 

  27. Hammami, M., Chahir, Y., & Chen, L. (2003). WebGuard: web based adult content detection and filtering system. IEEE/WIC International Conference on Web Intelligence (pp. 574-578). IEEE.

  28. Hammami M, Chahir Y, Chen L (2006) Webguard: a web filtering engine combining textual, structural, and visual content-based analysis. IEEE Trans Knowl Data Eng 18(2):272–284

    Article  Google Scholar 

  29. Hashemi M, Hall M (2018) Visualization, feature selection, machine learning: identifying the responsible group for extreme acts of violence. IEEE Access 6(1):70164–70171

    Article  Google Scholar 

  30. Hashemi, M., & Hall, M. (2018). Identifying the responsible group for extreme acts of violence through pattern recognition. International Conference on HCI in Business, Government, and Organizations (pp. 594-605). Cham: Springer.

  31. Hashemi M, Hall M (2019) Detecting and classifying online dark visual propaganda. Image Vis Comput 89:95–105

    Article  Google Scholar 

  32. Ho, W. H., & Watters, P. A. (2004). Statistical and structural approaches to filtering internet pornography. IEEE International Conference on Systems, Man and Cybernetics. 5, pp. 4792-4798. IEEE.

  33. Howard, A. G. (2013). Some improvements on deep convolutional neural network based image classification. ar**v preprint , ar**v:1312.5402.

  34. Hu W, Wu O, Chen Z, Fu Z, Maybank S (2007) Recognition of pornographic web pages by classifying texts and images. IEEE Trans Pattern Anal Mach Intell 29(6):1019–1034

    Article  Google Scholar 

  35. Hu W, Zuo H, Wu O, Chen Y, Zhang Z, Suter D (2011) Recognition of adult images, videos, and web page bags. ACM Transactions on Multimedia Computing, Communications, and Applications 7(1):28

    Google Scholar 

  36. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. ar**v preprint , ar**v:1508.01991.

  37. Ioffe, S., & Forsyth, D. (1999a). Finding people by sampling. The Seventh IEEE International Conference on Computer Vision. 2, pp. 1092-1097. IEEE.

  38. Ioffe, S., & Forsyth, D. A. (1999b). Learning to find pictures of people. Advances in Neural Information Processing Systems. 11, pp. 782-788. MIT Press.

  39. Ioffe S, Forsyth DA (2001) Probabilistic methods for finding people. Int J Comput Vis 43(1):45–68

    Article  MATH  Google Scholar 

  40. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: convolutional architecture for fast feature embedding. The 22nd ACM International Conference on Multimedia (pp. 675-678). ACM.

  41. Jiao, F., Gao, W., Duan, L., & Cui, G. (2001). Detecting adult image using multiple features. International Conference on Info-tech and Info-net Proceedings. 3, pp. 378-383. Bei**g: IEEE.

  42. **gHua B, **an ZX, Zhi**n L, **ao** L (2012) Mixture models for web page classification. Phys Procedia 25(1):499–505

    Article  Google Scholar 

  43. Jones MJ, Rehg JM (2002) Statistical color models with application to skin detection. Int J Comput Vis 46(1):81–96

    Article  MATH  Google Scholar 

  44. Jurafsky D, Martin JH (2014) Speech and language processing. Pearson, London

    Google Scholar 

  45. Kim S, Zhang B-T (2003) Genetic mining of HTML structures for effective web-document retrieval. Appl Intell 18(3):243–256

    Article  MATH  Google Scholar 

  46. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, (pp. 1097-1105).

  47. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. ar**v preprint , ar**v:1603.01360.

  48. Lee, J. Y., & Dernoncourt, F. (2016). Sequential short-text classification with recurrent and convolutional neural networks. ar**v preprint , ar**v:1603.03827.

  49. Lee PY, Hui SC, Fong AC (2002) Neural networks for web content filtering. IEEE Intell Syst 17(5):48–57

    Article  Google Scholar 

  50. Lee PY, Hui SC, Fong AC (2005) An intelligent categorization engine for bilingual web content filtering. IEEE Transactions on Multimedia 7(6):1183–1190

    Article  Google Scholar 

  51. Lee J-H, Yeh W-C, Chuang M-C (2015) Web page classification based on a simplified swarm optimization. Appl Math Comput 270(1):13–24

    MathSciNet  MATH  Google Scholar 

  52. Li L, Helenius M (2007) Usability evaluation of anti-phishing toolbars. J Comput Virol 3(2):163–184

    Article  Google Scholar 

  53. Li H, Xu Z, Li T, Sun G, Choo K-KR (2017a) An optimized approach for massive web page classification using entity similarity based on semantic network. Futur Gener Comput Syst 76(1):510–518

    Article  Google Scholar 

  54. Li X, Rao Y, ** social emotion classification with semantically rich hybrid neural networks. IEEE Trans Affect Comput 8(4):428–442

    Article  Google Scholar 

  55. Liang, K. M., Scott, S. D., & Waqas, M. (2004). Detecting pornographic images. Asian Conference on Computer Vision, (pp. 497-502).

  56. Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using random forests and weighted multimodal features. Information Retrieval Facility Conference (pp. 63-75). Springer.

  57. Liu W, Deng X, Huang G, Fu AY (2006) An antiphishing strategy based on visual similarity assessment. IEEE Internet Comput 10(2):58–65

    Article  Google Scholar 

  58. Luo Y (2017) Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform 72(1):85–95

    Article  Google Scholar 

  59. McKenna SJ, Gong S, Raja Y (1998) Modelling facial colour and identity with gaussian mixtures. Pattern Recogn 31(12):1883–1892

    Article  Google Scholar 

  60. Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. 11th Annual Conference of the International Speech Communication Association, 2, p. 3.

  61. Moustafa, M. (2015). Applying deep learning to classify pornographic images and videos. 7th Pacific-Rim Symposium on Image and Video Technology (p. ar**v:1511.08899). At Auckland, New Zealand: ar**v preprint.

  62. Munkhdalai, T., & Yu, H. (2016a). Reasoning with memory augmented neural networks for language comprehension. ar**v preprint , ar**v:1610.06454.

  63. Munkhdalai, T., & Yu, H. (2017a). Neural semantic encoders. The Annual Meeting of the Association for Computational Linguistics. 1, pp. 397-407. HHS Public Access.

  64. Munkhdalai, T., & Yu, H. (2017b). Neural tree indexers for text understanding. The Annual Meeting of the Association for Computational Linguistics. 1, pp. 11-21. HHS Public Access.

  65. Munkhdalai, T., Lalor, J., & Yu, H. (2016b). Citation analysis with neural attention models. The 7th International Workshop on Health Text Mining and Information Analysis, (pp. 69-77).

  66. Nian F, Li T, Wang Y, Xu M, Wu J (2016) Pornographic image detection utilizing deep convolutional neural networks. Neurocomputing 210(1):283–293

    Article  Google Scholar 

  67. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1717-1724). IEEE.

  68. Özel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415

    Article  Google Scholar 

  69. Perez M, Avila S, Moreira D, Moraes D, Testoni V, Valle E et al (2017) Video pornography detection through deep learning techniques and motion information. Neurocomputing 230(1):279–293

    Article  Google Scholar 

  70. Porter MF (1980) An algorithm for suffix strip**. Program 14(3):130–137

    Article  Google Scholar 

  71. Rajaraman, A., & Ullman, J. D. (2011). Data mining. In Mining of Massive Datasets (pp. 1-17). Cambridge University Press.

  72. Ribeiro, A., Fresno, V., Garcia-Alegre, M. C., & Guinea, D. (2003). Web page classification: a soft computing approach. International Atlantic Web Intelligence Conference (pp. 103-112). Berlin, Heidelberg: Springer.

  73. Rowley HA, **g Y, Baluja S (2006) Large scale image-based adult-content filtering. In: International conference on computer vision theory and applications, 1, pp 290–296

    Google Scholar 

  74. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  75. Segalin C, Cheng DS, Cristani M (2017b) Social profiling through image understanding: personality inference using convolutional neural networks. Comput Vis Image Underst 156(1):34–50

    Article  Google Scholar 

  76. Selamat A, Omatu S (2004) Web page feature selection and classification using neural networks. Inf Sci 158(1):69–88

    Article  MathSciNet  Google Scholar 

  77. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ar**v preprint , ar**v:1409.1556.

  78. Sun W, Su F, Wang L (2018) Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278(1):34–40

    Article  Google Scholar 

  79. Sundermeyer, M., Alkhouli, T., Wuebker, J., & Ney, H. (2014). Translation modeling with bidirectional recurrent neural networks. The Conference on Empirical Methods in Natural Language Processing, (pp. 14-25).

  80. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9). IEEE.

  81. Tamura A, Watanabe T, Sumita E (2014) Recurrent neural networks for word alignment model. The 52nd Annual Meeting of the Association for. Computational Linguistics 1:1470–1480

    Google Scholar 

  82. Terrillon, J. C., Shirazi, M. N., Fukamachi, H., & Akamatsu, S. (2000). Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images. Fourth IEEE International Conference on Automatic Face and Gesture Recognition (pp. 54-61). IEEE.

  83. Tian L, Zheng D, Zhu C (2013) Image classification based on the combination of text features and visual features. Int J Intell Syst 28(3):242–256

    Article  Google Scholar 

  84. Trotman A (2005) Choosing document structure weights. Inf Process Manag 41(2):243–264

    Article  MATH  Google Scholar 

  85. Ulges, A., & Stahl, A. (2011). Automatic detection of child pornography using color visual words. IEEE International Conference on Multimedia and Expo (pp. 1-6). IEEE.

  86. Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112

    Article  Google Scholar 

  87. Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. The 53rd Annual Meeting of the Association for. Computational Linguistics 2:707–712

    Google Scholar 

  88. Wang, J. Z., Wiederhold, G., & Firschein, O. (1997). System for screening objectionable images using daubechies' wavelets and color histograms. International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services (pp. 20-30). Berlin, Heidelberg: Springer.

  89. Wang JZ, Li J, Wiederhold G, Firschein O (1998) System for screening objectionable images. Comput Commun 21(15):1355–1360

    Article  Google Scholar 

  90. Wang M, Liu X, Wu X (2015) Visual classification by l1-hypergraph modeling. IEEE Trans Knowl Data Eng 27(9):2564–2574

    Article  Google Scholar 

  91. Wang, X., Cheng, F., Wang, S., Sun, H., Liu, G., & Zhou, C. (2018). Adult image classification by a local-context aware network. 25th IEEE International Conference on Image Processing (pp. 2989-2993). Athens, Greece: IEEE.

  92. **ong S, Lv H, Zhao W, Ji D (2018) Owards twitter sentiment classification by multi-level sentiment-enriched word embeddings. Neurocomputing 275(1):2459–2466

    Article  Google Scholar 

  93. Xu Y, Li B, Xue X, Lu H (2005) Region-based pornographic image detection. IEEE 7th Workshop on Multimedia Signal Processing (pp. 1–4). IEEE, Shanghai, China

    Google Scholar 

  94. Yan, X., Mou, L., Li, G., Chen, Y., Peng, H., & **, Z. (2015). Classifying relations via long short term memory networks along shortest dependency path. ar**v preprint , ar**v:1508.03720.

  95. Yang Y, Slattery S, Ghani R (2002) A study of approaches to hypertext categorization. J Intell Inf Syst 18(2–3):219–241

    Article  Google Scholar 

  96. Yang X, Zhang T, Xu C (2015) Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17(1):64–78

    Article  Google Scholar 

  97. Yi, J., & Sundaresan, N. (2000). A classifier for semi-structured documents. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 340-344). ACM.

  98. Yu J, Tao D, Wang M (2012) Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process 21(7):3262–3272

    Article  MathSciNet  MATH  Google Scholar 

  99. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision (pp. 818-833). Springer.

  100. Zhang Y, Egelman S, Cranor L, Hong J (2007) Phinding phish: evaluating anti-phishing tools. In: 14th Annual Network & Distributed System Security Symposium. Internet Society, San Diego, CA

    Google Scholar 

  101. Zhang H, Liu G, Chow TW, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22(10):1532–1546

    Article  Google Scholar 

  102. Zhao XG, Wang G, Bi X, Gong P, Zhao Y (2011) XML document classification based on ELM. Neurocomputing 74(16):2444–2451

    Article  Google Scholar 

  103. Zheng, H., Liu, H., & Daoudi, M. (2004). Blocking objectionable images: adult images and harmful symbols. IEEE International Conference on Multimedia and Expo. 2, pp. 1223-1226. IEEE.

  104. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., et al. (2016). Attention-based bidirectional long short-term memory networks for relation classification. The 54th Annual Meeting of the Association for Computational Linguistics, 2, pp. 207-212.

  105. Mahdi Hashemi, (2019) Enlarging smaller images before inputting into convolutional neural network: zero-padding vs. interpolation. Journal of Big Data 6 (1)

  106. Hashemi, M., Hall, M. (2020). Criminal tendency detection from facial images and the gender bias effect. Journal of Big Data 7 (2)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahdi Hashemi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hashemi, M. Web page classification: a survey of perspectives, gaps, and future directions. Multimed Tools Appl 79, 11921–11945 (2020). https://doi.org/10.1007/s11042-019-08373-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08373-8

Keywords

Navigation