Abstract
The explosive growth of the amount of information on Internet has made Web page classification essential for Web information management, retrieval, and integration, Web page indexing, topic-specific Web crawling, topic-specific information extraction models, advertisement removal, filtering out unwanted, futile, or harmful contents, and parental control systems. Owing to the recent staggering growth of performance and memory space in computing machines, along with specialization of machine learning models for text and image classification, many researchers have begun to target the Web page classification problem. Yet, automatic Web page classification remains at its early stages because of its complexity, diversity of Web pages’ contents (images of different sizes, text, hyperlinks, etc.), and its computational cost. This paper not only surveys the proposed methodologies in the literature, but also traces their evolution and portrays different perspectives toward this problem. Our study investigates the following: (a) metadata and contextual information surrounding the terms are mostly ignored in textual content classification, (b) the structure and distribution of text in HTML tags and hyperlinks are understudied in textual content classification, (c) measuring the effectives of features in distinguishing among Web page classes or measuring the contribution of each feature in the classification accuracy is a prominent research gap, (d) image classification methods rely heavily on computationally intensive and problem-specific analyses for feature extraction, (e) semi-supervised learning is understudied, despite its importance in Web page classification because of the massive amount of unlabeled Web pages and the high cost of labeling, (f) deep learning, convolutional and recurrent networks, and reinforcement learning remain underexplored but intriguing for Web page classification, and last but not least (g) develo** a detailed testbed along with evaluation metrics and establishing standard benchmarks remain a gap in assessing Web page classifiers.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-019-08373-8/MediaObjects/11042_2019_8373_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-019-08373-8/MediaObjects/11042_2019_8373_Fig2_HTML.png)
Similar content being viewed by others
References
Abbasi, A., & Chen, H. (2007). Detecting fake escrow websites using rich fraud cues and kernel-based methods. 17th Annual Workshop on Information Technologies and Systems, (pp. 55–60). Montreal, Canada.
Abbasi A, Chen H (2009) A comparison of tools for detecting fake websites. Computer 42(10):78–86
Abin, A. A., Fotouhi, M., & Kasaei, S. (2008). Skin segmentation based on cellular learning automata. 6th International Conference on Advances in Mobile Computing and Multimedia (pp. 254-259). Linz, Austria: ACM.
Ahmadi A, Fotouhi M, Khaleghi M (2011) Intelligent classification of web pages using contextual and visual features. Appl Soft Comput 11(2):1638–1647
Alvari H, Shakarian P, Snyder JK (2017) Semi-supervised learning for detecting human trafficking. Security Informatics 6(1). https://doi.org/10.1186/s13388-017-0029-8
Ap-Apid, R. (2005). An algorithm for nudity detection. 5th Philippine Computing Science Congress, (pp. 201-205).
Arentz WA, Olstad B (2004) Classifying offensive sites based on image content. Comput Vis Image Underst 94(1–3):295–310
Baecchi C, Uricchio T, Bertini M, Bimbo AD (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural emachine translation by jointly learning to align and translate. ar**v preprint , ar**v:1409.0473.
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Bosson, A., Cawley, G. C., Chan, Y., & Harvey, R. (2002). Non-retrieval: blocking pornographic images. International Conference on Image and Video Retrieval (pp. 50-60). Berlin, Heidelberg: Springer.
Chan, Y., Harvey, R., & Bangham, J. A. (2000). Using colour features to block dubious images. 10th European Signal Processing Conference. 3, pp. 1-4. IEEE.
Chiu, J. P., & Nichols, E. (2015). Named entity recognition with bidirectional LSTM-CNNs. ar**v preprint , ar**v:1511.08308.
Chou N, Ledesma R, Teraguchi Y, Mitchell JC (2004) Client-side defense against web-based identity theft. In: 11th annual network and distributed system security symposium. Internet Society, San Diego
Chua CE, Wareham J (2004) Fighting internet auction fraud: an assessment and proposal. Computer 37(10):31–37
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537
Denoyer L, Gallinari P (2004) Bayesian network model for semi-structured document classification. Inf Process Manag 40(5):807–827
Diligenti, M., Gori, M., Maggini, M., & Scarselli, F. (2001). Classification of html documents by hidden tree-markov models. Sixth International Conference on Document Analysis and Recognition (pp. 849-853). Seattle, WA, USA: IEEE.
Du, R., Safavi-Naini, R., & Susilo, W. (2003). Web filtering using text classification. The 11th IEEE International Conference on Networks (pp. 325-330). IEEE.
Dumais, S., & Chen, H. (2000). Hierarchical classification of Web content. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 256-263). ACM.
Fakeri-Tabrizi A, Amini M-R, Goutte C, Usunier N (2015) Multiview self-learning. Neurocomputing 155(1):117–127
Farfade, S. S., Saberian, M. J., & Li, L.-J. (2015). Multi-view face detection using deep convolutional neural networks. 5th International Conference on Multimedia Retrieval (pp. 643-650). ACM.
Fauzi F, Belkhatir M (2010) A user study to investigate semantically relevant contextual information of WWW images. International Journal of Human-Computer Studies 68(5):270–287
Fauzi F, Belkhatir M (2013) Multifaceted conceptual image indexing on the world wide web. Inf Process Manag 49(2):420–440
Fersini E, Messina E, Archetti F (2008) Enhancing web page classification through image-block importance analysis. Inf Process Manag 44(4):1431–1447
Forsyth DA, Fleck MM (1999) Automatic detection of human nudes. Int J Comput Vis 32(1):63–77
Hammami, M., Chahir, Y., & Chen, L. (2003). WebGuard: web based adult content detection and filtering system. IEEE/WIC International Conference on Web Intelligence (pp. 574-578). IEEE.
Hammami M, Chahir Y, Chen L (2006) Webguard: a web filtering engine combining textual, structural, and visual content-based analysis. IEEE Trans Knowl Data Eng 18(2):272–284
Hashemi M, Hall M (2018) Visualization, feature selection, machine learning: identifying the responsible group for extreme acts of violence. IEEE Access 6(1):70164–70171
Hashemi, M., & Hall, M. (2018). Identifying the responsible group for extreme acts of violence through pattern recognition. International Conference on HCI in Business, Government, and Organizations (pp. 594-605). Cham: Springer.
Hashemi M, Hall M (2019) Detecting and classifying online dark visual propaganda. Image Vis Comput 89:95–105
Ho, W. H., & Watters, P. A. (2004). Statistical and structural approaches to filtering internet pornography. IEEE International Conference on Systems, Man and Cybernetics. 5, pp. 4792-4798. IEEE.
Howard, A. G. (2013). Some improvements on deep convolutional neural network based image classification. ar**v preprint , ar**v:1312.5402.
Hu W, Wu O, Chen Z, Fu Z, Maybank S (2007) Recognition of pornographic web pages by classifying texts and images. IEEE Trans Pattern Anal Mach Intell 29(6):1019–1034
Hu W, Zuo H, Wu O, Chen Y, Zhang Z, Suter D (2011) Recognition of adult images, videos, and web page bags. ACM Transactions on Multimedia Computing, Communications, and Applications 7(1):28
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. ar**v preprint , ar**v:1508.01991.
Ioffe, S., & Forsyth, D. (1999a). Finding people by sampling. The Seventh IEEE International Conference on Computer Vision. 2, pp. 1092-1097. IEEE.
Ioffe, S., & Forsyth, D. A. (1999b). Learning to find pictures of people. Advances in Neural Information Processing Systems. 11, pp. 782-788. MIT Press.
Ioffe S, Forsyth DA (2001) Probabilistic methods for finding people. Int J Comput Vis 43(1):45–68
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: convolutional architecture for fast feature embedding. The 22nd ACM International Conference on Multimedia (pp. 675-678). ACM.
Jiao, F., Gao, W., Duan, L., & Cui, G. (2001). Detecting adult image using multiple features. International Conference on Info-tech and Info-net Proceedings. 3, pp. 378-383. Bei**g: IEEE.
**gHua B, **an ZX, Zhi**n L, **ao** L (2012) Mixture models for web page classification. Phys Procedia 25(1):499–505
Jones MJ, Rehg JM (2002) Statistical color models with application to skin detection. Int J Comput Vis 46(1):81–96
Jurafsky D, Martin JH (2014) Speech and language processing. Pearson, London
Kim S, Zhang B-T (2003) Genetic mining of HTML structures for effective web-document retrieval. Appl Intell 18(3):243–256
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, (pp. 1097-1105).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. ar**v preprint , ar**v:1603.01360.
Lee, J. Y., & Dernoncourt, F. (2016). Sequential short-text classification with recurrent and convolutional neural networks. ar**v preprint , ar**v:1603.03827.
Lee PY, Hui SC, Fong AC (2002) Neural networks for web content filtering. IEEE Intell Syst 17(5):48–57
Lee PY, Hui SC, Fong AC (2005) An intelligent categorization engine for bilingual web content filtering. IEEE Transactions on Multimedia 7(6):1183–1190
Lee J-H, Yeh W-C, Chuang M-C (2015) Web page classification based on a simplified swarm optimization. Appl Math Comput 270(1):13–24
Li L, Helenius M (2007) Usability evaluation of anti-phishing toolbars. J Comput Virol 3(2):163–184
Li H, Xu Z, Li T, Sun G, Choo K-KR (2017a) An optimized approach for massive web page classification using entity similarity based on semantic network. Futur Gener Comput Syst 76(1):510–518
Li X, Rao Y, ** social emotion classification with semantically rich hybrid neural networks. IEEE Trans Affect Comput 8(4):428–442
Liang, K. M., Scott, S. D., & Waqas, M. (2004). Detecting pornographic images. Asian Conference on Computer Vision, (pp. 497-502).
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using random forests and weighted multimodal features. Information Retrieval Facility Conference (pp. 63-75). Springer.
Liu W, Deng X, Huang G, Fu AY (2006) An antiphishing strategy based on visual similarity assessment. IEEE Internet Comput 10(2):58–65
Luo Y (2017) Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform 72(1):85–95
McKenna SJ, Gong S, Raja Y (1998) Modelling facial colour and identity with gaussian mixtures. Pattern Recogn 31(12):1883–1892
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. 11th Annual Conference of the International Speech Communication Association, 2, p. 3.
Moustafa, M. (2015). Applying deep learning to classify pornographic images and videos. 7th Pacific-Rim Symposium on Image and Video Technology (p. ar**v:1511.08899). At Auckland, New Zealand: ar**v preprint.
Munkhdalai, T., & Yu, H. (2016a). Reasoning with memory augmented neural networks for language comprehension. ar**v preprint , ar**v:1610.06454.
Munkhdalai, T., & Yu, H. (2017a). Neural semantic encoders. The Annual Meeting of the Association for Computational Linguistics. 1, pp. 397-407. HHS Public Access.
Munkhdalai, T., & Yu, H. (2017b). Neural tree indexers for text understanding. The Annual Meeting of the Association for Computational Linguistics. 1, pp. 11-21. HHS Public Access.
Munkhdalai, T., Lalor, J., & Yu, H. (2016b). Citation analysis with neural attention models. The 7th International Workshop on Health Text Mining and Information Analysis, (pp. 69-77).
Nian F, Li T, Wang Y, Xu M, Wu J (2016) Pornographic image detection utilizing deep convolutional neural networks. Neurocomputing 210(1):283–293
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1717-1724). IEEE.
Özel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415
Perez M, Avila S, Moreira D, Moraes D, Testoni V, Valle E et al (2017) Video pornography detection through deep learning techniques and motion information. Neurocomputing 230(1):279–293
Porter MF (1980) An algorithm for suffix strip**. Program 14(3):130–137
Rajaraman, A., & Ullman, J. D. (2011). Data mining. In Mining of Massive Datasets (pp. 1-17). Cambridge University Press.
Ribeiro, A., Fresno, V., Garcia-Alegre, M. C., & Guinea, D. (2003). Web page classification: a soft computing approach. International Atlantic Web Intelligence Conference (pp. 103-112). Berlin, Heidelberg: Springer.
Rowley HA, **g Y, Baluja S (2006) Large scale image-based adult-content filtering. In: International conference on computer vision theory and applications, 1, pp 290–296
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Segalin C, Cheng DS, Cristani M (2017b) Social profiling through image understanding: personality inference using convolutional neural networks. Comput Vis Image Underst 156(1):34–50
Selamat A, Omatu S (2004) Web page feature selection and classification using neural networks. Inf Sci 158(1):69–88
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ar**v preprint , ar**v:1409.1556.
Sun W, Su F, Wang L (2018) Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278(1):34–40
Sundermeyer, M., Alkhouli, T., Wuebker, J., & Ney, H. (2014). Translation modeling with bidirectional recurrent neural networks. The Conference on Empirical Methods in Natural Language Processing, (pp. 14-25).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9). IEEE.
Tamura A, Watanabe T, Sumita E (2014) Recurrent neural networks for word alignment model. The 52nd Annual Meeting of the Association for. Computational Linguistics 1:1470–1480
Terrillon, J. C., Shirazi, M. N., Fukamachi, H., & Akamatsu, S. (2000). Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images. Fourth IEEE International Conference on Automatic Face and Gesture Recognition (pp. 54-61). IEEE.
Tian L, Zheng D, Zhu C (2013) Image classification based on the combination of text features and visual features. Int J Intell Syst 28(3):242–256
Trotman A (2005) Choosing document structure weights. Inf Process Manag 41(2):243–264
Ulges, A., & Stahl, A. (2011). Automatic detection of child pornography using color visual words. IEEE International Conference on Multimedia and Expo (pp. 1-6). IEEE.
Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. The 53rd Annual Meeting of the Association for. Computational Linguistics 2:707–712
Wang, J. Z., Wiederhold, G., & Firschein, O. (1997). System for screening objectionable images using daubechies' wavelets and color histograms. International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services (pp. 20-30). Berlin, Heidelberg: Springer.
Wang JZ, Li J, Wiederhold G, Firschein O (1998) System for screening objectionable images. Comput Commun 21(15):1355–1360
Wang M, Liu X, Wu X (2015) Visual classification by l1-hypergraph modeling. IEEE Trans Knowl Data Eng 27(9):2564–2574
Wang, X., Cheng, F., Wang, S., Sun, H., Liu, G., & Zhou, C. (2018). Adult image classification by a local-context aware network. 25th IEEE International Conference on Image Processing (pp. 2989-2993). Athens, Greece: IEEE.
**ong S, Lv H, Zhao W, Ji D (2018) Owards twitter sentiment classification by multi-level sentiment-enriched word embeddings. Neurocomputing 275(1):2459–2466
Xu Y, Li B, Xue X, Lu H (2005) Region-based pornographic image detection. IEEE 7th Workshop on Multimedia Signal Processing (pp. 1–4). IEEE, Shanghai, China
Yan, X., Mou, L., Li, G., Chen, Y., Peng, H., & **, Z. (2015). Classifying relations via long short term memory networks along shortest dependency path. ar**v preprint , ar**v:1508.03720.
Yang Y, Slattery S, Ghani R (2002) A study of approaches to hypertext categorization. J Intell Inf Syst 18(2–3):219–241
Yang X, Zhang T, Xu C (2015) Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17(1):64–78
Yi, J., & Sundaresan, N. (2000). A classifier for semi-structured documents. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 340-344). ACM.
Yu J, Tao D, Wang M (2012) Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process 21(7):3262–3272
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision (pp. 818-833). Springer.
Zhang Y, Egelman S, Cranor L, Hong J (2007) Phinding phish: evaluating anti-phishing tools. In: 14th Annual Network & Distributed System Security Symposium. Internet Society, San Diego, CA
Zhang H, Liu G, Chow TW, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22(10):1532–1546
Zhao XG, Wang G, Bi X, Gong P, Zhao Y (2011) XML document classification based on ELM. Neurocomputing 74(16):2444–2451
Zheng, H., Liu, H., & Daoudi, M. (2004). Blocking objectionable images: adult images and harmful symbols. IEEE International Conference on Multimedia and Expo. 2, pp. 1223-1226. IEEE.
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., et al. (2016). Attention-based bidirectional long short-term memory networks for relation classification. The 54th Annual Meeting of the Association for Computational Linguistics, 2, pp. 207-212.
Mahdi Hashemi, (2019) Enlarging smaller images before inputting into convolutional neural network: zero-padding vs. interpolation. Journal of Big Data 6 (1)
Hashemi, M., Hall, M. (2020). Criminal tendency detection from facial images and the gender bias effect. Journal of Big Data 7 (2)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hashemi, M. Web page classification: a survey of perspectives, gaps, and future directions. Multimed Tools Appl 79, 11921–11945 (2020). https://doi.org/10.1007/s11042-019-08373-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08373-8