Digilog: Enhancing Website Embedding on Local Governments - A Comparative Analysis

  • Conference paper
  • First Online:
Foundations of Intelligent Systems (ISMIS 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14670))

Included in the following conference series:

  • 113 Accesses

Abstract

The ability to understand and process websites, known as website embedding, is crucial across various domains. It lays the foundation for machine understanding of websites. Specifically, website embedding proves invaluable when monitoring local government websites within the context of digital transformation. In this paper, we present a comparison of different state-of-the-art website embedding methods and their capability of creating a reasonable website embedding for our specific task based on different clustering scores. The models consist of visual, mixed, and textual-based embedding methods. We compare the models with a base line model which embeds the header section of a website. We measure their performance in an off-the-shelf evaluation as well as after transfer learning. Additionally, We evaluate the models’ capability of distinguishing municipality websites from other websites such as tourist websites. We found that when taking an off-the-shelf model, Homepage2Vec, a combination of visual and textual embedding, performs best. When applying transferred learning, MarkupLM, a markup language-based model, outperforms the others in both cluster scoring as well as precision and F1-score in the classification task. All mixed or markup language-based models achieve an F1-score and a precision over 97%. However, time is an important factor when it comes to calculations on large data quantities. Thus, when additionally considering the time needed, our base line model performs best.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.digilog-project.org/.

  2. 2.

    https://pypi.org/project/homepage2vec/.

  3. 3.

    https://huggingface.co/microsoft/markuplm-base.

  4. 4.

    https://huggingface.co/microsoft/markuplm-large.

  5. 5.

    https://libretranslate.com/.

References

  1. Akusok, A., Miche, Y., Karhunen, J., Bjork, K.M., Nian, R., Lendasse, A.: Arbitrary category classification of websites based on image content. IEEE Comput. Intell. Mag. 10(2), 30–41 (2015)

    Article  Google Scholar 

  2. Bhalla, V.K., Kumar, N.: An efficient scheme for automatic web pages categorization using the support vector machine. New Rev. Hypermedia Multimedia 22(3), 223–242 (2016)

    Article  Google Scholar 

  3. Bruni, R., Bianchi, G.: Website categorization: a formal approach and robustness analysis in the case of e-commerce detection. Expert Syst. Appl. 142, 113001 (2020)

    Article  Google Scholar 

  4. Buber, E., Diri, B.: Web page classification using RNN. Procedia Comput. Sci. 154, 62–72 (2019)

    Article  Google Scholar 

  5. Chen, X., et al.: WebSRC: a dataset for web-based structural reading comprehension. ar**v preprint ar**v:2101.09465 (2021)

  6. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)

    Article  Google Scholar 

  7. Espinosa-Leal, L., Akusok, A., Lendasse, A., Björk, K.-M.: Website classification from webpage renders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 41–50. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9_5

    Chapter  Google Scholar 

  8. Gao, T., Yao, X., Chen, D.: Simcse: simple contrastive learning of sentence embeddings. ar**v preprint ar**v:2104.08821 (2021)

  9. García-Sánchez, I.M., Rodríguez-Domínguez, L., Frias-Aceituno, J.V.: Evolutions in e-governance: evidence from Spanish local governments. Environ. Policy Gov. 23(5), 323–340 (2013)

    Article  Google Scholar 

  10. Gupta, A., Bhatia, R.: Ensemble approach for web page classification. Multimedia Tools Appl. 80, 25219–25240 (2021)

    Article  Google Scholar 

  11. Hashemi, M.: Web page classification: a survey of perspectives, gaps, and future directions. Multimedia Tools and Appl. 79(17–18), 11921–11945 (2020)

    Article  Google Scholar 

  12. Hashemi, M., Hall, M.: Detecting and classifying online dark visual propaganda. Image Vis. Comput. 89, 95–105 (2019)

    Article  Google Scholar 

  13. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)

    Article  Google Scholar 

  14. Li, J., Xu, Y., Cui, L., Wei, F.: MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding (2022). http://arxiv.org/abs/2110.08518. ar**v:2110.08518

  15. Lin, B.Y., Sheng, Y., Vo, N., Tata, S.: Freedom: a transferable neural architecture for structured information extraction on web documents. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1092–1102 (2020)

    Google Scholar 

  16. Lugeon, S., Piccardi, T., West, R.: Homepage2Vec: language-agnostic website embedding and classification. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 16, pp. 1285–1291 (2022)

    Google Scholar 

  17. López-Sánchez, D., Corchado, J.M., Arrieta, A.G.: A CBR system for image-based webpage classification: case representation with convolutional neural networks. In: The Thirtieth International Flairs Conference (2017)

    Google Scholar 

  18. Matošević, G., Dobša, J., Mladenić, D.: Using machine learning for web page classification in search engine optimization. Future Internet 13(1), 9 (2021)

    Article  Google Scholar 

  19. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR) 54(3), 1–40 (2021)

    Article  Google Scholar 

  20. Nandanwar, A.K., Choudhary, J.: Semantic features with contextual knowledge-based web page categorization using the GloVe model and stacked BiLSTM. Symmetry 13(10), 1772 (2021)

    Article  Google Scholar 

  21. Nandanwar, A.K., Choudhary, J.: Contextual embeddings-based web page categorization using the fine-tune BERT model. Symmetry 15(2), 395 (2023)

    Google Scholar 

  22. Pina, V., Torres, L., Royo, S.: Are ICTs improving transparency and accountability in the EU regional and local governments? An empirical study. Public Adm. 85(2), 449–472 (2007)

    Article  Google Scholar 

  23. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  24. Zhou, Y., Sheng, Y., Vo, N., Edmonds, N., Tata, S.: Simplified DOM trees for transferable attribute extraction from the web. ar**v preprint ar**v:2101.02415 (2021)

Download references

Acknowledgment

This work is supported by Grant No. GR 200839 of the Swiss National Science Foundation (SNF) and German Research Foundation (DFG) for the research project “Digital Transformation at the Local Tier of Government in Europe: Dynamics and Effects from a Cross-Countries and Over-Time Comparative Perspective (DIGILOG)”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jonathan Gerber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gerber, J., Kreiner, B., Saxer, J., Weiler, A. (2024). Digilog: Enhancing Website Embedding on Local Governments - A Comparative Analysis. In: Appice, A., Azzag, H., Hacid, MS., Hadjali, A., Ras, Z. (eds) Foundations of Intelligent Systems. ISMIS 2024. Lecture Notes in Computer Science(), vol 14670. Springer, Cham. https://doi.org/10.1007/978-3-031-62700-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-62700-2_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-62699-9

  • Online ISBN: 978-3-031-62700-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation