Understanding Interpretability: Explainable AI Approaches for Hate Speech Classifiers

  • Conference paper
  • First Online:
Explainable Artificial Intelligence (xAI 2023)

Abstract

Cyberbullying and hate speech are two of the most significant problems in today’s cyberspace. Automated artificial intelligence models might be used to find and remove online hate speech, which would address a critical problem. A variety of explainable AI strategies are being developed to make model judgments and justifications intelligible to people as artificial intelligence continues to permeate numerous industries and make critical change. Our study focuses on mixed code languages (a mix of Hindi and English) and the Indian sub-continent. This language combination is extensively used in SARRAC nations. Three transformer-based models and one machine learning model was trained and fine-tuned on the modified HASOC-Identification of Conversational Hate-Speech in Code-Mixed Languages (ICHCL) data for hate speech classification. Several types of explainability techniques have been explored on the respective models, such as Local interpretable model-agnostic explanations (LIME), Shapley additive explanations (SHAP), and model attention, to analyze model behavior. The analysis suggests that better trained models and comparison of Explainable Artificial Intelligence (XAI) techniques would provide better insight.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://xgboost.readthedocs.io/en/stable/.

  2. 2.

    https://huggingface.co/.

  3. 3.

    https://github.com/marcotcr/lime.

  4. 4.

    https://github.com/slundberg/shap.

  5. 5.

    https://github.com/jessevig/bertviz.

References

  1. Atanasova, P., Simonsen, J.G., Lioma, C., Augenstein, I.: A diagnostic study of explainability techniques for text classification. ar**v preprint ar**v:2009.13295 (2020)

  2. Attanasio, G., Nozza, D., Pastor, E., Hovy, D.: Benchmarking post-hoc interpretability approaches for transformer-based misogyny detection. In: Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, pp. 100–112 (2022)

    Google Scholar 

  3. Biradar, S., Saumya, S., et al.: Fighting hate speech from bilingual Hinglish speaker’s perspective, a transformer-and translation-based approach. Soc. Netw. Anal. Min. 12(1), 1–10 (2022)

    Article  Google Scholar 

  4. Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)

    Google Scholar 

  5. Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: natural language inference with natural language explanations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  6. Cheng, J., Bernstein, M., Danescu-Niculescu-Mizil, C., Leskovec, J.: Anyone can become a troll: causes of trolling behavior in online discussions. In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 1217–1230 (2017)

    Google Scholar 

  7. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. ar**v preprint ar**v:1911.02116 (2019)

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)

  9. Donthula, S.K., Kaushik, A.: Man is what he eats: a research on Hinglish sentiments of Youtube cookery channels using deep learning. Int. J. Recent Technol. Eng. (IJRTE) 8(2S11), 930–937 (2019)

    Google Scholar 

  10. Fersini, E., Nozza, D., Rosso, P., et al.: Overview of the evalita 2018 task on automatic misogyny identification (ami). In: EVALITA Evaluation of NLP and Speech Tools for Italian Proceedings of the Final Workshop 12–13 December 2018, Naples. Accademia University Press (2018)

    Google Scholar 

  11. Fersini, E., Nozza, D., Rosso, P., et al.: Ami@ evalita2020: automatic misogyny identification. In: Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). (seleziona...) (2020)

    Google Scholar 

  12. Hate crime: Law Commision united kingdom government. https://www.lawcom.gov.uk/project/hate-crime/. Accessed 02 Dec 2022

  13. Kakwani, D., et al.: IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of EMNLP (2020)

    Google Scholar 

  14. Karim, M.R., et al.: Deephateexplainer: explainable hate speech detection in under-resourced Bengali language. In: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. IEEE (2021)

    Google Scholar 

  15. Kaur, G., Kaushik, A., Sharma, S.: Cooking is creating emotion: a study on Hinglish sentiments of Youtube cookery channels using semi-supervised approach. Big Data Cogn. Comput. 3(3), 37 (2019)

    Article  Google Scholar 

  16. Kokatnoor, S.A., Krishnan, B.: Twitter hate speech detection using stacked weighted ensemble (SWE) model. In: 2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp. 87–92. IEEE (2020)

    Google Scholar 

  17. Kothari, R., Snell, R.: Chutnefying English: The Phenomenon of Hinglish. Penguin Books India (2011)

    Google Scholar 

  18. Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M.: Evaluating aggression identification in social media. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pp. 1–5 (2020)

    Google Scholar 

  19. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  20. Mandl, T., Modha, S., Kumar, M.A., Chakravarthi, B.R.: Overview of the HASOC track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for Information Retrieval Evaluation, pp. 29–32 (2020)

    Google Scholar 

  21. Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hatexplain: a benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 14867–14875 (2021)

    Google Scholar 

  22. Moreno, M.A., Gower, A.D., Brittain, H., Vaillancourt, T.: Applying natural language processing to evaluate news media coverage of bullying and cyberbullying. Prev. Sci. 20(8), 1274–1283 (2019)

    Article  Google Scholar 

  23. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  24. Satapara, S., et al.: Overview of the HASOC subtrack at fire 2022: hate speech and offensive content identification in English and Indo-Aryan languages. In: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 4–7 (2022)

    Google Scholar 

  25. Satapara, S., Modha, S., Mandl, T., Madhu, H., Majumder, P.: Overview of the HASOC subtrack at FIRE 2021: conversational hate speech detection in code-mixed language. Working Notes FIRE (2021)

    Google Scholar 

  26. Vig, J.: A multiscale visualization of attention in the transformer model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-3007, https://www.aclweb.org/anthology/P19-3007

  27. Vijayaraghavan, P., Larochelle, H., Roy, D.: Interpretable multi-modal hate speech detection. ar**v preprint ar**v:2103.01616 (2021)

  28. Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In: Proceedings of the Second Workshop on Language in Social Media, pp. 19–26 (2012)

    Google Scholar 

  29. Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93 (2016)

    Google Scholar 

  30. Yadav, S., Kaushik, A.: Contextualized embeddings from transformers for sentiment analysis on code-mixed Hinglish data: an expanded approach with explainable artificial intelligence. In: Anand Kumar, M., et al. (eds.) SPELLL 2022. Communications in Computer and Information Science, vol. 1802, pp. 99–119. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-33231-9_7

    Chapter  Google Scholar 

  31. Yadav, S., Kaushik, A., Sharma, S.: Cooking well, with love, is an art: transformers on Youtube Hinglish data. In: 2021 International Conference on Computational Performance Evaluation (ComPE), pp. 836–841. IEEE (2021)

    Google Scholar 

  32. Yang, C., Srinivasan, P.: Translating surveys to surveillance on social media: methodological challenges & solutions. In: Proceedings of the 2014 ACM Conference on Web Science, pp. 4–12 (2014)

    Google Scholar 

  33. Zaidan, O., Eisner, J., Piatko, C.: Using “annotator rationales” to improve machine learning for text categorization. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 260–267 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sargam Yadav or Abhishek Kaushik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yadav, S., Kaushik, A., McDaid, K. (2023). Understanding Interpretability: Explainable AI Approaches for Hate Speech Classifiers. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1903. Springer, Cham. https://doi.org/10.1007/978-3-031-44070-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44070-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44069-4

  • Online ISBN: 978-3-031-44070-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation