Understanding Interpretability: Explainable AI Approaches for Hate Speech Classifiers

Yadav, Sargam; Kaushik, Abhishek; McDaid, Kevin

doi:10.1007/978-3-031-44070-0_3

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1903))

Included in the following conference series:

World Conference on Explainable Artificial Intelligence

801 Accesses

Abstract

Cyberbullying and hate speech are two of the most significant problems in today’s cyberspace. Automated artificial intelligence models might be used to find and remove online hate speech, which would address a critical problem. A variety of explainable AI strategies are being developed to make model judgments and justifications intelligible to people as artificial intelligence continues to permeate numerous industries and make critical change. Our study focuses on mixed code languages (a mix of Hindi and English) and the Indian sub-continent. This language combination is extensively used in SARRAC nations. Three transformer-based models and one machine learning model was trained and fine-tuned on the modified HASOC-Identification of Conversational Hate-Speech in Code-Mixed Languages (ICHCL) data for hate speech classification. Several types of explainability techniques have been explored on the respective models, such as Local interpretable model-agnostic explanations (LIME), Shapley additive explanations (SHAP), and model attention, to analyze model behavior. The analysis suggests that better trained models and comparison of Explainable Artificial Intelligence (XAI) techniques would provide better insight.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Handling Disagreement in Hate Speech Modelling

Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation

Data Augmentation for Improving Explainability of Hate Speech Detection

Article 18 July 2023

Notes

References

Atanasova, P., Simonsen, J.G., Lioma, C., Augenstein, I.: A diagnostic study of explainability techniques for text classification. ar**v preprint ar**v:2009.13295 (2020)
Attanasio, G., Nozza, D., Pastor, E., Hovy, D.: Benchmarking post-hoc interpretability approaches for transformer-based misogyny detection. In: Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, pp. 100–112 (2022)
Google Scholar
Biradar, S., Saumya, S., et al.: Fighting hate speech from bilingual Hinglish speaker’s perspective, a transformer-and translation-based approach. Soc. Netw. Anal. Min. 12(1), 1–10 (2022)
Article Google Scholar
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Google Scholar
Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: natural language inference with natural language explanations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Cheng, J., Bernstein, M., Danescu-Niculescu-Mizil, C., Leskovec, J.: Anyone can become a troll: causes of trolling behavior in online discussions. In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 1217–1230 (2017)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. ar**v preprint ar**v:1911.02116 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
Donthula, S.K., Kaushik, A.: Man is what he eats: a research on Hinglish sentiments of Youtube cookery channels using deep learning. Int. J. Recent Technol. Eng. (IJRTE) 8(2S11), 930–937 (2019)
Google Scholar
Fersini, E., Nozza, D., Rosso, P., et al.: Overview of the evalita 2018 task on automatic misogyny identification (ami). In: EVALITA Evaluation of NLP and Speech Tools for Italian Proceedings of the Final Workshop 12–13 December 2018, Naples. Accademia University Press (2018)
Google Scholar
Fersini, E., Nozza, D., Rosso, P., et al.: Ami@ evalita2020: automatic misogyny identification. In: Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). (seleziona...) (2020)
Google Scholar
Hate crime: Law Commision united kingdom government. https://www.lawcom.gov.uk/project/hate-crime/. Accessed 02 Dec 2022
Kakwani, D., et al.: IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of EMNLP (2020)
Google Scholar
Karim, M.R., et al.: Deephateexplainer: explainable hate speech detection in under-resourced Bengali language. In: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. IEEE (2021)
Google Scholar
Kaur, G., Kaushik, A., Sharma, S.: Cooking is creating emotion: a study on Hinglish sentiments of Youtube cookery channels using semi-supervised approach. Big Data Cogn. Comput. 3(3), 37 (2019)
Article Google Scholar
Kokatnoor, S.A., Krishnan, B.: Twitter hate speech detection using stacked weighted ensemble (SWE) model. In: 2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp. 87–92. IEEE (2020)
Google Scholar
Kothari, R., Snell, R.: Chutnefying English: The Phenomenon of Hinglish. Penguin Books India (2011)
Google Scholar
Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M.: Evaluating aggression identification in social media. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pp. 1–5 (2020)
Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Mandl, T., Modha, S., Kumar, M.A., Chakravarthi, B.R.: Overview of the HASOC track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for Information Retrieval Evaluation, pp. 29–32 (2020)
Google Scholar
Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hatexplain: a benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 14867–14875 (2021)
Google Scholar
Moreno, M.A., Gower, A.D., Brittain, H., Vaillancourt, T.: Applying natural language processing to evaluate news media coverage of bullying and cyberbullying. Prev. Sci. 20(8), 1274–1283 (2019)
Article Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Satapara, S., et al.: Overview of the HASOC subtrack at fire 2022: hate speech and offensive content identification in English and Indo-Aryan languages. In: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 4–7 (2022)
Google Scholar
Satapara, S., Modha, S., Mandl, T., Madhu, H., Majumder, P.: Overview of the HASOC subtrack at FIRE 2021: conversational hate speech detection in code-mixed language. Working Notes FIRE (2021)
Google Scholar
Vig, J.: A multiscale visualization of attention in the transformer model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-3007, https://www.aclweb.org/anthology/P19-3007
Vijayaraghavan, P., Larochelle, H., Roy, D.: Interpretable multi-modal hate speech detection. ar**v preprint ar**v:2103.01616 (2021)
Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In: Proceedings of the Second Workshop on Language in Social Media, pp. 19–26 (2012)
Google Scholar
Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93 (2016)
Google Scholar
Yadav, S., Kaushik, A.: Contextualized embeddings from transformers for sentiment analysis on code-mixed Hinglish data: an expanded approach with explainable artificial intelligence. In: Anand Kumar, M., et al. (eds.) SPELLL 2022. Communications in Computer and Information Science, vol. 1802, pp. 99–119. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-33231-9_7
Chapter Google Scholar
Yadav, S., Kaushik, A., Sharma, S.: Cooking well, with love, is an art: transformers on Youtube Hinglish data. In: 2021 International Conference on Computational Performance Evaluation (ComPE), pp. 836–841. IEEE (2021)
Google Scholar
Yang, C., Srinivasan, P.: Translating surveys to surveillance on social media: methodological challenges & solutions. In: Proceedings of the 2014 ACM Conference on Web Science, pp. 4–12 (2014)
Google Scholar
Zaidan, O., Eisner, J., Piatko, C.: Using “annotator rationales” to improve machine learning for text categorization. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 260–267 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Dundalk Insitute of Technology, Dundalk, Ireland
Sargam Yadav, Abhishek Kaushik & Kevin McDaid

Authors

Sargam Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Kaushik
View author publications
You can also search for this author in PubMed Google Scholar
Kevin McDaid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Sargam Yadav or Abhishek Kaushik .

Editor information

Editors and Affiliations

Technological University Dublin, Dublin, Ireland
Luca Longo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yadav, S., Kaushik, A., McDaid, K. (2023). Understanding Interpretability: Explainable AI Approaches for Hate Speech Classifiers. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1903. Springer, Cham. https://doi.org/10.1007/978-3-031-44070-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-44070-0_3
Published: 21 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44069-4
Online ISBN: 978-3-031-44070-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Understanding Interpretability: Explainable AI Approaches for Hate Speech Classifiers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Handling Disagreement in Hate Speech Modelling

Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation

Data Augmentation for Improving Explainability of Hate Speech Detection

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Understanding Interpretability: Explainable AI Approaches for Hate Speech Classifiers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Handling Disagreement in Hate Speech Modelling

Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation

Data Augmentation for Improving Explainability of Hate Speech Detection

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation