WIDAR - Weighted Input Document Augmented ROUGE

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13185))

Included in the following conference series:

  • 2546 Accesses

Abstract

The task of automatic text summarization has gained a lot of traction due to the recent advancements in machine learning techniques. However, evaluating the quality of a generated summary remains to be an open problem. The literature has widely adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as the standard evaluation metric for summarization. However, ROUGE has some long-established limitations; a major one being its dependence on the availability of good quality reference summary. In this work, we propose the metric WIDAR which in addition to utilizing the reference summary uses also the input document in order to evaluate the quality of the generated summary. The proposed metric is versatile, since it is designed to adapt the evaluation score according to the quality of the reference summary. The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores provided in the SummEval dataset. The proposed metric is able to obtain comparable results with other state-of-the-art metrics while requiring a relatively short computational time (Implementation for WIDAR can be found at - https://github.com/ Raghav10j/WIDAR).

R. Jain, V. Mavi and A. Jangra—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Thailand)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 93.08
Price includes VAT (Thailand)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 109.99
Price excludes VAT (Thailand)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We multiply the final weights by the number of sentences in the reference summary |R| to ensure that the sum of weights remains the same as in plain ROUGE, i.e., \(\sum _i w_i = |R|\).

  2. 2.

    Note that \(ROUGE\text {-} 1\) and \(ROUGE\text {-} 1_{SL}\) denote the same metrics.

  3. 3.

    \(\lambda \) is a fixed hyper-parameter, which is set to 0.5 in our final experiments. We attempted to make \(\lambda \) a data-driven parameter by setting \(\lambda = max(w_{cov_i})\) or \(\lambda = mean(w_{cov_i})\), but this setting was not able to outperform the fixed \(\lambda =0.5\) value (refer to Sect. 4.3).

  4. 4.

    https://github.com/Yale-LILY/SummEval.

  5. 5.

    All the hyperparameter tuning experiments were performed using \(ROUGE\text {-} L^{f}\) unless stated otherwise.

  6. 6.

    It was also noticed that \(\lambda = mean(W_{cov})\) outperforms \(\lambda = max(W_{cov})\) in fluency and consistency; while the opposite happens for coherence and relevance. The reason for this can be explained by the fact that \(mean(W_{cov}) < max(W_{cov})\); therefore the \(\lambda = mean(W_{cov})\) variation always gives more weight to the input document similarity, giving higher fluency and consistency scores because input document consists of all the informationally rich and grammatically correct sentences.

  7. 7.

    In case a metric has more than one variation, the version that corresponds to f-score was used.

  8. 8.

    All the reported metrics in Table 3 have been computed in a multi-reference setting using 11 reference summaries per generated summary.

  9. 9.

    This experiment was conducted on a Tyrone machine with Intel’s Xeon W-2155 Processor having 196 Gb DDR4 RAM and 11 Gb Nvidia 1080Ti GPU. GPU was only used for BLANC, SUPERT, BERTScore and SummaQA evaluation metrics.

References

  1. Bhandari, M., Narayan Gour, P., Ashfaq, A., Liu, P., Neubig, G.: Re-evaluating evaluation in text summarization. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)

    Google Scholar 

  2. Böhm, F., Gao, Y., Meyer, C.M., Shapira, O., Dagan, I., Gurevych, I.: Better rewards yield better summaries: learning to summarise without references. ar**v:abs/1909.01214 (2019)

  3. Chen, P., Wu, F., Wang, T.: A semantic QA-based approach for text summarization evaluation. In: AAAI (2018)

    Google Scholar 

  4. Chopra, S., Auli, M., Rush, A.M.: Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 2016, pp. 93–98. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1012. https://aclanthology.org/N16-1012

  5. Clark, E., Celikyilmaz, A., Smith, N.A.: Sentence mover’s similarity: automatic evaluation for multi-sentence texts. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1264. https://aclanthology.org/P19-1264

  6. Dang, H.T.: Overview of DUC 2005. In: Proceedings of the Document Understanding Conference, vol. 2005, pp. 1–12 (2005)

    Google Scholar 

  7. Duan, Y., Jatowt, A.: Across-time comparative summarization of news articles. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 735–743. ACM (2019)

    Google Scholar 

  8. Edmundson, H.P.: New methods in automatic extracting. J. ACM 16, 264–285 (1969)

    Article  Google Scholar 

  9. Eyal, M., Baumel, T., Elhadad, M.: Question answering as an automatic evaluation metric for news article summarization. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 3938–3948. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1395. https://aclanthology.org/N19-1395

  10. Fabbri, A.R., Kryscinski, W., McCann, B., Socher, R., Radev, D.: SummEval: re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 9, 391–409 (2021)

    Article  Google Scholar 

  11. Ganesan, K.A.: Rouge 2.0: updated and improved measures for evaluation of summarization tasks. ar**v:abs/1803.01937 (2018)

  12. Gao, Y., Zhao, W., Eger, S.: SUPERT: towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.124. https://aclanthology.org/2020.acl-main.124

  13. Giannakopoulos, G., Karkaletsis, V.: AutoSumMENG and MeMoG in evaluating guided summaries. Theory Appl. Categ. (2011)

    Google Scholar 

  14. Grusky, M., Naaman, M., Artzi, Y.: Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, June 2018. Association for Computational Linguistics (2018). https://aclanthology.org/N18-1065

  15. Hasan, T., et al.: XL-Sum: large-scale multilingual abstractive summarization for 44 languages. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4693–4703 (2021)

    Google Scholar 

  16. Jangra, A., Jain, R., Mavi, V., Saha, S., Bhattacharyya, P.: Semantic extractor-paraphraser based abstractive summarization. ar**v preprint ar**v:2105.01296 (2021)

  17. Jangra, A., Jatowt, A., Hasanuzzaman, M., Saha, S.: Text-image-video summary generation using joint integer linear programming. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 190–198. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_24

    Chapter  Google Scholar 

  18. Jangra, A., Jatowt, A., Saha, S., Hasanuzzaman, M.: A survey on multi-modal summarization (2021)

    Google Scholar 

  19. Jangra, A., Saha, S., Jatowt, A., Hasanuzzaman, M.: Multi-modal summary generation using multi-objective optimization. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1745–1748 (2020)

    Google Scholar 

  20. Jangra, A., Saha, S., Jatowt, A., Hasanuzzaman, M.: Multi-modal supplementary-complementary summarization using multi-objective optimization. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 818–828 (2021)

    Google Scholar 

  21. Kryscinski, W., Keskar, N.S., McCann, B., **ong, C., Socher, R.: Neural text summarization: a critical evaluation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1051. https://aclanthology.org/D19-1051

  22. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, New York, NY, USA, pp. 68–73. Association for Computing Machinery (1995). https://doi.org/10.1145/215206.215333

  23. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, Lille, France, 07–09 July 2015, vol. 37, pp. 957–966. PMLR (2015). https://proceedings.mlr.press/v37/kusnerb15.html

  24. Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, June 2007, pp. 228–231. Association for Computational Linguistics (2007). https://aclanthology.org/W07-0734

  25. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, July 2004, pp. 74–81. Association for Computational Linguistics (2004). https://aclanthology.org/W04-1013

  26. Nallapati, R., Zhou, B., Santos, C.D., Çaglar Gülçehre, **ang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: CoNLL (2016)

    Google Scholar 

  27. Nenkova, A.: Summarization evaluation for text and speech: issues and approaches. In: INTERSPEECH (2006)

    Google Scholar 

  28. Ng, J.P., Abrecht, V.: Better summarization evaluation with word embeddings for ROUGE. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015, pp. 1925–1930. Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/D15-1222. https://aclanthology.org/D15-1222

  29. Paice, C.D.: Constructing literature abstracts by computer: techniques and prospects. Inf. Process. Manag. 26, 171–186 (1990)

    Article  Google Scholar 

  30. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  31. Passonneau, R.J., Chen, E., Guo, W., Perin, D.: Automated pyramid scoring of summaries using distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, August 2013, pp. 143–147. Association for Computational Linguistics (2013). https://aclanthology.org/P13-2026

  32. Peyrard, M., Botschen, T., Gurevych, I.: Learning to score system summaries for better content selection evaluation. In: Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, September 2017. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-4510. https://aclanthology.org/W17-4510

  33. Popović, M.: chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, September 2015, pp. 392–395. Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/W15-3049. https://aclanthology.org/W15-3049

  34. Rath, G.J., Resnick, S., Savage, T.R.: The formation of abstracts by the selection of sentences (1961)

    Google Scholar 

  35. Saini, N., Saha, S., Jangra, A., Bhattacharyya, P.: Extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowl. Based Syst. 164, 45–67 (2019)

    Article  Google Scholar 

  36. Scialom, T., Dray, P.A., Lamprier, S., Piwowarski, B., Staiano, J.: MLSUM: the multilingual summarization corpus. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8051–8067 (2020)

    Google Scholar 

  37. Scialom, T., Lamprier, S., Piwowarski, B., Staiano, J.: Answers unite! Unsupervised metrics for reinforced summarization models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1320. https://aclanthology.org/D19-1320

  38. See, A., Liu, P., Manning, C.: Get to the point: summarization with pointer-generator networks. In: Association for Computational Linguistics (2017). https://arxiv.org/abs/1704.04368

  39. ShafieiBavani, E., Ebrahimi, M., Wong, R., Chen, F.: A graph-theoretic summary evaluation for ROUGE. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October–November 2018. Association for Computational Linguistics (2018). https://aclanthology.org/D18-1085

  40. ShafieiBavani, E., Ebrahimi, M., Wong, R.K., Chen, F.: Summarization evaluation in the absence of human model summaries using the compositionality of word embeddings. In: COLING (2018)

    Google Scholar 

  41. Sun, S., Nenkova, A.: The feasibility of embedding based automatic evaluation for single document summarization. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019, pp. 1216–1221. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1116. https://aclanthology.org/D19-1116

  42. Vasilyev, O., Bohannon, J.: Is human scoring the best criteria for summary evaluation? In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.findings-acl.192. https://aclanthology.org/2021.findings-acl.192

  43. Vedantam, R., Zitnick, C.L., Parikh, D.: CIDER: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)

    Google Scholar 

  44. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evaluating text generation with BERT. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SkeHuCVFDr

  45. Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., Eger, S.: MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019, pp. 563–578. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1053. https://aclanthology.org/D19-1053

  46. Zhou, L., Lin, C.Y., Munteanu, D.S., Hovy, E.: ParaEval: using paraphrases to evaluate summaries automatically. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, USA, June 2006, pp. 447–454. Association for Computational Linguistics (2006). https://aclanthology.org/N06-1057

Download references

Acknowledgement

Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya Ph.D. Scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anubhav Jangra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jain, R., Mavi, V., Jangra, A., Saha, S. (2022). WIDAR - Weighted Input Document Augmented ROUGE. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13185. Springer, Cham. https://doi.org/10.1007/978-3-030-99736-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-99736-6_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-99735-9

  • Online ISBN: 978-3-030-99736-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation