WIDAR - Weighted Input Document Augmented ROUGE

Jain, Raghav; Mavi, Vaibhav; Jangra, Anubhav; Saha, Sriparna

doi:10.1007/978-3-030-99736-6_21

Raghav Jain¹⁵,
Vaibhav Mavi¹⁶,
Anubhav Jangra¹⁵ &
…
Sriparna Saha¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13185))

Included in the following conference series:

European Conference on Information Retrieval

2546 Accesses

Abstract

The task of automatic text summarization has gained a lot of traction due to the recent advancements in machine learning techniques. However, evaluating the quality of a generated summary remains to be an open problem. The literature has widely adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as the standard evaluation metric for summarization. However, ROUGE has some long-established limitations; a major one being its dependence on the availability of good quality reference summary. In this work, we propose the metric WIDAR which in addition to utilizing the reference summary uses also the input document in order to evaluate the quality of the generated summary. The proposed metric is versatile, since it is designed to adapt the evaluation score according to the quality of the reference summary. The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores provided in the SummEval dataset. The proposed metric is able to obtain comparable results with other state-of-the-art metrics while requiring a relatively short computational time (Implementation for WIDAR can be found at - https://github.com/ Raghav10j/WIDAR).

R. Jain, V. Mavi and A. Jangra—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Thailand)

eBook: EUR 93.08; Price includes VAT (Thailand)

Softcover Book: EUR 109.99; Price excludes VAT (Thailand)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Performance of Evaluation Methods Without Human References for Multi-document Text Summarization

A Survey of Text Summarization Approaches Based on Deep Learning

Article 31 May 2021

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

Notes

1.
We multiply the final weights by the number of sentences in the reference summary |R| to ensure that the sum of weights remains the same as in plain ROUGE, i.e., \(\sum _i w_i = |R|\).
2.
Note that \(ROUGE\text {-} 1\) and \(ROUGE\text {-} 1_{SL}\) denote the same metrics.
3.
\(\lambda \) is a fixed hyper-parameter, which is set to 0.5 in our final experiments. We attempted to make \(\lambda \) a data-driven parameter by setting \(\lambda = max(w_{cov_i})\) or \(\lambda = mean(w_{cov_i})\), but this setting was not able to outperform the fixed \(\lambda =0.5\) value (refer to Sect. 4.3).
4.
https://github.com/Yale-LILY/SummEval.
5.
All the hyperparameter tuning experiments were performed using \(ROUGE\text {-} L^{f}\) unless stated otherwise.
6.
It was also noticed that \(\lambda = mean(W_{cov})\) outperforms \(\lambda = max(W_{cov})\) in fluency and consistency; while the opposite happens for coherence and relevance. The reason for this can be explained by the fact that \(mean(W_{cov}) < max(W_{cov})\); therefore the \(\lambda = mean(W_{cov})\) variation always gives more weight to the input document similarity, giving higher fluency and consistency scores because input document consists of all the informationally rich and grammatically correct sentences.
7.
In case a metric has more than one variation, the version that corresponds to f-score was used.
8.
All the reported metrics in Table 3 have been computed in a multi-reference setting using 11 reference summaries per generated summary.
9.
This experiment was conducted on a Tyrone machine with Intel’s Xeon W-2155 Processor having 196 Gb DDR4 RAM and 11 Gb Nvidia 1080Ti GPU. GPU was only used for BLANC, SUPERT, BERTScore and SummaQA evaluation metrics.

References

Bhandari, M., Narayan Gour, P., Ashfaq, A., Liu, P., Neubig, G.: Re-evaluating evaluation in text summarization. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
Google Scholar
Böhm, F., Gao, Y., Meyer, C.M., Shapira, O., Dagan, I., Gurevych, I.: Better rewards yield better summaries: learning to summarise without references. ar**v:abs/1909.01214 (2019)
Chen, P., Wu, F., Wang, T.: A semantic QA-based approach for text summarization evaluation. In: AAAI (2018)
Google Scholar
Chopra, S., Auli, M., Rush, A.M.: Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 2016, pp. 93–98. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1012. https://aclanthology.org/N16-1012
Clark, E., Celikyilmaz, A., Smith, N.A.: Sentence mover’s similarity: automatic evaluation for multi-sentence texts. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1264. https://aclanthology.org/P19-1264
Dang, H.T.: Overview of DUC 2005. In: Proceedings of the Document Understanding Conference, vol. 2005, pp. 1–12 (2005)
Google Scholar
Duan, Y., Jatowt, A.: Across-time comparative summarization of news articles. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 735–743. ACM (2019)
Google Scholar
Edmundson, H.P.: New methods in automatic extracting. J. ACM 16, 264–285 (1969)
Article Google Scholar
Eyal, M., Baumel, T., Elhadad, M.: Question answering as an automatic evaluation metric for news article summarization. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 3938–3948. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1395. https://aclanthology.org/N19-1395
Fabbri, A.R., Kryscinski, W., McCann, B., Socher, R., Radev, D.: SummEval: re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 9, 391–409 (2021)
Article Google Scholar
Ganesan, K.A.: Rouge 2.0: updated and improved measures for evaluation of summarization tasks. ar**v:abs/1803.01937 (2018)
Gao, Y., Zhao, W., Eger, S.: SUPERT: towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.124. https://aclanthology.org/2020.acl-main.124
Giannakopoulos, G., Karkaletsis, V.: AutoSumMENG and MeMoG in evaluating guided summaries. Theory Appl. Categ. (2011)
Google Scholar
Grusky, M., Naaman, M., Artzi, Y.: Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, June 2018. Association for Computational Linguistics (2018). https://aclanthology.org/N18-1065
Hasan, T., et al.: XL-Sum: large-scale multilingual abstractive summarization for 44 languages. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4693–4703 (2021)
Google Scholar
Jangra, A., Jain, R., Mavi, V., Saha, S., Bhattacharyya, P.: Semantic extractor-paraphraser based abstractive summarization. ar**v preprint ar**v:2105.01296 (2021)
Jangra, A., Jatowt, A., Hasanuzzaman, M., Saha, S.: Text-image-video summary generation using joint integer linear programming. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 190–198. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_24
Chapter Google Scholar
Jangra, A., Jatowt, A., Saha, S., Hasanuzzaman, M.: A survey on multi-modal summarization (2021)
Google Scholar
Jangra, A., Saha, S., Jatowt, A., Hasanuzzaman, M.: Multi-modal summary generation using multi-objective optimization. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1745–1748 (2020)
Google Scholar
Jangra, A., Saha, S., Jatowt, A., Hasanuzzaman, M.: Multi-modal supplementary-complementary summarization using multi-objective optimization. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 818–828 (2021)
Google Scholar
Kryscinski, W., Keskar, N.S., McCann, B., **ong, C., Socher, R.: Neural text summarization: a critical evaluation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1051. https://aclanthology.org/D19-1051
Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, New York, NY, USA, pp. 68–73. Association for Computing Machinery (1995). https://doi.org/10.1145/215206.215333
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, Lille, France, 07–09 July 2015, vol. 37, pp. 957–966. PMLR (2015). https://proceedings.mlr.press/v37/kusnerb15.html
Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, June 2007, pp. 228–231. Association for Computational Linguistics (2007). https://aclanthology.org/W07-0734
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, July 2004, pp. 74–81. Association for Computational Linguistics (2004). https://aclanthology.org/W04-1013
Nallapati, R., Zhou, B., Santos, C.D., Çaglar Gülçehre, **ang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: CoNLL (2016)
Google Scholar
Nenkova, A.: Summarization evaluation for text and speech: issues and approaches. In: INTERSPEECH (2006)
Google Scholar
Ng, J.P., Abrecht, V.: Better summarization evaluation with word embeddings for ROUGE. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015, pp. 1925–1930. Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/D15-1222. https://aclanthology.org/D15-1222
Paice, C.D.: Constructing literature abstracts by computer: techniques and prospects. Inf. Process. Manag. 26, 171–186 (1990)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Passonneau, R.J., Chen, E., Guo, W., Perin, D.: Automated pyramid scoring of summaries using distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, August 2013, pp. 143–147. Association for Computational Linguistics (2013). https://aclanthology.org/P13-2026
Peyrard, M., Botschen, T., Gurevych, I.: Learning to score system summaries for better content selection evaluation. In: Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, September 2017. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-4510. https://aclanthology.org/W17-4510
Popović, M.: chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, September 2015, pp. 392–395. Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/W15-3049. https://aclanthology.org/W15-3049
Rath, G.J., Resnick, S., Savage, T.R.: The formation of abstracts by the selection of sentences (1961)
Google Scholar
Saini, N., Saha, S., Jangra, A., Bhattacharyya, P.: Extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowl. Based Syst. 164, 45–67 (2019)
Article Google Scholar
Scialom, T., Dray, P.A., Lamprier, S., Piwowarski, B., Staiano, J.: MLSUM: the multilingual summarization corpus. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8051–8067 (2020)
Google Scholar
Scialom, T., Lamprier, S., Piwowarski, B., Staiano, J.: Answers unite! Unsupervised metrics for reinforced summarization models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1320. https://aclanthology.org/D19-1320
See, A., Liu, P., Manning, C.: Get to the point: summarization with pointer-generator networks. In: Association for Computational Linguistics (2017). https://arxiv.org/abs/1704.04368
ShafieiBavani, E., Ebrahimi, M., Wong, R., Chen, F.: A graph-theoretic summary evaluation for ROUGE. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October–November 2018. Association for Computational Linguistics (2018). https://aclanthology.org/D18-1085
ShafieiBavani, E., Ebrahimi, M., Wong, R.K., Chen, F.: Summarization evaluation in the absence of human model summaries using the compositionality of word embeddings. In: COLING (2018)
Google Scholar
Sun, S., Nenkova, A.: The feasibility of embedding based automatic evaluation for single document summarization. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019, pp. 1216–1221. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1116. https://aclanthology.org/D19-1116
Vasilyev, O., Bohannon, J.: Is human scoring the best criteria for summary evaluation? In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.findings-acl.192. https://aclanthology.org/2021.findings-acl.192
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDER: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evaluating text generation with BERT. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SkeHuCVFDr
Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., Eger, S.: MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 2019, pp. 563–578. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1053. https://aclanthology.org/D19-1053
Zhou, L., Lin, C.Y., Munteanu, D.S., Hovy, E.: ParaEval: using paraphrases to evaluate summaries automatically. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, USA, June 2006, pp. 447–454. Association for Computational Linguistics (2006). https://aclanthology.org/N06-1057

Download references

Acknowledgement

Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya Ph.D. Scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.

Author information

Authors and Affiliations

Indian Institute of Technology Patna, Patna, India
Raghav Jain, Anubhav Jangra & Sriparna Saha
New York University, New York, USA
Vaibhav Mavi

Authors

Raghav Jain
View author publications
You can also search for this author in PubMed Google Scholar
Vaibhav Mavi
View author publications
You can also search for this author in PubMed Google Scholar
Anubhav Jangra
View author publications
You can also search for this author in PubMed Google Scholar
Sriparna Saha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anubhav Jangra .

Editor information

Editors and Affiliations

Martin Luther University Halle-Wittenberg, Halle, Germany
Matthias Hagen
Leiden University, Leiden, The Netherlands
Suzan Verberne
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Duisburg-Essen, Essen, Germany
Christin Seifert
University of Stavanger, Stavanger, Norway
Krisztian Balog
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Stavanger, Stavanger, Norway
Vinay Setty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jain, R., Mavi, V., Jangra, A., Saha, S. (2022). WIDAR - Weighted Input Document Augmented ROUGE. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13185. Springer, Cham. https://doi.org/10.1007/978-3-030-99736-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-99736-6_21
Published: 05 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99735-9
Online ISBN: 978-3-030-99736-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

WIDAR - Weighted Input Document Augmented ROUGE

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance of Evaluation Methods Without Human References for Multi-document Text Summarization

A Survey of Text Summarization Approaches Based on Deep Learning

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

WIDAR - Weighted Input Document Augmented ROUGE

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance of Evaluation Methods Without Human References for Multi-document Text Summarization

A Survey of Text Summarization Approaches Based on Deep Learning

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation