Abstract
In this paper we propose a graph-based biased LexRank approach combined with topic modeling to create a topic-based extractive summarization model. Topical summarization task is used to obtain a customized summary for a particular reader. We achieve so by including information about topics of interest into an extractive summarization model. Topical information is derived via aspect embedding vectors from the ABAE model and used as a topic input for the biased LexRank model that runs on the heterogeneous graph of topics and sentences. Inclusion of the topical structure into a summary increases the targeting of the summary and makes the overall summarization model benefit from it. We conduct experiments on a novel dataset for extractive summarization quality evaluation constructed by ourselves from Wikipedia articles.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Belwal, R.C., Rai, S., Gupta, A.: Text summarization using topic-based vector space model and semantic measure. Inf. Process. Manag. 58(3), 102536 (2021)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. ar**v preprint ar**v:1412.3555 (2014)
Cui, P., Hu, L., Liu, Y.: Enhancing extractive text summarization with topic-aware graph neural networks. ar**v preprint ar**v:2010.06253 (2020)
Dieng, A.B., Ruiz, F.J., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)
Erkan, G.: Using biased random walks for focused summarization (2006)
He, R., Lee, W.S., Ng, H.T., Dahlmeier, D.: An unsupervised neural attention model for aspect extraction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 388–397. Association for Computational Linguistics, July 2017. https://doi.org/10.18653/v1/P17-1036. https://www.aclweb.org/anthology/P17-1036
Ianina, A., Vorontsov, K.: Multimodal topic modeling for exploratory search in collective blog. J. Mach. Learn. Data Anal. 2(2), 173–186 (2016)
Iyyer, M., Guha, A., Chaturvedi, S., Boyd-Graber, J., Daumé III, H.: Feuding families and former friends: unsupervised learning for dynamic fictional relationships. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1534–1544 (2016)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. ar**v preprint ar**v:1312.6114 (2013)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. ar**v preprint ar**v:1609.02907 (2016)
Lin, C.Y., Och, F.: Looking for a few good metrics: rouge and its evaluation. In: NTCIR Workshop (2004)
Miao, Y., Grefenstette, E., Blunsom, P.: Discovering discrete latent topics with neural variational inference. ar**v preprint ar**v:1706.00359 (2017)
Miao, Y., Yu, L., Blunsom, P.: Neural variational inference for text processing. In: International Conference on Machine Learning, pp. 1727–1736 (2016)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. Association for Computational Linguistics, July 2004. https://www.aclweb.org/anthology/W04-3252
Nallapati, R., Zhai, F., Zhou, B.: SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents (2016)
Otterbacher, J., Erkan, G., Radev, D.R.: Biased LexRank: passage retrieval using random walks with question-based priors. Inf. Process. Manag. 45(1), 42–54 (2009)
Rehurek, R., Sojka, P.: Gensim-python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, vol. 3, no. 2 (2011)
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Vorontsov, K., Frei, O., Apishev, M., Romov, P., Dudarenko, M.: BigARTM: open source library for regularized multimodal topic modeling of large collections. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 370–381. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2_36
Vorontsov, K., Frei, O., Apishev, M., Romov, P., Suvorova, M., Yanina, A.: Non-Bayesian additive regularization for multimodal topic modeling of large collections. In: Proceedings of the 2015 Workshop on Topic Models: Post-processing and Applications, pp. 29–37 (2015)
Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Xu, J., Gan, Z., Cheng, Y., Liu, J.: Discourse-aware neural extractive text summarization (2020)
Yasunaga, M., Zhang, R., Meelu, K., Pareek, A., Srinivasan, K., Radev, D.: Graph-based neural multi-document summarization. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, pp. 452–462. Association for Computational Linguistics, August 2017. https://doi.org/10.18653/v1/K17-1045. https://www.aclweb.org/anthology/K17-1045
Zhang, L.: Neural topic models (2020). https://github.com/zll17/Neural_Topic_Models
Zhu, H., Dong, L., Wei, F., Qin, B., Liu, T.: Transforming wikipedia into augmented data for query-focused summarization (2019)
Acknowledgements
The work of the second author was funded by RFBR, project number 20-37-90025. The work of the last author was funded by RFBR, project number 19-37-60027.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Experiments with BigARTM and WikiPersons
BigARTM [7, 20] is an open source project for topic modeling of large collections. It allows to train various topic models including multimodal, hierarchical, temporal additively regularized variants. Several experiments on Wikipedia corpus show that BigARTM performs faster and gives better perplexity comparing to other popular packages, such as Vowpal Wabbit and Gensim.
The topic model was trained with the following parameters:
-
number of most frequently used words in the text corpora to be included into the dictionary: 2000;
-
number of topics: 50;
-
weight coefficient for SmoothSparsePhiRegularizer: \(-2\);
-
weight coefficient for DecorrelatorPhiRegularizer: 0.00001;
-
number of collection passes during the training: 35.
We also conducted the series of experiments varying the parameter d (dam** factor), testing the overall model performance on CNN/Daily Mail dataset (Table 3):
1.2 A.2 Experiments with Neural Topic Models and WikiPersons
Neural topic models (NTM) provide an optimizable coherence awareness from a text corpora. NTM uses neural variational inference, while basic models such as LDA relies on Gibbs sampling and Bayesian variational inference which appear to be mathematically cumbersome and should be re-derivated even after a small change of modeling assumptions. In our experiments we considered following neural topic models:
-
NVDM-GSM [12]. The architecture of the model is a simple VAE [9], which takes the BOW (bag of words) of a document as an input. The topic vector is sampled from the distribution Q(z|x) and then normalized using softmax function (for more details please refer to Fig. 3, which is taken from the Github repo “Neural Topic Models” [24]).
-
ETM [4]. The architecture is a straightforward VAE [9], with the topic-word distribution matrix decomposed as the product of the matrix with topic vectors and the matrix with word vectors (Fig. 4). This model improves the interpretability of topics by locating the topic vectors and the word vectors in the same vector space.
-
WTM-GMM. An improved model of the WLDA (Weak supervised LDA). Architecturally it is a Wasserstein Auto-Encoder (WAE) [?], that takes Gaussian mixture distribution as a prior distribution and uses Gaussian softmax. The number of components in the Gaussian mixture is usually the same as the number of topics in the model. A detailed scheme of WTM-GMM is presented in Fig. 5 (the picture is taken from the Github repo “Neural Topic Models” [24]).
All model implementations were taken from the Neural Topic Models repoFootnote 1. For all the models we set number of topics to 300 and batch size to 512. NVDM-GSM and WTM-GMM were trained for 150 epochs, while ETM was trained for 160 epochs.
We provide qualitative comparison between the aforementioned neural topic models in Table 4. We compare different NTM using the following criteria (these metrics are taken from the Coherence model from the library Gensim [17]):
-
c_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
-
c_uci measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words:
-
c_npmi is an enhanced version of the c_uci coherence using the normalized pointwise mutual information (NPMI).
According to the Table 4 and also manual evaluation of topics by its top tokens, NVDM-GSM model performs better than all the other mentioned models.
Other experimentation included varying the parameter d. However, the average precision increased from 0.643 only to 0.65 with the growth of the dam** value.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheltova, K., Ianina, A., Malykh, V. (2022). Topical Extractive Summarization. In: Malykh, V., Filchenkov, A. (eds) Artificial Intelligence and Natural Language. AINL 2022. Communications in Computer and Information Science, vol 1731. Springer, Cham. https://doi.org/10.1007/978-3-031-23372-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-23372-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23371-5
Online ISBN: 978-3-031-23372-2
eBook Packages: Computer ScienceComputer Science (R0)