Log in

Topic modeling combined with classification technique for extractive multi-document text summarization

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The qualities of human readable summaries available in the datasets are not up to the mark, leading to issues in creating an accurate model for text summarization. Although recent works have been largely built upon this issue and set up a strong platform for further improvements, they still have many limitations. Looking in this direction, the paper proposes a novel methodology for summarizing a corpus of documents to generate a coherent summary using topic modeling and classification technique. The objectives of the propose work are highlighted below:

  • A novel heuristic approach is introduced to find out the actual number of topics that exist in a corpus of documents which handles the stochastic nature of latent dirichlet allocation.

  • A large corpus of documents is handled by minimizing the huge set of sentences into a small set without losing the important one and thus providing a concise and information rich summary at the end.

  • Ensuring that the sentences are arranged as per their importance in the coherent summary.

  • Results of the experiment are compared with the state-of-the-art summary systems.

The outcomes of the empirical work show that the proposed model is more promising compared to the well-known text summarization models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html.

  2. https://towardsdatascience.com/lda2vec-word-embeddings-in-topic-models-4ee3fc4b2843.

  3. https://radimrehurek.com/gensim/.

  4. www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453.

  5. For experimental purpose various values are tested between the range 0.2 to 0.8 in steps of 0.05, and 0.4 performed the best among them.

  6. since reduction is being performed, \(2/X < 1\).

  7. http://www.nltk.org/.

  8. http://www.duc.nist.gov.

References

  • Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2015) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21(7):1785–1801. https://doi.org/10.1007/s00500-015-1881-4

    Article  Google Scholar 

  • Abdi A, Shamsuddin SM, Hasan S, Piran J (2018) Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment. Expert Syst Appl 109:66–85

    Article  Google Scholar 

  • Abdi A, Shamsuddin SM, Hasan S, Piran J (2019) Automatic sentiment-oriented summarization of multi-documents using soft computing. Soft Comput 23(20):10 551–10 568

    Article  Google Scholar 

  • Anand D, Wagh R (2019) Effective deep learning approaches for summarization of legal texts. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.11.015

    Article  Google Scholar 

  • Briët J, Harremoës P (2009) Properties of classical and quantum Jensen–Shannon divergence. Phys Rev A 79(5):1–11

    Article  Google Scholar 

  • Cagliero L, Garza P, Baralis E (2019) ELSA: a multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis. ACM Trans Inf Syst (TOIS) 37(2):1–33

    Article  Google Scholar 

  • Chatterjee N, Sahoo PK (2015) Random indexing and modified random indexing based approach for extractive text summarization. Comput Speech Lang 29(1):32–44

    Article  Google Scholar 

  • Chen H, ** H, Zhao F (2014) PSG: a two-layer graph model for document summarization. Front Comput Sci Sel Publ Chin Univ 8(1):119–130

    MathSciNet  Google Scholar 

  • Cheng J, Lapata M (2016) Neural summarization by extracting sentences and words. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 484–494

  • Elbarougy R, Behery G, Khatib AE (2020) Graph-based extractive Arabic text summarization using multiple morphological analyzers. J Inf Sci Eng 36(2):347–363

    Google Scholar 

  • Fang C, Mu D, Deng Z, Wu Z (2017) Word-sentence co-ranking for automatic extractive text summarization. Expert Syst Appl 72:189–195

    Article  Google Scholar 

  • Ferreira R, de Souza Cabral L, Freitas F, Lins RD, de França Silva G, Simske SJ, Favaro L (2014) A multi-document summarization system based on statistics and linguistic treatment. Expert Syst Appl 41(13):5780–5787

    Article  Google Scholar 

  • Genç S, Akay D, Boran FE, Yager RR (2019) Linguistic summarization of fuzzy social and economic networks: an application on the international trade network. Soft Comput 24:1511–1527

    Article  Google Scholar 

  • Glavaš G, Šnajder J (2014) Event graphs for information retrieval and multi-document summarization. Expert Syst Appl 41(15):6904–6916

    Article  Google Scholar 

  • Gupta V, Lehal GS (2010) A survey of text summarization extractive techniques. J Emerg Technol Web Intell 2(3):258–268

    Google Scholar 

  • Hu Y-H, Chen Y-L, Chou H-L (2017) Opinion mining from online hotel reviews—a text summarization approach. Inf Process Manag 53(2):436–449

    Article  Google Scholar 

  • Jagarlamudi J, **ali P, Varma V (2006) Query independent sentence scoring approach to DUC 2006. In: Proceeding of document understanding conference (DUC-2006)

  • Joshi A, Fidalgo E, Alegre E, Fernández-Robles L (2019) Summcoder: an unsupervised framework for extractive text summarization based on deep auto-encoders. Expert Syst Appl 129:200–215

    Article  Google Scholar 

  • Kondru J (2007) Using part of speech structure of text in the prediction of its readability. Comput Sci Eng. Compute Science Engineering, University of Texas, Arlington, US. http://proquest.umi.com/pdqweb?did=1216761731&sid=1&Fmt=2&clientld=46449&PQT=309&VName=PQD

  • Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    Article  MathSciNet  Google Scholar 

  • Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225

    Article  Google Scholar 

  • Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop, vol 8, pp 74–81

  • Liu H, Jiang C, Hu C, Zhang L (2016) Efficient relation extraction method based on spatial feature using ELM. Neural Comput Appl 27(2):1–11

    Google Scholar 

  • Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv (CSUR) 51(3):1–34

    Article  Google Scholar 

  • Lovinger J, Valova I, Clough C (2019) GIST: general integrated summarization of text and reviews. Soft Comput 23(5):1589–1601

    Article  Google Scholar 

  • Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165

    Article  MathSciNet  Google Scholar 

  • Lynn HM, Choi C, Kim P (2018) An improved method of automatic text summarization for web contents using lexical chain with semantic-related terms. Soft Comput 22(12):4013–4023

    Article  Google Scholar 

  • Mashechkin I, Petrovskiy M, Popov D, Tsarev DV (2011) Automatic text summarization using latent semantic analysis. Program Comput Softw 37(6):299–305

    Article  MathSciNet  Google Scholar 

  • Melli G (2006) Description of squash, the SFU question answering summary handler for the DUC-2006 summarization task. Safety 1:1–8

    Google Scholar 

  • Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411

  • Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  • Nagwani N (2015) Summarizing large text collection using topic modeling and clustering based on mapreduce framework. J Big Data 2(1):1–18

    Article  Google Scholar 

  • Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237

    Article  Google Scholar 

  • Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417

    Article  MathSciNet  Google Scholar 

  • Parveen D, Ramsl H-M, Strube M (2015) Topical coherence for graph-based extractive summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 1949–1954

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  • Sanchez-Gomez JM, Vega-Rodríguez MA, Pérez CJ (2018) Extractive multi-document text summarization using a multi-objective artificial bee colony optimization approach. Knowl Based Syst 159:1–8

    Article  Google Scholar 

  • Sankarasubramaniam Y, Ramanathan K, Ghosh S (2014) Text summarization using wikipedia. Inf Process Manag 50(3):443–461

    Article  Google Scholar 

  • Tohalino JV, Amancio DR (2018) Extractive multi-document summarization using multilayer networks. Physica A 503:526–539

    Article  Google Scholar 

  • Valizadeh M, Brazdil P (2015) Exploring actor–object relationships for query-focused multi-document summarization. Soft Comput 19(11):3109–3121

    Article  Google Scholar 

  • Wan X (2010) Towards a unified approach to simultaneous single-document and multi-document summarizations. In: Proceedings of the 23rd international conference on computational linguistics, Association for Computational Linguistics, pp 1137–1145

  • Wang X, McCallum A, Wei X (2007) Topical \(n\)-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE international conference on data mining (ICDM 2007), IEEE, pp 697–702

  • Woodsend K, Lapata M (2010) Automatic generation of story highlights. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp 565–574

  • Wu Z, Lei L, Li G, Huang H, Zheng C, Chen E, Xu G (2017) A topic modeling based approach to novel document automatic summarization. Expert Syst Appl 84:12–23

    Article  Google Scholar 

  • Yang G, Wen D, Chen N-S, Sutinen E et al (2015) A novel contextual topic model for multi-document summarization. Expert Syst Appl 42(3):1340–1352

    Article  Google Scholar 

  • Ye S, Chua T-S, Kan M-Y, Qiu L (2007) Document concept lattice for text understanding and summarization. Inf Process Manag 43(6):1643–1662

    Article  Google Scholar 

  • Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. Expert Syst Appl 68:93–105

    Article  Google Scholar 

  • Zamanian M, Heydari P (2012) Readability of texts: state of the art. Theory Pract Lang Stud 2(1):43–53

    Article  Google Scholar 

  • Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. CM SIGIR Forum 51(2):268–276

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajendra Kumar Roul.

Ethics declarations

Conflict of interest

The author declared that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by the author.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roul, R.K. Topic modeling combined with classification technique for extractive multi-document text summarization. Soft Comput 25, 1113–1127 (2021). https://doi.org/10.1007/s00500-020-05207-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05207-w

Keywords

Navigation