Log in

Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard

  • ARTIFICIAL INTELLIGENCE TECHNIQUES IN PATTERN RECOGNITION AND IMAGE ANALYSIS
  • Published:
Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Abstract

This work deals with the interrelated problems of assessing the closeness of a text to the most rational (reference) form of conveying its sense and the formation of a reference text collection, in relation to which the assessment itself is performed. The texts under analysis for closeness to the semantic standard are abstracts of scientific articles together with their titles. The solution is based on the comparison of values for the 5th percentile of the empirical distribution corresponding to an array of fractions for nonzero values of the term frequency (TF) for separate phrases within each abstract relative to each document under consideration for inclusion into the reference collection. A variant for numerical estimation the significance of the abstract for calculating the mentioned percentile for candidate documents with maximum precision in the case of selection of the most significant for the reference collection is offered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

REFERENCES

  1. M. Eremeev, and K. Vorontsov, “Lexical quantile-based text complexity measure,” in Proc. of Recent Advances in Natural Language Processing, Varna, Bulgaria, 2019 (INCOMA, 2019), pp. 270–275. https://doi.org/10.26615/978-954-452-056-4_031

  2. K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” J. Doc. 60 (5), 493–502 (2004). https://doi.org/10.1108/eb026526

    Article  Google Scholar 

  3. M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in Analysis of Images, Social Networks and Texts. AIST 2015, Ed. by M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Communications in Computer and Information Science, Vol. 542 (Springer, Cham, 2018), pp. 320–332. https://doi.org/10.1007/978-3-319-26123-2_31

  4. N. V. Kozlova, “Linguistic corpus: typology and terms,” Vestn. Novosibirskogo Gos. Univ. Ser.: Lingvist. Mezhkul’tur. Kommun. 11 (1), 79–89 (2013).

    Google Scholar 

  5. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn treebank,” Comput. Linguist. 19 (2), 313–330 (1993).

    Google Scholar 

  6. I. A. Mel’chuk, An Attempt at a Theory of “Meaning⇔Text” Linguistic Models: Semantics, Syntax (Yazyki Russkoi Kul’tury, Moscow, 1999) [In Russian].

    Google Scholar 

  7. D. V. Mikhaylov and G. M. Emelyanov, “Analysis of the mutual relevance of topical corpus documents in the problem of assessing the proximity of text to the semantic standard,” Pattern Recognit. Image Anal. 31 (3), 588–594 (2021). https://doi.org/10.1134/S1054661821030172

    Article  Google Scholar 

  8. A. Moskvina, D. Orlova, P. Panicheva, and O. Mitrofanova, “Development of the core for syntactic parser for Russian based on NLTK libraries,” in Computational Linguistics and Ontology. Proc. of 19th Int. Conf. on Internet and Modern Society (St. Petersburg, 2016), pp. 44–54.

  9. Natural Language Toolkit. http://www.nltk.org/. Cited July 5, 2022.

  10. NumPy. https://numpy.org/. Cited July 5, 2022.

  11. PDFMiner – Python PDF parser and analyzer. https://euske.github.io/pdfminer/. Cited July 5, 2022.

  12. S. Subramanian, R. Li, J. Pilault, and C. Pal, “On extractive and abstractive neural document summarization with transformer language models,” in Proc. 2020 Conf. on Emprical Methods in Natural Language Processing (EMNLP), 2020, Ed. by B. Webber, T. Cohn, Yu. He, and Ya. Liu (Association for Computing Linguistics, 2020), pp. 9308–9319. https://doi.org/10.18653/v1/2020.emnlp-main.748

  13. The Eclipse Foundation. https://www.eclipse.org. Cited July 5, 2022.

  14. N. G. Zagoruiko, Applied Methods of Data and Knowledge Analysis (Inst. Mat. im. Soboleva, Sib. Otd. Ross. Akad. Nauk, Novosibirsk, 1999) [In Russian].

Download references

Funding

This study was supported by the Russian Foundation for Basic Research (project no. 19-01-00006-a).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. V. Mikhaylov.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.

Conflict of Interest

The process of writing and the content of the article do not give grounds for raising the issue of a conflicts of interest.

Additional information

Dmitry V. Mikhaylov born in 1974 and graduated from Yaroslav-the-Wise Novgorod State University, Novgorod, in 1997. He obtained his PhD and his Doctoral degrees in Physics and Mathematics in 2003 and 2013, respectively. From 2000 to 2007 he worked at the Department of Computer Software of Novgorod State University. Now he is a Professor of the Department of Information Technologies and Systems at the same university. Since 2002, he has been a member of the Russian Association for Pattern Recognition and Image Analysis. Scientific interests: computational linguistics and artificial intelligence. He is the author of 48 papers in the scientific area of pattern recognition and image analysis.

Gennady M. Emelyanov born in 1943 and graduated from the Leningrad Institute of Electrical Engineering in 1966. Obtained his PhD and his Doctoral degrees in 1971 and 1990, respectively. From 1993 to 2003, he was a Dean of the Faculty of Mathematics and Computer Science at Yaroslav-the-Wise Novgorod State University. Now he is a Professor of the Department of Information Technologies and Systems at the same university. Scientific interests: construction of problem-oriented computing systems of image processing and analysis. He is the author of 103 publications in the field of pattern recognition and image analysis.

Translated by L. Solovyova

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mikhaylov, D.V., Emelyanov, G.M. Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard. Pattern Recognit. Image Anal. 33, 22–27 (2023). https://doi.org/10.1134/S1054661822040071

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1054661822040071

Keywords:

Navigation