Abstract
This work deals with the interrelated problems of assessing the closeness of a text to the most rational (reference) form of conveying its sense and the formation of a reference text collection, in relation to which the assessment itself is performed. The texts under analysis for closeness to the semantic standard are abstracts of scientific articles together with their titles. The solution is based on the comparison of values for the 5th percentile of the empirical distribution corresponding to an array of fractions for nonzero values of the term frequency (TF) for separate phrases within each abstract relative to each document under consideration for inclusion into the reference collection. A variant for numerical estimation the significance of the abstract for calculating the mentioned percentile for candidate documents with maximum precision in the case of selection of the most significant for the reference collection is offered.
REFERENCES
M. Eremeev, and K. Vorontsov, “Lexical quantile-based text complexity measure,” in Proc. of Recent Advances in Natural Language Processing, Varna, Bulgaria, 2019 (INCOMA, 2019), pp. 270–275. https://doi.org/10.26615/978-954-452-056-4_031
K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” J. Doc. 60 (5), 493–502 (2004). https://doi.org/10.1108/eb026526
M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in Analysis of Images, Social Networks and Texts. AIST 2015, Ed. by M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Communications in Computer and Information Science, Vol. 542 (Springer, Cham, 2018), pp. 320–332. https://doi.org/10.1007/978-3-319-26123-2_31
N. V. Kozlova, “Linguistic corpus: typology and terms,” Vestn. Novosibirskogo Gos. Univ. Ser.: Lingvist. Mezhkul’tur. Kommun. 11 (1), 79–89 (2013).
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn treebank,” Comput. Linguist. 19 (2), 313–330 (1993).
I. A. Mel’chuk, An Attempt at a Theory of “Meaning⇔Text” Linguistic Models: Semantics, Syntax (Yazyki Russkoi Kul’tury, Moscow, 1999) [In Russian].
D. V. Mikhaylov and G. M. Emelyanov, “Analysis of the mutual relevance of topical corpus documents in the problem of assessing the proximity of text to the semantic standard,” Pattern Recognit. Image Anal. 31 (3), 588–594 (2021). https://doi.org/10.1134/S1054661821030172
A. Moskvina, D. Orlova, P. Panicheva, and O. Mitrofanova, “Development of the core for syntactic parser for Russian based on NLTK libraries,” in Computational Linguistics and Ontology. Proc. of 19th Int. Conf. on Internet and Modern Society (St. Petersburg, 2016), pp. 44–54.
Natural Language Toolkit. http://www.nltk.org/. Cited July 5, 2022.
NumPy. https://numpy.org/. Cited July 5, 2022.
PDFMiner – Python PDF parser and analyzer. https://euske.github.io/pdfminer/. Cited July 5, 2022.
S. Subramanian, R. Li, J. Pilault, and C. Pal, “On extractive and abstractive neural document summarization with transformer language models,” in Proc. 2020 Conf. on Emprical Methods in Natural Language Processing (EMNLP), 2020, Ed. by B. Webber, T. Cohn, Yu. He, and Ya. Liu (Association for Computing Linguistics, 2020), pp. 9308–9319. https://doi.org/10.18653/v1/2020.emnlp-main.748
The Eclipse Foundation. https://www.eclipse.org. Cited July 5, 2022.
N. G. Zagoruiko, Applied Methods of Data and Knowledge Analysis (Inst. Mat. im. Soboleva, Sib. Otd. Ross. Akad. Nauk, Novosibirsk, 1999) [In Russian].
Funding
This study was supported by the Russian Foundation for Basic Research (project no. 19-01-00006-a).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
COMPLIANCE WITH ETHICAL STANDARDS
This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.
Conflict of Interest
The process of writing and the content of the article do not give grounds for raising the issue of a conflicts of interest.
Additional information
![](http://media.springernature.com/lw142/springer-static/image/art%3A10.1134%2FS1054661822040071/MediaObjects/11493_2023_8402_Fig1_HTML.png)
Dmitry V. Mikhaylov born in 1974 and graduated from Yaroslav-the-Wise Novgorod State University, Novgorod, in 1997. He obtained his PhD and his Doctoral degrees in Physics and Mathematics in 2003 and 2013, respectively. From 2000 to 2007 he worked at the Department of Computer Software of Novgorod State University. Now he is a Professor of the Department of Information Technologies and Systems at the same university. Since 2002, he has been a member of the Russian Association for Pattern Recognition and Image Analysis. Scientific interests: computational linguistics and artificial intelligence. He is the author of 48 papers in the scientific area of pattern recognition and image analysis.
![](http://media.springernature.com/lw142/springer-static/image/art%3A10.1134%2FS1054661822040071/MediaObjects/11493_2023_8402_Fig2_HTML.png)
Gennady M. Emelyanov born in 1943 and graduated from the Leningrad Institute of Electrical Engineering in 1966. Obtained his PhD and his Doctoral degrees in 1971 and 1990, respectively. From 1993 to 2003, he was a Dean of the Faculty of Mathematics and Computer Science at Yaroslav-the-Wise Novgorod State University. Now he is a Professor of the Department of Information Technologies and Systems at the same university. Scientific interests: construction of problem-oriented computing systems of image processing and analysis. He is the author of 103 publications in the field of pattern recognition and image analysis.
Translated by L. Solovyova
Rights and permissions
About this article
Cite this article
Mikhaylov, D.V., Emelyanov, G.M. Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard. Pattern Recognit. Image Anal. 33, 22–27 (2023). https://doi.org/10.1134/S1054661822040071
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1054661822040071