Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard

Mikhaylov, D. V.; Emelyanov, G. M.

doi:10.1134/S1054661822040071

Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard

ARTIFICIAL INTELLIGENCE TECHNIQUES IN PATTERN RECOGNITION AND IMAGE ANALYSIS
Published: 18 May 2023

Volume 33, pages 22–27, (2023)
Cite this article

Pattern Recognition and Image Analysis Aims and scope Submit manuscript

D. V. Mikhaylov¹ &
G. M. Emelyanov¹

55 Accesses
Explore all metrics

Abstract

This work deals with the interrelated problems of assessing the closeness of a text to the most rational (reference) form of conveying its sense and the formation of a reference text collection, in relation to which the assessment itself is performed. The texts under analysis for closeness to the semantic standard are abstracts of scientific articles together with their titles. The solution is based on the comparison of values for the 5th percentile of the empirical distribution corresponding to an array of fractions for nonzero values of the term frequency (TF) for separate phrases within each abstract relative to each document under consideration for inclusion into the reference collection. A variant for numerical estimation the significance of the abstract for calculating the mentioned percentile for candidate documents with maximum precision in the case of selection of the most significant for the reference collection is offered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

M. Eremeev, and K. Vorontsov, “Lexical quantile-based text complexity measure,” in Proc. of Recent Advances in Natural Language Processing, Varna, Bulgaria, 2019 (INCOMA, 2019), pp. 270–275. https://doi.org/10.26615/978-954-452-056-4_031
K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” J. Doc. 60 (5), 493–502 (2004). https://doi.org/10.1108/eb026526
Article Google Scholar
M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in Analysis of Images, Social Networks and Texts. AIST 2015, Ed. by M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Communications in Computer and Information Science, Vol. 542 (Springer, Cham, 2018), pp. 320–332. https://doi.org/10.1007/978-3-319-26123-2_31
N. V. Kozlova, “Linguistic corpus: typology and terms,” Vestn. Novosibirskogo Gos. Univ. Ser.: Lingvist. Mezhkul’tur. Kommun. 11 (1), 79–89 (2013).
Google Scholar
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn treebank,” Comput. Linguist. 19 (2), 313–330 (1993).
Google Scholar
I. A. Mel’chuk, An Attempt at a Theory of “Meaning⇔Text” Linguistic Models: Semantics, Syntax (Yazyki Russkoi Kul’tury, Moscow, 1999) [In Russian].
Google Scholar
D. V. Mikhaylov and G. M. Emelyanov, “Analysis of the mutual relevance of topical corpus documents in the problem of assessing the proximity of text to the semantic standard,” Pattern Recognit. Image Anal. 31 (3), 588–594 (2021). https://doi.org/10.1134/S1054661821030172
Article Google Scholar
A. Moskvina, D. Orlova, P. Panicheva, and O. Mitrofanova, “Development of the core for syntactic parser for Russian based on NLTK libraries,” in Computational Linguistics and Ontology. Proc. of 19th Int. Conf. on Internet and Modern Society (St. Petersburg, 2016), pp. 44–54.
Natural Language Toolkit. http://www.nltk.org/. Cited July 5, 2022.
NumPy. https://numpy.org/. Cited July 5, 2022.
PDFMiner – Python PDF parser and analyzer. https://euske.github.io/pdfminer/. Cited July 5, 2022.
S. Subramanian, R. Li, J. Pilault, and C. Pal, “On extractive and abstractive neural document summarization with transformer language models,” in Proc. 2020 Conf. on Emprical Methods in Natural Language Processing (EMNLP), 2020, Ed. by B. Webber, T. Cohn, Yu. He, and Ya. Liu (Association for Computing Linguistics, 2020), pp. 9308–9319. https://doi.org/10.18653/v1/2020.emnlp-main.748
The Eclipse Foundation. https://www.eclipse.org. Cited July 5, 2022.
N. G. Zagoruiko, Applied Methods of Data and Knowledge Analysis (Inst. Mat. im. Soboleva, Sib. Otd. Ross. Akad. Nauk, Novosibirsk, 1999) [In Russian].

Download references

Funding

This study was supported by the Russian Foundation for Basic Research (project no. 19-01-00006-a).

Author information

Authors and Affiliations

Yaroslav-the-Wise Novgorod State University, 173003, Velikii Novgorod, Russia
D. V. Mikhaylov & G. M. Emelyanov

Authors

D. V. Mikhaylov
View author publications
You can also search for this author in PubMed Google Scholar
G. M. Emelyanov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. V. Mikhaylov.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.

Conflict of Interest

The process of writing and the content of the article do not give grounds for raising the issue of a conflicts of interest.

Additional information

Dmitry V. Mikhaylov born in 1974 and graduated from Yaroslav-the-Wise Novgorod State University, Novgorod, in 1997. He obtained his PhD and his Doctoral degrees in Physics and Mathematics in 2003 and 2013, respectively. From 2000 to 2007 he worked at the Department of Computer Software of Novgorod State University. Now he is a Professor of the Department of Information Technologies and Systems at the same university. Since 2002, he has been a member of the Russian Association for Pattern Recognition and Image Analysis. Scientific interests: computational linguistics and artificial intelligence. He is the author of 48 papers in the scientific area of pattern recognition and image analysis.

Gennady M. Emelyanov born in 1943 and graduated from the Leningrad Institute of Electrical Engineering in 1966. Obtained his PhD and his Doctoral degrees in 1971 and 1990, respectively. From 1993 to 2003, he was a Dean of the Faculty of Mathematics and Computer Science at Yaroslav-the-Wise Novgorod State University. Now he is a Professor of the Department of Information Technologies and Systems at the same university. Scientific interests: construction of problem-oriented computing systems of image processing and analysis. He is the author of 103 publications in the field of pattern recognition and image analysis.

Translated by L. Solovyova

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mikhaylov, D.V., Emelyanov, G.M. Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard. Pattern Recognit. Image Anal. 33, 22–27 (2023). https://doi.org/10.1134/S1054661822040071

Download citation

Received: 20 July 2022
Revised: 20 July 2022
Accepted: 20 July 2022
Published: 18 May 2023
Issue Date: March 2023
DOI: https://doi.org/10.1134/S1054661822040071

Keywords:

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Term Frequency and Estimating the Closeness of Short Texts to the Semantic Standard

Abstract

Access this article

Subscribe and save

Buy Now

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

Conflict of Interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Subscribe and save

Buy Now

Search

Navigation