Abstract
The problem of the automatic categorization of text documents in the natural language is considered. The categorization is made by a method that is based on ideas of J.S. Mill. This technique uses the general principles (but not the technical details) of the JSM method for the automatic generation of hypotheses. Tests are described and the performance quality of the system that was built to carry out the described technique is assessed. With an optimal selection of options, the suggested approach shows better accuracy than other techniques.
Similar content being viewed by others
References
Sebastiani, F., Text categorization, in Text Mining and Its Applications, Zanasi, A., Ed., Southampton: Wit Press, 2005, pp. 109–129.
TextAnalyst. http://www.megaputer.com/site/textanalyst.php. Cited March 12, 2015.
Irosoft. Automatic document classification module docutheque enterprise. http://www.irosoft.com/en/communiques-presse/irosoft-adds-automatic-documentclassification-module-docutheque-entreprise. Cited April 10, 2015.
Automatic Document Classification with Artsyl’s docAlpha. http://www.artsyltech.com/da_classification.html. Cited April 26, 2015.
Yang, Y., An evaluation of statistical approaches to text categorization, Inf. Retr., 1999, vol. 1, nos. 1–2, pp. 69–90.
Joachims, T., Text categorization with suport vector machines: Learning with many relevant features, Proceedings of European Conference on Machine Learning, 1998, pp. 137–142.
McCallum, K.N., A comparison of event models for naive Bayes text classication, AAAI-98 Workshop on Learning for Text Categorization, 1998.
Schapire, R.E. and Singer, Y., Boostexter: A boostingbased system for text categorization, Mach. Learn., 2000, no. 39, pp. 135–168.
Bai, J. and Nie, J.-Y., Using language models text classification, Proceedings of Asia Information Retrieval Symposium, Bei**g, 2004.
Mill, J.S., A System of Logic, Ratiocinative and Inductive, NY.: Harper & Brothers, 1882.
Finn, V.K., Databases with incomplete information and a new method for automatic generation of hypotheses, in Dialogovye i faktograficheskie sistemy informatsionnogo obespecheniya (Dialogue and Factual Information Support System), Moscow, 1981.
Mill, J.S., A System of Logic, Ratiocinative and Inductive, Cambridge University Press, 2011.
Finn, V.K., About the computer-oriented formalization of plausible reasoning in the style of Francis Bacon–J.S. Mill, Semiotika Inf., 1983, vol. 20, pp. 35–101.
Rosser, J.B. and Turquette, A.R., Many-Valued Logics, Amsterdam: North-Holland, 1951.
Kuznetsov, S.O., JSM method in the language of Galois, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 2006, no. 12, pp. 1–7.
Ganter, B. and Wille, R., Formal Concept Analysis: Mathematical Foundations, Berlin: Springer-Verlag, 1999.
Finn, V.K., Epistemological foundations of the JSM method for automatic hypothesis generation, Autom. Doc. Math. Linguist., 2014, vol. 48, no. 2, pp. 96–148.
Finn, V.K., On the definition of empirical regularities by the JSM method for the automatic generation of hypotheses, Sci. Tech. Inf. Process., 2012, vol. 39, no. 5, pp. 261–267.
Finn, V.K., J.S. Mill’s inductive methods in artificial intelligence systems, Sci. Tech. Inf. Process., Part I, 2011, vol. 38, no. 6, pp. 385–402; Part II, 2012, vol. 39, pp. 241–260.
Volkova, A.Yu., Algorithmization of procedures of the JSM method for automatic hypothesis generation, Autom. Doc. Math. Linguist., 2011, vol. 45, no. 3, pp. 113–120.
Anshakov, O.M., The JSM method: A set-theoretical explanation, Autom. Doc. Math. Linguist., 2012, vol. 46, no. 5, pp. 202–220.
Grigor’ev, P.A., A method for automatic generation of hypotheses that is similar to JSM-method: the use of statistical considerations, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 1996, nos. 5–6, pp. 52–55.
Grigor’ev, P.A., Sword-systems or JSM-systems for chains using statistical considerations, Nauchn.Tekhn. Inform., Ser. 2. Protsessy Sist., 1996, nos. 5–6, pp. 45–51.
Anshakov, O.M., Generalized quantifiers are defined using templates. Part I, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 2000, no. 11, pp. 5–17.
Anshakov, O.M., Generalized quantifiers are defined using templates. Part II, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 2001, no. 5, pp. 35–48.
Gaek, P. and Gavranek, T., Avtomaticheskoe obrazovanie gipotez: Matematicheskie osnovy obshchei teorii (Automatic Hypothesis Formation: Mathematical Foundations of General Theory), Moscow: Nauka, 1984.
Porter, M.F., Snowball: A Language for Stemming Algorithms, 2001.
Segalovich, I., A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine, MLMTA, 2003.
Korobov, M., Morphological analyzer and generator for Russian and Ukrainian languages, Analysis of Images, Social Networks and Texts: 4th International Conference (AIST 2015), Yekaterinburg, 2015.
Automatic Text Processing. http://www.aot.ru. Cited February 6, 2015.
Salton, G., Allan, J., and Buckley, C., Automatic structuring and retrieval of large text files, Commun. ACM, 1994, vol. 37, no. 2.
Cavnar, W.B. and Trenkle, J.M., N-Gram-based text categorization, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 161–175.
Dunning, T., Statistical Identification of Languages, Comp. Res. Lab. Technical Report, MCCS, 1994, pp. 94–273.
Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Boston: Addison-Wesley Longman Publishing, 1989.
Yang, Y. and Pedersen, J.O., A comparative study on feature selection in text categorization, Proc. of ICML-97, 1997, pp. 412–420.
Ahonen-Myka, H., Finding all maximal frequent sequences in text, Proceedings of the 16th International Conference of Machine Learning, ICML-99 Workshop on Machine Learning in Text Data Analisys, 1999, pp. 11–17.
Menon R.K. and Choi, Y., Domain independent authorship attribution without domain adaptation, Proceedings of Recent Advances in Natural Language Processing, Hissar, 2011, pp. 309–315.
Raghavan, S. and Kovashka, R., Mooney authorship attribution using probabilistic context-free grammars, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), 2010, pp. 38–42.
Ageev, M.S. and Kuralenok, I.E., Official metrics of ROMIP-2004, Rossiiskii seminar po Otsenke Metodov Informatsionnogo Poiska (ROMIP 2004) (Russian Seminar on Evaluation of Information Retrieval Methods(ROMIP 2004)), Pushchino, 2004.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © N.D. Lyfenko, 2015, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2, 2015, No. 11, pp. 12–23.
About this article
Cite this article
Lyfenko, N.D. An approach to text data categorization based on the ideas of J.S. Mill. Autom. Doc. Math. Linguist. 49, 202–212 (2015). https://doi.org/10.3103/S0005105515060035
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105515060035