Log in

An approach to text data categorization based on the ideas of J.S. Mill

  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

The problem of the automatic categorization of text documents in the natural language is considered. The categorization is made by a method that is based on ideas of J.S. Mill. This technique uses the general principles (but not the technical details) of the JSM method for the automatic generation of hypotheses. Tests are described and the performance quality of the system that was built to carry out the described technique is assessed. With an optimal selection of options, the suggested approach shows better accuracy than other techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sebastiani, F., Text categorization, in Text Mining and Its Applications, Zanasi, A., Ed., Southampton: Wit Press, 2005, pp. 109–129.

    Google Scholar 

  2. TextAnalyst. http://www.megaputer.com/site/textanalyst.php. Cited March 12, 2015.

  3. Irosoft. Automatic document classification module docutheque enterprise. http://www.irosoft.com/en/communiques-presse/irosoft-adds-automatic-documentclassification-module-docutheque-entreprise. Cited April 10, 2015.

  4. Automatic Document Classification with Artsyl’s docAlpha. http://www.artsyltech.com/da_classification.html. Cited April 26, 2015.

  5. Yang, Y., An evaluation of statistical approaches to text categorization, Inf. Retr., 1999, vol. 1, nos. 1–2, pp. 69–90.

    Article  Google Scholar 

  6. Joachims, T., Text categorization with suport vector machines: Learning with many relevant features, Proceedings of European Conference on Machine Learning, 1998, pp. 137–142.

    Google Scholar 

  7. McCallum, K.N., A comparison of event models for naive Bayes text classication, AAAI-98 Workshop on Learning for Text Categorization, 1998.

    Google Scholar 

  8. Schapire, R.E. and Singer, Y., Boostexter: A boostingbased system for text categorization, Mach. Learn., 2000, no. 39, pp. 135–168.

    Article  MATH  Google Scholar 

  9. Bai, J. and Nie, J.-Y., Using language models text classification, Proceedings of Asia Information Retrieval Symposium, Bei**g, 2004.

    Google Scholar 

  10. Mill, J.S., A System of Logic, Ratiocinative and Inductive, NY.: Harper & Brothers, 1882.

    Google Scholar 

  11. Finn, V.K., Databases with incomplete information and a new method for automatic generation of hypotheses, in Dialogovye i faktograficheskie sistemy informatsionnogo obespecheniya (Dialogue and Factual Information Support System), Moscow, 1981.

    Google Scholar 

  12. Mill, J.S., A System of Logic, Ratiocinative and Inductive, Cambridge University Press, 2011.

    Book  Google Scholar 

  13. Finn, V.K., About the computer-oriented formalization of plausible reasoning in the style of Francis Bacon–J.S. Mill, Semiotika Inf., 1983, vol. 20, pp. 35–101.

    MathSciNet  MATH  Google Scholar 

  14. Rosser, J.B. and Turquette, A.R., Many-Valued Logics, Amsterdam: North-Holland, 1951.

    Google Scholar 

  15. Kuznetsov, S.O., JSM method in the language of Galois, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 2006, no. 12, pp. 1–7.

    Google Scholar 

  16. Ganter, B. and Wille, R., Formal Concept Analysis: Mathematical Foundations, Berlin: Springer-Verlag, 1999.

    Book  MATH  Google Scholar 

  17. Finn, V.K., Epistemological foundations of the JSM method for automatic hypothesis generation, Autom. Doc. Math. Linguist., 2014, vol. 48, no. 2, pp. 96–148.

    Article  MathSciNet  Google Scholar 

  18. Finn, V.K., On the definition of empirical regularities by the JSM method for the automatic generation of hypotheses, Sci. Tech. Inf. Process., 2012, vol. 39, no. 5, pp. 261–267.

    Article  MathSciNet  Google Scholar 

  19. Finn, V.K., J.S. Mill’s inductive methods in artificial intelligence systems, Sci. Tech. Inf. Process., Part I, 2011, vol. 38, no. 6, pp. 385–402; Part II, 2012, vol. 39, pp. 241–260.

    Article  Google Scholar 

  20. Volkova, A.Yu., Algorithmization of procedures of the JSM method for automatic hypothesis generation, Autom. Doc. Math. Linguist., 2011, vol. 45, no. 3, pp. 113–120.

    Article  Google Scholar 

  21. Anshakov, O.M., The JSM method: A set-theoretical explanation, Autom. Doc. Math. Linguist., 2012, vol. 46, no. 5, pp. 202–220.

    Article  Google Scholar 

  22. Grigor’ev, P.A., A method for automatic generation of hypotheses that is similar to JSM-method: the use of statistical considerations, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 1996, nos. 5–6, pp. 52–55.

    Google Scholar 

  23. Grigor’ev, P.A., Sword-systems or JSM-systems for chains using statistical considerations, Nauchn.Tekhn. Inform., Ser. 2. Protsessy Sist., 1996, nos. 5–6, pp. 45–51.

    Google Scholar 

  24. Anshakov, O.M., Generalized quantifiers are defined using templates. Part I, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 2000, no. 11, pp. 5–17.

    Google Scholar 

  25. Anshakov, O.M., Generalized quantifiers are defined using templates. Part II, Nauchn.-Tekhn. Inform., Ser. 2. Protsessy Sist., 2001, no. 5, pp. 35–48.

    Google Scholar 

  26. Gaek, P. and Gavranek, T., Avtomaticheskoe obrazovanie gipotez: Matematicheskie osnovy obshchei teorii (Automatic Hypothesis Formation: Mathematical Foundations of General Theory), Moscow: Nauka, 1984.

    Google Scholar 

  27. Porter, M.F., Snowball: A Language for Stemming Algorithms, 2001.

    Google Scholar 

  28. Segalovich, I., A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine, MLMTA, 2003.

    Google Scholar 

  29. Korobov, M., Morphological analyzer and generator for Russian and Ukrainian languages, Analysis of Images, Social Networks and Texts: 4th International Conference (AIST 2015), Yekaterinburg, 2015.

    Google Scholar 

  30. Automatic Text Processing. http://www.aot.ru. Cited February 6, 2015.

  31. Salton, G., Allan, J., and Buckley, C., Automatic structuring and retrieval of large text files, Commun. ACM, 1994, vol. 37, no. 2.

    Article  Google Scholar 

  32. Cavnar, W.B. and Trenkle, J.M., N-Gram-based text categorization, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 161–175.

    Google Scholar 

  33. Dunning, T., Statistical Identification of Languages, Comp. Res. Lab. Technical Report, MCCS, 1994, pp. 94–273.

    Google Scholar 

  34. Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Boston: Addison-Wesley Longman Publishing, 1989.

    Google Scholar 

  35. Yang, Y. and Pedersen, J.O., A comparative study on feature selection in text categorization, Proc. of ICML-97, 1997, pp. 412–420.

    Google Scholar 

  36. Ahonen-Myka, H., Finding all maximal frequent sequences in text, Proceedings of the 16th International Conference of Machine Learning, ICML-99 Workshop on Machine Learning in Text Data Analisys, 1999, pp. 11–17.

    Google Scholar 

  37. Menon R.K. and Choi, Y., Domain independent authorship attribution without domain adaptation, Proceedings of Recent Advances in Natural Language Processing, Hissar, 2011, pp. 309–315.

    Google Scholar 

  38. Raghavan, S. and Kovashka, R., Mooney authorship attribution using probabilistic context-free grammars, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), 2010, pp. 38–42.

    Google Scholar 

  39. Ageev, M.S. and Kuralenok, I.E., Official metrics of ROMIP-2004, Rossiiskii seminar po Otsenke Metodov Informatsionnogo Poiska (ROMIP 2004) (Russian Seminar on Evaluation of Information Retrieval Methods(ROMIP 2004)), Pushchino, 2004.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. D. Lyfenko.

Additional information

Original Russian Text © N.D. Lyfenko, 2015, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2, 2015, No. 11, pp. 12–23.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lyfenko, N.D. An approach to text data categorization based on the ideas of J.S. Mill. Autom. Doc. Math. Linguist. 49, 202–212 (2015). https://doi.org/10.3103/S0005105515060035

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105515060035

Keywords

Navigation