Log in

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Feature selection is a crucial preprocessing step for text categorization that can help to reduce the feature space, speed up the learning process, and improve the accuracy of classification algorithms. Efficient and popular filter-based feature selection methods based on document frequency, such as Information Gain, Chi-Square Test, Improved Gini Index, etc., are most widely used due to their high performance and low time complexity compared to information-theoretic methods, which have effective performance on low-dimensional data but are less efficient in front of high-dimensional data. However, the main issue of statistical filter-based methods is feature redundancy; assessing the feature’s importance independently of other ones leads to selecting a large number of features that do not provide additional information for the class variable, resulting in additional training time and low classification performance. To take advantage of the effectiveness of information-theoretic methods in the treatment of redundancy issues but with low time complexity, we propose a new non-sequential selection method with low time complexity, named Co-occurrence-level Feature Selection and Redundancy Removal (CFSRR), which uses mutual information to evaluate the importance of each feature with respect to its co-occurring ones instead of using already selected features as in classical theoretic-information-based methods. The idea is that co-occurring features in the same context are semantically correlated, forming candidate redundant features and hel** to avoid feature-by-feature redundancy evaluation. Compared to ten effective feature selection metrics, empirical results show the efficiency of CFSRR in terms of micro-F1 and macro-F1 scores obtained from Naïve Bayes and SVM classifiers on five publicly available datasets, showing its robustness for balanced, unbalanced, binary, and multi-class classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Algorithm 2

Similar content being viewed by others

Availability of supporting data

The data described in this article is publicly available at https://www.kaggle.com/datasets and https://starling.utdallas.edu/datasets/

References

  1. Attieh J, Tekli J (2023) Supervised term-category feature weighting for improved text classification. Knowl-Based Syst 261:110215. https://doi.org/10.1016/j.knosys.2022.110215

    Article  Google Scholar 

  2. Basu A, Watters C, Shepherd M (2003) Support vector machines for text categorization. In: 36th Annual Hawaii International Conference on System Sciences. https://doi.org/10.1109/HICSS.2003.1174243

  3. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550. https://doi.org/10.1109/72.298224

    Article  CAS  PubMed  Google Scholar 

  4. Bell A (2003) The co-information lattice. In: 4th Int. Symp. Independent Component Analysis and Blind Source Separation

  5. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Sys Appl 36(3):5432–5435. https://doi.org/10.1016/j.eswa.2008.06.054

    Article  Google Scholar 

  6. Chen Y, Han B, Hou P (2014) New feature selection methods based on context similarity for text categorization. In: Proceedings of the international conference on fuzzy systems and knowledge discovery. https://doi.org/10.1109/FSKD.2014.6980902

  7. Cover T, Thomas J (2006) Elements of Information Theory. John Wiley & Sons, New York

    Google Scholar 

  8. Cover TM, Thomas JA (1991) Elements of Information Theory. John Wiley & Sons, New York

    Google Scholar 

  9. Craven M, Pasquo DD, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national tenth conference on Artificial intelligence Innovative applications of artificial intelligence AAAI’98/IAAI’98). American Association for Artificial Intelligence, USA, pp 509–516

  10. Eligüzel N, Çetinkaya C, Dereli T (2022) A novel approach for text categorization by applying hybrid genetic bat algorithm through feature extraction and feature selection methods. Expert Syst Appl 202:117433. https://doi.org/10.1016/j.eswa.2022.117433

    Article  Google Scholar 

  11. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    Google Scholar 

  12. Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50:1272–1288. https://doi.org/10.1007/s10489-019-01597-z

    Article  Google Scholar 

  13. Georgieva-Trifonova T, Duraku M (2021) Search on n-grams feature selection methods for text classification. IOP Conference Series: Materials Science and Engineering, pp 1031. https://doi.org/10.1088/1757-899x/1031/1/012048

  14. Gunal S (2012) Hybrid feature selection fortext classification. Turk J Electr Eng Comput Sci 20(2):1296–1311

    Google Scholar 

  15. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.1162/153244303322753616

    Article  Google Scholar 

  16. Hidalgo JMG, Bringas GC, Sanz EP, Garcia FC (2006) Content based SMS spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp 107–114. https://doi.org/10.1145/1166160.1166191

  17. Hussain SF, Babar HZ, Khalil AH, Jillani RM, Hanif M, Khurshid K (2020) A Fast Non-Redundant Feature Selection Technique for Text Data. IEEE Access 8:181763–181781

    Article  Google Scholar 

  18. Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine Learning: ECML-98. ECML 1998. Lect Notes Comput Sci (Lecture Notes in Artificial Intelligence). Springer, Berlin, Heidelberg. vol 1398. https://doi.org/10.1007/BFb0026683

  19. Jundong L, Kewei C, Suhang W, Fred M, Robert PT, Jiliang T, Huan L (2019) Feature selection: A data perspective. ACM Comput Surv 50(6). https://doi.org/10.1145/3136625

  20. Kolluri J, Razia S (2020) Text classification using naive bayes classifier. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.10.058

  21. Kou G, Ergu D, Lin C (2016) Pairwise comparison matrix in multiple criteria decision making. Technol Econ Dev Econ 22(5):738–765. https://doi.org/10.3846/20294913.2016.1210694

    Article  Google Scholar 

  22. Kumar V (2014) Feature selection a literature review. Smart Comput Rev :211–229. https://doi.org/10.6029/smartcr.2014.03.007

  23. Labani M, Moradi P, Jalili M (2020) A multi-objective genetic algo- rithm for text feature selection using the relative discriminative criterion. Expert Syst Appl 149:113276. https://doi.org/10.1016/j.eswa.2020.113276

    Article  Google Scholar 

  24. Lewis DD (2019) Reuters-21578 text categorization collection data set. https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

  25. Li S, Zhang K, Chen Q, Wang S, Zhang S (2020) Feature selection for high dimensional data using weighted k-nearest neighbors and genetic algorithm. IEEE Access 8:139512–139528

    Article  Google Scholar 

  26. Li B (2016) Selecting features with class based and importance weighted document frequency in text classification. In: Proceedings of the 2016 ACM Symposium on Document Engineering

  27. Liu W, Guo Z, Jiang F, Liu G, Wang D, Ni Z (2022) Improved WOA and its application in feature selection. PLoS ONE 17

  28. Manochandar S, Punniyamoorthy M (2018) Scaling feature selection method for enhancing the classification performance of Support Vector Machines in text mining. Comput Ind Eng 124:139–156

    Article  Google Scholar 

  29. Mao KZ (2004) Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Trans Syst Man Cybernet Part B (Cybernetics) 34:629–634

    Article  CAS  Google Scholar 

  30. McGill WJ (1954) Multivariate information transmission. Psychometrika 19:97–116

    Article  Google Scholar 

  31. Mielniczuk J (2022) Information theoretic methods for variable selection - a review. Entropy 2022(24):1079. https://doi.org/10.3390/e24081079

    Article  ADS  MathSciNet  Google Scholar 

  32. Mishra NK, Singh PK (2020) FS-MLC: feature selection for multi-label classification using clustering in feature space. Inf Proc Manag 57(4). https://doi.org/10.1016/j.ipm.2020.102240

  33. Pang B, Lee L (2004) A sentimental education: Sentiment anal- ysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, pp. 271–278. https://doi.org/10.3115/1218955.1218990

  34. Pintas JT, Fernandes LAF, Garcia ACB (2021) Feature selection methods for text classification: A systematic literature review. Artif Intell Rev 54:6149–6200. https://doi.org/10.1007/s10462-021-09970-6

    Article  Google Scholar 

  35. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):11190–1125

    Article  Google Scholar 

  36. Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53:473–489

    Article  Google Scholar 

  37. Sanderson M, Ruthven I (1996) Report on the Glasgow IR group (glair4) submission. TREC

  38. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. https://doi.org/10.1145/505282.505283

    Article  MathSciNet  Google Scholar 

  39. Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5. https://doi.org/10.1016/j.eswa.2006.04.001

    Article  Google Scholar 

  40. Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on, Coronado, CA pp 71–75

  41. Tang J, Alelyani S, Liu H (2014) Feature selection for classifica- tion: a review. Data classification : algorithms and applications, pp. 37–64. https://doi.org/10.1201/B17320

  42. Tang Z, Li W, Li Y (2022) An improved supervised term weight- ing scheme for text representation and classification. Expert Syst Appl 189:115985. https://doi.org/10.1016/j.eswa.2021.115985

    Article  Google Scholar 

  43. Timme N, Alford W, Flecker B, Beggs JM (2014) Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 36:119–140. https://doi.org/10.1007/s10827-013-0458-4

    Article  MathSciNet  PubMed  Google Scholar 

  44. Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050

    Article  Google Scholar 

  45. Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Expert Syst Appl 41(13):5938–5947

    Article  Google Scholar 

  46. Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0

    Article  Google Scholar 

  47. Wang J, Zhang L (2021) Discriminant mutual information for text feature selection. In: et al. Database Systems for Advanced Applications. DASFAA 2021. Lect Notes Comput Sci. vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_9

  48. Webkb (2019) The 4 universities data set. https://doi.org/10.1007/s00500-016-2093-25

  49. Wiener E, Pedersen JO, Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pp. 317–332

  50. Witten IH, Frank E, Hall MA (2005) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco

    Google Scholar 

  51. Wolf D (1996) The Generalization of Mutual Information as the Information between a Set of Variables: The Information Correlation Function Hierarchy and the Information Structure of Multi-Agent Systems (Technical report). NASA Ames Research Center

  52. Wu G, Xu J (2015) Optimized Approach of Feature Selection Based on Information Gain. In: 2015 International Conference on Computer Science and Mechanical Automation (CSMA), pp. 157–161

  53. Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754. https://doi.org/10.1016/j.ipm.2011.12.005

    Article  Google Scholar 

  54. Yap BW, Ibrahim NS, Hamid HA, Rahman SA, Fong SJ (2018) Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika J Sci Technol 26:329–340

    Google Scholar 

  55. Zheng Z, Wu X, Srihari RK (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89. https://doi.org/10.1145/1007730.1007741

    Article  Google Scholar 

Download references

Acknowledgements

Not Applicable.

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Contributions

The authors, Farek Lazhar and Benaidja Amira, contributed equally to this work.

Corresponding author

Correspondence to Lazhar Farek.

Ethics declarations

Conflicts of interest

This research did not contain any studiess involving animal or human participants, nor did it take place on any private or protected areas.

Competing interests

The authors declare no conflicts of interest in preparing this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Benaidja Amira contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farek, L., Benaidja, A. A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information. Multimed Tools Appl 83, 20193–20214 (2024). https://doi.org/10.1007/s11042-023-15876-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15876-y

Keywords

Navigation