Abstract
Feature selection is a crucial preprocessing step for text categorization that can help to reduce the feature space, speed up the learning process, and improve the accuracy of classification algorithms. Efficient and popular filter-based feature selection methods based on document frequency, such as Information Gain, Chi-Square Test, Improved Gini Index, etc., are most widely used due to their high performance and low time complexity compared to information-theoretic methods, which have effective performance on low-dimensional data but are less efficient in front of high-dimensional data. However, the main issue of statistical filter-based methods is feature redundancy; assessing the feature’s importance independently of other ones leads to selecting a large number of features that do not provide additional information for the class variable, resulting in additional training time and low classification performance. To take advantage of the effectiveness of information-theoretic methods in the treatment of redundancy issues but with low time complexity, we propose a new non-sequential selection method with low time complexity, named Co-occurrence-level Feature Selection and Redundancy Removal (CFSRR), which uses mutual information to evaluate the importance of each feature with respect to its co-occurring ones instead of using already selected features as in classical theoretic-information-based methods. The idea is that co-occurring features in the same context are semantically correlated, forming candidate redundant features and hel** to avoid feature-by-feature redundancy evaluation. Compared to ten effective feature selection metrics, empirical results show the efficiency of CFSRR in terms of micro-F1 and macro-F1 scores obtained from Naïve Bayes and SVM classifiers on five publicly available datasets, showing its robustness for balanced, unbalanced, binary, and multi-class classification problems.
Similar content being viewed by others
Availability of supporting data
The data described in this article is publicly available at https://www.kaggle.com/datasets and https://starling.utdallas.edu/datasets/
References
Attieh J, Tekli J (2023) Supervised term-category feature weighting for improved text classification. Knowl-Based Syst 261:110215. https://doi.org/10.1016/j.knosys.2022.110215
Basu A, Watters C, Shepherd M (2003) Support vector machines for text categorization. In: 36th Annual Hawaii International Conference on System Sciences. https://doi.org/10.1109/HICSS.2003.1174243
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550. https://doi.org/10.1109/72.298224
Bell A (2003) The co-information lattice. In: 4th Int. Symp. Independent Component Analysis and Blind Source Separation
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Sys Appl 36(3):5432–5435. https://doi.org/10.1016/j.eswa.2008.06.054
Chen Y, Han B, Hou P (2014) New feature selection methods based on context similarity for text categorization. In: Proceedings of the international conference on fuzzy systems and knowledge discovery. https://doi.org/10.1109/FSKD.2014.6980902
Cover T, Thomas J (2006) Elements of Information Theory. John Wiley & Sons, New York
Cover TM, Thomas JA (1991) Elements of Information Theory. John Wiley & Sons, New York
Craven M, Pasquo DD, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national tenth conference on Artificial intelligence Innovative applications of artificial intelligence AAAI’98/IAAI’98). American Association for Artificial Intelligence, USA, pp 509–516
Eligüzel N, Çetinkaya C, Dereli T (2022) A novel approach for text categorization by applying hybrid genetic bat algorithm through feature extraction and feature selection methods. Expert Syst Appl 202:117433. https://doi.org/10.1016/j.eswa.2022.117433
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50:1272–1288. https://doi.org/10.1007/s10489-019-01597-z
Georgieva-Trifonova T, Duraku M (2021) Search on n-grams feature selection methods for text classification. IOP Conference Series: Materials Science and Engineering, pp 1031. https://doi.org/10.1088/1757-899x/1031/1/012048
Gunal S (2012) Hybrid feature selection fortext classification. Turk J Electr Eng Comput Sci 20(2):1296–1311
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.1162/153244303322753616
Hidalgo JMG, Bringas GC, Sanz EP, Garcia FC (2006) Content based SMS spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp 107–114. https://doi.org/10.1145/1166160.1166191
Hussain SF, Babar HZ, Khalil AH, Jillani RM, Hanif M, Khurshid K (2020) A Fast Non-Redundant Feature Selection Technique for Text Data. IEEE Access 8:181763–181781
Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine Learning: ECML-98. ECML 1998. Lect Notes Comput Sci (Lecture Notes in Artificial Intelligence). Springer, Berlin, Heidelberg. vol 1398. https://doi.org/10.1007/BFb0026683
Jundong L, Kewei C, Suhang W, Fred M, Robert PT, Jiliang T, Huan L (2019) Feature selection: A data perspective. ACM Comput Surv 50(6). https://doi.org/10.1145/3136625
Kolluri J, Razia S (2020) Text classification using naive bayes classifier. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.10.058
Kou G, Ergu D, Lin C (2016) Pairwise comparison matrix in multiple criteria decision making. Technol Econ Dev Econ 22(5):738–765. https://doi.org/10.3846/20294913.2016.1210694
Kumar V (2014) Feature selection a literature review. Smart Comput Rev :211–229. https://doi.org/10.6029/smartcr.2014.03.007
Labani M, Moradi P, Jalili M (2020) A multi-objective genetic algo- rithm for text feature selection using the relative discriminative criterion. Expert Syst Appl 149:113276. https://doi.org/10.1016/j.eswa.2020.113276
Lewis DD (2019) Reuters-21578 text categorization collection data set. https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
Li S, Zhang K, Chen Q, Wang S, Zhang S (2020) Feature selection for high dimensional data using weighted k-nearest neighbors and genetic algorithm. IEEE Access 8:139512–139528
Li B (2016) Selecting features with class based and importance weighted document frequency in text classification. In: Proceedings of the 2016 ACM Symposium on Document Engineering
Liu W, Guo Z, Jiang F, Liu G, Wang D, Ni Z (2022) Improved WOA and its application in feature selection. PLoS ONE 17
Manochandar S, Punniyamoorthy M (2018) Scaling feature selection method for enhancing the classification performance of Support Vector Machines in text mining. Comput Ind Eng 124:139–156
Mao KZ (2004) Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Trans Syst Man Cybernet Part B (Cybernetics) 34:629–634
McGill WJ (1954) Multivariate information transmission. Psychometrika 19:97–116
Mielniczuk J (2022) Information theoretic methods for variable selection - a review. Entropy 2022(24):1079. https://doi.org/10.3390/e24081079
Mishra NK, Singh PK (2020) FS-MLC: feature selection for multi-label classification using clustering in feature space. Inf Proc Manag 57(4). https://doi.org/10.1016/j.ipm.2020.102240
Pang B, Lee L (2004) A sentimental education: Sentiment anal- ysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, pp. 271–278. https://doi.org/10.3115/1218955.1218990
Pintas JT, Fernandes LAF, Garcia ACB (2021) Feature selection methods for text classification: A systematic literature review. Artif Intell Rev 54:6149–6200. https://doi.org/10.1007/s10462-021-09970-6
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):11190–1125
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53:473–489
Sanderson M, Ruthven I (1996) Report on the Glasgow IR group (glair4) submission. TREC
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. https://doi.org/10.1145/505282.505283
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5. https://doi.org/10.1016/j.eswa.2006.04.001
Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on, Coronado, CA pp 71–75
Tang J, Alelyani S, Liu H (2014) Feature selection for classifica- tion: a review. Data classification : algorithms and applications, pp. 37–64. https://doi.org/10.1201/B17320
Tang Z, Li W, Li Y (2022) An improved supervised term weight- ing scheme for text representation and classification. Expert Syst Appl 189:115985. https://doi.org/10.1016/j.eswa.2021.115985
Timme N, Alford W, Flecker B, Beggs JM (2014) Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 36:119–140. https://doi.org/10.1007/s10827-013-0458-4
Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050
Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Expert Syst Appl 41(13):5938–5947
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0
Wang J, Zhang L (2021) Discriminant mutual information for text feature selection. In: et al. Database Systems for Advanced Applications. DASFAA 2021. Lect Notes Comput Sci. vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_9
Webkb (2019) The 4 universities data set. https://doi.org/10.1007/s00500-016-2093-25
Wiener E, Pedersen JO, Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pp. 317–332
Witten IH, Frank E, Hall MA (2005) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco
Wolf D (1996) The Generalization of Mutual Information as the Information between a Set of Variables: The Information Correlation Function Hierarchy and the Information Structure of Multi-Agent Systems (Technical report). NASA Ames Research Center
Wu G, Xu J (2015) Optimized Approach of Feature Selection Based on Information Gain. In: 2015 International Conference on Computer Science and Mechanical Automation (CSMA), pp. 157–161
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754. https://doi.org/10.1016/j.ipm.2011.12.005
Yap BW, Ibrahim NS, Hamid HA, Rahman SA, Fong SJ (2018) Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika J Sci Technol 26:329–340
Zheng Z, Wu X, Srihari RK (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89. https://doi.org/10.1145/1007730.1007741
Acknowledgements
Not Applicable.
Funding
Not Applicable.
Author information
Authors and Affiliations
Contributions
The authors, Farek Lazhar and Benaidja Amira, contributed equally to this work.
Corresponding author
Ethics declarations
Conflicts of interest
This research did not contain any studiess involving animal or human participants, nor did it take place on any private or protected areas.
Competing interests
The authors declare no conflicts of interest in preparing this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Benaidja Amira contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Farek, L., Benaidja, A. A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information. Multimed Tools Appl 83, 20193–20214 (2024). https://doi.org/10.1007/s11042-023-15876-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15876-y