A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Farek, Lazhar; Benaidja, Amira

doi:10.1007/s11042-023-15876-y

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Published: 31 July 2023

Volume 83, pages 20193–20214, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

149 Accesses
Explore all metrics

Abstract

Feature selection is a crucial preprocessing step for text categorization that can help to reduce the feature space, speed up the learning process, and improve the accuracy of classification algorithms. Efficient and popular filter-based feature selection methods based on document frequency, such as Information Gain, Chi-Square Test, Improved Gini Index, etc., are most widely used due to their high performance and low time complexity compared to information-theoretic methods, which have effective performance on low-dimensional data but are less efficient in front of high-dimensional data. However, the main issue of statistical filter-based methods is feature redundancy; assessing the feature’s importance independently of other ones leads to selecting a large number of features that do not provide additional information for the class variable, resulting in additional training time and low classification performance. To take advantage of the effectiveness of information-theoretic methods in the treatment of redundancy issues but with low time complexity, we propose a new non-sequential selection method with low time complexity, named Co-occurrence-level Feature Selection and Redundancy Removal (CFSRR), which uses mutual information to evaluate the importance of each feature with respect to its co-occurring ones instead of using already selected features as in classical theoretic-information-based methods. The idea is that co-occurring features in the same context are semantically correlated, forming candidate redundant features and hel** to avoid feature-by-feature redundancy evaluation. Compared to ten effective feature selection metrics, empirical results show the efficiency of CFSRR in terms of micro-F1 and macro-F1 scores obtained from Naïve Bayes and SVM classifiers on five publicly available datasets, showing its robustness for balanced, unbalanced, binary, and multi-class classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

A high-quality feature selection method based on frequent and correlated items for text classification

Article Open access 04 June 2023

An optimal feature selection method for text classification through redundancy and synergy analysis

Article 28 June 2024

Availability of supporting data

The data described in this article is publicly available at https://www.kaggle.com/datasets and https://starling.utdallas.edu/datasets/

References

Attieh J, Tekli J (2023) Supervised term-category feature weighting for improved text classification. Knowl-Based Syst 261:110215. https://doi.org/10.1016/j.knosys.2022.110215
Article Google Scholar
Basu A, Watters C, Shepherd M (2003) Support vector machines for text categorization. In: 36th Annual Hawaii International Conference on System Sciences. https://doi.org/10.1109/HICSS.2003.1174243
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550. https://doi.org/10.1109/72.298224
Article CAS PubMed Google Scholar
Bell A (2003) The co-information lattice. In: 4th Int. Symp. Independent Component Analysis and Blind Source Separation
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Sys Appl 36(3):5432–5435. https://doi.org/10.1016/j.eswa.2008.06.054
Article Google Scholar
Chen Y, Han B, Hou P (2014) New feature selection methods based on context similarity for text categorization. In: Proceedings of the international conference on fuzzy systems and knowledge discovery. https://doi.org/10.1109/FSKD.2014.6980902
Cover T, Thomas J (2006) Elements of Information Theory. John Wiley & Sons, New York
Google Scholar
Cover TM, Thomas JA (1991) Elements of Information Theory. John Wiley & Sons, New York
Google Scholar
Craven M, Pasquo DD, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national tenth conference on Artificial intelligence Innovative applications of artificial intelligence AAAI’98/IAAI’98). American Association for Artificial Intelligence, USA, pp 509–516
Eligüzel N, Çetinkaya C, Dereli T (2022) A novel approach for text categorization by applying hybrid genetic bat algorithm through feature extraction and feature selection methods. Expert Syst Appl 202:117433. https://doi.org/10.1016/j.eswa.2022.117433
Article Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Google Scholar
Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50:1272–1288. https://doi.org/10.1007/s10489-019-01597-z
Article Google Scholar
Georgieva-Trifonova T, Duraku M (2021) Search on n-grams feature selection methods for text classification. IOP Conference Series: Materials Science and Engineering, pp 1031. https://doi.org/10.1088/1757-899x/1031/1/012048
Gunal S (2012) Hybrid feature selection fortext classification. Turk J Electr Eng Comput Sci 20(2):1296–1311
Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.1162/153244303322753616
Article Google Scholar
Hidalgo JMG, Bringas GC, Sanz EP, Garcia FC (2006) Content based SMS spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp 107–114. https://doi.org/10.1145/1166160.1166191
Hussain SF, Babar HZ, Khalil AH, Jillani RM, Hanif M, Khurshid K (2020) A Fast Non-Redundant Feature Selection Technique for Text Data. IEEE Access 8:181763–181781
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine Learning: ECML-98. ECML 1998. Lect Notes Comput Sci (Lecture Notes in Artificial Intelligence). Springer, Berlin, Heidelberg. vol 1398. https://doi.org/10.1007/BFb0026683
Jundong L, Kewei C, Suhang W, Fred M, Robert PT, Jiliang T, Huan L (2019) Feature selection: A data perspective. ACM Comput Surv 50(6). https://doi.org/10.1145/3136625
Kolluri J, Razia S (2020) Text classification using naive bayes classifier. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.10.058
Kou G, Ergu D, Lin C (2016) Pairwise comparison matrix in multiple criteria decision making. Technol Econ Dev Econ 22(5):738–765. https://doi.org/10.3846/20294913.2016.1210694
Article Google Scholar
Kumar V (2014) Feature selection a literature review. Smart Comput Rev :211–229. https://doi.org/10.6029/smartcr.2014.03.007
Labani M, Moradi P, Jalili M (2020) A multi-objective genetic algo- rithm for text feature selection using the relative discriminative criterion. Expert Syst Appl 149:113276. https://doi.org/10.1016/j.eswa.2020.113276
Article Google Scholar
Lewis DD (2019) Reuters-21578 text categorization collection data set. https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
Li S, Zhang K, Chen Q, Wang S, Zhang S (2020) Feature selection for high dimensional data using weighted k-nearest neighbors and genetic algorithm. IEEE Access 8:139512–139528
Article Google Scholar
Li B (2016) Selecting features with class based and importance weighted document frequency in text classification. In: Proceedings of the 2016 ACM Symposium on Document Engineering
Liu W, Guo Z, Jiang F, Liu G, Wang D, Ni Z (2022) Improved WOA and its application in feature selection. PLoS ONE 17
Manochandar S, Punniyamoorthy M (2018) Scaling feature selection method for enhancing the classification performance of Support Vector Machines in text mining. Comput Ind Eng 124:139–156
Article Google Scholar
Mao KZ (2004) Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Trans Syst Man Cybernet Part B (Cybernetics) 34:629–634
Article CAS Google Scholar
McGill WJ (1954) Multivariate information transmission. Psychometrika 19:97–116
Article Google Scholar
Mielniczuk J (2022) Information theoretic methods for variable selection - a review. Entropy 2022(24):1079. https://doi.org/10.3390/e24081079
Article ADS MathSciNet Google Scholar
Mishra NK, Singh PK (2020) FS-MLC: feature selection for multi-label classification using clustering in feature space. Inf Proc Manag 57(4). https://doi.org/10.1016/j.ipm.2020.102240
Pang B, Lee L (2004) A sentimental education: Sentiment anal- ysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, pp. 271–278. https://doi.org/10.3115/1218955.1218990
Pintas JT, Fernandes LAF, Garcia ACB (2021) Feature selection methods for text classification: A systematic literature review. Artif Intell Rev 54:6149–6200. https://doi.org/10.1007/s10462-021-09970-6
Article Google Scholar
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):11190–1125
Article Google Scholar
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53:473–489
Article Google Scholar
Sanderson M, Ruthven I (1996) Report on the Glasgow IR group (glair4) submission. TREC
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. https://doi.org/10.1145/505282.505283
Article MathSciNet Google Scholar
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5. https://doi.org/10.1016/j.eswa.2006.04.001
Article Google Scholar
Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on, Coronado, CA pp 71–75
Tang J, Alelyani S, Liu H (2014) Feature selection for classifica- tion: a review. Data classification : algorithms and applications, pp. 37–64. https://doi.org/10.1201/B17320
Tang Z, Li W, Li Y (2022) An improved supervised term weight- ing scheme for text representation and classification. Expert Syst Appl 189:115985. https://doi.org/10.1016/j.eswa.2021.115985
Article Google Scholar
Timme N, Alford W, Flecker B, Beggs JM (2014) Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 36:119–140. https://doi.org/10.1007/s10827-013-0458-4
Article MathSciNet PubMed Google Scholar
Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050
Article Google Scholar
Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Expert Syst Appl 41(13):5938–5947
Article Google Scholar
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0
Article Google Scholar
Wang J, Zhang L (2021) Discriminant mutual information for text feature selection. In: et al. Database Systems for Advanced Applications. DASFAA 2021. Lect Notes Comput Sci. vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_9
Webkb (2019) The 4 universities data set. https://doi.org/10.1007/s00500-016-2093-25
Wiener E, Pedersen JO, Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pp. 317–332
Witten IH, Frank E, Hall MA (2005) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco
Google Scholar
Wolf D (1996) The Generalization of Mutual Information as the Information between a Set of Variables: The Information Correlation Function Hierarchy and the Information Structure of Multi-Agent Systems (Technical report). NASA Ames Research Center
Wu G, Xu J (2015) Optimized Approach of Feature Selection Based on Information Gain. In: 2015 International Conference on Computer Science and Mechanical Automation (CSMA), pp. 157–161
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754. https://doi.org/10.1016/j.ipm.2011.12.005
Article Google Scholar
Yap BW, Ibrahim NS, Hamid HA, Rahman SA, Fong SJ (2018) Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika J Sci Technol 26:329–340
Google Scholar
Zheng Z, Wu X, Srihari RK (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89. https://doi.org/10.1145/1007730.1007741
Article Google Scholar

Download references

Acknowledgements

Not Applicable.

Funding

Not Applicable.

Author information

Authors and Affiliations

Computer Science Department, University of Guelma, Guelma, Algeria
Lazhar Farek & Amira Benaidja
Computer Science Department, University of Setif 1, Setif, Algeria
Lazhar Farek & Amira Benaidja

Authors

Lazhar Farek
View author publications
You can also search for this author in PubMed Google Scholar
Amira Benaidja
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors, Farek Lazhar and Benaidja Amira, contributed equally to this work.

Corresponding author

Correspondence to Lazhar Farek.

Ethics declarations

Conflicts of interest

This research did not contain any studiess involving animal or human participants, nor did it take place on any private or protected areas.

Competing interests

The authors declare no conflicts of interest in preparing this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Benaidja Amira contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Farek, L., Benaidja, A. A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information. Multimed Tools Appl 83, 20193–20214 (2024). https://doi.org/10.1007/s11042-023-15876-y

Download citation

Received: 20 September 2022
Revised: 21 April 2023
Accepted: 22 May 2023
Published: 31 July 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11042-023-15876-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A new feature selection method for handling redundant information in text classification

A high-quality feature selection method based on frequent and correlated items for text classification

An optimal feature selection method for text classification through redundancy and synergy analysis

Availability of supporting data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A new feature selection method for handling redundant information in text classification

A high-quality feature selection method based on frequent and correlated items for text classification

An optimal feature selection method for text classification through redundancy and synergy analysis

Availability of supporting data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation