Abstract
Text classification is a method for determining the class label of an unknown textual document. In text classification, the vector representation of a document plays a crucial role in enhancing the efficiency of classification process. Several approaches of text classification use content-based features like words for document vector representation. Words with high distinguishing capability increase the performance of the text classification. Therefore, recognizing such words from a huge number of words is an essential step in text classification. This problem of high dimensional is solved with the help of feature selection methods. In the literature, several feature selection methods are proposed by the researchers based on the information of term distributions in various classes of dataset. In this chapter, we developed an approach for text classification (TC) by combining feature selection algorithm (FSA) and term weight measures (TWMs), in which a new feature selection method is developed to delete redundant features and for selecting relevant features. The recognized features are utilized for expressing the documents as vectors. The value of term in representation of vector is calculated by using TWM. In the proposed approach, a new Term Weight Measure is developed and compared the performance of proposed TWM with several well-known TWMs. Six different classification algorithms namely support vector machine (SVM), decision tree (DT), Naïve Bayes (NB), k-nearest neighbour (KNN), logistic regression (LR), and random forest (RF) are used for generating the model for classification. The experiment is performed on six benchmark datasets in the field of TC. The results showed that the proposed approach showed best accuracies for TC on six datasets compared with different works in the domain of TC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
H. Zhao, A.P. Sinha, W. Ge, Effects of feature construction on classification performance: An empirical study in bank failure prediction. Expert Syst. Appl. 36(2), 2633–2644 (2009)
A. Onan, Ensemble learning based feature selection with an application to text classification (2018 26th Signal Processing and Communications Applications Conference (SIU), 2018). https://doi.org/10.1109/siu.2018.8404258
A. Onan, On the Performance of Ensemble Learning for Automated Diagnosis of Breast Cancer (Artificial Intelligence Perspectives and Applications, 2015), pp. 119–129. https://doi.org/10.1007/978-3-319-18476-0_13
R. Cekik, A.K. Uysal, A novel filter feature selection method using rough set for short text data. Expert Syst. Appl. 160(113691), 1–15 (2020)
M. Labani, P. Moradi, F. Ahmadizar, M. Jalili, A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25 (2018)
M. Labani, P. Moradi, M. Jalili, A multi-objective genetic algorithm for text feature selection using the relative discriminative criterion. Expert Syst. Appl. 149(113276), 1–21 (2020)
T. Dogana, A.K. Uysal, A novel term weighting scheme for text classification: TF-MONO. J. Informet.
L. Chen, L. Jiang, C. Li, Modified DFS-based term weighting scheme for text classification. Expert Syst. Appl. 168, 114438 (2021)
T. Dogan, A.K. Uysal, Improved inverse gravity moment term weighting for text classification. Expert Syst. Appl. 130, 45–59 (2019)
Z. Tang, W. Li, W. Yan Li, S.L. Zhao, Several alternative term weighting methods for text representation and classification. Knowl.-Based Syst. 207, 106399 (2020)
J. Chen, P.K. Kudjo, S. Mensah, S.A. Brown, G. Akorfu, An automatic software vulnerability classification framework using term frequency-inverse gravity moment and feature selection. J. Syst. Software 167(110616), 1–20 (2020)
https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
K. Lang. (2008, January). 20 Newsgroups. Available: http://qwone.com/~jason/20Newsgroups/
ComeToMyHead. (2004, January 2018). AG’s Corpus of News Articles. Available: https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
https://www.kaggle.com/vikassingh1996/news-clickbait-dataset
L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
B. Schölkopf, C.J. Burges, Advances in Kernel Methods: Support Vector Learning (MIT press, 1999)
T. Pranckevičius, V. Marcinkevičius, Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic J. Modern Comput. 5(2), 221 (2017)
G.V. Kass, An exploratory technique for investigating large quantities of categorical data. Appl. Stat. 29, 119–127 (1980)
J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
T.M. Cover, P.E. Hart, Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
G. Salton, A. Wong, C. Yang, A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
M. Lan, C. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
F. Ren, M.G. Sohrab, Class-indexing-based term weighting for automatic text classification. Inf. Sci. 236, 109–125 (2013). https://doi.org/10.1016/j.ins.2013.02.029
Y. Liu, H.T. Loh, A. Sun, Imbalanced text classification: A term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009). https://doi.org/10.1016/j.eswa.2007.10.042
K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst. Appl. 66, 1339–1351 (2016)
Mohamed Abdel Fattah, New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing (2015). https://doi.org/10.1016/j.neucom.2015.04.051i
F. Carvalho, G.P. Guedes, TF-IDFC-RF: A Novel Supervised Term Weighting Scheme for Sentiment Analysis (ar**v:2003.07193v2 [cs.IR] 12, Aug, 2020)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Palacharla, R.K., Vatsavayi, V.K. (2024). A Novel Approach for Text Classification Using Feature Selection Algorithm and Term Weight Measures. In: Lin, F.M., Patel, A., Kesswani, N., Sambana, B. (eds) Accelerating Discoveries in Data Science and Artificial Intelligence I. ICDSAI 2023. Springer Proceedings in Mathematics & Statistics, vol 421. Springer, Cham. https://doi.org/10.1007/978-3-031-51167-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-51167-7_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51166-0
Online ISBN: 978-3-031-51167-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)