Abstract
Extracting domain keywords from the corpus helps optimize the task of document classification. Specialized vocabularies built only from semantically similar domain keywords are inadequate for understanding concepts in a specific domain. The proposed paradigm demonstrates that forming domain-specific vocabulary using semantically significant frequently occurring words of the specialized corpus outperformed the traditional classifiers to achieve effective classification. Also, a proposed novel methodology for weight computation assigns the rationalized weight to each word implying its high applicability to a specific domain. The results depicting high accuracy in classifying documents prove the importance of term-document frequency with semantics in word representation with rationalized weights for classification.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-022-00889-x/MediaObjects/41870_2022_889_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-022-00889-x/MediaObjects/41870_2022_889_Fig2_HTML.png)
Similar content being viewed by others
References
Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20:1264–1272. https://doi.org/10.1109/TKDE.2008.76
Canedo ED, Mendes BC (2020) Software requirements classification using machine learning algorithms. Entropy 22:1057. https://doi.org/10.3390/E22091057
Chen CH (2017) Improved TFIDF in big news retrieval: an empirical study. Pattern Recognit Lett 93:113–122. https://doi.org/10.1016/J.PATREC.2016.11.004
Yahav I, Shehory O, Schwartz D (2019) Comments mining with TF-IDF: the inherent bias and its removal. IEEE Trans Knowl Data Eng 31:437–450. https://doi.org/10.1109/TKDE.2018.2840127
Jatnika D, Bijaksana MA, Suryani AA (2019) Word2Vec model analysis for semantic similarities in English words. Pro Comput Sci 157:160–167. https://doi.org/10.1016/J.PROCS.2019.08.153
Cahyani DE, Patasik I (2021) Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bull Electr Eng Inform. https://doi.org/10.11591/EEI.V10I5.3157
Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2021) Text classification algorithms: a survey. Information. https://doi.org/10.3390/info10040150
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 521:273–292. https://doi.org/10.1007/S10462-018-09677-1
Kumari M, Jain A, Bhatia A (2016) Synonyms based term weighting scheme: an extension to TF.IDF. Proc Comput Sci 89:555–561. https://doi.org/10.1016/J.PROCS.2016.06.093
Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364. https://doi.org/10.1016/J.PROENG.2014.03.129
Qaiser S, Ali R, Utara U, Sintok M, Kedah M, Ramsha A, Analytics T (2018) Text mining: use of TF-IDF to examine the relevance of words to documents. Artic Int J Comput Appl 181:975–8887. https://doi.org/10.5120/ijca2018917395
Ge L, Moh TS (2017) Improving text classification with word embedding. In: Proceedings of the 2017 IEEE international conference Big Data, Big Data 2017, pp 1796–1805. https://doi.org/10.1109/BIGDATA.2017.8258123.
Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Kingsbury P, Liu H (2018) A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 87:12–20. https://doi.org/10.1016/J.JBI.2018.09.008
Tezgider M, Yildiz B, Aydin G (2019) Improving word representation by tuning Word2Vec parameters with deep learning model. In: 2018 international conference artificial intelligence data process, IDAP 2018. https://doi.org/10.1109/IDAP.2018.8620919
Onishi T, Shiina H (2020) Distributed representation computation using CBOW model and skip-gram model. In: Proceedings of the 2020 9th international congress advanced application on informatics, IIAI-AAI 2020, pp 845–846. https://doi.org/10.1109/IIAI-AAI50415.2020.00179
Jang B, Kim I, Kim JW (2019) Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE 14:e0220976. https://doi.org/10.1371/JOURNAL.PONE.0220976
Tao J, Jia L, Wan MC, Meng JH (2020) The text modeling method of Tibetan text combining Word2vec and improved TF-IDF. J Phys Conf Ser 1601:042007. https://doi.org/10.1088/1742-6596/1601/4/042007
Mohammad Al-Kharboush F, Abdullah Al-Hagery M (2021) Features extraction effect on the accuracy of sentiment classification using ensemble models. Artic Int J Sci Res. https://doi.org/10.21275/SR21303123511
Akpatsa SK, Li X, Lei H (2021) A survey and future perspectives of hybrid deep learning models for text classification. Int Conf Artif Intell Secur. https://doi.org/10.1007/978-3-030-78609-0_31
Sood M, Kaur H, Gera J (2021) Creating domain based dictionary and its evaluation using classification accuracy. In: 2021 8th international conference on computing for sustainable global development (INDIACom), pp 341–347. https://doi.org/10.1109/INDIACom51348.2021.00059
ML resources-BBC datasets. http://mlg.ucd.ie/datasets/bbc.html
Kalra V, Kashyap I, Kaur H (2021) Generation of domain-specific vocabulary set and classification of documents: weight-inclusion approach. Int J Inf Technol 2022:1–11. https://doi.org/10.1007/S41870-021-00830-8
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kalra, V., Kashyap, I. & Kaur, H. Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF. Int. j. inf. tecnol. 14, 2451–2457 (2022). https://doi.org/10.1007/s41870-022-00889-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-022-00889-x