Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF

Kalra, Vandana; Kashyap, Indu; Kaur, Harmeet

doi:10.1007/s41870-022-00889-x

Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF

Original Research
Published: 02 March 2022

Volume 14, pages 2451–2457, (2022)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

323 Accesses
5 Citations
Explore all metrics

Abstract

Extracting domain keywords from the corpus helps optimize the task of document classification. Specialized vocabularies built only from semantically similar domain keywords are inadequate for understanding concepts in a specific domain. The proposed paradigm demonstrates that forming domain-specific vocabulary using semantically significant frequently occurring words of the specialized corpus outperformed the traditional classifiers to achieve effective classification. Also, a proposed novel methodology for weight computation assigns the rationalized weight to each word implying its high applicability to a specific domain. The results depicting high accuracy in classifying documents prove the importance of term-document frequency with semantics in word representation with rationalized weights for classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Collecting the Database for the Neural Network Deep Learning Implementation

Feature Representation Based on Improved Word-Vector Clustering Using AP and E2LSH

De-word Classification Algorithm Based on the Electric Power of Large Data Library Retrieval

References

Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20:1264–1272. https://doi.org/10.1109/TKDE.2008.76
Article Google Scholar
Canedo ED, Mendes BC (2020) Software requirements classification using machine learning algorithms. Entropy 22:1057. https://doi.org/10.3390/E22091057
Article Google Scholar
Chen CH (2017) Improved TFIDF in big news retrieval: an empirical study. Pattern Recognit Lett 93:113–122. https://doi.org/10.1016/J.PATREC.2016.11.004
Article Google Scholar
Yahav I, Shehory O, Schwartz D (2019) Comments mining with TF-IDF: the inherent bias and its removal. IEEE Trans Knowl Data Eng 31:437–450. https://doi.org/10.1109/TKDE.2018.2840127
Article Google Scholar
Jatnika D, Bijaksana MA, Suryani AA (2019) Word2Vec model analysis for semantic similarities in English words. Pro Comput Sci 157:160–167. https://doi.org/10.1016/J.PROCS.2019.08.153
Article Google Scholar
Cahyani DE, Patasik I (2021) Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bull Electr Eng Inform. https://doi.org/10.11591/EEI.V10I5.3157
Article Google Scholar
Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2021) Text classification algorithms: a survey. Information. https://doi.org/10.3390/info10040150
Article Google Scholar
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 521:273–292. https://doi.org/10.1007/S10462-018-09677-1
Article Google Scholar
Kumari M, Jain A, Bhatia A (2016) Synonyms based term weighting scheme: an extension to TF.IDF. Proc Comput Sci 89:555–561. https://doi.org/10.1016/J.PROCS.2016.06.093
Article Google Scholar
Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364. https://doi.org/10.1016/J.PROENG.2014.03.129
Article Google Scholar
Qaiser S, Ali R, Utara U, Sintok M, Kedah M, Ramsha A, Analytics T (2018) Text mining: use of TF-IDF to examine the relevance of words to documents. Artic Int J Comput Appl 181:975–8887. https://doi.org/10.5120/ijca2018917395
Article Google Scholar
Ge L, Moh TS (2017) Improving text classification with word embedding. In: Proceedings of the 2017 IEEE international conference Big Data, Big Data 2017, pp 1796–1805. https://doi.org/10.1109/BIGDATA.2017.8258123.
Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Kingsbury P, Liu H (2018) A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 87:12–20. https://doi.org/10.1016/J.JBI.2018.09.008
Article Google Scholar
Tezgider M, Yildiz B, Aydin G (2019) Improving word representation by tuning Word2Vec parameters with deep learning model. In: 2018 international conference artificial intelligence data process, IDAP 2018. https://doi.org/10.1109/IDAP.2018.8620919
Onishi T, Shiina H (2020) Distributed representation computation using CBOW model and skip-gram model. In: Proceedings of the 2020 9th international congress advanced application on informatics, IIAI-AAI 2020, pp 845–846. https://doi.org/10.1109/IIAI-AAI50415.2020.00179
Jang B, Kim I, Kim JW (2019) Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE 14:e0220976. https://doi.org/10.1371/JOURNAL.PONE.0220976
Article Google Scholar
Tao J, Jia L, Wan MC, Meng JH (2020) The text modeling method of Tibetan text combining Word2vec and improved TF-IDF. J Phys Conf Ser 1601:042007. https://doi.org/10.1088/1742-6596/1601/4/042007
Article Google Scholar
Mohammad Al-Kharboush F, Abdullah Al-Hagery M (2021) Features extraction effect on the accuracy of sentiment classification using ensemble models. Artic Int J Sci Res. https://doi.org/10.21275/SR21303123511
Akpatsa SK, Li X, Lei H (2021) A survey and future perspectives of hybrid deep learning models for text classification. Int Conf Artif Intell Secur. https://doi.org/10.1007/978-3-030-78609-0_31
Article Google Scholar
Sood M, Kaur H, Gera J (2021) Creating domain based dictionary and its evaluation using classification accuracy. In: 2021 8th international conference on computing for sustainable global development (INDIACom), pp 341–347. https://doi.org/10.1109/INDIACom51348.2021.00059
ML resources-BBC datasets. http://mlg.ucd.ie/datasets/bbc.html
Kalra V, Kashyap I, Kaur H (2021) Generation of domain-specific vocabulary set and classification of documents: weight-inclusion approach. Int J Inf Technol 2022:1–11. https://doi.org/10.1007/S41870-021-00830-8
Article Google Scholar

Download references

Author information

Authors and Affiliations

Manav Rachna International Institute of Research Studies, Faridabad, Haryana, India
Vandana Kalra & Indu Kashyap
Hansraj College, University of Delhi, New Delhi, India
Harmeet Kaur

Authors

Vandana Kalra
View author publications
You can also search for this author in PubMed Google Scholar
Indu Kashyap
View author publications
You can also search for this author in PubMed Google Scholar
Harmeet Kaur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harmeet Kaur.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalra, V., Kashyap, I. & Kaur, H. Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF. Int. j. inf. tecnol. 14, 2451–2457 (2022). https://doi.org/10.1007/s41870-022-00889-x

Download citation

Received: 10 November 2021
Accepted: 01 February 2022
Published: 02 March 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s41870-022-00889-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Collecting the Database for the Neural Network Deep Learning Implementation

Feature Representation Based on Improved Word-Vector Clustering Using AP and E2LSH

De-word Classification Algorithm Based on the Electric Power of Large Data Library Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Collecting the Database for the Neural Network Deep Learning Implementation

Feature Representation Based on Improved Word-Vector Clustering Using AP and E2LSH

De-word Classification Algorithm Based on the Electric Power of Large Data Library Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation