Log in

Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Extracting domain keywords from the corpus helps optimize the task of document classification. Specialized vocabularies built only from semantically similar domain keywords are inadequate for understanding concepts in a specific domain. The proposed paradigm demonstrates that forming domain-specific vocabulary using semantically significant frequently occurring words of the specialized corpus outperformed the traditional classifiers to achieve effective classification. Also, a proposed novel methodology for weight computation assigns the rationalized weight to each word implying its high applicability to a specific domain. The results depicting high accuracy in classifying documents prove the importance of term-document frequency with semantics in word representation with rationalized weights for classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20:1264–1272. https://doi.org/10.1109/TKDE.2008.76

    Article  Google Scholar 

  2. Canedo ED, Mendes BC (2020) Software requirements classification using machine learning algorithms. Entropy 22:1057. https://doi.org/10.3390/E22091057

    Article  Google Scholar 

  3. Chen CH (2017) Improved TFIDF in big news retrieval: an empirical study. Pattern Recognit Lett 93:113–122. https://doi.org/10.1016/J.PATREC.2016.11.004

    Article  Google Scholar 

  4. Yahav I, Shehory O, Schwartz D (2019) Comments mining with TF-IDF: the inherent bias and its removal. IEEE Trans Knowl Data Eng 31:437–450. https://doi.org/10.1109/TKDE.2018.2840127

    Article  Google Scholar 

  5. Jatnika D, Bijaksana MA, Suryani AA (2019) Word2Vec model analysis for semantic similarities in English words. Pro Comput Sci 157:160–167. https://doi.org/10.1016/J.PROCS.2019.08.153

    Article  Google Scholar 

  6. Cahyani DE, Patasik I (2021) Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bull Electr Eng Inform. https://doi.org/10.11591/EEI.V10I5.3157

    Article  Google Scholar 

  7. Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2021) Text classification algorithms: a survey. Information. https://doi.org/10.3390/info10040150

    Article  Google Scholar 

  8. Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 521:273–292. https://doi.org/10.1007/S10462-018-09677-1

    Article  Google Scholar 

  9. Kumari M, Jain A, Bhatia A (2016) Synonyms based term weighting scheme: an extension to TF.IDF. Proc Comput Sci 89:555–561. https://doi.org/10.1016/J.PROCS.2016.06.093

    Article  Google Scholar 

  10. Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364. https://doi.org/10.1016/J.PROENG.2014.03.129

    Article  Google Scholar 

  11. Qaiser S, Ali R, Utara U, Sintok M, Kedah M, Ramsha A, Analytics T (2018) Text mining: use of TF-IDF to examine the relevance of words to documents. Artic Int J Comput Appl 181:975–8887. https://doi.org/10.5120/ijca2018917395

    Article  Google Scholar 

  12. Ge L, Moh TS (2017) Improving text classification with word embedding. In: Proceedings of the 2017 IEEE international conference Big Data, Big Data 2017, pp 1796–1805. https://doi.org/10.1109/BIGDATA.2017.8258123.

  13. Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Kingsbury P, Liu H (2018) A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 87:12–20. https://doi.org/10.1016/J.JBI.2018.09.008

    Article  Google Scholar 

  14. Tezgider M, Yildiz B, Aydin G (2019) Improving word representation by tuning Word2Vec parameters with deep learning model. In: 2018 international conference artificial intelligence data process, IDAP 2018. https://doi.org/10.1109/IDAP.2018.8620919

  15. Onishi T, Shiina H (2020) Distributed representation computation using CBOW model and skip-gram model. In: Proceedings of the 2020 9th international congress advanced application on informatics, IIAI-AAI 2020, pp 845–846. https://doi.org/10.1109/IIAI-AAI50415.2020.00179

  16. Jang B, Kim I, Kim JW (2019) Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE 14:e0220976. https://doi.org/10.1371/JOURNAL.PONE.0220976

    Article  Google Scholar 

  17. Tao J, Jia L, Wan MC, Meng JH (2020) The text modeling method of Tibetan text combining Word2vec and improved TF-IDF. J Phys Conf Ser 1601:042007. https://doi.org/10.1088/1742-6596/1601/4/042007

    Article  Google Scholar 

  18. Mohammad Al-Kharboush F, Abdullah Al-Hagery M (2021) Features extraction effect on the accuracy of sentiment classification using ensemble models. Artic Int J Sci Res. https://doi.org/10.21275/SR21303123511

  19. Akpatsa SK, Li X, Lei H (2021) A survey and future perspectives of hybrid deep learning models for text classification. Int Conf Artif Intell Secur. https://doi.org/10.1007/978-3-030-78609-0_31

    Article  Google Scholar 

  20. Sood M, Kaur H, Gera J (2021) Creating domain based dictionary and its evaluation using classification accuracy. In: 2021 8th international conference on computing for sustainable global development (INDIACom), pp 341–347. https://doi.org/10.1109/INDIACom51348.2021.00059

  21. ML resources-BBC datasets. http://mlg.ucd.ie/datasets/bbc.html

  22. Kalra V, Kashyap I, Kaur H (2021) Generation of domain-specific vocabulary set and classification of documents: weight-inclusion approach. Int J Inf Technol 2022:1–11. https://doi.org/10.1007/S41870-021-00830-8

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harmeet Kaur.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kalra, V., Kashyap, I. & Kaur, H. Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF. Int. j. inf. tecnol. 14, 2451–2457 (2022). https://doi.org/10.1007/s41870-022-00889-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-022-00889-x

Keywords

Navigation