Advanced Machine Learning Techniques in Natural Language Processing for Indian Languages

Gupta, Vaishali; Joshi, Nisheeth; Mathur, Iti

doi:10.1007/978-3-030-03131-2_7

Vaishali Gupta⁶,
Nisheeth Joshi⁶ &
Iti Mathur⁶

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 374))

416 Accesses

Abstract

The paper represents the advanced NLP learning resources in context of Indian languages: Hindi and Urdu. The research is based on domain-specific platforms which covers health, tourism, and agriculture corpora with 60 k sentences. With these corpora, some NLP-based learning resources such as stemmer, lemmatizer, POS tagger, and MWE identifier have been developed. All of these resources are connected in sequential form, and they are beneficial in information retrieval, language translation, handling word sense disambiguation, and many other useful applications. Stemming is first and foremost process of root extraction from given input word, but sometimes it does not produce valid root word. So the problem of stemming has been resolved by develo** Lemmatizer, which produces the exact root by adding some rules in stemmed output. Eventually, statistical POS tagger has been designed with the help of Indian Government (TDIL) tagset (Indian Govt. Tagset, [1]). With this POS-tagged file, MWE identifier was developed. However, for develo** MWE identifier, some rules are created for MWE tagset and then MWE-tagged file has been developed which in turn produces the automatic extraction of the MWEs from tagged corpora using CRF\({+}{+}\) tool. Moreover, evaluation of learning resources has been performed to calculate the accuracy, and as a result, the output of corresponding proposed resources such as stemmer, lemmatizer, POS tagger, and MWE identifier are 77.0, 86.8, 73.20, and 43.50% for Hindi and 74.0, 85.4, 84.97 and 47.2% for Urdu, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 103.50; Price includes VAT (United Kingdom)

Hardcover Book: GBP 129.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Implementation of Stemmer and Lemmatizer for a Low-Resource Language—Kannada

Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool

Automatic Stemming of Words for Punjabi Language

References

Indian Govt. Tagset http://www.tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf
Mishra, U., Prakash, C.: MAULIK: an effective stemmer for Hindi language. Int. J. Comput. Sci. Eng. (IJCSE) 4(5), 711–717 (2012)
Google Scholar
Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages Workshop, Budapest, pp. 42–48 (2003)
Google Scholar
Khan, S.A., Anwar, W., Bajwa, U.J., Wang, X.: A light weight stemmer for Urdu language. A scarce resourced language. In: 3rd Workshop on South and Southeast Asian NLP, pp. 69–78 (2012)
Google Scholar
Ali, M., et al.: A rule based stemming method for multilingual Urdu text. Int. J. Comput. Appl. 134(8), 10–18 (2016)
Google Scholar
Jabbar, A., Iqbal, S., Khan, M.U.G.: Analysis and development of resources for Urdu text stemming. Language and Technology, 1 (2016)
Google Scholar
Chakrabarty, A., Choudhury, S.R., Garain, U.: An unsupervised lemmatizer for Indian languages. In: Proceedings of Forum for Information Retrieval and Evaluation (FIRE 2014) (2014)
Google Scholar
Paul, S., Joshi, N., Mathur, I.: Development of a hindi lemmatizer. Proceeding Int. J. Comput. Linguist. Nat. Lang. Process. 2(5), 380–384 (2013). ISSN 2279 0756
Google Scholar
Plisson, J., Lavrac, N., Mladeni, D.: A rule based approach to word lemmatization (2004)
Google Scholar
Dave, R., Balani, P.: Survey paper of different lemmatization approaches. In Proceedings of International Journal of Research in Advent Technology (E-ISSN: 2321-9637), 08 March 2015 (2015)
Google Scholar
Joshi, N., Darbari, H., Mathur, I.: HMM Based POS tagger for Hindi. In: Proceedings of AISC, pp. 341–349. CS & IT-CSCP (2013)
Google Scholar
Shrivastava, M., Bhattacharyya, P.: Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge, Pune, India (2008)
Google Scholar
Khanam, M.H., Madhumurthy, K.V., Khudhus, M.A.: Part-of-Speech tagging for urdu in scarce resource: mix maximum entropy modelling system. Proc. Int. J. Adv. Res. Comput. Commun. Eng. 2(9) (2013)
Google Scholar
Patra, B.G., Debbarma, K., Das, D., Bandyopadhyay, S.: Part of Speech (POS) tagger for Kokborok. In: Proceedings of COLING 2012: Posters, pp. 923–932 (2012)
Google Scholar
Gadde, P., Yeleti, M.V.: Improving statistical POS tagging using linguistic feature for Hindi and Telugu. In: Proceedings of ICON-2008 (2008)
Google Scholar
Anwar, W., Wang, X., Li, L., Wang, X.: Hidden Markov model based part of speech tagger for Urdu. Inf. Technol. J. 6(8), 1190–1198 (2007)
Article Google Scholar
Kunchukuttan, A., Damani, O.P.: A system for compound noun multiword expression extraction for Hindi (2008)
Google Scholar
Mahesh, R., Sinha, K.: Stepwise mining of multi-word expressions in Hindi. In: Proceedings of the Workshop on Multiword Expression: from Parsing and Generation to the Real World, Portland, Oregon, USA, pp. 110–115 (2011)
Google Scholar
Eryigit, G, Adali K, Torunoglu-Selamet, D., Sulubacak, U., Pamay, T.: Annotation and extraction of multiword expressions in Turkish treebanks. In: MWE@ NAACL-HLT, pp. 70–76 (2005)
Google Scholar
Green, S., De Marneffe, M.-C., Bauer, J., Manning, C.D.: Multiword expression identification with tree substitution grammars: a parsing tour de force with French. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 725–735. Association for Computational Linguistics (2011)
Google Scholar
Hawwari, A., Attia, M., Diab, M.: A framework for the classification and annotation of multiword expressions in dialectal arabic. In: ANLP 2014, p. 48 (2014)
Google Scholar
Tutin, A., Esperança-Rodier, E., Iborra, M., Reverdy, J.: Annotation of multiword expressions in French. In: European Society of Phraseology Conference (EUROPHRAS 2015), pp. 60–67 (2015)
Google Scholar
Singh, D., Bhingardive, S., Patel, K., Bhattacharyya, P.: Detection of multiword expressions for Hindi language using word embeddings and WordNet-based features. In: 12th International Conference on Natural Language Processing, p. 291 (2015)
Google Scholar
Lovins, J.B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11(1), 22–31 (1968)
Google Scholar
Gupta, V., Joshi, N., Mathur, I.: POS tagger for Urdu using Stochastic approaches. In: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, p. 56. ACM, New York (2016)
Google Scholar
Gupta, V., Joshi, N., Mathur, I.: CRF based Part of Speech tagger for domain specific Hindi corpus. Published in Int. J. Comput. Appl. (IJCA-2016) (2016). ISBN: 0975-8887
Google Scholar
Dalal, A., Nagaraj, K., Swant, U., Shelke, S., Bhattacharyya, P.: Building feature rich pos tagger for morphologically rich languages: experience in Hindi. In: Proceedings of International Conference on NLP (ICON-2007) (2007)
Google Scholar
Dhanalakshmi, V., Anand Kumar, M., Rajendran, S., Soman, K.P.: POS tagger and chunker for Tamil language. In: Proceedings of International Conference, Morphological Tagger, Koeln, Germany (2009)
Google Scholar
Saharia, N., Das, D., Sharma, U., Kalita, J.: Part of Speech tagger for Assamese text. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 33–36. Association for Computational Linguistics (2009)
Google Scholar
Dandapat, S., Sarkar, S., Basu, A.: Automatic Part-of-Speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics (2007)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning (ICML01), Williamstown, MA, USA, pp 282–289 (2001)
Google Scholar
Baldwin, T., Kim, S.N.: Multiword expressions. Handbook of Natural Language Processing, vol. 2, pp. 267–292. CRC Press, Boca Raton (2010)
Google Scholar
Nandi, M., Ramasree, R.J.: Rule based extraction of multi-word expressions for elementary sanskrit texts. Proc. Int. J. Adv. Res. Comput. Sci. Softw. Engg. 3(11), 661–667 (2013)
Google Scholar
Kulkarni, N., Finlayson, M.A.: jMWE: a java toolkit for detecting multi-word expressions. In: Proceedings of the Workshop on Multi-word Expressions: from Parsing and Generation to the Real World (MWE 2011), Portland, Oregon, USA, pp. 122–124 (2011)
Google Scholar
Chakraborty, T.: Identifying Bengali Multiword expressions using semantic clustering. In: Proceedings of International Journal of Linguistics and Language Resources, John Benjamins publishing company, ISSN 0378-4169 (2014)
Google Scholar
Gayen, V., Sarkar, K.: Machine Learning approach for the identification of Bengali Noun\(+\)Noun Compound Multiword Expressions. In: Proceedings of ICON-2013: 10th International Conference on Natural Language Processing (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Banasthali Vidyapith, Rajasthan, India
Vaishali Gupta, Nisheeth Joshi & Iti Mathur

Authors

Vaishali Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Nisheeth Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Iti Mathur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vaishali Gupta .

Editor information

Editors and Affiliations

School of Computer Engineering, KIIT University, Bhubaneswar, Odisha, India
Manoj Kumar Mishra
School of Computer Engineering, KIIT University, Bhubaneswar, Odisha, India
Bhabani Shankar Prasad Mishra
Department of Computer Science and Engineering, IIT Patna, Patna, Bihar, India
Yashwant Singh Patel
Department of Computer Science and Engineering, IIT Patna, Patna, Bihar, India
Rajiv Misra

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gupta, V., Joshi, N., Mathur, I. (2019). Advanced Machine Learning Techniques in Natural Language Processing for Indian Languages. In: Mishra, M., Mishra, B., Patel, Y., Misra, R. (eds) Smart Techniques for a Smarter Planet. Studies in Fuzziness and Soft Computing, vol 374. Springer, Cham. https://doi.org/10.1007/978-3-030-03131-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-03131-2_7
Published: 30 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03130-5
Online ISBN: 978-3-030-03131-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Advanced Machine Learning Techniques in Natural Language Processing for Indian Languages

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Implementation of Stemmer and Lemmatizer for a Low-Resource Language—Kannada

Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool

Automatic Stemming of Words for Punjabi Language

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Advanced Machine Learning Techniques in Natural Language Processing for Indian Languages

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Implementation of Stemmer and Lemmatizer for a Low-Resource Language—Kannada

Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool

Automatic Stemming of Words for Punjabi Language

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation