Cognate Identification to Augment Lexical Resources for NLP

Kumar, Shantanu; Vaidya, Ashwini; Agarwal, Sumeet

doi:10.1007/978-981-99-3966-4_10

Shantanu Kumar⁴,
Ashwini Vaidya⁴ &
Sumeet Agarwal⁴

217 Accesses

Abstract

Cognates are words across different languages that are known to have a common ancestral origin. For example, the English word night and the German Nacht, both meaning night are cognates with a common ancestral (Proto-Germanic) origin. Cognates are not always revealingly similar and can change substantially over time such that they do not share form similarity. Automatic cognate identification determines whether a given word pair is cognate or not. A cognate pair may have diverged at the surface level over time, but it shares a common ancestor and is likely to have similar meanings. This is especially true in languages that are typologically closer to each other. Our system uses a character-level model with recurrent neural network architecture and attention. We test its performance on datasets drawn from three different language families. Our results show an improvement in performance as compared to existing models and highlight the usefulness of phonetic and conceptual features. Our model finds similar word pairs with high accuracy from a pair of closely related languages (Hindi and Marathi). One of the applications of our work is to project linguistic annotations from a high-resource language to a (typologically-related) low-resource language. This projection can be used to bootstrap lexical resource creation, e.g., predicate frame information, word sense annotations, etc. The bootstrap** of lexical resources is particularly relevant for languages in South Asia, which are diverse but share areal and typological properties. Apart from this application, cognate identification helps improve the performance of tasks like sentence alignment for machine translation.

At the time of writing this paper, Shantanu Kumar was a student at Indian Institute of Technology, Delhi. He is now working at Sizmek, USA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 106.99; Price includes VAT (Germany)

Hardcover Book: EUR 139.09; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Deep Learning Approach to Solving Morphological Analogies

Learning Lexical-Semantic Relations Using Intuitive Cognitive Links

A Comparison of Character and Word Embeddings in Bidirectional LSTMs for POS Tagging in Italian

Notes

1.
http://ielex.mpi.nl/.
2.
Precision and Recall is computed on positive labels at 0.5 threshold. Precision \(=\) TP/(TP \(+\) FP), Recall \(=\) TP/(TP \(+\) FN), TP: True Positives, FP: False Positives, FN: False Negatives.
3.
It can be noted that there is a difference in the reported F-score of the CNN models as compared to the original paper Rama (2016). This is because we report the f-score with respect to the positive labels only, whereas the original paper reported the average f-scores of positive and negative labels (Observed from the implementation in the author’s code).

References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In ICLR 2015.
Google Scholar
Bhat, R., Jain, N., Vaidya, A., Palmer, M., Khan, T., Sharma, D., & Babani, J. (2014). Adapting predicate frames for Urdu PropBanking. In Proceedings of the EMNLP 2014 Workshop On Language Technology For Closely Related Languages and Language Variants (pp. 47–55). http://www.aclweb.org/anthology/W14-4206
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on EMNLP. ACL.
Google Scholar
Brown, C., Holman, E., Wichmann, S., & Villupillai, V. (2008). Automated classification of the world’s languages: A description of the method and preliminary results. Language Typology and Universals.
Google Scholar
Greenhill, S., Blust, R., & Gray, R. (2008). The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4, 271–283.
Article PubMed PubMed Central Google Scholar
Hauer, B., & Kondrak, G. (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In IJCNLP (pp. 865–873). Citeseer.
Google Scholar
Inkpen, D., Frunza, O., & Kondrak, G. (2005). Automatic identification of cognates and false friends in French and English. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2005) (pp. 251–257).
Google Scholar
Kondrak, G., & Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. In Proceedings of the 20th international conference on Computational Linguistics (p. 952). Association for Computational Linguistics.
Google Scholar
Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of HLT-NAACL 2003 (Short papers, Vol. 2, NAACL-Short’03, pp. 46–48). ACL. https://doi.org/10.3115/1073483.1073499
List, J. M., Lopez, P., & Bapteste, E. (2016). Using sequence similarity networks to identify partial cognates in multilingual wordlists. In Proceedings of the ACL 2016 (Vol. 2: Short Papers, pp. 599–605). Berlin. http://anthology.aclweb.org/P16-2097
Luong, M., Pham, H. & Manning, C. Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 1412-1421. Lisbon (2015).
Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech (Vol. 2, p. 3).
Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). http://www.aclweb.org/anthology/D14-1162
Rama, T. (2015). Automatic cognate identification with gap-weighted string subsequences. In Proceedings of the 2015 Conference of NAACL: Human Language Technologies (pp. 1227–1231).
Google Scholar
Rama, T. (2016). Siamese convolutional networks for cognate identification. In Proceedings of COLING 2016 (pp. 1018–1027).
Google Scholar
Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kocisky, T., & Blunsom, P. (2106). Reasoning about entailment with neural attention. In ICLR.
Google Scholar
Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In Proceedings of the 1993 Conference of CASCON (pp. 1071–1082). IBM Press.
Google Scholar
Singh, A. K., & Surana, H. (2007). Study of cognates among south Asian languages for the purpose of building lexical resources. Journal of Language Technology.
Google Scholar
Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., & Nie, J. Y. (2015). A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (pp. 553–562). ACM.
Google Scholar
Wichmann, S., & Holman, E. (2008). Languages with longer words have more lexical change. Approaches To Measuring Linguistic Differences.
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference On Machine Learning (pp. 2048–2057).
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 21–29).
Google Scholar

Download references

Author information

Authors and Affiliations

IIT Delhi, New Delhi, India
Shantanu Kumar, Ashwini Vaidya & Sumeet Agarwal

Authors

Shantanu Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Ashwini Vaidya
View author publications
You can also search for this author in PubMed Google Scholar
Sumeet Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashwini Vaidya .

Editor information

Editors and Affiliations

Department of Humanities and Social Sciences, Indian Institute of Technology Delhi, New Delhi, Delhi, India
Sumitava Mukherjee
School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Kamand, Himachal Pradesh, India
Varun Dutt
Department of Cognitive Science, Indian Institute of Technology Kanpur, Kanpur, Uttar Pradesh, India
Narayanan Srinivasan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kumar, S., Vaidya, A., Agarwal, S. (2023). Cognate Identification to Augment Lexical Resources for NLP. In: Mukherjee, S., Dutt, V., Srinivasan, N. (eds) Applied Cognitive Science and Technology. Springer, Singapore. https://doi.org/10.1007/978-981-99-3966-4_10

Download citation

DOI: https://doi.org/10.1007/978-981-99-3966-4_10
Published: 24 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-3965-7
Online ISBN: 978-981-99-3966-4
eBook Packages: Behavioral Science and PsychologyBehavioral Science and Psychology (R0)

Publish with us

Policies and ethics

Cognate Identification to Augment Lexical Resources for NLP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Deep Learning Approach to Solving Morphological Analogies

Learning Lexical-Semantic Relations Using Intuitive Cognitive Links

A Comparison of Character and Word Embeddings in Bidirectional LSTMs for POS Tagging in Italian

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Cognate Identification to Augment Lexical Resources for NLP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Deep Learning Approach to Solving Morphological Analogies

Learning Lexical-Semantic Relations Using Intuitive Cognitive Links

A Comparison of Character and Word Embeddings in Bidirectional LSTMs for POS Tagging in Italian

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation