Abstract
Cognates are words across different languages that are known to have a common ancestral origin. For example, the English word night and the German Nacht, both meaning night are cognates with a common ancestral (Proto-Germanic) origin. Cognates are not always revealingly similar and can change substantially over time such that they do not share form similarity. Automatic cognate identification determines whether a given word pair is cognate or not. A cognate pair may have diverged at the surface level over time, but it shares a common ancestor and is likely to have similar meanings. This is especially true in languages that are typologically closer to each other. Our system uses a character-level model with recurrent neural network architecture and attention. We test its performance on datasets drawn from three different language families. Our results show an improvement in performance as compared to existing models and highlight the usefulness of phonetic and conceptual features. Our model finds similar word pairs with high accuracy from a pair of closely related languages (Hindi and Marathi). One of the applications of our work is to project linguistic annotations from a high-resource language to a (typologically-related) low-resource language. This projection can be used to bootstrap lexical resource creation, e.g., predicate frame information, word sense annotations, etc. The bootstrap** of lexical resources is particularly relevant for languages in South Asia, which are diverse but share areal and typological properties. Apart from this application, cognate identification helps improve the performance of tasks like sentence alignment for machine translation.
At the time of writing this paper, Shantanu Kumar was a student at Indian Institute of Technology, Delhi. He is now working at Sizmek, USA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Precision and Recall is computed on positive labels at 0.5 threshold. Precision \(=\) TP/(TP \(+\) FP), Recall \(=\) TP/(TP \(+\) FN), TP: True Positives, FP: False Positives, FN: False Negatives.
- 3.
It can be noted that there is a difference in the reported F-score of the CNN models as compared to the original paper Rama (2016). This is because we report the f-score with respect to the positive labels only, whereas the original paper reported the average f-scores of positive and negative labels (Observed from the implementation in the author’s code).
References
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In ICLR 2015.
Bhat, R., Jain, N., Vaidya, A., Palmer, M., Khan, T., Sharma, D., & Babani, J. (2014). Adapting predicate frames for Urdu PropBanking. In Proceedings of the EMNLP 2014 Workshop On Language Technology For Closely Related Languages and Language Variants (pp. 47–55). http://www.aclweb.org/anthology/W14-4206
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on EMNLP. ACL.
Brown, C., Holman, E., Wichmann, S., & Villupillai, V. (2008). Automated classification of the world’s languages: A description of the method and preliminary results. Language Typology and Universals.
Greenhill, S., Blust, R., & Gray, R. (2008). The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4, 271–283.
Hauer, B., & Kondrak, G. (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In IJCNLP (pp. 865–873). Citeseer.
Inkpen, D., Frunza, O., & Kondrak, G. (2005). Automatic identification of cognates and false friends in French and English. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2005) (pp. 251–257).
Kondrak, G., & Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. In Proceedings of the 20th international conference on Computational Linguistics (p. 952). Association for Computational Linguistics.
Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of HLT-NAACL 2003 (Short papers, Vol. 2, NAACL-Short’03, pp. 46–48). ACL. https://doi.org/10.3115/1073483.1073499
List, J. M., Lopez, P., & Bapteste, E. (2016). Using sequence similarity networks to identify partial cognates in multilingual wordlists. In Proceedings of the ACL 2016 (Vol. 2: Short Papers, pp. 599–605). Berlin. http://anthology.aclweb.org/P16-2097
Luong, M., Pham, H. & Manning, C. Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 1412-1421. Lisbon (2015).
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech (Vol. 2, p. 3).
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). http://www.aclweb.org/anthology/D14-1162
Rama, T. (2015). Automatic cognate identification with gap-weighted string subsequences. In Proceedings of the 2015 Conference of NAACL: Human Language Technologies (pp. 1227–1231).
Rama, T. (2016). Siamese convolutional networks for cognate identification. In Proceedings of COLING 2016 (pp. 1018–1027).
Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kocisky, T., & Blunsom, P. (2106). Reasoning about entailment with neural attention. In ICLR.
Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In Proceedings of the 1993 Conference of CASCON (pp. 1071–1082). IBM Press.
Singh, A. K., & Surana, H. (2007). Study of cognates among south Asian languages for the purpose of building lexical resources. Journal of Language Technology.
Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., & Nie, J. Y. (2015). A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (pp. 553–562). ACM.
Wichmann, S., & Holman, E. (2008). Languages with longer words have more lexical change. Approaches To Measuring Linguistic Differences.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference On Machine Learning (pp. 2048–2057).
Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 21–29).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Kumar, S., Vaidya, A., Agarwal, S. (2023). Cognate Identification to Augment Lexical Resources for NLP. In: Mukherjee, S., Dutt, V., Srinivasan, N. (eds) Applied Cognitive Science and Technology. Springer, Singapore. https://doi.org/10.1007/978-981-99-3966-4_10
Download citation
DOI: https://doi.org/10.1007/978-981-99-3966-4_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-3965-7
Online ISBN: 978-981-99-3966-4
eBook Packages: Behavioral Science and PsychologyBehavioral Science and Psychology (R0)