Cognate Identification to Augment Lexical Resources for NLP

  • Chapter
  • First Online:
Applied Cognitive Science and Technology

Abstract

Cognates are words across different languages that are known to have a common ancestral origin. For example, the English word night and the German Nacht, both meaning night are cognates with a common ancestral (Proto-Germanic) origin. Cognates are not always revealingly similar and can change substantially over time such that they do not share form similarity. Automatic cognate identification determines whether a given word pair is cognate or not. A cognate pair may have diverged at the surface level over time, but it shares a common ancestor and is likely to have similar meanings. This is especially true in languages that are typologically closer to each other. Our system uses a character-level model with recurrent neural network architecture and attention. We test its performance on datasets drawn from three different language families. Our results show an improvement in performance as compared to existing models and highlight the usefulness of phonetic and conceptual features. Our model finds similar word pairs with high accuracy from a pair of closely related languages (Hindi and Marathi). One of the applications of our work is to project linguistic annotations from a high-resource language to a (typologically-related) low-resource language. This projection can be used to bootstrap lexical resource creation, e.g., predicate frame information, word sense annotations, etc. The bootstrap** of lexical resources is particularly relevant for languages in South Asia, which are diverse but share areal and typological properties. Apart from this application, cognate identification helps improve the performance of tasks like sentence alignment for machine translation.

At the time of writing this paper, Shantanu Kumar was a student at Indian Institute of Technology, Delhi. He is now working at Sizmek, USA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 106.99
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR 139.09
Price includes VAT (Germany)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://ielex.mpi.nl/.

  2. 2.

    Precision and Recall is computed on positive labels at 0.5 threshold. Precision \(=\) TP/(TP \(+\) FP), Recall \(=\) TP/(TP \(+\) FN), TP: True Positives, FP: False Positives, FN: False Negatives.

  3. 3.

    It can be noted that there is a difference in the reported F-score of the CNN models as compared to the original paper Rama (2016). This is because we report the f-score with respect to the positive labels only, whereas the original paper reported the average f-scores of positive and negative labels (Observed from the implementation in the author’s code).

References

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In ICLR 2015.

    Google Scholar 

  • Bhat, R., Jain, N., Vaidya, A., Palmer, M., Khan, T., Sharma, D., & Babani, J. (2014). Adapting predicate frames for Urdu PropBanking. In Proceedings of the EMNLP 2014 Workshop On Language Technology For Closely Related Languages and Language Variants (pp. 47–55). http://www.aclweb.org/anthology/W14-4206

  • Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on EMNLP. ACL.

    Google Scholar 

  • Brown, C., Holman, E., Wichmann, S., & Villupillai, V. (2008). Automated classification of the world’s languages: A description of the method and preliminary results. Language Typology and Universals.

    Google Scholar 

  • Greenhill, S., Blust, R., & Gray, R. (2008). The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4, 271–283.

    Article  PubMed  PubMed Central  Google Scholar 

  • Hauer, B., & Kondrak, G. (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In IJCNLP (pp. 865–873). Citeseer.

    Google Scholar 

  • Inkpen, D., Frunza, O., & Kondrak, G. (2005). Automatic identification of cognates and false friends in French and English. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2005) (pp. 251–257).

    Google Scholar 

  • Kondrak, G., & Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. In Proceedings of the 20th international conference on Computational Linguistics (p. 952). Association for Computational Linguistics.

    Google Scholar 

  • Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of HLT-NAACL 2003 (Short papers, Vol. 2, NAACL-Short’03, pp. 46–48). ACL. https://doi.org/10.3115/1073483.1073499

  • List, J. M., Lopez, P., & Bapteste, E. (2016). Using sequence similarity networks to identify partial cognates in multilingual wordlists. In Proceedings of the ACL 2016 (Vol. 2: Short Papers, pp. 599–605). Berlin. http://anthology.aclweb.org/P16-2097

  • Luong, M., Pham, H. & Manning, C. Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 1412-1421. Lisbon (2015).

    Google Scholar 

  • Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech (Vol. 2, p. 3).

    Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). http://www.aclweb.org/anthology/D14-1162

  • Rama, T. (2015). Automatic cognate identification with gap-weighted string subsequences. In Proceedings of the 2015 Conference of NAACL: Human Language Technologies (pp. 1227–1231).

    Google Scholar 

  • Rama, T. (2016). Siamese convolutional networks for cognate identification. In Proceedings of COLING 2016 (pp. 1018–1027).

    Google Scholar 

  • Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kocisky, T., & Blunsom, P. (2106). Reasoning about entailment with neural attention. In ICLR.

    Google Scholar 

  • Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In Proceedings of the 1993 Conference of CASCON (pp. 1071–1082). IBM Press.

    Google Scholar 

  • Singh, A. K., & Surana, H. (2007). Study of cognates among south Asian languages for the purpose of building lexical resources. Journal of Language Technology.

    Google Scholar 

  • Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., & Nie, J. Y. (2015). A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (pp. 553–562). ACM.

    Google Scholar 

  • Wichmann, S., & Holman, E. (2008). Languages with longer words have more lexical change. Approaches To Measuring Linguistic Differences.

    Google Scholar 

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference On Machine Learning (pp. 2048–2057).

    Google Scholar 

  • Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 21–29).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashwini Vaidya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kumar, S., Vaidya, A., Agarwal, S. (2023). Cognate Identification to Augment Lexical Resources for NLP. In: Mukherjee, S., Dutt, V., Srinivasan, N. (eds) Applied Cognitive Science and Technology. Springer, Singapore. https://doi.org/10.1007/978-981-99-3966-4_10

Download citation

Publish with us

Policies and ethics

Navigation