Log in

Addressing the data gap: building a parallel corpus for Kashmiri language

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

This paper marks a significant step forward in language technology for low-resource languages by develo** the first parallel corpus for the Kashmiri language, which previously lacked substantial digital resources. We compiled and refined approximately 30,000 sentence pairs through innovative data collection and processing techniques, establishing a high-quality corpus. Leveraging this corpus, we built a Neural Machine Translation (NMT) model, demonstrating its effectiveness with comprehensive performance metrics. Our findings not only showcase the potential for enhancing NMT systems for languages like Kashmiri but also lay the groundwork for future research in linguistic technology development without the need for extensive external resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availibility

The data that supports the findings of this study are avaliable from the corrresponding author upon reasonable request.

Notes

  1. https://www.mturk.com/.

  2. https://labelbox.com/.

  3. https://scale.com/.

  4. https://appen.com/.

  5. http://art.uok.edu.in/Main/Default.aspx.

  6. https://kashmirobserver.net/2015/11/25/refurbished-kitab-ghar-thrown-open-at-srinagar/.

  7. http://www.publishingnext.in/shafi-shauq/.

  8. https://en.wikipedia.org/wiki/Lalleshwari.

  9. https://en.wikipedia.org/wiki/Nund_Rishi.

  10. https://en.wikipedia.org/wiki/Agha_Shahid_Ali.

  11. https://pdf.abbyy.com/.

  12. https://github.com/manisandro/gImageReader.

  13. https://www.cisdem.com/.

  14. https://workspace.google.com/.

References

  1. Hu L, Hu J (2022) Exploration of the problems and solutions based on the translation of computer software into Japanese language. Math Probl Eng 2022:1

    Google Scholar 

  2. Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. Lang Resour Eval Lr 2012, European Language Resources Association (ELRA), pp 2214–2218

  3. Koehn P (2017) Neural machine translation. Preprint ar**v:1709.07809

  4. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 4:3104–3112

    Google Scholar 

  5. Koul O (2003) The Kashmiri language and society. Kashmir its people. APH Publication, New Delhi, pp 293–324

    Google Scholar 

  6. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. Proc Mach Transl Summit X Pap 11:79–86

    Google Scholar 

  7. Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations parallel corpus v1.0, pp 3530–3534

  8. Salesky E, Wiesner M, Bremerman J, Cattoni R, Negri M, Turchi M, et al. (2021) The multilingual TEDX corpus for speech recognition and translation. Preprint ar**v:2102.01757

  9. Mackenzie J, Benham R, Petri M, Trippas JR, Culpepper JS, Moffat A (2020) CC-News-En: a large English news corpus. In: Int Conf Inf Knowl Manag, pp 3077–3084

  10. Esplà-Gomis M, Forcada ML, Ramírez‐Sánchez G, Hoang H (2019) ParaCrawl: web-scale parallel corpora for the languages of the EU. In: Proceedings of machine translation summit XVII: translator, project and user tracks, pp 118–119

  11. Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409

  12. Htay HH, Kumar GB, Murthy KN (2006) Constructing English-Myanmar parallel corpora. In: International conference on computer applications

  13. Qumar SMU, Azim M, Quadri SMK (2023) Neural machine translation: a survey of methods used for low resource languages. In: 2023 10th international conference on computing for sustainable global development (INDIACom), pp 1640–1647

  14. Lalrempuii C, Soni B (2023) Extremely low-resource multilingual neural machine translation for indic mizo language. Int J Inf Technol 15:4275–4282

    Google Scholar 

  15. Koul N, Manvi SS (2021) A proposed model for neural machine translation of Sanskrit into English. Int J Inf Technol 13:375–381

    Google Scholar 

  16. Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Trans Asian Low-Resou Lang Inf Process 19:1–16

    Article  Google Scholar 

  17. Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2013) DGT-TM: a freely available translation memory in 22 languages. Preprint ar**v:1309.5226

  18. Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D et al (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. Preprint ar**v:Cs/0609058

  19. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1:9

    Google Scholar 

  20. Lison P, Tiedemann J, Kouylekov M (2018) OpenSubtitles2018: statistical rescoring of sentence alignments in large, noisy parallel corpora. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)

  21. Iranzo-Sánchez J, Silvestre-Cerda JA, Jorge J, Roselló N, Giménez A, Sanchis A, et al. (2020) Europarl-st: a multilingual corpus for speech translation of parliamentary debates. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8229–8233

  22. Eisele A, Chen Y (2010) Multiun: a multilingual corpus from united nation documents. LREC

  23. Rozis R, Skadiņš R (2017) Tilde MODEL-multilingual open data for EU languages. In: Proceedings of the 21st Nordic conference on computational linguistics, pp 263–265

  24. Chakravarthi BR, Arcan M, McCrae JP (2018) Improving wordnets for under-resourced languages using machine translation. In: Proceedings of the 9th global wordnet conference, pp 77–86

  25. de Anoop RAM, Søgaard KA (2021) Itihasa: a large-scale corpus for Sanskrit to English translation. WAT 2021:191

    Google Scholar 

  26. Fraisse A, Jenn R, Fishkin SF (2018) Building multilingual parallel corpora for under-resourced languages using translated fictional texts. In: Proceedings of the 3rd workshop on collaboration and computing for under-resourced languages: sustaining knowledge diversity in the digital age, Miyazaki, Japan, pp 39–43

  27. Guzmán F, Chen P-J, Ott M, Pino J, Lample G, Koehn P, et al. (2019) The flores evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. Preprint ar**v:19020.1382

  28. Hegde A, Shashirekha HL (2023) Check for KanSan: Kannada–Sanskrit parallel corpus construction for machine translation. In: International conference on speech and language technologies for low-resource languages, Springer Nature, pp 23–25

  29. Kashefi O (2018) MIZAN: a large Persian–English parallel corpus. Preprint ar**v:1801.02107

  30. Hemmati N, Faili H, Maleki J (2018) Multiple system combination for PersoArabic–Latin transliteration. In: Computational linguistics and intelligent text processing: 18th international conference, CICLing 2017, Budapest, Hungary, Apr 17–23, pp 469–81

  31. Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The iit bombay English–Hindi parallel corpus. Preprint ar**v:17100.2855

  32. Tiedemann J, Thottingal S (2020) OPUS-MT-building open translation services for the world. In: Proceedings of the 22nd annual conference of the European association for machine translation, pp 479–480

  33. Parida S, Dash SR, Bojar O, Motlicek P, Pattnaik P, Mallick DK (2020) OdiEnCorp 2.0: Odia–English parallel corpus for machine translation. In: Proceedings of the WILDRE5-5th workshop on Indian language data: resources and evaluation, pp 14–19

  34. Ramesh G, Doddapaneni S, Bheemaraj A, Jobanputra M, Raghavan AK, Sharma A et al (2022) Samanantar: the largest publicly available parallel corpora collection for 11 Indic languages. Trans Assoc Comput Linguist 10:145–162. https://doi.org/10.1162/TACL_A_00452

    Article  Google Scholar 

  35. Ramesh SH, Sankaranarayanan KP (2018) Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. Preprint ar**v:18060.9652

  36. Siripragada S, Philip J, Namboodiri VP, Jawahar C V (2020) A multilingual parallel corpora collection effort for Indian languages. Preprint ar**v:20070.7691

  37. Dash NS, Bhattacharyya P, Pawar JD (2017) The WordNet in Indian languages. Springer, Cham

    Book  Google Scholar 

  38. Technology Development for Indian Languages (TDIL) | Ministry of Electronics and Information Technology (2023) Government of India n.d. https://www.meity.gov.in/content/technology-development-indian-languages-tdil. Accessed June 7

  39. Ramamoorthy L, Choudhary N, Bhat SM (2019) A gold standard Kashmiri raw text corpus. Central Institute of Indian Languages, Mysore

    Google Scholar 

  40. Mir TA, Lawaye AA, Rana P, Ahmed G (2023) Building Kashmiri sense annotated corpus and its usage in supervised word sense disambiguation. Indian J Sci Technol 16(13):1021–1029

    Article  Google Scholar 

  41. Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. Comput Linguist Proc Conf Long Pap 2:567–573. https://doi.org/10.18653/v1/P17-2090

    Article  Google Scholar 

  42. Utka A, Mockienė L, Laurinaitis M, Rackevičienė S, Rokas A, Bielinskienė A (2022) Building of parallel and comparable cybersecurity corpora for bilingual terminology extraction

  43. Tiedemann J, Agić Ž, Nivre J (2014) Treebank translation for cross-lingual parser induction. In: Proceedings of the eighteenth conference on computational natural language learning, pp 130–140

  44. Maimaiti M, Liu Y, Luan H, Sun M (2021) Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation. Tsinghua Sci Technol 27:150–163

    Article  Google Scholar 

  45. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014–2014 Conf Empir Methods Nat Lang Process Proc Conf, pp 1724–34. https://doi.org/10.3115/v1/d14-1179

  46. Vaswani A, Bengio S, Brevdo E, Chollet F, Gomez AN, Gouws S, et al (2018) Tensor2tensor for neural machine translation. Preprint ar**v:18030.7416

  47. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:5999–6009

    Google Scholar 

  48. Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50–72

    Article  Google Scholar 

  49. Kirchhoff K, Turner AM, Axelrod A, Saavedra F (2011) Application of statistical machine translation to public health information: a feasibility study. J Am Med Inform Assoc 18:473–478

    Article  Google Scholar 

  50. Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques, and assurance actions. ACM Comput Surv 51:1–40

    Article  Google Scholar 

  51. Malik IH (2022) Spatial dimension of impact, relief, and rescue of the 2014 flood in Kashmir valley. Nat Hazards 110:1911–1929

    Article  Google Scholar 

  52. Fujii Y (2013) The translation of legal agreements and contracts from Japanese into English: the case for a free approach. Babel 59:421–444

    Article  Google Scholar 

  53. Al-Jabri H (2019) Recreating tone in two Arabic translations of landay poetry. Int J Arab Stud 19:445–460

    Google Scholar 

  54. Emruli S, Nuhiu A, Kadriu B (2016) Copyright and copyright protection. Eur J Interdiscip Stud 2:36–40

    Article  Google Scholar 

  55. Wani SH (2021) Kashmiri to English machine translation: a study in translation divergence issues of personal and possessive pronouns. Indian J Multiling Res Dev 2:1–9

    Article  Google Scholar 

  56. Bashir R, Quadri S (2013) Identification of Kashmiri script in a bilingual document image. In: 2013 IEEE second international conference on image information processing. IEEE, pp 575–579

  57. Kak AA (1995) Acceptability of Kashmiri–English mixed sentences: a sociolinguistic study

  58. Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 644–648

  59. Marchisio K, **ong C, Koehn P (2021) Embedding-enhanced GIZA++: improving alignment in low-and high-resource scenarios using embedding space geometry. Preprint ar**v:21040.8721

  60. Anastasopoulos A, Chiang D, Duong L (2016) An unsupervised probability model for speech-to-translation alignment of low-resource languages. Preprint ar**v:16090.8139

  61. Brown PF, Lai JC, Mercer RL (1991) Aligning sentences in parallel corpora. In: 29th annual meeting of the association for computational linguistics, pp 169–176

  62. Tyagi AK, Abraham A (2022) Recurrent neural networks: concepts and applications. Springer, London

    Book  Google Scholar 

  63. Mikolov T, Karafiát M, Burget L, Cernocký JH, Khudanpur S (2010) Recurrent neural network based language model. INTERSPEECH, Makuhari, Chiba, Japan

  64. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  65. Van Houdt G, Mosquera C, Nápoles G (2020) A review on the long short-term memory model. Artif Intell Rev 53:5929–5955

    Article  Google Scholar 

  66. Jozefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In: International conference on machine learning, pp 2342–2350

  67. Dey R, Salem FM. Gate-variants of gated recurrent unit (GRU) neural networks. 2017 IEEE 60th Int. midwest Symp. circuits Syst., IEEE; 2017, p. 1597–600

  68. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  69. Shen L, Liu L, Jiang H, Shi S (2022) On the evaluation metrics for paraphrase generation. Preprint ar**v:22020.8479

  70. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  71. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summ Branches Out, pp 74–81

  72. Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation, pp 392–395

  73. Mutton A, Dras M, Wan S, Dale R (2007) GLEU: automatic evaluation of sentence-level fluency In: Proceedings of the 45th annual meeting of the association of computational linguistic, pp 344–351

Download references

Acknowledgements

There are no specific acknowledgements for this work.

Funding

This research was conducted without any external funding. The authors declare that no specific grants, sponsorships, or financial support were received for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Syed Matla Ul Qumar.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest related to this work.

Human and animals participants

This research does not contain any studies with human participants or animals performed by any of the author.

Informed consent

Informed consent was not pplicable for this study.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qumar, S.M.U., Azim, M. & Quadri, S.M.K. Addressing the data gap: building a parallel corpus for Kashmiri language. Int. j. inf. tecnol. (2024). https://doi.org/10.1007/s41870-024-01979-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41870-024-01979-8

Keywords

Navigation