Abstract
This paper marks a significant step forward in language technology for low-resource languages by develo** the first parallel corpus for the Kashmiri language, which previously lacked substantial digital resources. We compiled and refined approximately 30,000 sentence pairs through innovative data collection and processing techniques, establishing a high-quality corpus. Leveraging this corpus, we built a Neural Machine Translation (NMT) model, demonstrating its effectiveness with comprehensive performance metrics. Our findings not only showcase the potential for enhancing NMT systems for languages like Kashmiri but also lay the groundwork for future research in linguistic technology development without the need for extensive external resources.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41870-024-01979-8/MediaObjects/41870_2024_1979_Fig13_HTML.png)
Similar content being viewed by others
Data availibility
The data that supports the findings of this study are avaliable from the corrresponding author upon reasonable request.
Notes
References
Hu L, Hu J (2022) Exploration of the problems and solutions based on the translation of computer software into Japanese language. Math Probl Eng 2022:1
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. Lang Resour Eval Lr 2012, European Language Resources Association (ELRA), pp 2214–2218
Koehn P (2017) Neural machine translation. Preprint ar**v:1709.07809
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 4:3104–3112
Koul O (2003) The Kashmiri language and society. Kashmir its people. APH Publication, New Delhi, pp 293–324
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. Proc Mach Transl Summit X Pap 11:79–86
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations parallel corpus v1.0, pp 3530–3534
Salesky E, Wiesner M, Bremerman J, Cattoni R, Negri M, Turchi M, et al. (2021) The multilingual TEDX corpus for speech recognition and translation. Preprint ar**v:2102.01757
Mackenzie J, Benham R, Petri M, Trippas JR, Culpepper JS, Moffat A (2020) CC-News-En: a large English news corpus. In: Int Conf Inf Knowl Manag, pp 3077–3084
Esplà-Gomis M, Forcada ML, Ramírez‐Sánchez G, Hoang H (2019) ParaCrawl: web-scale parallel corpora for the languages of the EU. In: Proceedings of machine translation summit XVII: translator, project and user tracks, pp 118–119
Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409
Htay HH, Kumar GB, Murthy KN (2006) Constructing English-Myanmar parallel corpora. In: International conference on computer applications
Qumar SMU, Azim M, Quadri SMK (2023) Neural machine translation: a survey of methods used for low resource languages. In: 2023 10th international conference on computing for sustainable global development (INDIACom), pp 1640–1647
Lalrempuii C, Soni B (2023) Extremely low-resource multilingual neural machine translation for indic mizo language. Int J Inf Technol 15:4275–4282
Koul N, Manvi SS (2021) A proposed model for neural machine translation of Sanskrit into English. Int J Inf Technol 13:375–381
Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Trans Asian Low-Resou Lang Inf Process 19:1–16
Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2013) DGT-TM: a freely available translation memory in 22 languages. Preprint ar**v:1309.5226
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D et al (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. Preprint ar**v:Cs/0609058
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1:9
Lison P, Tiedemann J, Kouylekov M (2018) OpenSubtitles2018: statistical rescoring of sentence alignments in large, noisy parallel corpora. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)
Iranzo-Sánchez J, Silvestre-Cerda JA, Jorge J, Roselló N, Giménez A, Sanchis A, et al. (2020) Europarl-st: a multilingual corpus for speech translation of parliamentary debates. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8229–8233
Eisele A, Chen Y (2010) Multiun: a multilingual corpus from united nation documents. LREC
Rozis R, Skadiņš R (2017) Tilde MODEL-multilingual open data for EU languages. In: Proceedings of the 21st Nordic conference on computational linguistics, pp 263–265
Chakravarthi BR, Arcan M, McCrae JP (2018) Improving wordnets for under-resourced languages using machine translation. In: Proceedings of the 9th global wordnet conference, pp 77–86
de Anoop RAM, Søgaard KA (2021) Itihasa: a large-scale corpus for Sanskrit to English translation. WAT 2021:191
Fraisse A, Jenn R, Fishkin SF (2018) Building multilingual parallel corpora for under-resourced languages using translated fictional texts. In: Proceedings of the 3rd workshop on collaboration and computing for under-resourced languages: sustaining knowledge diversity in the digital age, Miyazaki, Japan, pp 39–43
Guzmán F, Chen P-J, Ott M, Pino J, Lample G, Koehn P, et al. (2019) The flores evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. Preprint ar**v:19020.1382
Hegde A, Shashirekha HL (2023) Check for KanSan: Kannada–Sanskrit parallel corpus construction for machine translation. In: International conference on speech and language technologies for low-resource languages, Springer Nature, pp 23–25
Kashefi O (2018) MIZAN: a large Persian–English parallel corpus. Preprint ar**v:1801.02107
Hemmati N, Faili H, Maleki J (2018) Multiple system combination for PersoArabic–Latin transliteration. In: Computational linguistics and intelligent text processing: 18th international conference, CICLing 2017, Budapest, Hungary, Apr 17–23, pp 469–81
Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The iit bombay English–Hindi parallel corpus. Preprint ar**v:17100.2855
Tiedemann J, Thottingal S (2020) OPUS-MT-building open translation services for the world. In: Proceedings of the 22nd annual conference of the European association for machine translation, pp 479–480
Parida S, Dash SR, Bojar O, Motlicek P, Pattnaik P, Mallick DK (2020) OdiEnCorp 2.0: Odia–English parallel corpus for machine translation. In: Proceedings of the WILDRE5-5th workshop on Indian language data: resources and evaluation, pp 14–19
Ramesh G, Doddapaneni S, Bheemaraj A, Jobanputra M, Raghavan AK, Sharma A et al (2022) Samanantar: the largest publicly available parallel corpora collection for 11 Indic languages. Trans Assoc Comput Linguist 10:145–162. https://doi.org/10.1162/TACL_A_00452
Ramesh SH, Sankaranarayanan KP (2018) Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. Preprint ar**v:18060.9652
Siripragada S, Philip J, Namboodiri VP, Jawahar C V (2020) A multilingual parallel corpora collection effort for Indian languages. Preprint ar**v:20070.7691
Dash NS, Bhattacharyya P, Pawar JD (2017) The WordNet in Indian languages. Springer, Cham
Technology Development for Indian Languages (TDIL) | Ministry of Electronics and Information Technology (2023) Government of India n.d. https://www.meity.gov.in/content/technology-development-indian-languages-tdil. Accessed June 7
Ramamoorthy L, Choudhary N, Bhat SM (2019) A gold standard Kashmiri raw text corpus. Central Institute of Indian Languages, Mysore
Mir TA, Lawaye AA, Rana P, Ahmed G (2023) Building Kashmiri sense annotated corpus and its usage in supervised word sense disambiguation. Indian J Sci Technol 16(13):1021–1029
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. Comput Linguist Proc Conf Long Pap 2:567–573. https://doi.org/10.18653/v1/P17-2090
Utka A, Mockienė L, Laurinaitis M, Rackevičienė S, Rokas A, Bielinskienė A (2022) Building of parallel and comparable cybersecurity corpora for bilingual terminology extraction
Tiedemann J, Agić Ž, Nivre J (2014) Treebank translation for cross-lingual parser induction. In: Proceedings of the eighteenth conference on computational natural language learning, pp 130–140
Maimaiti M, Liu Y, Luan H, Sun M (2021) Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation. Tsinghua Sci Technol 27:150–163
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014–2014 Conf Empir Methods Nat Lang Process Proc Conf, pp 1724–34. https://doi.org/10.3115/v1/d14-1179
Vaswani A, Bengio S, Brevdo E, Chollet F, Gomez AN, Gouws S, et al (2018) Tensor2tensor for neural machine translation. Preprint ar**v:18030.7416
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:5999–6009
Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50–72
Kirchhoff K, Turner AM, Axelrod A, Saavedra F (2011) Application of statistical machine translation to public health information: a feasibility study. J Am Med Inform Assoc 18:473–478
Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques, and assurance actions. ACM Comput Surv 51:1–40
Malik IH (2022) Spatial dimension of impact, relief, and rescue of the 2014 flood in Kashmir valley. Nat Hazards 110:1911–1929
Fujii Y (2013) The translation of legal agreements and contracts from Japanese into English: the case for a free approach. Babel 59:421–444
Al-Jabri H (2019) Recreating tone in two Arabic translations of landay poetry. Int J Arab Stud 19:445–460
Emruli S, Nuhiu A, Kadriu B (2016) Copyright and copyright protection. Eur J Interdiscip Stud 2:36–40
Wani SH (2021) Kashmiri to English machine translation: a study in translation divergence issues of personal and possessive pronouns. Indian J Multiling Res Dev 2:1–9
Bashir R, Quadri S (2013) Identification of Kashmiri script in a bilingual document image. In: 2013 IEEE second international conference on image information processing. IEEE, pp 575–579
Kak AA (1995) Acceptability of Kashmiri–English mixed sentences: a sociolinguistic study
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 644–648
Marchisio K, **ong C, Koehn P (2021) Embedding-enhanced GIZA++: improving alignment in low-and high-resource scenarios using embedding space geometry. Preprint ar**v:21040.8721
Anastasopoulos A, Chiang D, Duong L (2016) An unsupervised probability model for speech-to-translation alignment of low-resource languages. Preprint ar**v:16090.8139
Brown PF, Lai JC, Mercer RL (1991) Aligning sentences in parallel corpora. In: 29th annual meeting of the association for computational linguistics, pp 169–176
Tyagi AK, Abraham A (2022) Recurrent neural networks: concepts and applications. Springer, London
Mikolov T, Karafiát M, Burget L, Cernocký JH, Khudanpur S (2010) Recurrent neural network based language model. INTERSPEECH, Makuhari, Chiba, Japan
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Van Houdt G, Mosquera C, Nápoles G (2020) A review on the long short-term memory model. Artif Intell Rev 53:5929–5955
Jozefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In: International conference on machine learning, pp 2342–2350
Dey R, Salem FM. Gate-variants of gated recurrent unit (GRU) neural networks. 2017 IEEE 60th Int. midwest Symp. circuits Syst., IEEE; 2017, p. 1597–600
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Shen L, Liu L, Jiang H, Shi S (2022) On the evaluation metrics for paraphrase generation. Preprint ar**v:22020.8479
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summ Branches Out, pp 74–81
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation, pp 392–395
Mutton A, Dras M, Wan S, Dale R (2007) GLEU: automatic evaluation of sentence-level fluency In: Proceedings of the 45th annual meeting of the association of computational linguistic, pp 344–351
Acknowledgements
There are no specific acknowledgements for this work.
Funding
This research was conducted without any external funding. The authors declare that no specific grants, sponsorships, or financial support were received for this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest related to this work.
Human and animals participants
This research does not contain any studies with human participants or animals performed by any of the author.
Informed consent
Informed consent was not pplicable for this study.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qumar, S.M.U., Azim, M. & Quadri, S.M.K. Addressing the data gap: building a parallel corpus for Kashmiri language. Int. j. inf. tecnol. (2024). https://doi.org/10.1007/s41870-024-01979-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41870-024-01979-8