Addressing the data gap: building a parallel corpus for Kashmiri language

Qumar, Syed Matla Ul; Azim, Muzaffar; Quadri, S. M. K.

doi:10.1007/s41870-024-01979-8

Addressing the data gap: building a parallel corpus for Kashmiri language

Original Research
Published: 21 June 2024

(2024)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

11 Accesses
Explore all metrics

Abstract

This paper marks a significant step forward in language technology for low-resource languages by develo** the first parallel corpus for the Kashmiri language, which previously lacked substantial digital resources. We compiled and refined approximately 30,000 sentence pairs through innovative data collection and processing techniques, establishing a high-quality corpus. Leveraging this corpus, we built a Neural Machine Translation (NMT) model, demonstrating its effectiveness with comprehensive performance metrics. Our findings not only showcase the potential for enhancing NMT systems for languages like Kashmiri but also lay the groundwork for future research in linguistic technology development without the need for extensive external resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Why artificial intelligence needs sociology of knowledge: parts I and II

Article Open access 18 May 2024

A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness

Article Open access 18 June 2024

Data availibility

The data that supports the findings of this study are avaliable from the corrresponding author upon reasonable request.

Notes

References

Hu L, Hu J (2022) Exploration of the problems and solutions based on the translation of computer software into Japanese language. Math Probl Eng 2022:1
Google Scholar
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. Lang Resour Eval Lr 2012, European Language Resources Association (ELRA), pp 2214–2218
Koehn P (2017) Neural machine translation. Preprint ar**v:1709.07809
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 4:3104–3112
Google Scholar
Koul O (2003) The Kashmiri language and society. Kashmir its people. APH Publication, New Delhi, pp 293–324
Google Scholar
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. Proc Mach Transl Summit X Pap 11:79–86
Google Scholar
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations parallel corpus v1.0, pp 3530–3534
Salesky E, Wiesner M, Bremerman J, Cattoni R, Negri M, Turchi M, et al. (2021) The multilingual TEDX corpus for speech recognition and translation. Preprint ar**v:2102.01757
Mackenzie J, Benham R, Petri M, Trippas JR, Culpepper JS, Moffat A (2020) CC-News-En: a large English news corpus. In: Int Conf Inf Knowl Manag, pp 3077–3084
Esplà-Gomis M, Forcada ML, Ramírez‐Sánchez G, Hoang H (2019) ParaCrawl: web-scale parallel corpora for the languages of the EU. In: Proceedings of machine translation summit XVII: translator, project and user tracks, pp 118–119
Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409
Htay HH, Kumar GB, Murthy KN (2006) Constructing English-Myanmar parallel corpora. In: International conference on computer applications
Qumar SMU, Azim M, Quadri SMK (2023) Neural machine translation: a survey of methods used for low resource languages. In: 2023 10th international conference on computing for sustainable global development (INDIACom), pp 1640–1647
Lalrempuii C, Soni B (2023) Extremely low-resource multilingual neural machine translation for indic mizo language. Int J Inf Technol 15:4275–4282
Google Scholar
Koul N, Manvi SS (2021) A proposed model for neural machine translation of Sanskrit into English. Int J Inf Technol 13:375–381
Google Scholar
Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Trans Asian Low-Resou Lang Inf Process 19:1–16
Article Google Scholar
Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2013) DGT-TM: a freely available translation memory in 22 languages. Preprint ar**v:1309.5226
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D et al (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. Preprint ar**v:Cs/0609058
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1:9
Google Scholar
Lison P, Tiedemann J, Kouylekov M (2018) OpenSubtitles2018: statistical rescoring of sentence alignments in large, noisy parallel corpora. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)
Iranzo-Sánchez J, Silvestre-Cerda JA, Jorge J, Roselló N, Giménez A, Sanchis A, et al. (2020) Europarl-st: a multilingual corpus for speech translation of parliamentary debates. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8229–8233
Eisele A, Chen Y (2010) Multiun: a multilingual corpus from united nation documents. LREC
Rozis R, Skadiņš R (2017) Tilde MODEL-multilingual open data for EU languages. In: Proceedings of the 21st Nordic conference on computational linguistics, pp 263–265
Chakravarthi BR, Arcan M, McCrae JP (2018) Improving wordnets for under-resourced languages using machine translation. In: Proceedings of the 9th global wordnet conference, pp 77–86
de Anoop RAM, Søgaard KA (2021) Itihasa: a large-scale corpus for Sanskrit to English translation. WAT 2021:191
Google Scholar
Fraisse A, Jenn R, Fishkin SF (2018) Building multilingual parallel corpora for under-resourced languages using translated fictional texts. In: Proceedings of the 3rd workshop on collaboration and computing for under-resourced languages: sustaining knowledge diversity in the digital age, Miyazaki, Japan, pp 39–43
Guzmán F, Chen P-J, Ott M, Pino J, Lample G, Koehn P, et al. (2019) The flores evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. Preprint ar**v:19020.1382
Hegde A, Shashirekha HL (2023) Check for KanSan: Kannada–Sanskrit parallel corpus construction for machine translation. In: International conference on speech and language technologies for low-resource languages, Springer Nature, pp 23–25
Kashefi O (2018) MIZAN: a large Persian–English parallel corpus. Preprint ar**v:1801.02107
Hemmati N, Faili H, Maleki J (2018) Multiple system combination for PersoArabic–Latin transliteration. In: Computational linguistics and intelligent text processing: 18th international conference, CICLing 2017, Budapest, Hungary, Apr 17–23, pp 469–81
Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The iit bombay English–Hindi parallel corpus. Preprint ar**v:17100.2855
Tiedemann J, Thottingal S (2020) OPUS-MT-building open translation services for the world. In: Proceedings of the 22nd annual conference of the European association for machine translation, pp 479–480
Parida S, Dash SR, Bojar O, Motlicek P, Pattnaik P, Mallick DK (2020) OdiEnCorp 2.0: Odia–English parallel corpus for machine translation. In: Proceedings of the WILDRE5-5th workshop on Indian language data: resources and evaluation, pp 14–19
Ramesh G, Doddapaneni S, Bheemaraj A, Jobanputra M, Raghavan AK, Sharma A et al (2022) Samanantar: the largest publicly available parallel corpora collection for 11 Indic languages. Trans Assoc Comput Linguist 10:145–162. https://doi.org/10.1162/TACL_A_00452
Article Google Scholar
Ramesh SH, Sankaranarayanan KP (2018) Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. Preprint ar**v:18060.9652
Siripragada S, Philip J, Namboodiri VP, Jawahar C V (2020) A multilingual parallel corpora collection effort for Indian languages. Preprint ar**v:20070.7691
Dash NS, Bhattacharyya P, Pawar JD (2017) The WordNet in Indian languages. Springer, Cham
Book Google Scholar
Technology Development for Indian Languages (TDIL) | Ministry of Electronics and Information Technology (2023) Government of India n.d. https://www.meity.gov.in/content/technology-development-indian-languages-tdil. Accessed June 7
Ramamoorthy L, Choudhary N, Bhat SM (2019) A gold standard Kashmiri raw text corpus. Central Institute of Indian Languages, Mysore
Google Scholar
Mir TA, Lawaye AA, Rana P, Ahmed G (2023) Building Kashmiri sense annotated corpus and its usage in supervised word sense disambiguation. Indian J Sci Technol 16(13):1021–1029
Article Google Scholar
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. Comput Linguist Proc Conf Long Pap 2:567–573. https://doi.org/10.18653/v1/P17-2090
Article Google Scholar
Utka A, Mockienė L, Laurinaitis M, Rackevičienė S, Rokas A, Bielinskienė A (2022) Building of parallel and comparable cybersecurity corpora for bilingual terminology extraction
Tiedemann J, Agić Ž, Nivre J (2014) Treebank translation for cross-lingual parser induction. In: Proceedings of the eighteenth conference on computational natural language learning, pp 130–140
Maimaiti M, Liu Y, Luan H, Sun M (2021) Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation. Tsinghua Sci Technol 27:150–163
Article Google Scholar
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014–2014 Conf Empir Methods Nat Lang Process Proc Conf, pp 1724–34. https://doi.org/10.3115/v1/d14-1179
Vaswani A, Bengio S, Brevdo E, Chollet F, Gomez AN, Gouws S, et al (2018) Tensor2tensor for neural machine translation. Preprint ar**v:18030.7416
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:5999–6009
Google Scholar
Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50–72
Article Google Scholar
Kirchhoff K, Turner AM, Axelrod A, Saavedra F (2011) Application of statistical machine translation to public health information: a feasibility study. J Am Med Inform Assoc 18:473–478
Article Google Scholar
Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques, and assurance actions. ACM Comput Surv 51:1–40
Article Google Scholar
Malik IH (2022) Spatial dimension of impact, relief, and rescue of the 2014 flood in Kashmir valley. Nat Hazards 110:1911–1929
Article Google Scholar
Fujii Y (2013) The translation of legal agreements and contracts from Japanese into English: the case for a free approach. Babel 59:421–444
Article Google Scholar
Al-Jabri H (2019) Recreating tone in two Arabic translations of landay poetry. Int J Arab Stud 19:445–460
Google Scholar
Emruli S, Nuhiu A, Kadriu B (2016) Copyright and copyright protection. Eur J Interdiscip Stud 2:36–40
Article Google Scholar
Wani SH (2021) Kashmiri to English machine translation: a study in translation divergence issues of personal and possessive pronouns. Indian J Multiling Res Dev 2:1–9
Article Google Scholar
Bashir R, Quadri S (2013) Identification of Kashmiri script in a bilingual document image. In: 2013 IEEE second international conference on image information processing. IEEE, pp 575–579
Kak AA (1995) Acceptability of Kashmiri–English mixed sentences: a sociolinguistic study
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 644–648
Marchisio K, **ong C, Koehn P (2021) Embedding-enhanced GIZA++: improving alignment in low-and high-resource scenarios using embedding space geometry. Preprint ar**v:21040.8721
Anastasopoulos A, Chiang D, Duong L (2016) An unsupervised probability model for speech-to-translation alignment of low-resource languages. Preprint ar**v:16090.8139
Brown PF, Lai JC, Mercer RL (1991) Aligning sentences in parallel corpora. In: 29th annual meeting of the association for computational linguistics, pp 169–176
Tyagi AK, Abraham A (2022) Recurrent neural networks: concepts and applications. Springer, London
Book Google Scholar
Mikolov T, Karafiát M, Burget L, Cernocký JH, Khudanpur S (2010) Recurrent neural network based language model. INTERSPEECH, Makuhari, Chiba, Japan
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Van Houdt G, Mosquera C, Nápoles G (2020) A review on the long short-term memory model. Artif Intell Rev 53:5929–5955
Article Google Scholar
Jozefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In: International conference on machine learning, pp 2342–2350
Dey R, Salem FM. Gate-variants of gated recurrent unit (GRU) neural networks. 2017 IEEE 60th Int. midwest Symp. circuits Syst., IEEE; 2017, p. 1597–600
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Shen L, Liu L, Jiang H, Shi S (2022) On the evaluation metrics for paraphrase generation. Preprint ar**v:22020.8479
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summ Branches Out, pp 74–81
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation, pp 392–395
Mutton A, Dras M, Wan S, Dale R (2007) GLEU: automatic evaluation of sentence-level fluency In: Proceedings of the 45th annual meeting of the association of computational linguistic, pp 344–351

Download references

Acknowledgements

There are no specific acknowledgements for this work.

Funding

This research was conducted without any external funding. The authors declare that no specific grants, sponsorships, or financial support were received for this study.

Author information

Authors and Affiliations

FTK-Centre for Information Technology, Jamia Millia Islamia, New Delhi, 110025, India
Syed Matla Ul Qumar & Muzaffar Azim
Department of Computer Science, Jamia Millia Islamia, New Delhi, 110025, India
S. M. K. Quadri

Authors

Syed Matla Ul Qumar
View author publications
You can also search for this author in PubMed Google Scholar
Muzaffar Azim
View author publications
You can also search for this author in PubMed Google Scholar
S. M. K. Quadri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Matla Ul Qumar.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest related to this work.

Human and animals participants

This research does not contain any studies with human participants or animals performed by any of the author.

Informed consent

Informed consent was not pplicable for this study.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qumar, S.M.U., Azim, M. & Quadri, S.M.K. Addressing the data gap: building a parallel corpus for Kashmiri language. Int. j. inf. tecnol. (2024). https://doi.org/10.1007/s41870-024-01979-8

Download citation

Received: 16 February 2024
Accepted: 29 May 2024
Published: 21 June 2024
DOI: https://doi.org/10.1007/s41870-024-01979-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Addressing the data gap: building a parallel corpus for Kashmiri language

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Why artificial intelligence needs sociology of knowledge: parts I and II

A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness

Data availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animals participants

Informed consent

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Addressing the data gap: building a parallel corpus for Kashmiri language

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Why artificial intelligence needs sociology of knowledge: parts I and II

A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness

Data availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animals participants

Informed consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation