Corpus and Machine Translation

  • Chapter
  • First Online:
Utility and Application of Language Corpora

Abstract

History shows that a machine translation (MT) system with the support of a few linguistic rules is not realistic. A few rules are not sufficient for capturing the wide variety a natural language exhibits in its diverse use. This leads us to argue for a corpus-based machine translation (CBMT) system that desires to rely on a large amount of linguistic data, information, examples, and rules retrieved from corpora. The first benefit of a CBMT system is the development of algorithms for alignment of bilingual text corpus (BTC)—an essential part of an MT system. A BTC generates a new kind of translation support resource that helps in learning through trial, verification, and validation. A CBMT system begins with analysis of translations produced by human to understand and define the internal structures of BTC, completely or partially, to design strategies for machine learning. Analysis of BTC lends heavily to develop aids to translation as we do not expect an MT system to ‘produce’ exact translation but to ‘understand’ how translations are actually produced with linguistic and extralinguistic information. The use of BTC in CBMT is justified on the ground that data and information acquired from BTC are richer than monolingual corpus with regard to information of contextual equivalence between the languages. Thus, a CBMT system earns a unique status by a combination of features of the example-based machine translation (EBMT) and statistics-based machine translation (SBMT) kee** a mutual interface between the two.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Altenberg, B., and K. Aijmer. 2000. The English-Swedish parallel corpus: A resource for contrastive research and translation studies. In Corpus Linguistics and Linguistic Theory, ed. C. Mair and M. Hundt, 15–33. Amsterdam-Atlanta, GA: Rodopi.

    Google Scholar 

  • Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and Technology: In Honour of John Sinclair, ed. M. Baker, F. Gill, and E. Tognini-Bonelli, 233–250. Philadelphia: John Benjamins.

    Google Scholar 

  • Baker, M. 1996. Corpus-based translation studies: The challenges that lie ahead. In: Terminology, LSP, and Translation: Studies in language engineering in honour of Juan C. Sager., ed. Somers, H. Translation Library 18, 175–186. Amsterdam: John Benjamins’.

    Google Scholar 

  • Brown, P., J. Cocke, S.D. Pietra, F. Jelinek, R.L. Mercer, and P.S. Rosin. 1990. A Statistical approach to language translation. Computational Linguistics 16 (1): 79–85.

    Google Scholar 

  • Brown, P.F., S.D. Pietra, and R.L. Mercer. 1993. Statistical machine translation. Computational Linguistics 19 (2): 263–312.

    Google Scholar 

  • Castillo, J.J. 2010. Using machine translation systems to expand a corpus in textual entailment. In Proceedings of the 7th International Conference on Advances in Natural Language Processing. New York, US: Springer, 97–102.

    Google Scholar 

  • Chen, K.H., and H.H. Chen. 1995. Aligning bilingual corpora especially for language pairs from different families. Informations-Sciences-Applications, 4 (2):57–81.

    Google Scholar 

  • Condamines, A. 2010. Variations in terminology: Application to the management of risks related to language use in the workplace. Terminology 16 (1): 30–50.

    Google Scholar 

  • Dash, N.S. 2005. Role of context in word sense disambiguation. Indian Linguistics 66 (1–4): 159–175.

    Google Scholar 

  • Dash, N.S. 2016. Culling scientific and technical terms (STTs) from text corpora for compiling termbank in Bangla. Research Cell: An International Journal of Engineering Sciences 21: 107–122.

    Google Scholar 

  • Dash, N.S., and S. Arulmozi. 2016. Generating parallel translation corpora in indian languages: cultivating bilingual texts for cross-lingual fertilization. Translation Today 10 (1): 84–118.

    Google Scholar 

  • Dietzel, S. 2009. Example-based Machine Translation. Berlin: Springer.

    Google Scholar 

  • Furuse, O., and H. Lida. 1992. An Example-based Method for Transfer-driven Machine Translation. In Proceedings of the MTI-92, Montreal, Canada, 139–150.

    Google Scholar 

  • Jones, D. 1992. Non-hybrid Example-based Machine Translation Architectures. In Proceedings of the MTI-92, Montreal, Canada, 163–171.

    Google Scholar 

  • Kay, M., and M. Röscheisen. 1993. Text-translation alignment. Computational Linguistics 19 (1): 13–27.

    Google Scholar 

  • Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of MT Summit X, Phuket, Thailand, 79–97.

    Google Scholar 

  • Koehn, P. 2010. Statistical Machine Translation. Cambridge: Cambridge University Press.

    Google Scholar 

  • Macken, L., E. Lefever, and V. Hoste. 2013. Bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology 19 (1): 1–30.

    Google Scholar 

  • McLean, I. 1992. Example-based machine translation using connectionist matching. In Proceedings of the MTI-92. Montreal, Canada, 35–43.

    Google Scholar 

  • Pala, K., and S.V. Ganagashetty. 2012. Challenges and opportunities in automatically building bilingual lexicon from web corpus. Interdisciplinary Journal of Linguistics 5 (1–2): 169–184.

    Google Scholar 

  • Sanderson, M., and W.B. Croft. 2012. The History of information retrieval research. Proceedings of the IEEE 100: 1444–1451.

    Google Scholar 

  • Somers, H. 1999. Example-based machine translation. Machine Translation 14 (2): 113–157.

    Google Scholar 

  • Somers, H. 2008. Corpora and machine translation. In Corpus Linguistics: An International Handbook, ed. Lüdeling, A., and M. Kytö, 1175–1196. Berlin: Mouton de Gruyter.

    Google Scholar 

  • Su, K.Y., and J.S. Chang. 1992. Why corpus-based statistics-oriented machine translation. In The Proceedings of the MTI-92, Montreal, Canada, pp. 249–262.

    Google Scholar 

  • Temmerman, R. 2000. Towards New Ways of Terminology Description: The Socio-Cognitive Approach, 26. London: John Benjamins.

    Google Scholar 

  • Teubert, W. 2000. Corpus linguistics—A partisan view. International Journal of Corpus Linguistics. 4 (1): 1–16.

    Google Scholar 

  • Teubert, W. 2002. The role of parallel corpora in translation and multilingual lexicography. In Lexis in Contrast: Corpus-based Approaches, ed. B. Altenberg and S. Granger, 189–214. Amsterdam: John Benjamins.

    Google Scholar 

  • Vandeghinste, V. 2007. Removing the distinction between a translation memory, a bilingual dictionary, and a parallel corpus. In Proceedings of Translation and the Computer 29, ASLIB, London, UK.

    Google Scholar 

  • Winograd, T. 1983. Language as a Cognitive Process, vol. I. Mass: Addison-Wesley.

    Google Scholar 

  • Wright, S.E., and G. Budin. 1997. Handbook of Terminology Management, Basic Aspects of Terminology Management, vol. 1, 370. Amsterdam: John Benjamins.

    Google Scholar 

Web Links

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dash, N.S., Ramamoorthy, L. (2019). Corpus and Machine Translation. In: Utility and Application of Language Corpora . Springer, Singapore. https://doi.org/10.1007/978-981-13-1801-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1801-6_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1800-9

  • Online ISBN: 978-981-13-1801-6

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

Navigation