Automatic Grammar Correction of Commas in Czech Written Texts: Comparative Study

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2022)

Abstract

The task of grammatical error correction is a widely studied field of natural language processing where the traditional rule-based approaches compete with the machine learning methods. The rule-based approach benefits mainly from a wide knowledge base available for a given language. On the contrary, the transfer learning methods and especially the use of pre-trained Transformers have the ability to be trained from a  huge number of texts in a given language. In this paper, we focus on the task of automatic correction of missing commas in Czech written texts and we compare the rule-based approach with the Transformer-based model trained for this task.

This work was supported by the project of specific research Lexikon a gramatika češtiny II - 2022 (Lexicon and Grammar of Czech II - 2022; project No. MUNI/A/1137/2021) and by the Czech Science Foundation (GA CR), project No. GA22-27800S.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    You can try out the rule-based commas detection and correction at http://opravidlo.cz/.

  2. 2.

    https://commoncrawl.org/.

References

  1. Pravidla českého pravopisu, 2. rozšířené vydání. Academia, Praha (1993)

    Google Scholar 

  2. Boháč, M., Rott, M., Kovář, V.: Text punctuation: an inter-annotator agreement study. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 120–128. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_14

    Chapter  Google Scholar 

  3. Bryant, C., Felice, M., Andersen, Ø.E., Briscoe, T.: The BEA-2019 shared task on grammatical error correction. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 52–75. Association for Computational Linguistics, Florence, Italy (Aug 2019)

    Google Scholar 

  4. Cai, Y., Wang, D.: Question mark prediction by bert. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 363–367 (2019). https://doi.org/10.1109/APSIPAASC47483.2019.9023090

  5. Chordia, V.: PunKtuator: a multilingual punctuation restoration system for spoken and written text. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 312–320. Association for Computational Linguistics, Online (Apr 2021). https://doi.org/10.18653/v1/2021.eacl-demos.37, https://aclanthology.org/2021.eacl-demos.37

  6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  7. Hlaváčková, D., et al.: New online proofreader for Czech. Slavonic Natural Language Processing in the 21st Century, pp. 79–92 (2019)

    Google Scholar 

  8. Hlaváčková, D., Žižková, H., Dvořáková, K., Pravdová, M.: Develo** online czech proofreader tool: Achievements, limitations and pitfalls. In: Bohemistyka, XXII, (1), pp. 122–134 (2022). https://doi.org/10.14746/bo.2022.1.7

  9. Hlubík, P., Španěl, M., Boháč, M., Weingartová, L.: Inserting punctuation to ASR output in a real-time production environment. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 418–425. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_45

    Chapter  Google Scholar 

  10. Karlík, P., Nekula, M., Pleskalová, J.e.: Nový encyklopedický slovník češtiny (2012–2020). https://www.czechency.org/

  11. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., Suchomel, V.: The Sketch Engine: ten years on. Lexicography 1(1), 7–36 (2014). https://doi.org/10.1007/s40607-014-0009-9

    Article  Google Scholar 

  12. Klejch, O., Bell, P., Renals, S.: Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5700–5704 (2017). https://doi.org/10.1109/ICASSP.2017.7953248

  13. Kovář, V., Machura, J., Zemková, K., Rott, M.: Evaluation and improvements in punctuation detection for czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 287–294. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_33

    Chapter  Google Scholar 

  14. Kovář, V., Horák, A., Jakubíček, M.: Syntactic analysis using finite patterns: a new parsing system for Czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS (LNAI), vol. 6562, pp. 161–171. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20095-3_15

    Chapter  Google Scholar 

  15. Liu, Y., et al.: Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692

  16. Machura, J., Gerzová, H., Masopustová, M., Valícková, M.: Comparing majka and morphodita for automatic grammar checking. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 3–14. Brno (2019)

    Google Scholar 

  17. Nunberg, G.: The Linguistics of Punctuation. CSLI lecture notes, Cambridge University Press (1990). https://books.google.cz/books?id=Sh-sruuKjJwC

  18. Păiş, V., Tufiş, D.: Capitalization and punctuation restoration: a survey. Artif. Intell. Rev. 55(3), 1681–1722 (2021). https://doi.org/10.1007/s10462-021-10051-x

    Article  Google Scholar 

  19. Pravdová, M., Svobodová, I.: Akademická příručka českého jazyka. Academia, Praha (2019)

    Google Scholar 

  20. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)

    Google Scholar 

  21. Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 13–18. Association for Computational Linguistics, Baltimore, Maryland (Jun 2014). https://doi.org/10.3115/v1/P14-5003,https://aclanthology.org/P14-5003

  22. Suchomel, V., Michelfeit, J., Pomikálek, J.: Text tokenisation using Unitok. In: Eight Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 71–75. Tribun EU, Brno (2014). https://nlp.fi.muni.cz/raslan/2014/14.pdf

  23. Švec, J., Lehečka, J., Šmídl, L., Ircing, P.: Transformer-based automatic punctuation prediction and word casing reconstruction of the ASR output. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 86–94. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_7

    Chapter  Google Scholar 

  24. Švec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Resour. Eval. 48(2), 227–248 (2014). https://doi.org/10.1007/s10579-013-9246-z

  25. Šmerk, P.: Unsupervised learning of rules for morphological disambiguation. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 211–216. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30120-2_27

    Chapter  Google Scholar 

  26. Šmerk, P.: Fast morphological analysis of Czech. In: Proceedings of the RASLAN Workshop 2009. Masarykova univerzita, Brno (2009). https://nlp.fi.muni.cz/raslan/2009/papers/13.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Švec .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Machura, J., Frémund, A., Švec, J. (2022). Automatic Grammar Correction of Commas in Czech Written Texts: Comparative Study. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16270-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16269-5

  • Online ISBN: 978-3-031-16270-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation