Log in

Improving cross-lingual language understanding with consistency regularization-based fine-tuning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Fine-tuning pre-trained cross-lingual language models alleviates the need for annotated data in different languages, as it allows the models to transfer task-specific supervision between languages, especially from high- to low-resource languages. In this work, we propose to improve cross-lingual language understanding with consistency regularization-based fine-tuning. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method (the code is available at https://github.com/bozheng-hit/xTune)  achieves significant improvements across various cross-lingual language understanding tasks, including text classification, question answering, and sequence labeling. Furthermore, we extend our method to the few-shot cross-lingual transfer setting, particularly considering a more realistic setting where machine translation systems are available. Meanwhile, machine translation as data augmentation can be well combined with our consistency regularization method. Experimental results demonstrate that our method also benefits the few-shot scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability statement

All datasets in the XTREME benchmark are available at http://github.com/google-research/xtreme. The bilingual dictionaries used in code-switch substitution data augmentation are available at http://github.com/facebookresearch/MUSE. The machine translation data augmentation is available at repository https://github.com/bozheng-hit/xTune.

Notes

  1. We define conventional cross-lingual fine-tuning as fine-tuning the pre-trained cross-lingual model with the labeled training set in the source language only (typically English) or with labeled training sets in all languages.

  2. Implemented by .detach() in PyTorch.

  3. https://github.com/google-research/xtreme

  4. https://github.com/facebookresearch/MUSE

  5. X-STILTs [39] uses additional SQuAD v1.1 English training data for the TyDiQA-GoldP dataset, while we prefer a cleaner setting here.

  6. FILTER directly selects the best model on the test set of XQuAD and TyDiQA-GoldP. Under this setting, we can obtain 83.1/69.7 for XQuAD, 75.5/61.1 for TyDiQA-GoldP.

  7. For span extraction datasets, to align the labels, the answers are enclosed in quotes before translating, which makes it easy to extract answers from translated context [30]. This method can also be applied to NER tasks. However, aligning label information requires complex post-processing, and there can be alignment errors.

  8. Paragraphs in XQuAD contains more question-answer pairs than MLQA.

References

  1. Aghajanyan A, Shrivastava A, Gupta A, et al (2020) Better fine-tuning by reducing representational collapse. CoRR. ar**v:2008.03156

  2. Artetxe M, Ruder S, Yogatama D (2020) On the cross-lingual transferability of monolingual representations. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 4623–4637. https://www.aclweb.org/anthology/2020.acl-main.421/

  3. Athiwaratkun B, Finzi M, Izmailov P, et al (2019) There are many consistent explanations of unlabeled data: why you should average. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9. OpenReview.net, https://openreview.net/forum?id=rkgKBhA5Y7

  4. Carmon Y, Raghunathan A, Schmidt L, et al (2019) Unlabeled data improves adversarial robustness. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp 11190–11201. http://papers.nips.cc/paper/9298-unlabeled-data-improves-adversarial-robustness

  5. Chi Z, Dong L, Wei F, et al (2020) InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. CoRR. ar**v:2007.07834

  6. Chi Z, Dong L, Zheng B, et al (2021) Improving pretrained cross-lingual language models via self-labeled word alignment. In: Zong C, **a F, Li W, et al (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: Long Papers), Virtual Event, August 1–6, 2021. Association for Computational Linguistics, pp 3418–3430. https://doi.org/10.18653/v1/2021.acl-long.265

  7. Chi Z, Huang S, Dong L, et al (2022) XLM-E: cross-lingual language model pre-training via ELECTRA. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th annual meeting of the association for computational linguistics (vol 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022. Association for Computational Linguistics, pp 6170–6182. https://doi.org/10.18653/v1/2022.acl-long.427

  8. Chung HW, Garrette D, Tan KC, et al (2020) Improving multilingual models with language-clustered vocabularies. In: Webber B, Cohn T, He Y, et al (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16–20, 2020. Association for Computational Linguistics, pp 4536–4546. https://doi.org/10.18653/v1/2020.emnlp-main.367

  9. Clark JH, Palomaki J, Nikolaev V, et al (2020) Tydi QA: a benchmark for information-seeking question answering in typologically diverse languages. Trans Assoc Comput Linguist 8:454–470. https://transacl.org/ojs/index.php/tacl/article/view/1929

  10. Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp 7057–7067. http://papers.nips.cc/paper/8928-cross-lingual-language-model-pretraining

  11. Conneau A, Rinott R, Lample G, et al (2018) XNLI: evaluating cross-lingual sentence representations. In: Riloff E, Chiang D, Hockenmaier J, et al (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018. Association for Computational Linguistics, pp 2475–2485. https://doi.org/10.18653/v1/d18-1269

  12. Conneau A, Khandelwal K, Goyal N, et al (2020a) Unsupervised cross-lingual representation learning at scale. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 8440–8451. http://www.aclweb.org/anthology/2020.acl-main.747/

  13. Conneau A, Wu S, Li H, et al (2020b) Emerging cross-lingual structure in pretrained language models. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536/

  14. Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the Association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, vol 1 (Long and Short Papers). Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423

  15. Fang Y, Wang S, Gan Z, et al (2020) FILTER: an enhanced fusion method for cross-lingual language understanding. CoRR. ar**v:2009.05166

  16. Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Bouma G, Parmentier Y (eds) Proceedings of the 14th conference of the European chapter of the association for computational linguistics, EACL 2014, April 26–30, 2014, Gothenburg, Sweden. The Association for Computer Linguistics, pp 462–471. https://doi.org/10.3115/v1/e14-1049

  17. Fei H, Zhang M, Ji D (2020) Cross-lingual semantic role labeling with high-quality translated training corpus. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 7014–7026. http://www.aclweb.org/anthology/2020.acl-main.627/

  18. Gao T, Han X, **e R, et al (2020) Neural snowball for few-shot relation learning. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020. AAAI Press, pp 7772–7779. http://ojs.aaai.org/index.php/AAAI/article/view/6281

  19. Guo J, Che W, Yarowsky D, et al (2015) Cross-lingual dependency parsing based on distributed representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31, 2015, Bei**g, China, vol 1: Long Papers. The Association for Computer Linguistics, pp 1234–1244. https://doi.org/10.3115/v1/p15-1119

  20. Hou Y, Che W, Lai Y, et al (2020) Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 1381–1393. https://doi.org/10.18653/v1/2020.acl-main.128

  21. Hou Y, Mao J, Lai Y, et al (2020) Fewjoint: a few-shot learning benchmark for joint language understanding. CoRR. ar**v:2009.08138

  22. Hu J, Ruder S, Siddhant A, et al (2020) XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13–18 July 2020, virtual event, proceedings of machine learning research, vol 119. PMLR, pp 4411–4421. http://proceedings.mlr.press/v119/hu20b.html

  23. Hu J, Johnson M, Firat O, et al (2021) Explicit alignment objectives for multilingual bidirectional encoders. In: Toutanova K, Rumshisky A, Zettlemoyer L, et al (eds) Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2021, Online, June 6–11, 2021. Association for Computational Linguistics, pp 3633–3643. https://doi.org/10.18653/v1/2021.naacl-main.284

  24. Hu W, Miyato T, Tokui S, et al (2017) Learning discrete representations via information maximizing self-augmented training. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, proceedings of machine learning research, vol 70. PMLR, pp 1558–1567. http://proceedings.mlr.press/v70/hu17b.html

  25. Jiang H, He P, Chen W, et al (2020) SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 2177–2190. https://www.aclweb.org/anthology/2020.acl-main.197/

  26. Kudo T (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. In: Gurevych I, Miyao Y (eds) Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, vol 1: long papers. Association for Computational Linguistics, pp 66–75. https://doi.org/10.18653/v1/P18-1007. https://www.aclweb.org/anthology/P18-1007/

  27. Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Blanco E, Lu W (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018: system demonstrations, Brussels, Belgium, October 31–November 4, 2018. Association for Computational Linguistics, pp 66–71. https://doi.org/10.18653/v1/d18-2012

  28. Lample G, Conneau A, Denoyer L, et al (2018) Unsupervised machine translation using monolingual corpora only. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, conference track proceedings. OpenReview.net, http://openreview.net/forum?id=rkYTTf-AZ

  29. Lauscher A, Ravishankar V, Vulic I, et al (2020) From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers. In: Webber B, Cohn T, He Y, et al (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16–20, 2020. Association for Computational Linguistics, pp 4483–4499. https://doi.org/10.18653/v1/2020.emnlp-main.363

  30. Lewis PSH, Oguz B, Rinott R, et al (2020) MLQA: evaluating cross-lingual extractive question answering. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 7315–7330. http://www.aclweb.org/anthology/2020.acl-main.653/

  31. Li H, Yan H, Li Y, et al (2023) Distinguishability calibration to in-context learning. CoRR. https://doi.org/10.48550/ar**v.2302.06198. ar**v:2302.06198

  32. Liu X, Cheng H, He P, et al (2020) Adversarial training for large neural language models. CoRR. ar**v:2004.08994

  33. Luo F, Wang W, Liu J, et al (2020) VECO: Variable encoder-decoder pre-training for cross-lingual understanding and generation. ar**v:2010.16046

  34. Lv X, Gu Y, Han X, et al (2019) Adapting meta knowledge graph information for multi-hop reasoning over few-shot relations. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 3374–3379. https://doi.org/10.18653/v1/D19-1334

  35. Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. CoRR. ar**v:1309.4168

  36. Miyato T, Maeda S, Koyama M et al (2019) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans Pattern Anal Mach Intell 41(8):1979–1993. https://doi.org/10.1109/TPAMI.2018.2858821

    Article  Google Scholar 

  37. Nivre J, Blokland R, Partanen N, et al (2018) Universal dependencies 2.2

  38. Pan X, Zhang B, May J, et al (2017) Cross-lingual name tagging and linking for 282 languages. In: Barzilay R, Kan M (eds) Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, volume 1: long papers. Association for Computational Linguistics, pp 1946–1958. https://doi.org/10.18653/v1/P17-1178

  39. Phang J, Htut PM, Pruksachatkun Y, et al (2020) English intermediate-task training improves zero-shot cross-lingual transfer too. CoRR. ar**v:2005.13013

  40. Provilkov I, Emelianenko D, Voita E (2020) BPE-dropout: simple and effective subword regularization. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 1882–1892. https://www.aclweb.org/anthology/2020.acl-main.170/

  41. Qin L, Ni M, Zhang Y, et al (2020) CoSDA-ML: multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP. In: Bessiere C (eds) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020. ijcai.org, pp 3853–3860. https://doi.org/10.24963/ijcai.2020/533

  42. Shah DJ, Gupta R, Fayazi AA, et al (2019) Robust zero-shot cross-domain slot filling with example values. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, vol 1: long papers. Association for Computational Linguistics, pp 5484–5490. https://doi.org/10.18653/v1/p19-1547

  43. Singh J, McCann B, Keskar NS, et al (2019) XLDA: cross-lingual data augmentation for natural language inference and question answering. CoRR. ar**v:1905.11471

  44. Sun S, Sun Q, Zhou K, et al (2019) Hierarchical attention prototypical networks for few-shot text classification. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 476–485. https://doi.org/10.18653/v1/D19-1045

  45. Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Workshop Track Proceedings. OpenReview.net, http://openreview.net/forum?id=ry8u21rtl

  46. Wang Y, Che W, Guo J, et al (2019) Cross-lingual BERT transformation for zero-shot dependency parsing. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 5720–5726. https://doi.org/10.18653/v1/D19-1575

  47. **e Q, Dai Z, Hovy EH, et al (2020) Unsupervised data augmentation for consistency training. In: Larochelle H, Ranzato M, Hadsell R, et al (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, virtual. http://proceedings.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html

  48. Xu H, Murray K (2022) Por qué não utiliser alla språk? mixed training with gradient optimization in few-shot cross-lingual transfer. CoRR. ar**v:2204.13869

  49. Xu R, Yang Y, Otani N, et al (2018) Unsupervised cross-lingual transfer of word embedding spaces. In: Riloff E, Chiang D, Hockenmaier J, et al (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018. Association for Computational Linguistics, pp 2465–2474. https://doi.org/10.18653/v1/d18-1268

  50. Yan H, Gui L, He Y (2022) Hierarchical interpretation of neural text classification. Comput Linguist 48(4):987–1020. https://doi.org/10.1162/coli_a_00459

    Article  Google Scholar 

  51. Yan H, Gui L, Li W, et al (2022b) Addressing token uniformity in transformers via singular value transformation. In: Cussens J, Zhang K (eds) Uncertainty in artificial intelligence, proceedings of the thirty-eighth conference on uncertainty in artificial intelligence, UAI 2022, 1–5 August 2022, Eindhoven, The Netherlands, proceedings of machine learning research, vol 180. PMLR, pp 2181–2191. http://proceedings.mlr.press/v180/yan22b.html

  52. Yan L, Zheng Y, Cao J (2018) Few-shot learning for short text classification. Multimed Tools Appl 77(22):29799–29810. https://doi.org/10.1007/s11042-018-5772-4

    Article  Google Scholar 

  53. Yang H, Chen H, Zhou H et al (2022) Enhancing cross-lingual transfer by manifold mixup. In: The 10th International conference on learning representations, ICLR 2022. Virtual Event. April 25-29, 2022

  54. Yang Y, Zhang Y, Tar C, et al (2019) PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 3685–3690. https://doi.org/10.18653/v1/D19-1382

  55. Ye M, Zhang X, Yuen PC, et al (2019) Unsupervised embedding learning via invariant and spreading instance feature. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 6210–6219. https://doi.org/10.1109/CVPR.2019.00637. http://openaccess.thecvf.com/content_CVPR_2019/html/Ye_Unsupervised_Embedding_Learning_via_Invariant_and_Spreading_Instance_Feature_CVPR_2019_paper.html

  56. Yu M, Guo X, Yi J, et al (2018) Diverse few-shot text classification with multiple metrics. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, vol 1 (long papers). Association for Computational Linguistics, pp 1206–1215. https://doi.org/10.18653/v1/n18-1109

  57. Zhang M, Zhang Y, Fu G (2019) Cross-lingual dependency parsing using code-mixed treebank. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 997–1006. https://doi.org/10.18653/v1/D19-1092

  58. Zhao M, Zhu Y, Shareghi E, et al (2021) A closer look at few-shot crosslingual transfer: the choice of shots matters. In: Zong C, **a F, Li W, et al (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: long papers), virtual event, August 1–6, 2021. Association for Computational Linguistics, pp 5751–5767. https://doi.org/10.18653/v1/2021.acl-long.447

  59. Zhao W, Eger S, Bjerva J, et al (2021) Inducing language-agnostic multilingual representations. In: Nastase V, Vulic I (eds) Proceedings of *SEM 2021: the tenth joint conference on lexical and computational semantics, *SEM 2021, Online, August 5–6, 2021. Association for Computational Linguistics, pp 229–240. https://doi.org/10.18653/v1/2021.starsem-1.22

  60. Zheng B, Dong L, Huang S, et al (2021) Allocating large vocabulary capacity for cross-lingual language model pre-training. In: Moens M, Huang X, Specia L, et al (eds) Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, virtual event/Punta Cana, Dominican Republic, 7–11 November, 2021. Association for Computational Linguistics, pp 3203–3215. https://doi.org/10.18653/v1/2021.emnlp-main.257

  61. Zheng S, Song Y, Leung T, et al (2016) Improving the robustness of deep neural networks via stability training. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, pp 4480–4488. https://doi.org/10.1109/CVPR.2016.485

  62. Zhu C, Cheng Y, Gan Z, et al (2020) FreeLB: enhanced adversarial training for natural language understanding. In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, https://openreview.net/forum?id=BygzbyHFvB

Download references

Acknowledgements

This work was supported by the National Key R &D Program of China via grant 2020AAA0106501 and the National Natural Science Foundation of China (NSFC) via Grants 62236004 and 61976072.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wanxiang Che.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Statistics of XTREME datasets

See Table 9.

Table 9 Statistics for the datasets in the XTREME benchmark. we report the number of training examples (#Train), and the number of languages (#Lang)

Appendix 2: Hyper-parameters

See Table 10.

Table 10 The best hyper-parameters used for xTune under the cross-lingual transfer setting

See Table 11.

Table 11 The best hyper-parameters used for xTune under the translate-train-all setting

1.1 Conventional cross-lingual fine-tuning

For XNLI, PAWS-X, POS and NER, we fine-tune 10 epochs. For XQuAD and MLQA, we fine-tune 4 epochs. For TyDiQA-GoldP, we fine-tune 20 epochs and 10 epochs for base and large model, respectively. We select \(\lambda _{1}\) in [1.0, 2.0, 5.0], \(\lambda _{2}\) in [0.3, 0.5, 1.0, 2.0, 5.0]. For learning rate, we select in [5e-6, 7e-6, 1e-5, 1.5e\(-\)5] for large models, [7e-6, 1e-5, 2e-5, 3e-5] for base models. We use batch size 32 for all datasets and 10% of total training steps for warmup with a linear learning rate schedule. Our experiments are conducted with a single 32GB Nvidia V100 GPU, and we use gradient accumulation for large-size models. The other hyper-parameters for the two-stage xTune training are shown in Tables 10 and 11.

1.2 Few-shot cross-lingual fine-tuning

During the source-training stage, we use the same hyper-parameters as conventional cross-lingual fine-tuning. During the target-adapting stage, for POS and NER tasks, we fine-tune the model 100 epochs and select the model on the development set every epoch, with an early-stop** strategy of 10 epochs. For MLQA and XQuAD tasks, we fine-tune the model for 2 or 3 epochs and use the model of the last epoch. For learning rate, we select in [2e-5, 1e-5]. We use batch size 32 for POS and NER tasks and batch size 8 for XQuAD and MLQA tasks. We set \(\lambda _1\) to 5.0 in UDA since it has the best performance in the previous experiments.

Appendix 3: Results for each dataset and language

We provide detailed results for each dataset and language below. We compare our method against \(\text {XLM-R}_\text {large}\) for cross-lingual transfer setting, FILTER [15] for translate-train-all setting.

Appendix 4: How to select data augmentation strategies in xTune

We give instructions on selecting a proper data augmentation strategy depending on the corresponding task.

1.1 Classification

The two distribution in example consistency \(\mathcal {R}_{1}\) can always be aligned. Therefore, we recommend using machine translation as data augmentation if the machine translation systems are available. Otherwise, the priority of our data augmentation strategies is code-switch substitution, subword sampling and Gaussian noise.

1.2 Span extraction

The two distribution in example consistency \(\mathcal {R}_{1}\) can not be aligned in translation-pairs. Therefore, it is impossible to use machine translation as data augmentation in example consistency \(\mathcal {R}_{1}\). We prefer to use code-switch when applying example consistency \(\mathcal {R}_{1}\) individually. However, when the training corpus is augmented with translations, since the bilingual dictionaries between arbitrary language pairs may not be available, we recommend using subword sampling in example consistency \(\mathcal {R}_1\).

1.3 Sequence labeling

Similar to span extraction, the two distribution in example consistency \(\mathcal {R}_{1}\) can not be aligned in translation-pairs. Therefore, we do not use machine translation in example consistency \(\mathcal {R}_{1}\). Unlike classification and span extraction, sequence labeling requires finer-grained information and is more sensitive to noise. We found code-switch is worse than subword sampling as data augmentation in both example consistency \(\mathcal {R}_{1}\) and model consistency \(\mathcal {R}_{2}\), it will even degrade performance for certain hyper-parameters. Thus we recommend using subword sampling in example consistency \(\mathcal {R}_{1}\), and use machine translation to augment the English training corpus if machine translation systems are available, otherwise subword sampling.

Appendix 5: Results for each language

See Tables 12, 13, 14, 15, 16 and 17.

Table 12 PAWSX results (accuracy scores) for each language
Table 13 XQuAD results (F1/EM scores) for each language
Table 14 MLQA results (F1/EM scores) for each language
Table 15 TyDiQA-GolP results (F1/EM scores) for each language
Table 16 POS results (accuracy) for each language
Table 17 NER results (F1 scores) for each language

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, B., Che, W. Improving cross-lingual language understanding with consistency regularization-based fine-tuning. Int. J. Mach. Learn. & Cyber. 14, 3621–3639 (2023). https://doi.org/10.1007/s13042-023-01854-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01854-1

Keywords

Navigation