Speaker-Adaptive Lip Reading with User-Dependent Padding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the user-dependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID. The effectiveness of the proposed method is verified on sentence- and word-level lip reading, and we show it can further improve the performance of a well-trained model with large speaker variations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Spain)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 93.08
Price includes VAT (Spain)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 114.39
Price includes VAT (Spain)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdel-Hamid, O., Jiang, H.: Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7942–7946. IEEE (2013)

    Google Scholar 

  2. Abdel-Hamid, O., Jiang, H.: Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition. In: INTERSPEECH, pp. 1248–1252 (2013)

    Google Scholar 

  3. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)

    Google Scholar 

  4. Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. ar**v preprint ar**v:1809.00496 (2018)

  5. Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2143–2147. IEEE (2020)

    Google Scholar 

  6. Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2722–2726. IEEE (2016)

    Google Scholar 

  7. Anastasakos, T., McDonough, J., Makhoul, J.: Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1043–1046. IEEE (1997)

    Google Scholar 

  8. Anvari, Z., Athitsos, V.: A pipeline for automated face dataset creation from unlabeled images. In: Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 227–235 (2019)

    Google Scholar 

  9. Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LipNet: end-to-end sentence-level lipreading. ar**v preprint ar**v:1611.01599 (2016)

  10. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)

    Google Scholar 

  11. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

    Chapter  Google Scholar 

  12. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

    Chapter  Google Scholar 

  13. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  14. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)

    Article  Google Scholar 

  15. Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: RetinaFace: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5203–5212 (2020)

    Google Scholar 

  16. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

    Google Scholar 

  17. Digalakis, V.V., Rtischev, D., Neumeyer, L.G.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)

    Article  Google Scholar 

  18. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)

    Google Scholar 

  19. Gopinath, R.A.: Maximum likelihood modeling with gaussian distributions for classification. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998 (Cat. No. 98CH36181), vol. 2, pp. 661–664. IEEE (1998)

    Google Scholar 

  20. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  21. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_6

    Chapter  Google Scholar 

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  23. Hong, J., Kim, M., Park, S.J., Ro, Y.M.: Speech reconstruction with reminiscent sound via visual voice memory. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3654–3667 (2021)

    Article  Google Scholar 

  24. Huang, Y., He, L., Wei, W., Gale, W., Li, J., Gong, Y.: Using personalized speech synthesis and neural language generator for rapid speaker adaptation. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7399–7403. IEEE (2020)

    Google Scholar 

  25. Kandala, P.A., et al.: Speaker adaptation for lip-reading using visual identity vectors. In: INTERSPEECH, pp. 2758–2762 (2019)

    Google Scholar 

  26. Kim, M., Hong, J., Park, S.J., Ro, Y.M.: CroMM-VSR: cross-modal memory augmented visual speech recognition. IEEE Trans. Multimedia (2021)

    Google Scholar 

  27. Kim, M., Hong, J., Park, S.J., Ro, Y.M.: Multi-modality associative bridging through memory: speech sound recollected from face video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 296–306 (2021)

    Google Scholar 

  28. Kim, M., Hong, J., Ro, Y.M.: Lip to speech synthesis with visual context attentional GAN. Adv. Neural. Inf. Process. Syst. 34, 2758–2770 (2021)

    Google Scholar 

  29. Kim, M., Yeo, J.H., Ro, Y.M.: Distinguishing homophenes using multi-head visual-audio memory for lip reading. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, vol. 22 (2022)

    Google Scholar 

  30. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)

  31. Klejch, O., Fainberg, J., Bell, P., Renals, S.: Speaker adaptive training using model agnostic meta-learning. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 881–888. IEEE (2019)

    Google Scholar 

  32. Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)

    Google Scholar 

  33. Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

    Google Scholar 

  34. Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, pp. I. IEEE (2006)

    Google Scholar 

  35. Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 368–373. IEEE (2013)

    Google Scholar 

  36. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)

  37. Ma, P., Mira, R., Petridis, S., Schuller, B.W., Pantic, M.: LiRA: learning visual speech representations from audio through self-supervision. ar**v preprint ar**v:2106.09171 (2021)

  38. Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)

    Google Scholar 

  39. Mei, K., Zhu, C., Zou, J., Zhang, S.: Instance adaptive self-training for unsupervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 415–430. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_25

    Chapter  Google Scholar 

  40. Meng, Z., et al.: Speaker-invariant training via adversarial learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5969–5973. IEEE (2018)

    Google Scholar 

  41. Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  42. Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1938–1949 (2015)

    Article  Google Scholar 

  43. Mira, R., Haliassos, A., Petridis, S., Schuller, B.W., Pantic, M.: SVTS: scalable video-to-speech synthesis. ar**v preprint ar**v:2205.02058 (2022)

  44. Mira, R., Vougioukas, K., Ma, P., Petridis, S., Schuller, B.W., Pantic, M.: End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Trans. Cybern. (2022)

    Google Scholar 

  45. Neto, J., et al.: Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system (1995)

    Google Scholar 

  46. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  47. Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)

    Google Scholar 

  48. Ren, S., Du, Y., Lv, J., Han, G., He, S.: Learning from the master: distilling cross-modal advanced knowledge for lip reading. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13325–13333 (2021)

    Google Scholar 

  49. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 24–29. IEEE (2011)

    Google Scholar 

  50. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. ar**v preprint ar**v:1703.04105 (2017)

  51. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014)

    Google Scholar 

  52. Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 171–176. IEEE (2014)

    Google Scholar 

  53. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)

    Google Scholar 

  54. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  55. Veselỳ, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 267–272. IEEE (2013)

    Google Scholar 

  56. Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. ar**v preprint ar**v:1905.02540 (2019)

  57. **ao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)

    Google Scholar 

  58. **e, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020)

    Google Scholar 

  59. Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)

    Article  Google Scholar 

  60. Yang, C., Wang, S., Zhang, X., Zhu, Y.: Speaker-independent lipreading with limited data. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2181–2185. IEEE (2020)

    Google Scholar 

  61. Yang, S., et al.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)

    Google Scholar 

  62. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)

    Google Scholar 

  63. Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. IEEE (2013)

    Google Scholar 

  64. Zhang, Q., Wang, S., Chen, G.: Speaker-independent lipreading by disentangled representation learning. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2493–2497. IEEE (2021)

    Google Scholar 

  65. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  66. Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6917–6924 (2020)

    Google Scholar 

Download references

Acknowledgment

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Man Ro .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 325 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, M., Kim, H., Ro, Y.M. (2022). Speaker-Adaptive Lip Reading with User-Dependent Padding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20059-5_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation