Log in

Automatic phoneme recognition by deep neural networks

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This work presents a lightweight phoneme recognition model using object detection techniques. This model is mainly proposed to run on devices with low processing power, such as tablets and mobile phones. The use of the combination of hardware network architecture research complemented by the NetAdapt algorithm has led to the use of a simpler and lighter network architecture called MobileNet. The MobileNetV3 convolutional network architecture was combined with the Single-Shot Detection. The databases used in model training were TIMIT and LibriSpeech, both have spoken audios in English. To generate a graphical representation using the audiobases, for each audio, its spectrogram was calculated on the Mel scale. To train the algorithm of phoneme location detection, the temporal position of the occurrence of each phoneme in respective spectrogram is used. Additionally, it was necessary to increase the training dataset, in order to provide improvement in the generalization of the model. Therefore, the two databases were joined and data augmentation techniques were applied to audios. The main idea was to achieve learning using a lightweight architecture that can be used on devices with low processing power, such as tablets and mobile phones. Thus, this research used the MobileNet-Large architecture, which obtained an accuracy of 0.72 mAP@0.5IOU. For comparison, the MobileNet-Small architecture was also used, which obtained an accuracy of 0.63 mAP@0.5IOU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Algabri M, Mathkour H, Bencherif MA et al (2020) Towards deep object detection techniques for phoneme recognition. IEEE Access 8:54663–54680

    Article  Google Scholar 

  2. Bresolin AdA (2008) Speech recognition through units smaller than the word, using wavelet packet and svm, in a new hierarchical decision structure. Ph.D. thesis, Federal University of Rio Grande do Norte

  3. Coniam D (1999) Voice recognition software accuracy with second language speakers of English. System 27(1):49–64

    Article  Google Scholar 

  4. Dai W, Dai C, Qu S et al (2017) Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics. Speech and Signal Processing ICASSP. IEEE, pp 421–425

  5. Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 248–255

  6. Erhan D, Szegedy C, Toshev A et al (2014) Scalable object detection using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2147–2154

  7. Everingham M, Van Gool L, Williams CK et al (2010) The Pascal visual object classes VOC challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  8. Fan R, Liu G (2018) CNN-based audio front end processing on speech recognition. In: 2018 International Conference on Audio, Language and Image Processing ICALIP. IEEE, pp 349–354

  9. Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process 29(2):254–272

    Article  Google Scholar 

  10. Glackin C, Wall JA, Chollet G et al (2018) Convolutional neural networks for phoneme recognition. In: ICPRAM, pp 190–195

  11. Gordillo CDA (2013) Recognition of continuous voice combining MFCC and PNCC attributes with SS, WD, map and FRN robustness methods. Ph.D. thesis, Pontifical Catholic University of Rio de Janeiro

  12. Graves A, Fernández S, Gomez F et al (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp 369–376

  13. Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp 6645–6649

  14. Grozdic DT, Jovicic S, Subotić M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22

    Article  Google Scholar 

  15. Gupta V, Juyal S, Hu YC (2022) Understanding human emotions through speech spectrograms using deep neural network. J Supercomput 78:1–30

    Article  Google Scholar 

  16. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  17. Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  18. Howard A, Sandler M, Chu G et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1314–1324

  19. Howard AG, Zhu M, Chen B et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v preprint ar**v:1704.04861

  20. Iandola FN, Han S, Moskewicz MW et al (2016) Squeezenet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. ar**v preprint ar**v:1602.07360

  21. Juang BH, Levinson S, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of Markov chains Corresp. IEEE Trans Inf Theory 32(2):307–309

    Article  Google Scholar 

  22. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  23. Kuznetsova A, Rom H, Alldrin N et al (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. ar**v e-prints pp ar**v–1811

  24. Lathi BP (2006) Linear signals and systems-2. Bookman

  25. Lin M, Chen Q, Yan S (2013) Network in network. ar**v preprint ar**v:1312.4400

  26. Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European Conference on Computer Vision. Springer, pp 740–755

  27. Liu W, Anguelov D, Erhan D et al (2016) Ssd: single shot multibox detector. In: European Conference on Computer Vision. pringer, pp 21–37

  28. Lugosch L, Ravanelli M, Ignoto P et al (2019) Speech model pre-training for end-to-end spoken language understanding. ar**v preprint ar**v:1904.03670

  29. McAuliffe M, Socolof M, Mihuc S et al (2017) Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, pp 498–502

  30. Meftah A, Alotaibi YA, Selouani SA (2016) A comparative study of different speech features for Arabic phonemes classification. In: 2016 European Modelling Symposium EMS. EEE, pp 47–52

  31. Muckenhirn H, Doss MM, Marcell S (2018) Towards directly modeling raw speech signal for speaker verification using CNNs. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP. IEEE, pp 4884–4888

  32. Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP. IEEE, pp 4295–4299

  33. Panayotov V, Chen G, Povey D et al (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP. IEEE, pp 5206–5210

  34. Quintanilha IM, Biscainho LWP, Netto SL (2017) Towards an end-to-end speech recognizer for portuguese using deep neural networks. In: Proceedings of 35th Simpósio Brasileiro de Telecomunicaçãues e Processamento de Sinais

  35. Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286. https://doi.org/10.1109/5.18626

    Article  Google Scholar 

  36. Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR

  37. Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  38. Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  39. Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Google

  40. Sak H, Vinyals O, Heigold G et al (2014) Sequence discriminative distributed training of long short-term memory recurrent neural networks. Google

  41. Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520

  42. Sree BL, Vijaya M (2020) Building acoustic model for phoneme recognition using PSO-DBN. Int J Bus Intell Data Min 16(4):506–523

    Google Scholar 

  43. Wang L, Feng S, Hasegawa-Johnson M et al (2022) Self-supervised semantic-driven phoneme discovery for zero-resource speech recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pp 8027–8047. https://doi.org/10.18653/v1/2022.acl-long.553, https://aclanthology.org/2022.acl-long.553

  44. Zhang X, Zhou X, Lin M et al (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Elsevier, pp 6848–6856

Download references

Acknowledgements

The authors express their gratitude to Federal University of Maranhão, specially to InovTec Lab.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bianca Valéria L. Pereira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pereira, B.V.L., de Carvalho, M.B.F., Alves, P.A.A.S.A.N. et al. Automatic phoneme recognition by deep neural networks. J Supercomput 80, 16654–16678 (2024). https://doi.org/10.1007/s11227-024-06098-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06098-6

Keywords

Navigation