Abstract
This work presents a lightweight phoneme recognition model using object detection techniques. This model is mainly proposed to run on devices with low processing power, such as tablets and mobile phones. The use of the combination of hardware network architecture research complemented by the NetAdapt algorithm has led to the use of a simpler and lighter network architecture called MobileNet. The MobileNetV3 convolutional network architecture was combined with the Single-Shot Detection. The databases used in model training were TIMIT and LibriSpeech, both have spoken audios in English. To generate a graphical representation using the audiobases, for each audio, its spectrogram was calculated on the Mel scale. To train the algorithm of phoneme location detection, the temporal position of the occurrence of each phoneme in respective spectrogram is used. Additionally, it was necessary to increase the training dataset, in order to provide improvement in the generalization of the model. Therefore, the two databases were joined and data augmentation techniques were applied to audios. The main idea was to achieve learning using a lightweight architecture that can be used on devices with low processing power, such as tablets and mobile phones. Thus, this research used the MobileNet-Large architecture, which obtained an accuracy of 0.72 mAP@0.5IOU. For comparison, the MobileNet-Small architecture was also used, which obtained an accuracy of 0.63 mAP@0.5IOU.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-024-06098-6/MediaObjects/11227_2024_6098_Fig9_HTML.png)
Similar content being viewed by others
References
Algabri M, Mathkour H, Bencherif MA et al (2020) Towards deep object detection techniques for phoneme recognition. IEEE Access 8:54663–54680
Bresolin AdA (2008) Speech recognition through units smaller than the word, using wavelet packet and svm, in a new hierarchical decision structure. Ph.D. thesis, Federal University of Rio Grande do Norte
Coniam D (1999) Voice recognition software accuracy with second language speakers of English. System 27(1):49–64
Dai W, Dai C, Qu S et al (2017) Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics. Speech and Signal Processing ICASSP. IEEE, pp 421–425
Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 248–255
Erhan D, Szegedy C, Toshev A et al (2014) Scalable object detection using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2147–2154
Everingham M, Van Gool L, Williams CK et al (2010) The Pascal visual object classes VOC challenge. Int J Comput Vis 88(2):303–338
Fan R, Liu G (2018) CNN-based audio front end processing on speech recognition. In: 2018 International Conference on Audio, Language and Image Processing ICALIP. IEEE, pp 349–354
Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process 29(2):254–272
Glackin C, Wall JA, Chollet G et al (2018) Convolutional neural networks for phoneme recognition. In: ICPRAM, pp 190–195
Gordillo CDA (2013) Recognition of continuous voice combining MFCC and PNCC attributes with SS, WD, map and FRN robustness methods. Ph.D. thesis, Pontifical Catholic University of Rio de Janeiro
Graves A, Fernández S, Gomez F et al (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp 369–376
Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp 6645–6649
Grozdic DT, Jovicic S, Subotić M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22
Gupta V, Juyal S, Hu YC (2022) Understanding human emotions through speech spectrograms using deep neural network. J Supercomput 78:1–30
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Howard A, Sandler M, Chu G et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1314–1324
Howard AG, Zhu M, Chen B et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v preprint ar**v:1704.04861
Iandola FN, Han S, Moskewicz MW et al (2016) Squeezenet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. ar**v preprint ar**v:1602.07360
Juang BH, Levinson S, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of Markov chains Corresp. IEEE Trans Inf Theory 32(2):307–309
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Kuznetsova A, Rom H, Alldrin N et al (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. ar**v e-prints pp ar**v–1811
Lathi BP (2006) Linear signals and systems-2. Bookman
Lin M, Chen Q, Yan S (2013) Network in network. ar**v preprint ar**v:1312.4400
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European Conference on Computer Vision. Springer, pp 740–755
Liu W, Anguelov D, Erhan D et al (2016) Ssd: single shot multibox detector. In: European Conference on Computer Vision. pringer, pp 21–37
Lugosch L, Ravanelli M, Ignoto P et al (2019) Speech model pre-training for end-to-end spoken language understanding. ar**v preprint ar**v:1904.03670
McAuliffe M, Socolof M, Mihuc S et al (2017) Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, pp 498–502
Meftah A, Alotaibi YA, Selouani SA (2016) A comparative study of different speech features for Arabic phonemes classification. In: 2016 European Modelling Symposium EMS. EEE, pp 47–52
Muckenhirn H, Doss MM, Marcell S (2018) Towards directly modeling raw speech signal for speaker verification using CNNs. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP. IEEE, pp 4884–4888
Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP. IEEE, pp 4295–4299
Panayotov V, Chen G, Povey D et al (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP. IEEE, pp 5206–5210
Quintanilha IM, Biscainho LWP, Netto SL (2017) Towards an end-to-end speech recognizer for portuguese using deep neural networks. In: Proceedings of 35th Simpósio Brasileiro de Telecomunicaçãues e Processamento de Sinais
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286. https://doi.org/10.1109/5.18626
Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR
Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Google
Sak H, Vinyals O, Heigold G et al (2014) Sequence discriminative distributed training of long short-term memory recurrent neural networks. Google
Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520
Sree BL, Vijaya M (2020) Building acoustic model for phoneme recognition using PSO-DBN. Int J Bus Intell Data Min 16(4):506–523
Wang L, Feng S, Hasegawa-Johnson M et al (2022) Self-supervised semantic-driven phoneme discovery for zero-resource speech recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pp 8027–8047. https://doi.org/10.18653/v1/2022.acl-long.553, https://aclanthology.org/2022.acl-long.553
Zhang X, Zhou X, Lin M et al (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Elsevier, pp 6848–6856
Acknowledgements
The authors express their gratitude to Federal University of Maranhão, specially to InovTec Lab.
Funding
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pereira, B.V.L., de Carvalho, M.B.F., Alves, P.A.A.S.A.N. et al. Automatic phoneme recognition by deep neural networks. J Supercomput 80, 16654–16678 (2024). https://doi.org/10.1007/s11227-024-06098-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06098-6