Abstract
One of the most important problems in digital speech-signal processing is distinguishing segments of active speech and of background noise or silence in an input acoustic signal. This problem arises in many important practical applications, such as speech analysis in voice command systems, transmission of speech over a network, automated speech recognition, etc. However, most available systems designed for automated speech analysis cannot efficiently solve this problem if the signal-to-noise ratio is small. In addition, their parameters must be tuned separately for different noise levels. This prevents fully automated segmentation of noisy speech signals. In this work, we design a system for the automated segmentation of speech signals distorted by additive noise of different types and intensities. The developed system is based on three various deep convolutional neural network models and can efficiently detect speech and silence segments in noisy signals over a wide range of the signal-to-noise ratios and different noise types.
REFERENCES
Rabiner, L.R. and Sambur, M.R., An algorithm for determining the endpoints of isolated utterances, Bell Syst. Tech. J., 1957, vol. 54, no. 2, pp. 297–315. https://doi.org/10.1002/j.1538-7305.1975.tb02840.x
Zhang, R.Z. and Cui, H.J., Speech endpoint detection algorithm analyses based on short-term energy, Audio Eng., 2005, vol. 7, pp. 52–59.
Ghosh, P.K., Tsiartas, A., and Narayanan, S., Robust voice activity detection using longterm signal variability, IEEE Trans. Audio, Speech, Lang. Process., 2011, vol. 19, no. 3, pp. 600–613. https://doi.org/10.1109/TASL.2010.2052803
Ma, Ya. and Nishihara, A., Efficient voice activity detection algorithm using long-term spectral flatness measure, EURASIP J. Audio, Speech, Music Process., 2013, vol. 2013, p. 87. https://doi.org/10.1186/1687-4722-2013-21
Atal, B.S., Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am., 1974, vol. 55, no. 6, pp. 1304–1322. https://doi.org/10.1121/1.1914702
Davis, S.B. and Mermelstein, P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Readings Speech Recognit., 1990, vol. 28, no. 4, pp. 65–74.
Eshaghi, M. and Karami Mollaei, M.R., Voice activity detection based on using wavelet packet, Digital Signal Process., 2010, vol. 20, no. 4, pp. 1102–1115. https://doi.org/10.1016/j.dsp.2009.11.008
Li, J., Zhou, P., **g, X., and Du, Zh., Speech endpoint detection method based on TEO in noise environment, Proc. Eng., 2012, vol. 29, no. 4, pp. 2655–2660. https://doi.org/10.1016/j.proeng.2012.01.367
Li, L. and Zhu, J., Research of speech endpoint detection based on wavelet analysis and neural networks, J. Electr. Meas. Instrum., 2013, vol. 27, no. 6, pp. 528–534.
Sehgal, A. and Kehtarnavaz, N., A Convolutional neural network smartphone app for real-time voice activity detection, IEEE Access, 2018, vol. 6, pp. 9017–9026. https://doi.org/10.1109/ACCESS.2018.2800728
Amodei, D., Ananthanarayanan, S., Anubhai, R., et al, Deep speech 2: End-to-end speech recognition in English and Mandarin, ICML’16: Proc. 33rd Int. Conf. on Machine Learning, Balcan, M.F. and Weinberger, K.Q., Eds., New York, 2016, JMLR.org, 2016, vol. 48, pp. 173–182.
Hussain, M.S. and Haque, M.A., SwishNet: A fast convolutional neural network for speech, music and noise classification and segmentation, 2018. ar**v:1812.00149 [cs.LG].
LibriVox. https://librivox.org/.
ChiME-4. http://spandh.des.shef.ac.uk/chime_challenge/chime2016/index.html.
Wang, Z., Vincent, E., Serizel, R., and Yan, Yo., Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments, Comput. Speech Lang., 2018, vol. 49, pp. 37–51. https://doi.org/10.1016/j.csl.2017.11.003
Jia, F., Majumdar, S., and Cinsburg, B., MarbleNet: Deep 1D time-channel separable convolutional neural network for voice activity detection, 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Toronto, 2021, IEEE, 2021, pp. 6818–6822. https://doi.org/10.1109/ICASSP39728.2021.9414470
Tan, X. and Zhang, X.-L., Speech enhancement aided end-to-end multi-task learning for voice activity detection, 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Toronto, 2021, IEEE, 2021, pp. 6823–6827. https://doi.org/10.1109/ICASSP39728.2021.9414445
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S., Librispeech: An ASR corpus based on public domain audio books, IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, IEEE, 2015, pp. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
Garofolo, J.S., et al., TIMIT acoustic-phonetic continuous speech corpus LDC93S1, Philadelphia: Linguistic Data Consortium, 1993.
Common Voice. https://commonvoice.mozilla.org/en.
Boersma, P., Praat, a system for doing phonetics by computer, 2002.
Piczak, K.J., ESC: Dataset for environmental sound classification, 2015.
Loshchilov, I. and Hutter, F., Fixing weight decay regularization in Adam, 6th Int. Conf. on Learning Representations, Vancouver, 2018. ar**v:1711.05101 [cs.LG]
Reddi, S., Kale, S., and Kumar, S., On the convergence of Adam and beyond, 6th Int. Conf. on Learning Representations, Vancouver, 2018. ar**v:1904.09237 [cs.LG]
Smith, L.N. and Topin, N., Super-convergence: very fast training of neural networks using large learning rates, SPIE Proc., 2019, vol. 11006, p. 1100612. https://doi.org/10.1117/12.2520589
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Translated by L. Kartvelishvili
About this article
Cite this article
Protserov, S.D., Shishkin, A.G. Segmentation of Noisy Speech Signals. Sci. Tech. Inf. Proc. 49, 356–363 (2022). https://doi.org/10.3103/S0147688222050100
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0147688222050100