Log in

Segmentation of Noisy Speech Signals

  • Published:
Scientific and Technical Information Processing Aims and scope

Abstract

One of the most important problems in digital speech-signal processing is distinguishing segments of active speech and of background noise or silence in an input acoustic signal. This problem arises in many important practical applications, such as speech analysis in voice command systems, transmission of speech over a network, automated speech recognition, etc. However, most available systems designed for automated speech analysis cannot efficiently solve this problem if the signal-to-noise ratio is small. In addition, their parameters must be tuned separately for different noise levels. This prevents fully automated segmentation of noisy speech signals. In this work, we design a system for the automated segmentation of speech signals distorted by additive noise of different types and intensities. The developed system is based on three various deep convolutional neural network models and can efficiently detect speech and silence segments in noisy signals over a wide range of the signal-to-noise ratios and different noise types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

REFERENCES

  1. Rabiner, L.R. and Sambur, M.R., An algorithm for determining the endpoints of isolated utterances, Bell Syst. Tech. J., 1957, vol. 54, no. 2, pp. 297–315.  https://doi.org/10.1002/j.1538-7305.1975.tb02840.x

    Article  Google Scholar 

  2. Zhang, R.Z. and Cui, H.J., Speech endpoint detection algorithm analyses based on short-term energy, Audio Eng., 2005, vol. 7, pp. 52–59.

    Google Scholar 

  3. Ghosh, P.K., Tsiartas, A., and Narayanan, S., Robust voice activity detection using longterm signal variability, IEEE Trans. Audio, Speech, Lang. Process., 2011, vol. 19, no. 3, pp. 600–613.  https://doi.org/10.1109/TASL.2010.2052803

    Article  Google Scholar 

  4. Ma, Ya. and Nishihara, A., Efficient voice activity detection algorithm using long-term spectral flatness measure, EURASIP J. Audio, Speech, Music Process., 2013, vol. 2013, p. 87.  https://doi.org/10.1186/1687-4722-2013-21

    Article  Google Scholar 

  5. Atal, B.S., Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am., 1974, vol. 55, no. 6, pp. 1304–1322.  https://doi.org/10.1121/1.1914702

    Article  Google Scholar 

  6. Davis, S.B. and Mermelstein, P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Readings Speech Recognit., 1990, vol. 28, no. 4, pp. 65–74.

    Article  Google Scholar 

  7. Eshaghi, M. and Karami Mollaei, M.R., Voice activity detection based on using wavelet packet, Digital Signal Process., 2010, vol. 20, no. 4, pp. 1102–1115.  https://doi.org/10.1016/j.dsp.2009.11.008

    Article  Google Scholar 

  8. Li, J., Zhou, P., **g, X., and Du, Zh., Speech endpoint detection method based on TEO in noise environment, Proc. Eng., 2012, vol. 29, no. 4, pp. 2655–2660.  https://doi.org/10.1016/j.proeng.2012.01.367

    Article  Google Scholar 

  9. Li, L. and Zhu, J., Research of speech endpoint detection based on wavelet analysis and neural networks, J. Electr. Meas. Instrum., 2013, vol. 27, no. 6, pp. 528–534.

    Article  Google Scholar 

  10. Sehgal, A. and Kehtarnavaz, N., A Convolutional neural network smartphone app for real-time voice activity detection, IEEE Access, 2018, vol. 6, pp. 9017–9026.  https://doi.org/10.1109/ACCESS.2018.2800728

    Article  Google Scholar 

  11. Amodei, D., Ananthanarayanan, S., Anubhai, R., et al, Deep speech 2: End-to-end speech recognition in English and Mandarin, ICML’16: Proc. 33rd Int. Conf. on Machine Learning, Balcan, M.F. and Weinberger, K.Q., Eds., New York, 2016, JMLR.org, 2016, vol. 48, pp. 173–182.

  12. Hussain, M.S. and Haque, M.A., SwishNet: A fast convolutional neural network for speech, music and noise classification and segmentation, 2018. ar**v:1812.00149 [cs.LG].

  13. LibriVox. https://librivox.org/.

  14. ChiME-4. http://spandh.des.shef.ac.uk/chime_challenge/chime2016/index.html.

  15. Wang, Z., Vincent, E., Serizel, R., and Yan, Yo., Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments, Comput. Speech Lang., 2018, vol. 49, pp. 37–51.  https://doi.org/10.1016/j.csl.2017.11.003

    Article  Google Scholar 

  16. Jia, F., Majumdar, S., and Cinsburg, B., MarbleNet: Deep 1D time-channel separable convolutional neural network for voice activity detection, 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Toronto, 2021, IEEE, 2021, pp. 6818–6822. https://doi.org/10.1109/ICASSP39728.2021.9414470

  17. Tan, X. and Zhang, X.-L., Speech enhancement aided end-to-end multi-task learning for voice activity detection, 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Toronto, 2021, IEEE, 2021, pp. 6823–6827.  https://doi.org/10.1109/ICASSP39728.2021.9414445

  18. Panayotov, V., Chen, G., Povey, D., and Khudanpur, S., Librispeech: An ASR corpus based on public domain audio books, IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, IEEE, 2015, pp. 5206–5210.  https://doi.org/10.1109/ICASSP.2015.7178964

  19. Garofolo, J.S., et al., TIMIT acoustic-phonetic continuous speech corpus LDC93S1, Philadelphia: Linguistic Data Consortium, 1993.

    Google Scholar 

  20. Common Voice. https://commonvoice.mozilla.org/en.

  21. Boersma, P., Praat, a system for doing phonetics by computer, 2002.

  22. Piczak, K.J., ESC: Dataset for environmental sound classification, 2015.

  23. Loshchilov, I. and Hutter, F., Fixing weight decay regularization in Adam, 6th Int. Conf. on Learning Representations, Vancouver, 2018. ar**v:1711.05101 [cs.LG]

  24. Reddi, S., Kale, S., and Kumar, S., On the convergence of Adam and beyond, 6th Int. Conf. on Learning Representations, Vancouver, 2018. ar**v:1904.09237 [cs.LG]

  25. Smith, L.N. and Topin, N., Super-convergence: very fast training of neural networks using large learning rates, SPIE Proc., 2019, vol. 11006, p. 1100612.  https://doi.org/10.1117/12.2520589

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to S. D. Protserov or A. G. Shishkin.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by L. Kartvelishvili

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Protserov, S.D., Shishkin, A.G. Segmentation of Noisy Speech Signals. Sci. Tech. Inf. Proc. 49, 356–363 (2022). https://doi.org/10.3103/S0147688222050100

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0147688222050100

Keywords:

Navigation