Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

  • Conference paper
  • First Online:
ICT for Intelligent Systems ( ICTIS 2023)

Abstract

Deep learning algorithms have significantly prevailed as a stand-in for i-vector in numerous applications such as speaker recognition, diarization, segmentation and verification. CNN when fed directly with raw voice fragments have resulted in propitious outcomes. Instead of using conventional hand-crafted features, CNNs learn low-level speech patterns using waveforms, which allows the network to better represent crucial narrow band speaker properties like pitch and formants. These narrow band features are less susceptible to surrounding noise. The design of neural network becomes significant to achieve this purpose. This study suggests the use of SincNet, an unique CNN architecture which enables the first convolutional layer to find deep relevant filters. Band-pass filters are implemented via parameterized sinc functions, on which SincNet is built. The standard CNNs normally learns all elements of each filter, where as the sincnet technique learns only high and low cutoff frequencies directly from raw fed data. This provides a highly compact and effective method to produce a unique filter bank that is precisely tuned for the particular application. Our research on the speaker verification task illustrate that the suggested architecture attains higher speed and performs better than the regular CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now
Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 279.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Beigi H, Beigi H (2011) Speaker recognition. Springer

    Google Scholar 

  2. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  3. Heaton J (2018) Ian goodfellow, yoshua bengio, and aaron courville: deep learning. The MIT press, p 800. ISBN: 0262035618. Genet Prog Evol Mach 19(1–2):305–307

    Google Scholar 

  4. Kenny P, Stafylakis T, Ouellet P, Gupta V, Alam MJ (2014) Deep neural networks for extracting BAUM-welch statistics for speaker recognition. Odyssey 2014:293–298

    Article  Google Scholar 

  5. Yaman S, Pelecanos JW, Sarikaya R (2012) Bottleneck features for speaker recognition. Odyssey 12:105–108

    Google Scholar 

  6. Variani E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J (2014) Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4052–4056

    Google Scholar 

  7. Richardson F, Reynolds D, Dehak N (2015) A unified deep neural network for speaker and language recognition. ar**v preprint ar**v:1504.00923

  8. Zhang C, Koishida K, Hansen JH (2018) Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans Audio Speech Lang Process 26(9):1633–1644

    Article  Google Scholar 

  9. Palaz D, Collobert R et al (2015) Analysis of CNN-based speech recognition system using raw speech as input. Technical report, Idiap

    Book  Google Scholar 

  10. Sainath TN, Kingsbury B, Mohamed AR, Ramabhadran B (2013) Learning filter banks within a deep neural network framework. In: 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, pp 297–302

    Google Scholar 

  11. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5200–5204

    Google Scholar 

  12. Muckenhirn H, Doss MM, Marcell S (2018) Towards directly modeling raw speech signal for speaker verification using CNNs. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4884–4888

    Google Scholar 

  13. Dinkel H, Chen N, Qian Y, Yu K (2017) End-to-end spoofing detection with raw waveform CLDNNs. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4860–4864

    Google Scholar 

  14. Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. ar**v preprint ar**v:1609.03499

  15. Seki H, Yamamoto K, Nakagawa S (2017) A deep neural network integrated with filterbank learning for speech recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5480–5484

    Google Scholar 

  16. Haris BC, Pradhan G, Misra A, Prasanna S, Das RK, Sinha R (2012) Multivariability speaker recognition database in Indian scenario. Int J Speech Technol 15:441–453

    Article  Google Scholar 

  17. Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M et al (2006) The AMI meeting corpus: a pre-announcement. In: Machine learning for multimodal interaction: second international workshop, MLMI 2005, Edinburgh, UK, 11–13 July 2005, Revised Selected Papers 2. Springer, pp 28–39

    Google Scholar 

  18. Chikkamath S, Nirmala S (2022) Music detection using deep learning with tensorflow. In: ICDSMLA 2020: proceedings of the 2nd international conference on data science, machine learning and applications. Springer, pp 283–291

    Google Scholar 

  19. Chikkamath S, Nirmala S (2021) Melody generation using LSTM and bi-LSTM network. In: 2021 international conference on computational intelligence and computing applications (ICCICA). IEEE, pp 1–6

    Google Scholar 

  20. Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with SINCNET. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 1021–1028

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pratham Sanshi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sanshi, P., Kuruvalli, L.R., Chikkamath, S., Nirmala, R.S. (2023). Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions. In: Choudrie, J., Mahalle, P.N., Perumal, T., Joshi, A. (eds) ICT for Intelligent Systems. ICTIS 2023. Smart Innovation, Systems and Technologies, vol 361. Springer, Singapore. https://doi.org/10.1007/978-981-99-3982-4_17

Download citation

Publish with us

Policies and ethics

Navigation