An Open-Source Voice Command-Based Human-Computer Interaction System Using Speech Recognition Platforms

Fuad, Adnan Mahmud; Ahmed, Sheikh Jahan; Anannya, Nusrat Jahan; Mridha, M. F.; Nur, Kamruddin

doi:10.1007/978-981-99-8937-9_36

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 867))

Included in the following conference series:

International Conference on Big Data, IoT and Machine Learning

118 Accesses

Abstract

Voice command-based human-computer interaction (HCI) is becoming useful and practical day by day. Here, we present an open-source voice command-based speech interaction system featuring hands-free interactions of mouse and keyboard without any active internet connection. The usefulness of the application is demonstrated by evaluating the application thoroughly kee** in mind for a motor-disabled person as well as for a normal person. Several participants of different age groups who evaluated the system found that the implemented system worked reliably and helped them complete the task with voice commands only without using mouse and keyboard. In this research, we identify common voice tokens a person would speak to accomplish a human-computer interaction, then we program the tokens to work with major speech recognition platforms such as CMU PocketSphinx, DeepSpeech, and VOSK. Different results were obtained for each platform based on detection rate, accuracy, inference time, CPU usage, system memory usage, and various age group users’ accuracy. In the results section, we present that using our proposed system, the VOSK speech recognition platform outperformed other compared platforms having a 91% successful task completion rate for real-time applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 159.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 199.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Rubio-Drosdov E, Dìaz-Sànchez D, Almenàrez F, Arias-Cabarcos P, Marìn A (2017) Seamless human-device interaction in the internet of things. IEEE Trans Consum Electron 63(4):490–498
Google Scholar
Ansari JA, Sathyamurthy A, Balasubramanyam R (2016) An open voice command interface kit. IEEE Trans Hum Mach Syst 46(3):467–473
Google Scholar
Kwon S, Kim S, Choeh JY (2016) Preprocessing for elderly speech recognition of smart devices. Comput Speech Lang 36:110–121 March
Article Google Scholar
Meltzner GS, Heaton JT, Deng Y, De Luca G, Roy SH, Kline JC (2017) Silent speech recognition as an alternative communication device for persons With laryngectomy. IEEE/ACM Trans Audio Speech Lang Process 25(12):2386–2398
Google Scholar
Sahraeian R, Van Compernolle D (2017) Crosslingual and multilingual speech recognition based on the speech manifold. IEEE/ACM Trans Audio Speech Lang Process 25(12):2301–2312
Google Scholar
Takashima Y, Takashima R, Takiguchi T, Ariki Y (2019) Knowledge transferability between the speech data of persons With dysarthria speaking different languages for dysarthric speech recognition. IEEE Access 7:164320–164326
Google Scholar
Shmyrev N (2022) CMUSphinx. [online] CMUSphinx open source speech recognition. Available at https://cmusphinx.github.io. Accessed 22 Aug 2022
VOSK (2022) VOSK Offline speech recognition API. [online] Available at: https://alphacephei.com/vosk/. Accessed 22 Aug 2022
DeepSpeech (2022) DeepSpeech’s documentation. [online] Available at: https://deepspeech.readthedocs.io/en/r0.9/. Accessed 22 Aug 2022
Lee L-M (2016) Piecewise polynomial high-order Hidden Markov Models with applications in speech recognition. In: IEEE International conference on computer and information technology
Google Scholar
Cheng G, Miao H, Yang R, Deng K, Yan Y (2022) ETEH: unified attention-based end-to-end ASR and KWS architecture. IEEE/ACM Trans Audio Speech Lang Process 30:1360–1373
Google Scholar
Keefer R,, Liu Y, Bourbakis N (2013) The development and evaluation of an eyes-free interaction model for mobile reading devices. IEEE Trans Hum Mach Syst 43(1):76–91
Google Scholar
Garg I, Solanki H, Verma S (2020) Automation and presentation of word document using speech recognition. In: 2020 International conference for emerging technology (INCET), 03 Aug 2020
Google Scholar
Lin Y, Guo D, Zhang J, Chen Z, Yang B (2021) A unified framework for multilingual speech recognition in air traffic control Systems. IEEE Trans Neural Networks Learn Syst 32(8):3608–3620
Google Scholar
Messaoudia A, Haddada H, Fouratia C, Hmidaa MB, Ben A, Mabrouka E, Graietb M (2021) Tunisian dialectal end-to-end speech recognition based on DeepSpeech. In: 5th International conference on AI in computational linguistics
Google Scholar
Carreras A, Morenza-Cinos M, Pous R, Melià-Seguì J, Nur K, Oliver J, De Porrata-Doria R (2013) STORE VIEW: Pervasive RFID; Indoor Navigation Based Retail Inventory Management. In: Proceedings Of The 2013 ACM conference on pervasive and ubiquitous computing adjunct publication, pp 1037–1042 (2013), https://doi.org/10.1145/2494091.2496011
Nur K, Rashid Z, Pous R (2015) A smartphone application for voice browsing RFID smart shelves. In: The 14th International conference on mobile and ubiquitous multimedia (MUM 2015), Linz, Austria
Google Scholar
Leem S-G, Yoo I-C, Yook D (2019) Multitask learning of deep neural network-based keyword spotting for IoT Devices. 65(2), 188–194
Google Scholar
Tabani H, Arnau J, Tubella J, González A (2018) Performance analysis and optimization of automatic speech recognition. IEEE Trans Multi-Scale Comput Syst 4(4):847–860
Google Scholar
Tachioka Y, Narita T (2017) Optimal automatic speech recognition system selection for noisy environments. In: 2016 Asia-Pacific Signal and Information Processing Association, Annual Summit and Conference (APSIPA), 19 Jan 2017
Google Scholar
Xu J, Tan X, Ren Y, Qin T, Li J, Zhao S, Liu T-Y (2020) LRSpeech: extremely low-resource speech synthesis and recognition. In: ACM SIGKDD International conference on knowledge discovery and data mining
Google Scholar
Tong S, Garner PN, Bourlard H (2018) Cross-lingual adaptation of a CTC-based multilingual acoustic model. Speech Commun 104:39–46 November
Article Google Scholar
Dahl GE, Yu D, Li D, Acero A (2001) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Google Scholar
Martins V, Queiroz L, Brito A, Barbosa D, Silva T, Patricia A, Magalhaes F, Souza J, Campos J (2021) Comparing Pocketsphinx and Vosk recognition in human speech decoding. In: IV Brazilian humanoid robot workshop (BRAHUR) and the V Brazilian workshop on service robotics (BRASERO), 15 July 2021
Google Scholar
Huggins-Daines D, Kumar M, Chan A, Black A, Ravishankar M, Rudnicky A (2006) Pocketsphinx: a free, real-time continuous speech recognition system for hand-held devices. In: 2006 IEEE International conference on acoustics speech and signal processing proceedings, vol 1, p I-I
Google Scholar
Ramunyisi N, Badenhorst J, Moors C, Gumede T (2018) Rapid development of a command and control interface for smart office environments. In: Proceedings of the annual conference of the South African Institute of Computer Scientists and Information Technologists, pp 188–194
Google Scholar
Hair A, Ballard K, Ahmed B, Gutierrez-Osuna R (2019) Evaluating automatic speech recognition for child speech therapy applications. In: Proceedings of the 21st International ACM SIGACCESS conference on computers and accessibility, pp 578-580. https://doi.org/10.1145/3308561.3354606
Ferland F, Chauvin R, Létourneau D, Michaud F (2014) Hello robot can you come here? Using ROS4iOS to provide remote perceptual capabilities for visual location, speech and speaker recognition. In: Proceedings of the 2014 ACM/IEEE international conference on human-robot interaction, p 101. https://doi.org/10.1145/2559636.2559639
Gao Y, Srivastava B, Salsman J (2018) Spoken english intelligibility remediation with pocketsphinx alignment and feature extraction improves substantially over the state of the art. In: 2nd IEEE Advanced information management, communicates, electronic and automation control conference (IMCEC), pp 924–927
Google Scholar
Pant A, Wu K, Tseng Y (2020) Speak to action: offline and hybrid language recognition on embedded board for smart control system. In: 2020 International computer symposium (ICS), pp 85–90
Google Scholar
Introduction. Deep Neural Networks in Kaldi. [online] Available at: https://kaldi-asr.org/doc/dnn.html. Accessed 28 Oct 2022
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A et al (2014) Deep speech: scaling up end-to-end speech recognition. Ar**v Preprint Ar**v:1412.5567
Hearing Health Foundation (2022) Decibel levels—measuring dangerous noise—Hearing Health Foundation. [online] Available at: https://hearinghealthfoundation.org/decibel-levels. Accessed 11 Oct 2022

Download references

Author information

Authors and Affiliations

American International University-Bangladesh (AIUB), Dhaka, 1229, Bangladesh
Adnan Mahmud Fuad, Sheikh Jahan Ahmed, Nusrat Jahan Anannya, M. F. Mridha & Kamruddin Nur

Authors

Adnan Mahmud Fuad
View author publications
You can also search for this author in PubMed Google Scholar
Sheikh Jahan Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Nusrat Jahan Anannya
View author publications
You can also search for this author in PubMed Google Scholar
M. F. Mridha
View author publications
You can also search for this author in PubMed Google Scholar
Kamruddin Nur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adnan Mahmud Fuad .

Editor information

Editors and Affiliations

Chittagong University of Engineering and Technology, Chittagong, Bangladesh
Mohammad Shamsul Arefin
Jahangirnagar University, Dhaka, Bangladesh
M. Shamim Kaiser
Daffodil International University, Dhaka, Bangladesh
Touhid Bhuiyan
Department of Computer Science and Engineering, Techno International New Town, Kolkata, India
Nilanjan Dey
Nottingham Trent University, Nottingham, UK
Mufti Mahmud

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fuad, A.M., Ahmed, S.J., Anannya, N.J., Mridha, M.F., Nur, K. (2024). An Open-Source Voice Command-Based Human-Computer Interaction System Using Speech Recognition Platforms. In: Arefin, M.S., Kaiser, M.S., Bhuiyan, T., Dey, N., Mahmud, M. (eds) Proceedings of the 2nd International Conference on Big Data, IoT and Machine Learning. BIM 2023. Lecture Notes in Networks and Systems, vol 867. Springer, Singapore. https://doi.org/10.1007/978-981-99-8937-9_36

Download citation

DOI: https://doi.org/10.1007/978-981-99-8937-9_36
Published: 30 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8936-2
Online ISBN: 978-981-99-8937-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics