Log in

Deep Learning-Based Hand Gesture Recognition System and Design of a Human–Machine Interface

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Hand gesture recognition plays an important role in develo** effective human–machine interfaces (HMIs) that enable direct communication between humans and machines. But in real-time scenarios, it is difficult to identify the correct hand gesture to control an application while moving the hands. To address this issue, in this work, a low-cost hand gesture recognition system based human-computer interface (HCI) is presented in real-time scenarios. The system consists of six stages: (1) hand detection, (2) gesture segmentation, (3) feature extraction and gesture classification using five pre-trained convolutional neural network models (CNN) and vision transformer (ViT), (4) building an interactive human–machine interface (HMI), (5) development of a gesture-controlled virtual mouse, (6) smoothing of virtual mouse pointer using of Kalman filter. In our work, five pre-trained CNN models (VGG16, VGG19, ResNet50, ResNet101, and Inception-V1) and ViT have been employed to classify hand gesture images. Two multi-class datasets (one public and one custom) have been used to validate the models. Considering the model’s performances, it is observed that Inception-V1 has significantly shown a better classification performance compared to the other four CNN models and ViT in terms of accuracy, precision, recall, and F-score values. We have also expanded this system to control some multimedia applications (such as VLC player, audio player, playing 2D Super-Mario-Bros game, etc.) with different customized gesture commands in real-time scenarios. The average speed of this system has reached 25 fps (frames per second), which meets the requirements for the real-time scenario. Performance of the proposed gesture control system obtained the average response time in milisecond for each control which makes it suitable for real-time. This model (prototype) will benefit physically disabled people interacting with desktops.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Algorithm 1
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

We confirm that dataset will be made available on reasonable request.

References

  1. Berezhnoy V, Popov D, Afanasyev I, Mavridis N (2018) The hand-gesture-based control interface with wearable glove system. In: ICINCO (2), pp 458–465

  2. Abhishek KS, Qubeley LCF, Ho D (2016) Glove-based hand gesture recognition sign language translator using capacitive touch sensor. In: 2016 IEEE international conference on electron devices and solid-state circuits (EDSSC), IEEE, pp 334–337

  3. Liao C-J, Su S-F, Chen M-C (2015) Vision-based hand gesture recognition system for a dynamic and complicated environment. In: 2015 IEEE international conference on systems, man, and cybernetics, pp 2891–2895. https://doi.org/10.1109/SMC.2015.503

  4. Al Farid F, Hashim N, Abdullah J, Bhuiyan MR, Shahida Mohd Isa WN, Uddin J, Haque MA, Husen MN (2022) A structured and methodological review on vision-based hand gesture recognition system. J Imaging 8(6):153

    Article  Google Scholar 

  5. Mantecón T, del Blanco CR, Jaureguizar F, García N (2016) Hand gesture recognition using infrared imagery provided by leap motion controller. In: International conference on advanced concepts for intelligent vision systems, Springer, pp 47–57

  6. Huang D-Y, Hu W-C, Chang S-H (2011) Gabor filter-based hand-pose angle estimation for hand gesture recognition under varying illumination. Expert Syst Appl 38(5):6031–6042

    Article  Google Scholar 

  7. Singha J, Roy A, Laskar RH (2018) Dynamic hand gesture recognition using vision-based approach for human-computer interaction. Neural Comput Appl 29(4):1129–1141

    Article  Google Scholar 

  8. Yang Z, Li Y, Chen W, Zheng Y (2012) Dynamic hand gesture recognition using hidden markov models. In: 2012 7th international conference on computer science & education (ICCSE), IEEE, pp 360–365

  9. Yingxin X, **ghua L, Lichun W, Dehui K (2016) A robust hand gesture recognition method via convolutional neural network. In: 6th international conference on digital home (ICDH). IEEE 2016:64–67

  10. Oyedotun OK, Khashman A (2017) Deep learning in vision-based static hand gesture recognition. Neural Comput Appl 28(12):3941–3951

    Article  Google Scholar 

  11. Fang W, Ding Y, Zhang F, Sheng J (2019) Gesture recognition based on CNN and DCGAN for calculation and text output. IEEE Access 7:28230–28237

    Article  Google Scholar 

  12. Adithya V, Rajesh R (2020) A deep convolutional neural network approach for static hand gesture recognition. Proc Comput Sci 171:2353–2361

    Article  Google Scholar 

  13. Neethu P, Suguna R, Sathish D (2020) An efficient method for human hand gesture detection and recognition using deep learning convolutional neural networks. Soft Comput 24:15239–15248

    Article  Google Scholar 

  14. Sen A, Mishra TK, Dash R (2022) A novel hand gesture detection and recognition system based on ensemble-based convolutional neural network. Multimed Tools Appl 81(28):40043–40066

    Article  Google Scholar 

  15. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale, ar**v preprint ar**v:2010.11929

  16. Godoy RV, Lahr GJ, Dwivedi A, Reis TJ, Polegato PH, Becker M, Caurin GA, Liarokapis M (2022) Electromyography-based, robust hand motion classification employing temporal multi-channel vision transformers. IEEE Robot Autom Lett 7(4):10200–10207

    Article  Google Scholar 

  17. Montazerin M, Zabihi S, Rahimian E, Mohammadi A, Naderkhani F (2022) Vit-hgr: Vision transformer-based hand gesture recognition from high density surface EMG signals, ar**v preprint ar**v:2201.10060

  18. Rautaray SS, Agrawal A (2010) A novel human computer interface based on hand gesture recognition using computer vision techniques. In: Proceedings of the first international conference on intelligent interactive technologies and multimedia, pp 292–296

  19. Kim K-S, Jang D-S, Choi H-I (2007) Real time face tracking with pyramidal lucas-kanade feature tracker. In: Computational science and its applications–ICCSA 2007: international conference, Kuala Lumpur, Malaysia, August 26-29, 2007. Proceedings, Part I 7, Springer, pp 1074–1082

  20. Paliwal M, Sharma G, Nath D, Rathore A, Mishra H, Mondal S (2013) A dynamic hand gesture recognition system for controlling vlc media player. In: 2013 international conference on advances in technology and engineering (ICATE), IEEE, pp 1–4

  21. Shibly KH, Dey SK, Islam MA, Showrav SI (2019) Design and development of hand gesture based virtual mouse. In: 2019 1st international conference on advances in science, engineering and robotics technology (ICASERT), IEEE, pp 1–5

  22. Tsai T-H, Huang C-C, Zhang K-L (2020) Design of hand gesture recognition system for human-computer interaction. Multimed Tools Appl 79(9):5989–6007

    Article  Google Scholar 

  23. Xu P (2017) A real-time hand gesture recognition and human-computer interaction system, ar**v preprint ar**v:1704.07296

  24. Kim Y, Bang H (2018) Introduction to kalman filter and its applications. In: F. Govaers (Ed.), Introduction and Implementations of the Kalman Filter, IntechOpen, Rijeka, Ch. 2. https://doi.org/10.5772/intechopen.80600

  25. Chen Z-h, Kim J-T, Liang J, Zhang J, Yuan Y-B (2014) Real-time hand gesture recognition using finger segmentation. Sci World J. https://doi.org/10.1155/2014/267872

    Article  Google Scholar 

  26. Jamil N, Sembok TMT, Bakar ZA (2008) Noise removal and enhancement of binary images using morphological operations. In: International symposium on information technology, Vol. 4. IEEE 2008:1–6

  27. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, ar**v preprint ar**v:1409.1556

  28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  29. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  30. Vlc-ctrl. https://pypi.org/project/vlc-ctrl/

  31. Audioplayer. https://pypi.org/project/audioplayer/

  32. Kauten C (2018) Super Mario Bros for OpenAI Gym, GitHub

  33. Asaari MSM, Suandi SA (2010) Hand gesture tracking system using adaptive Kalman filter. In: 2010 10th international conference on intelligent systems design and applications, IEEE, pp 166–171

  34. Ren S, He K, Girshick R, Sun J. Faster (2015) r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Proc Syst 28

  35. Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790

  36. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L, Imagenet: A large-scale hierarchical image database, in, (2009) IEEE conference on computer vision and pattern recognition. Ieee 2009:248–255

  37. Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, Ajlan NA (2021) Vision transformers for remote sensing image classification. Remote Sens 13(3):516

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abir Sen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Appendix

1.1 A.1 VGG16

VGG16 architecture is an one type of variant of VGGNet [27] model, consisting of 13 convolutional layers with kernel sizes of \((3 \times 3)\) and Relu activation function. Each convolution layer is pursued by max-pooling layer with filter size \((2 \times 2)\). Finally the FC layer is added with softmax activation function to produce the final output class label. In this architecture, the depth of network is increased by adding more convolution and max-pooling layers. This network is trained on large-scale ImageNet [36] dataset, ImageNet dataset consists of millions of images and having more than 20,000 class labels developed for large scale visual recognition challenge. VGG16 has reported test accuracy of 92.7% in the ILSVRC-2012 challenge.

1.2 A.2 VGG19

VGG19 is also a variant of VGGNet network [27], it comprises 16 convolutional layers and three dense layers with kernel /filter size of \((3 \times 3)\) and then the max-pooling layer is used with filter size \( (2 \times 2)\). This architecture is pursued by the final FC layer with softmax function to deliver the predicted class label. This model has achieved second rank in ILSVRC-2014 challenge after being trained on ImageNet [36] dataset. This model has an input size of \((224 \times 224)\).

1.3 A.3 Inception-V1

Inception-V1 or GoogleNet [29] is a powerful CNN architecture with having 22 layers built on the inception module. In this module, the architecture is limited by three independent filters such as \((1 \times 1)\), \((3 \times 3)\) and \((5 \times 5)\). Here in this architecture, \((1 \times 1)\) filter is used before \((3 \times 3)\), \((5 \times 5)\) convolutional filters for dimension reduction purpose. This module also includes one max-pooling layer with pool size \((3 \times 3)\). In this module the outputs by using convolutional layers such as \((1 \times 1)\), \((3 \times 3)\) and \((5 \times 5)\) are concatenated and form the inputs for next layer. The last part follows the FC layer with a softmax function to produce the final predicted output. This input of this model is \((224 \times 224)\). This architecture is trained with ImageNet [36] dataset and has reported top-5 error of 6.67% in ILSVRC-2014 challenge.

1.4 A.4 ResNet50

Residual neural network [28] was developed by Microsoft research, This model consists of 50 layers, where 50 stands total number of deep layers, containing 48 convolutional layers, one max-pooling. Finally global average pool layer is connected to the top of the final residual block, which is pursued by the dense layer with softmax activation to generate the final output class. This network has input size of \((224 \times 224)\). The backbone of this architecture is based on residual block. In the case of residual block, the output of one layer is added to a deeper layer in the block, which is also called skip connections or shortcuts. This architecture also reduces the vanishing and exploding gradient problems during training. ResNet50 architecture was trained on the ImageNet dataset [36] and has achieved a good results in ILSVRC-2014 challenge with an error of 3.57%.

1.5 A.5 ResNet101

ResNet101 model consists of 101 deep layers. Like ResNet50, this architecture is also based on the residual building block. In our experiment, we have loaded the pre-trained version of this architecture, trained on ImageNet dataset [36] that comprises millions of images. This model’s default input image size is \((224 \times 224)\).

1.6 A.6 Vision Transformer

A standard transformer architecture consists of two components (1) a set of encoder and (2) a decoder. But in the case of ViT, it doesn’t require the decoder part as it contains only encoder part. In Vision Transformer, firstly image is split into fixed-sized patches, and each patch is implemented for the patch-embedding phase. In the case of patch embedding, each patch is flattened to produce a one-dimensional vector. After the patch-embedding phase, positional embedding is added with the patches to retain the positional information about the image patches in the sequence. Next, they are moved to the transformer encoder. The transformer encoder [37] comprises two components: (1) a Multi-head self-attention block (MHSA) and (2) MLP (multiple-layer perceptron). Hence MHSA block splits the inputs into several number of heads so that each head can learn different levels of self-attention. Then, the outputs of multiple attention heads are concatenated and delivered to the MLP. Next, the classification task is performed by the MLP layer.

B Appendix

1.1 B.1 Statistical Hypothesis Testing

We have also performed a statistical analysis in order to check statistical significance of our model. In Experiment-1 4.4, we have conducted one sample t-test with the help of IBM SPSS statistical analysis tool. In case of null hypothesis, we have to assume that our model is not statistically significant.

To obtain the value of t, the following formula is used:

t=\(\frac{(\overline{X}-\mu )}{\frac{SD}{\sqrt{k}} } \)

Table 13 One sample statistics
Table 14 One-sample T-test result

Where \(\overline{X}\) is the mean of samples. \(\mu \) the test value. SD sample standard deviation. k size of samples.

To get the value of \(\overline{X}\), firstly, we have used the ten-fold-cross-validation strategy using Dataset-1, and have calculated the fold-wise accuracy followed by computing the average (considered as sample mean) of these accuracy values with the Inception-V1 (best-selected model for Experiment-1 4.4) model.

Table 13 shows that the sample mean (\(\overline{X}\)), sample size (k), test value (\(\mu \)), and the standard deviation (SD) are 99.83, 10, 99, and 0.2907, respectively, and the entire statistical analysis of one-sample t-test has been demonstrated in Table 14.

The results in Table 14 exhibit that p-value < 0.001. Here p-value is used for hypothesis testing to determine whether there is evidence to reject the null hypothesis.

If p < \(\alpha \) where, \(\alpha \) (confidence level) = 0.05, then null hypothesis is rejected.

In Table 14, it is observed that p-value is very less than \(\alpha \), so the null hypothesis is rejected, and we can say that there is a statistically significant difference in the mean of the accuracy values.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sen, A., Mishra, T.K. & Dash, R. Deep Learning-Based Hand Gesture Recognition System and Design of a Human–Machine Interface. Neural Process Lett 55, 12569–12596 (2023). https://doi.org/10.1007/s11063-023-11433-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11433-8

Keywords

Navigation