Log in

A resource-efficient partial 3D convolution for gesture recognition

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

3DCNNs have shown impressive capabilities in extracting spatiotemporal features from videos. However, in practical applications, the numerous trainable parameters in most 3DCNN models result in longer latency times. Many models attempt to improve computational speed by reducing the number of floating-point operations. However, this approach alone may not effectively reduce latency times. Therefore, this paper proposes partial 3D convolution by extracting features from only a portion of the channels to reduce memory access and latency times. Additionally, structural reparameterization was applied to simplify the inference structure of the partial convolution. This convolution can easily substitute regular convolutions and depthwise convolutions in existing models. Through verification on three datasets, Jester, EgoGesture, and NvGesture, the proposed partial 3D convolution demonstrates the following highlights: (i) low memory access, (ii) significantly lower latency compared to other models, (iii) almost unchanged accuracy. For example, when the regular convolutions of ResNeXt101 and ResNeXt50 are replaced with the proposed partial convolutions on a GPU, runtime latency is reduced by 29.6 and 29.9% respectively, with almost no change in accuracy. Furthermore, computational complexity is also reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Data availability

Yes, all involved data in experiments are available online.

References

  1. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v preprint ar**v:1704.04861 (2017)

  2. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Adam, H.: Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1314–1324 (2019)

  3. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)

  4. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6848–6856 (2018)

  5. Ma, N., Zhang, X., Zheng, H. T., & Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV). pp 116–131 (2018)

  6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  7. Chen, S., **e, E., Ge, C., Chen, R., Liang, D., Luo, P.: Cyclemlp: a mlp-like architecture for dense prediction. ar**v preprint ar**v:2107.10224 (2021)

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

  9. Kopuklu, O., Kose, N., Gunduz, A., Rigoll, G.: Resource efficient 3d convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. (2019)

  10. **ee, S., Girshick, R., Dollár, P., Tu, Z., & He, K.: Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)

  11. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018)

  12. Shan, P., Fu, C., Dai, L., Jia, T., Tie, M., Liu, J.: Automatic skin lesion classification using a new densely connected convolutional network with an SF module. Med. Biol. Eng. Comput. 60(8), 2173–2188 (2022)

    Article  Google Scholar 

  13. Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S.W., Anwer, R.M., Shahbaz Khan, F.: Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision (pp. 3–20). Cham: Springer Nature Switzerland (2022)

  14. Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. ar**v 2021. ar**v preprint ar**v:2110.02178

  15. Wang, W., **e, E., Li, X., Fan, D. P., Song, K., Liang, D., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 568–578 (2021)

  16. Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: more features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1580–1589 (2020)

  17. Tan, M., & Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. pp. 6105–6114. PMLR (2019)

  18. Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In International Conference on Machine Learning. pp. 10096–10106. PMLR (2021)

  19. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)

  20. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., Kaiser, Ł.: Universal transformers. ar**v preprint ar**v:1807.03819 (2018)

  21. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497 (2015)

  22. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)

  23. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. pp. 5533–5541 (2017)

  24. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459 (2018)

  25. **e, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV). pp. 305–321 (2018)

  26. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6202–6211 (2019)

  27. Chen, G., Dong, Z., Wang, J., **a, L.: Parallel temporal feature selection based on improved attention mechanism for dynamic gesture recognition. Complex Intell. Syst. 9(2), 1377–1390 (2023)

    Article  Google Scholar 

  28. Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7083–7093 (2019)

  29. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016)

    Article  Google Scholar 

  30. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. pp. 2736–2744 (2017)

  31. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ar**v preprint ar**v:1612.03928. (2016)

  32. Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)

  33. Ding, X., Guo, Y., Ding, G., Han, J.: Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1911–1920 (2019)

  34. Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13733–13742 (2021)

  35. Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: MobileOne: an improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7907–7917/ (2023)

  36. Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The jester dataset: a large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

  37. Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimed. 20(5), 1038–1050 (2018)

    Article  Google Scholar 

  38. Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4207–4215. (2016)

  39. Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans. Biom. Behav. Identity Sci. 2(2), 85–97 (2020)

    Article  Google Scholar 

  40. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision pp. 618–626. (2017)

Download references

Funding

No funding.

Author information

Authors and Affiliations

Authors

Contributions

G.C.; Methodology, G.C.; Validation, G.C.; Investigation, Z.D., and J.W.; Writing—original draft, G.C.; Writing—review & editing, G.C.; Supervision, J.W., and L.X. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Gongzheng Chen.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (GIF 12496 KB)

Appendix

Appendix

figure a

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Dong, Z., Wang, J. et al. A resource-efficient partial 3D convolution for gesture recognition. J Real-Time Image Proc 21, 132 (2024). https://doi.org/10.1007/s11554-024-01509-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-024-01509-6

Keywords

Navigation