A resource-efficient partial 3D convolution for gesture recognition

Chen, Gongzheng; Dong, Zhenghong; Wang, Jue; Hu, Jijian

doi:10.1007/s11554-024-01509-6

A resource-efficient partial 3D convolution for gesture recognition

Research
Published: 15 July 2024

Volume 21, article number 132, (2024)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Gongzheng Chen¹,
Zhenghong Dong¹,
Jue Wang¹ &
…
Jijian Hu¹

Abstract

3DCNNs have shown impressive capabilities in extracting spatiotemporal features from videos. However, in practical applications, the numerous trainable parameters in most 3DCNN models result in longer latency times. Many models attempt to improve computational speed by reducing the number of floating-point operations. However, this approach alone may not effectively reduce latency times. Therefore, this paper proposes partial 3D convolution by extracting features from only a portion of the channels to reduce memory access and latency times. Additionally, structural reparameterization was applied to simplify the inference structure of the partial convolution. This convolution can easily substitute regular convolutions and depthwise convolutions in existing models. Through verification on three datasets, Jester, EgoGesture, and NvGesture, the proposed partial 3D convolution demonstrates the following highlights: (i) low memory access, (ii) significantly lower latency compared to other models, (iii) almost unchanged accuracy. For example, when the regular convolutions of ResNeXt101 and ResNeXt50 are replaced with the proposed partial convolutions on a GPU, runtime latency is reduced by 29.6 and 29.9% respectively, with almost no change in accuracy. Furthermore, computational complexity is also reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Data availability

Yes, all involved data in experiments are available online.

References

Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v preprint ar**v:1704.04861 (2017)
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Adam, H.: Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1314–1324 (2019)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6848–6856 (2018)
Ma, N., Zhang, X., Zheng, H. T., & Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV). pp 116–131 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Chen, S., **e, E., Ge, C., Chen, R., Liang, D., Luo, P.: Cyclemlp: a mlp-like architecture for dense prediction. ar**v preprint ar**v:2107.10224 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
Kopuklu, O., Kose, N., Gunduz, A., Rigoll, G.: Resource efficient 3d convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. (2019)
**ee, S., Girshick, R., Dollár, P., Tu, Z., & He, K.: Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018)
Shan, P., Fu, C., Dai, L., Jia, T., Tie, M., Liu, J.: Automatic skin lesion classification using a new densely connected convolutional network with an SF module. Med. Biol. Eng. Comput. 60(8), 2173–2188 (2022)
Article Google Scholar
Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S.W., Anwer, R.M., Shahbaz Khan, F.: Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision (pp. 3–20). Cham: Springer Nature Switzerland (2022)
Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. ar**v 2021. ar**v preprint ar**v:2110.02178
Wang, W., **e, E., Li, X., Fan, D. P., Song, K., Liang, D., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 568–578 (2021)
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: more features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1580–1589 (2020)
Tan, M., & Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. pp. 6105–6114. PMLR (2019)
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In International Conference on Machine Learning. pp. 10096–10106. PMLR (2021)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., Kaiser, Ł.: Universal transformers. ar**v preprint ar**v:1807.03819 (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497 (2015)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. pp. 5533–5541 (2017)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459 (2018)
**e, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV). pp. 305–321 (2018)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6202–6211 (2019)
Chen, G., Dong, Z., Wang, J., **a, L.: Parallel temporal feature selection based on improved attention mechanism for dynamic gesture recognition. Complex Intell. Syst. 9(2), 1377–1390 (2023)
Article Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7083–7093 (2019)
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016)
Article Google Scholar
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. pp. 2736–2744 (2017)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ar**v preprint ar**v:1612.03928. (2016)
Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)
Ding, X., Guo, Y., Ding, G., Han, J.: Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1911–1920 (2019)
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13733–13742 (2021)
Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: MobileOne: an improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7907–7917/ (2023)
Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The jester dataset: a large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimed. 20(5), 1038–1050 (2018)
Article Google Scholar
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4207–4215. (2016)
Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans. Biom. Behav. Identity Sci. 2(2), 85–97 (2020)
Article Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision pp. 618–626. (2017)

Download references

Funding

No funding.

Author information

Authors and Affiliations

Space Engineering University, Bei**g, China
Gongzheng Chen, Zhenghong Dong, Jue Wang & Jijian Hu

Authors

Gongzheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhenghong Dong
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jijian Hu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.C.; Methodology, G.C.; Validation, G.C.; Investigation, Z.D., and J.W.; Writing—original draft, G.C.; Writing—review & editing, G.C.; Supervision, J.W., and L.X. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Gongzheng Chen.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (GIF 12496 KB)

Appendix

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, G., Dong, Z., Wang, J. et al. A resource-efficient partial 3D convolution for gesture recognition. J Real-Time Image Proc 21, 132 (2024). https://doi.org/10.1007/s11554-024-01509-6

Download citation

Received: 14 May 2024
Accepted: 29 June 2024
Published: 15 July 2024
DOI: https://doi.org/10.1007/s11554-024-01509-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

A resource-efficient partial 3D convolution for gesture recognition

Abstract

Access this article

Subscribe and save

Buy Now

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary file1 (GIF 12496 KB)

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation