Abstract
3DCNNs have shown impressive capabilities in extracting spatiotemporal features from videos. However, in practical applications, the numerous trainable parameters in most 3DCNN models result in longer latency times. Many models attempt to improve computational speed by reducing the number of floating-point operations. However, this approach alone may not effectively reduce latency times. Therefore, this paper proposes partial 3D convolution by extracting features from only a portion of the channels to reduce memory access and latency times. Additionally, structural reparameterization was applied to simplify the inference structure of the partial convolution. This convolution can easily substitute regular convolutions and depthwise convolutions in existing models. Through verification on three datasets, Jester, EgoGesture, and NvGesture, the proposed partial 3D convolution demonstrates the following highlights: (i) low memory access, (ii) significantly lower latency compared to other models, (iii) almost unchanged accuracy. For example, when the regular convolutions of ResNeXt101 and ResNeXt50 are replaced with the proposed partial convolutions on a GPU, runtime latency is reduced by 29.6 and 29.9% respectively, with almost no change in accuracy. Furthermore, computational complexity is also reduced.
Data availability
Yes, all involved data in experiments are available online.
References
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v preprint ar**v:1704.04861 (2017)
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Adam, H.: Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1314–1324 (2019)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6848–6856 (2018)
Ma, N., Zhang, X., Zheng, H. T., & Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV). pp 116–131 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Chen, S., **e, E., Ge, C., Chen, R., Liang, D., Luo, P.: Cyclemlp: a mlp-like architecture for dense prediction. ar**v preprint ar**v:2107.10224 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
Kopuklu, O., Kose, N., Gunduz, A., Rigoll, G.: Resource efficient 3d convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. (2019)
**ee, S., Girshick, R., Dollár, P., Tu, Z., & He, K.: Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018)
Shan, P., Fu, C., Dai, L., Jia, T., Tie, M., Liu, J.: Automatic skin lesion classification using a new densely connected convolutional network with an SF module. Med. Biol. Eng. Comput. 60(8), 2173–2188 (2022)
Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S.W., Anwer, R.M., Shahbaz Khan, F.: Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision (pp. 3–20). Cham: Springer Nature Switzerland (2022)
Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. ar**v 2021. ar**v preprint ar**v:2110.02178
Wang, W., **e, E., Li, X., Fan, D. P., Song, K., Liang, D., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 568–578 (2021)
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: more features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1580–1589 (2020)
Tan, M., & Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. pp. 6105–6114. PMLR (2019)
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In International Conference on Machine Learning. pp. 10096–10106. PMLR (2021)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., Kaiser, Ł.: Universal transformers. ar**v preprint ar**v:1807.03819 (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497 (2015)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. pp. 5533–5541 (2017)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459 (2018)
**e, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV). pp. 305–321 (2018)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6202–6211 (2019)
Chen, G., Dong, Z., Wang, J., **a, L.: Parallel temporal feature selection based on improved attention mechanism for dynamic gesture recognition. Complex Intell. Syst. 9(2), 1377–1390 (2023)
Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7083–7093 (2019)
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016)
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. pp. 2736–2744 (2017)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ar**v preprint ar**v:1612.03928. (2016)
Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)
Ding, X., Guo, Y., Ding, G., Han, J.: Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1911–1920 (2019)
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13733–13742 (2021)
Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: MobileOne: an improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7907–7917/ (2023)
Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The jester dataset: a large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimed. 20(5), 1038–1050 (2018)
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4207–4215. (2016)
Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans. Biom. Behav. Identity Sci. 2(2), 85–97 (2020)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision pp. 618–626. (2017)
Funding
No funding.
Author information
Authors and Affiliations
Contributions
G.C.; Methodology, G.C.; Validation, G.C.; Investigation, Z.D., and J.W.; Writing—original draft, G.C.; Writing—review & editing, G.C.; Supervision, J.W., and L.X. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, G., Dong, Z., Wang, J. et al. A resource-efficient partial 3D convolution for gesture recognition. J Real-Time Image Proc 21, 132 (2024). https://doi.org/10.1007/s11554-024-01509-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-024-01509-6