Dual-stage temporal perception network for continuous sign language recognition

Huang, Zhigang; Xue, Wanli; Zhou, Yuxi; Sun, **lu; Wu, Yazhou; Yuan, Tiantian; Chen, Shengyong

doi:10.1007/s00371-024-03516-x

Dual-stage temporal perception network for continuous sign language recognition

Original article
Published: 08 June 2024

(2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Zhigang Huang¹,
Wanli Xue ORCID: orcid.org/0000-0002-6031-9334^1,4,
Yuxi Zhou¹,
**lu Sun²,
Yazhou Wu⁴,
Tiantian Yuan³ &
…
Shengyong Chen¹

104 Accesses
Explore all metrics

Abstract

Continuous sign language recognition (CSLR) aims to identify a sequence of glosses from a sign language video with only a sentence-level label provided in a weakly supervised way. In sign language videos, the transitions among actions are naturally fluent, and different glosses or the same gloss correspond to video clips with various temporal scales. Obviously, these factors pose a challenge to the effective extraction of complex temporal information. However, most previous deep learning-based CSLR methods employ a temporal modeling method with a fixed temporal receptive field, which is a simple and effective solution but does not cope well with video clips that have various temporal scales. To relieve this problem, we propose a dual-stage temporal perception module (DTPM) by leveraging the strengths of both temporal convolutions and transformers, which follows a hierarchical structure with dual stages aimed at capturing richer and more comprehensive temporal features. Specifically, each stage for DTPM is cleverly composed of two parts: a multi-scale local temporal module (MS-LTM), followed by a set of global–local temporal modules (GLTMs), where each GLTM can be further decomposed into a global temporal relational module (GTRM) and a local temporal relational module (LTRM). At each stage, an MS-LTM is first employed to model multi-scale local temporal relations and then utilize a set of GLTMs to model global temporal relations and strengthen local temporal relations. We finally aggregate the output features of each stage to form a video feature representation with rich semantic information. Extensive experiments on three CSLR benchmarks, PHOENIX14 (Koller et al. Comput Vis Image Underst 141:108–125, 2015), PHOENIX14-T (Camgoz et al., in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7784–7793, 2018), and CSL (Huang et al., in: Proceedings of the AAAI conference on artificial intelligence, pp 32, 2018), validate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global-Temporal Enhancement for Sign Language Recognition

Fully Convolutional Networks for Continuous Sign Language Recognition

KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism

Article 26 December 2023

Data availability

The datasets involved in this study are publicly available and are prominently described and cited in the manuscript.

References

Adaloglou, N., Chatzis, T., Papastratis, I., Stergioulas, A., Papadopoulos, G.T., Zacharopoulou, V., Xydopoulos, G.J., Atzakas, K., Papazachariou, D., Daras, P.: A comprehensive study on deep learning-based methods for sign language recognition. IEEE Trans. Multimedia 24, 1750–1762 (2021)
Article Google Scholar
Li, H., Gao, L., Han, R., Wan, L., Feng, W.: Key action and joint ctc-attention based sign language recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2348–2352 (2020). IEEE
Wei, C., Zhao, J., Zhou, W., Li, H.: Semantic boundary detection with reinforcement learning for continuous sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 31(3), 1138–1149 (2020)
Article Google Scholar
Xue, W., Liu, J., Yan, S., Zhou, Y., Yuan, T., Guo, Q.: Alleviating data insufficiency for chinese sign language recognition. Vis. Intell. 1(1), 26 (2023)
Article Google Scholar
Xue, W., Kang, Z., Guo, L., Yang, S., Yuan, T., Chen, S.: Continuous sign language recognition for hearing-impaired consumer communication via self-guidance network. IEEE Transactions on Consumer Electronics (2023)
Min, Y., Hao, A., Chai, X., Chen, X.: Visual alignment constraint for continuous sign language recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, 11542–11551 (2021)
Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of the AAAI conference on artificial intelligence, 34, 13009–13016 (2020)
Cihan Camgoz, N., Hadfield, S., Koller, O., Bowden, R.: Subunets: End-to-end hand shape and continuous sign language recognition. In: Proceedings of the IEEE international conference on computer vision, 3056–3065 (2017)
Koller, O., Zargaran, S., Ney, H.: Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4297–4305 (2017)
Niu, Z., Mak, B.: Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, 172–186 (2020). Springer
Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4165–4174 (2019)
Zhang, Z., Pu, J., Zhuang, L., Zhou, W., Li, H.: Continuous sign language recognition via reinforcement learning. In: 2019 IEEE international conference on image processing (ICIP), 285–289 (2019). IEEE
Wang, S., Guo, D., Zhou, W.-g., Zha, Z.-J., Wang, M.: Connectionist temporal fusion for sign language translation. In: Proceedings of the 26th ACM international conference on multimedia, 1483–1491 (2018)
Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimed. 21(7), 1880–1891 (2019)
Article Google Scholar
Hao, A., Min, Y., Chen, X.: Self-mutual distillation learning for continuous sign language recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, 11303–11312 (2021)
Hu, L., Gao, L., Feng, W., et al.: Self-emphasizing network for continuous sign language recognition. ar**v preprint ar**v:2211.17081 (2022)
Yang, T., Zhang, H., Hu, W., Chen, C., Wang, X.: Fast-parc: Position aware global kernel for convnets and vits. ar**v preprint ar**v:2210.04020 (2022)
Dai, R., Das, S., Kahatapitiya, K., Ryoo, M.S., Brémond, F.: Ms-tct: multi-scale temporal convtransformer for action detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 20041–20051 (2022)
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., Ling, H.: M2det: A single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, 33, 9259–9266 (2019)
Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)
Article Google Scholar
Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7361–7369 (2017)
Zuo, R., Mak, B.: C2slr: Consistency-enhanced continuous sign language recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5131–5140 (2022)
Hu, L., Gao, L., Liu, Z., Feng, W.: Temporal lift pooling for continuous sign language recognition. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 511–527 (2022). Springer
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning, 369–376 (2006)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Han, K., **ao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
Google Scholar
Tian, C., Zheng, M., Zuo, W., Zhang, S., Zhang, Y., Lin, C.-W.: A cross transformer for image denoising. Inf. Fusion 102, 102043 (2024)
Article Google Scholar
Li, S., **, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., Yan, X.: Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems 32 (2019)
Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iterative optimization for continuous sign language recognition. In: IJCAI, 3, 7 (2018)
Guo, D., Wang, S., Tian, Q., Wang, M.: Dense temporal convolution network for sign language translation. In: IJCAI, 744–750 (2019)
Zhou, H., Zhou, W., Li, H.: Dynamic pseudo label decoding for continuous sign language recognition. In: 2019 IEEE international conference on multimedia and expo (ICME), pp. 1282–1287 (2019). IEEE
Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 13505–13515 (2021)
Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3575–3584 (2019)
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 591–600 (2020)
Wang, L., Tong, Z., Ji, B., Wu, G.: Tdn: Temporal difference networks for efficient action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1895–1904 (2021)
Dai, R., Das, S., Minciullo, L., Garattoni, L., Francesca, G., Bremond, F.: Pdan: Pyramid dilated attention network for action detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2970–2979 (2021)
Wu, H., **ao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 22–31 (2021)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826 (2016)
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856 (2018)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258 (2017)
Ning, X., Yu, Z., Li, L., Li, W., Tiwari, P.: Dilf: Differentiable rendering-based multi-view image-language fusion for zero-shot 3d shape understanding. Inf. Fusion 102, 102033 (2024)
Article Google Scholar
Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., Tian, S.: Feature refinement and filter network for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 31(9), 3391–3402 (2020)
Article Google Scholar
Tian, C., Zhang, X., Zhang, Q., Yang, M., Ju, Z.: Image super-resolution via dynamic network. CAAI Transactions on Intelligence Technology (2023)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Fu, L., Tian, H., Zhai, X.B., Gao, P., Peng, X.: Incepformer: Efficient inception transformer with pyramid pooling for semantic segmentation. ar**v preprint ar**v:2212.03035 (2022)
Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of the AAAI conference on artificial intelligence, 32 (2018)
Dreuw, P., Neidle, C., Athitsos, V., Sclaroff, S., Ney, H.: Benchmark databases for video-based automatic sign language recognition. In: LREC (2008)
Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7784–7793 (2018)
Forster, J., Schmidt, C., Koller, O., Bellgardt, M., Ney, H.: Extensions of the sign language recognition and translation corpus rwth-phoenix-weather. In: LREC, pp. 1911–1916 (2014)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Pu, J., Zhou, W., Hu, H., Li, H.: Boosting continuous sign language recognition via cross modality augmentation. In: Proceedings of the 28th ACM international conference on multimedia, pp. 1497–1505 (2020)
Cheng, K.L., Yang, Z., Chen, Q., Tai, Y.-W.: Fully convolutional networks for continuous sign language recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pp. 697–714 (2020). Springer
Yang, Z., Shi, Z., Shen, X., Tai, Y.-W.: Sf-net: Structured feature network for continuous sign language recognition. ar**v preprint ar**v:1908.01341 (2019)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 31 (2017)
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3560–3569 (2021)
Guo, L., Xue, W., Guo, Q., Liu, B., Zhang, K., Yuan, T., Chen, S.: Distilling cross-temporal contexts for continuous sign language recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10771–10780 (2023)
Zhao, W., Xu, L.: Weakly supervised target detection based on spatial attention. Vis. Intell. 2(1), 1–11 (2024)
Article Google Scholar
Wang, Y., Cao, C., Zhang, Y.: Visual-semantic network: a visual and semantic enhanced model for gesture recognition. Vis. Intell. 1(1), 25 (2023)
Article Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp. 618–626 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141 (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Wang, Z., She, Q., Smolic, A.: Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13214–13223 (2021)
Liu, Y., Shao, Z., Teng, Y., Hoffmann, N.: Nam: Normalization-based attention module. ar** Information Technology Co., Ltd, No. 10 ZhangBaYi Road, **’an, 710075, Shaanxi, China
Wanli Xue & Yazhou Wu

Authors

Zhigang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yuxi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
**lu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yazhou Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tiantian Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Shengyong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wanli Xue or Yuxi Zhou.

Ethics declarations

Conflict of interest

There are no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, Z., Xue, W., Zhou, Y. et al. Dual-stage temporal perception network for continuous sign language recognition. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03516-x

Download citation

Accepted: 24 April 2024
Published: 08 June 2024
DOI: https://doi.org/10.1007/s00371-024-03516-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual-stage temporal perception network for continuous sign language recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Global-Temporal Enhancement for Sign Language Recognition

Fully Convolutional Networks for Continuous Sign Language Recognition

KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism

Data availability

References

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dual-stage temporal perception network for continuous sign language recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Global-Temporal Enhancement for Sign Language Recognition

Fully Convolutional Networks for Continuous Sign Language Recognition

KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism

Data availability

References

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation