A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition Model

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

  • 1183 Accesses

Abstract

The spatio-temporal convolution model is widely recognized for its effectiveness in predicting action in various fields. This model typically uses video clips as input and employs multiple clips for inference, ultimately deriving a video-level prediction through an aggregation function. However, the model will give a high confidence prediction result, regardless of whether the input clips have sufficient spatio-temporal information to indicate its class or not. The inaccurate high confidence prediction errors can subsequently affect the accuracy of the video-level results. Although the current approach to mitigating this problem involves increasing the number of clips used, it fails to address this problem from its root causes. To solve this issue, we propose a fine-tuning framework based on Fuzzy error loss, aimed at further refining the well-trained spatio-temporal convolution model that relies on dense sampling. By giving a low confidence prediction output for clips with insufficient spatio-temporal information, our framework strives to enhance the accuracy of video-level motion recognition. We conducted extensive experiments on two motion recognition datasets, namely UCF101 and Kinetics-Sounds, to evaluate the effectiveness of our proposed framework. The results indicate a significant improvement in motion recognition accuracy at the video level on both data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 51.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015). https://doi.org/10.1109/iccv.2015.510

  2. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00685

  3. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00675

  4. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019). https://doi.org/10.1109/iccv.2019.00630

  5. Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020). https://doi.org/10.1109/cvpr42600.2020.00028

  6. Jiang, Y., Gong, X., Wu, J., Shi, H., Yan, Z., Wang, Z.: Auto-X3D: ultra-efficient video understanding via finer-grained neural architecture search. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022). https://doi.org/10.1109/wacv51458.2022.00241

  7. Wang, J., et al.: Maximizing spatio-temporal entropy of deep 3D CNNs for efficient video recognition (2023)

    Google Scholar 

  8. Tan, Y., Hao, Y., Zhang, H., Wang, S., He, X.: Hierarchical hourglass convolutional network for efficient video classification (2022)

    Google Scholar 

  9. Chen, C.F.R., et al.: Deep analysis of CNN-based spatio-temporal representations for action recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.00610

  10. Shalmani, S., Chiang, F., Zheng, R.: Efficient action recognition using confidence distillation (2021)

    Google Scholar 

  11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). https://doi.org/10.1109/cvpr.2017.502

  12. **e, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding (2017)

    Google Scholar 

  13. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00813

  14. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  15. Wu, Z., **ong, C., Ma, C.Y., Socher, R., Davis, L.S.: Adaframe: adaptive frame selection for fast video recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019). https://doi.org/10.1109/cvpr.2019.00137

  16. Alwassel, H., Caba Heilbron, F., Ghanem, B.: Action search: spotting actions in videos and its application to temporal action localization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 253–269. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_16

    Chapter  Google Scholar 

  17. Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020). https://doi.org/10.1109/cvpr42600.2020.01047

  18. Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021). https://doi.org/10.1109/iccv48922.2021.01594

  19. Wu, W., He, D., Tan, X., Chen, S., Wen, S.: Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019). https://doi.org/10.1109/iccv.2019.00632

  20. Huang, H., Zhou, X., He, R.: Orthogonal transformer: an efficient vision transformer backbone with token orthogonalization (2022)

    Google Scholar 

  21. UCF101: a dataset of 101 human actions classes from videos in the wild

    Google Scholar 

  22. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017). https://doi.org/10.1109/iccv.2017.73

  23. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  24. Fan, H., Li, Y., **ong, B., Lo, W.Y., Feichtenhofer, C.: Pyslowfast (2020). https://github.com/facebookresearch/slowfast

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant 62072048, and in part by Industry-University-Research Innovation Fund of Universities in China under Grant 2021ITA07005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ye Tian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, J., Yang, M., Liu, Y., **, G., Zhang, L., Tian, Y. (2024). A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition Model. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8429-9_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8428-2

  • Online ISBN: 978-981-99-8429-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation