Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization

**a, Huifen; Zhan, Yongzhao; Liu, Honglin; Ren, **aopeng

doi:10.1631/FITEE.2300024

Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization

通过类别特定帧聚类增**动作显著性的弱监督时序动作检测

Research Article
Published: 05 July 2024

Volume 25, pages 809–823, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

10 Accesses
Explore all metrics

Abstract

Temporal action localization (TAL) is a task of detecting the start and end timestamps of action instances and classifying them in an untrimmed video. As the number of action categories per video increases, existing weakly-supervised TAL (W-TAL) methods with only video-level labels cannot provide sufficient supervision. Single-frame supervision has attracted the interest of researchers. Existing paradigms model single-frame annotations from the perspective of video snippet sequences, neglect action discrimination of annotated frames, and do not pay sufficient attention to their correlations in the same category. Considering a category, the annotated frames exhibit distinctive appearance characteristics or clear action patterns. Thus, a novel method to enhance action discrimination via category-specific frame clustering for W-TAL is proposed. Specifically, the K-means clustering algorithm is employed to aggregate the annotated discriminative frames of the same category, which are regarded as exemplars to exhibit the characteristics of the action category. Then, the class activation scores are obtained by calculating the similarities between a frame and exemplars of various categories. Category-specific representation modeling can provide complimentary guidance to snippet sequence modeling in the mainline. As a result, a convex combination fusion mechanism is presented for annotated frames and snippet sequences to enhance the consistency properties of action discrimination, which can generate a robust class activation sequence for precise action classification and localization. Due to the supplementary guidance of action discriminative enhancement for video snippet sequences, our method outperforms existing single-frame annotation based methods. Experiments conducted on three datasets (THUMOS14, GTEA, and BEOID) show that our method achieves high localization performance compared with state-of-the-art methods.

摘要

时序动作检测任务是指在未裁剪的视频中检测出动作的开始时间和结束时间, 并对动作实例进行分类. 随着视频中动作类别的增多, 现有仅提供视频级别标签的弱监督时序动作检测方法已无法提供足够的监督. 单帧标注方法引起了人们兴趣. 但现有单帧标注方法仅从视频片段序列的角度对标注的单帧建模, 而忽略了标注单帧的动作显著性, 并且没有充分考虑它们在同一动作类别中的相关性. 考虑到在同一动作类别中, 带标注的单帧能表现出独特的外观特征和清晰的动作模式, 本文提出一种新颖的通过类别特定帧聚类来增**动作显著性的弱监督时序动作检测方法. 该方法采用 K-均值聚类算法对同一动作类别的帧聚合, 将其作为该动作类别的特征表示. 通过计算每帧与各个动作类别之间的相似度, 得到类激活分数. 特定于类别的单帧表征建模可以为主线中的视频片段序列建模提供补充性的指导. 因此, 针对标注的帧和其对应的视频片段序列, 提出凸组合融合机制, 用于增**动作显著性的一致性特性, 从而生成更加鲁棒的类激活序列, 进行精确的动作分类和动作定位. 由于动作显著性增**的补充指导, 该方法优于现有的基于单帧标注的动作检测方法. 在 THUMOS14、 GTEA 和 BEOID 3 个数据集上进行的实验表明, 与最新的方法相比, 所提方法具有更高的检测性能.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Bojanowski P, Bach F, Laptev I, et al., 2013. Finding actors and actions in movies. IEEE Int Conf on Computer Vision, p.2280–2287. https://doi.org/10.1109/ICCV.2013.283
Bojanowski P, Lajugie R, Bach F, et al., 2014. Weakly supervised action labeling in videos under ordering constraints. 13^th European Conf on Computer Vision, p.628–643. https://doi.org/10.1007/978-3-319-10602-1_41
Carreira J, Zisserman A, 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. IEEE Conf on Computer Vision and Pattern Recognition, p.4724–4733. https://doi.org/10.1109/CVPR.2017.502
Chao YW, Vijayanarasimhan S, Seybold B, et al., 2018. Rethinking the faster R-CNN architecture for temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1130–1139. https://doi.org/10.1109/CVPR.2018.00124
Chen ZY, Liu H, Zhang LL, et al., 2022. Multi-dimensional attention with similarity constraint for weakly-supervised temporal action localization. IEEE TransMultim, 25:4349–4360. https://doi.org/10.1109/TMM.2022.3174344
Google Scholar
Damen D, Leelasawassuk T, Haines O, et al., 2014. You-Do, I-Learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. Proc British Machine Vision Conf, p.3.
Gan C, Sun C, Duan LX, et al., 2016. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. 14^th European Conf on Computer Vision, p.849–866. https://doi.org/10.1007/978-3-319-46487-9_52
Gao JY, Chen MY, Xu CS, 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19967–19977. https://doi.org/10.1109/CVPR52688.2022.01937
Ge YX, Qin XL, Yang D, et al., 2021. Deep snippet selective network for weakly supervised temporal action localization. Patt Recogn, 110:107686. https://doi.org/10.1016/j.patcog.2020.107686
Article Google Scholar
Huang DA, Fei-Fei L, Niebles JC, 2016. Connectionist temporal modeling for weakly supervised action labeling. 14^th European Conf on Computer Vision, p. 137–153. https://doi.org/10.1007/978-3-319-46493-0_9
Huang LJ, Wang L, Li HS, 2022a. Multi-modality self-distillation for weakly supervised temporal action localization. IEEE Trans Image Process, 31:1504–1519. https://doi.org/10.1109/TIP.2021.3137649
Article Google Scholar
Huang LJ, Wang L, Li HS, 2022b. Weakly supervised temporal action localization via representative snippet knowledge propagation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3262–3271. https://doi.org/10.1109/CVPR52688.2022.00327
Jiang YG, Liu J, Roshan Zamir A, et al., 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. Available from https://crcv.ucf.edu/THUMOS14 [Accessed on May 10, 2022].
Ju C, Zhao PS, Chen SH, et al., 2021. Divide and conquer for single-frame temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.13435–13444. https://doi.org/10.1109/ICCV48922.2021.01320
Kay W, Carreira J, Simonyan K, et al., 2017. The kinetics human action video dataset. https://arxiv.org/abs/1705.06950
Kingma DP, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980
Lee P, Byun H, 2021. Learning action completeness from points for weakly-supervised temporal action localization. IEEE/CVF Int Conf on Computer Vision, p. 13628–13637. https://doi.org/10.1109/ICCV48922.2021.01339
Lee P, Uh Y, Byun H, 2020. Background suppression network for weakly-supervised temporal action localization. Proc AAAI Conf Artif Intell, 34(7):11320–11327. https://doi.org/10.1609/aaai.v34i07.6793
Google Scholar
Lei P, Todorovic S, 2018. Temporal deformable residual networks for action segmentation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6742–6751. https://doi.org/10.1109/CVPR.2018.00705
Liao YG, Qiu CZ, Zhang ZY, et al., 2021. GCRNet: global context relation network for weakly-supervised temporal action localization: identify the target actions in a long untrimmed video and find the corresponding action start point and end point. Proc 5^th Int Conf on Video and Image Processing, p.184–190. https://doi.org/10.1145/3511176.3511204
Lin CM, Xu CM, Luo DH, et al., 2021. Learning salient boundary feature for anchor-free temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3319–3328. https://doi.org/10.1109/CVPR46437.2021.00333
Lin TW, Zhao X, Shou Z, 2017. Single shot temporal action detection. Proc 25^th ACM Int Conf on Multimedia, p.988–996. https://doi.org/10.1145/3123266.3123343
Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. IEEE Int Conf on Computer Vision, p.2999–3007. https://doi.org/10.1109/ICCV.2017.324
Liu DC, Jiang TT, Wang YZ, 2019. Completeness modeling and context separation for weakly supervised temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1298–1307. https://doi.org/10.1109/CVPR.2019.00139
Long FC, Yao T, Qiu ZF, et al., 2019. Gaussian temporal awareness networks for action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p. 344–353. https://doi.org/10.1109/CVPR.2019.00043
Ma F, Zhu LC, Yang Y, et al., 2020. SF-Net: single-frame supervision for temporal action localization. 16^th European Conf on Computer Vision, p.420–437. https://doi.org/10.1007/978-3-030-58548-8_25
Moltisanti D, Fidler S, Damen D, 2019. Action recognition from single timestamp supervision in untrimmed videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9907–9916. https://doi.org/10.1109/CVPR.2019.01015
Narayan S, Cholakkal H, Khan FS, et al., 2019. 3C-Net: category count and center loss for weakly-supervised action localization. IEEE/CVF Int Conf on Computer Vision, p.8678–8686. https://doi.org/10.1109/ICCV.2019.00877
Nguyen P, Han B, Liu T, et al., 2018. Weakly supervised action localization by sparse temporal pooling network. IEEE Conf on Computer Vision and Pattern Recognition, p.6752–6761. https://doi.org/10.1109/CVPR.2018.00706
Nguyen P, Ramanan D, Fowlkes C, 2019. Weakly-supervised action localization with background modeling. IEEE/CVF Int Conf on Computer Vision, p.5501–5510. https://doi.org/10.1109/ICCV.2019.00560
Paul S, Roy S, Roy-Chowdhury AK, 2018. W-TALC: weakly-supervised temporal activity localization and classification. Proc 15^th European Conf on Computer Vision, p.588–607. https://doi.org/10.1007/978-3-030-01225-0_35
Shi BF, Dai Q, Mu YD, et al., 2020. Weakly-supervised action localization by generative attention modeling. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1006–1016. https://doi.org/10.1109/CVPR42600.2020.00109
Shou Z, Gao H, Zhang L, et al., 2018. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. Proc 15^th European Conf on Computer Vision, p.162–179. https://doi.org/10.1007/978-3-030-01270-0_10
Singh KK, Lee YJ, 2017. Hide-and-Seek: forcing a network to be meticulous for weakly-supervised object and action localization. IEEE Int Conf on Computer Vision, p.3544–3553. https://doi.org/10.1109/ICCV.2017.381
Sultani W, Chen C, Shah M, 2018. Real-world anomaly detection in surveillance videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6479–6488. https://doi.org/10.1109/CVPR.2018.00678
Tong Z, Song YB, Wang J, et al., 2022. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. https://arxiv.org/abs/2203.12602
Wang LM, **ong YJ, Lin DH, et al., 2017. UntrimmedNets for weakly supervised action recognition and detection. IEEE Conf on Computer Vision and Pattern Recognition, p.6402–6411. https://doi.org/10.1109/CVPR.2017.678
Wedel A, Pock T, Zach C, et al., 2009. An improved algorithm for TV-L¹ optical flow. Statistical and Geometrical Approaches to Visual Motion Analysis, p.23–45. https://doi.org/10.1007/978-3-642-03061-1_2
Xu MM, Zhao C, Rojas DS, et al., 2020. G-TAD: sub-graph localization for temporal action detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p. 10153–10162. https://doi.org/10.1109/CVPR42600.2020.01017
Yang L, Han JW, Zhao T, et al., 2022. Background-click supervision for temporal action localization. IEEE Trans Patt Anal Mach Intell, 44(12):9814–9829. https://doi.org/10.1109/TPAMI.2021.3132058
Article Google Scholar
Yang WF, Zhang TZ, Yu XY, et al., 2021. Uncertainty guided collaborative training for weakly supervised temporal action detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.53–63. https://doi.org/10.1109/CVPR46437.2021.00012
Yang Y, Zhuang YT, Pan YH, 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inform Technol Electron Eng, 22(12):1551–1558. https://doi.org/10.1631/fitee.2100463
Article Google Scholar
Zeng RH, Huang WB, Gan C, et al., 2019. Graph convolutional networks for temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.7093–7102. https://doi.org/10.1109/ICCV.2019.00719
Zhai YH, Wang L, Tang W, et al., 2020. Two-stream consensus network for weakly-supervised temporal action localization. 16^th European Conf on Computer Vision, p.37–54. https://doi.org/10.1007/978-3-030-58539-6_3
Zhang CW, Xu YL, Cheng ZZ, et al., 2019. Adversarial seeded sequence growing for weakly-supervised temporal action localization. Proc 27^th ACM Int Conf on Multimedia, p.738–746. https://doi.org/10.1145/3343031.3351044
Zhao Y, **ong YJ, Wang LM, et al., 2017. Temporal action detection with structured segment networks. IEEE Int Conf on Computer Vision, p.2933–2942. https://doi.org/10.1109/ICCV.2017.317
Zhou H, Zhan YZ, Mao QR, 2021. Video anomaly detection based on space-time fusion graph network learning. J Comput Res Dev, 58(1):48–59 (in Chinese). https://doi.org/10.7544/issn1000-1239202120200264
Google Scholar
Zhu LC, Fan HH, Luo YW, et al., 2022. Temporal cross-layer correlation mining for action recognition. IEEE Trans Multim, 24:668–676. https://doi.org/10.1109/TMM.2021.3057503
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China
Huifen **a (夏惠芬), Yongzhao Zhan (詹永照), Honglin Liu (刘洪麟) & **aopeng Ren (任晓鹏)
Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agricultural Applications, Zhenjiang, 212013, China
Yongzhao Zhan (詹永照)
Changzhou Vocational Institute of Mechatronic Technology, Changzhou, 213164, China
Huifen **a (夏惠芬)

Authors

Huifen **a (夏惠芬)
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhao Zhan (詹永照)
View author publications
You can also search for this author in PubMed Google Scholar
Honglin Liu (刘洪麟)
View author publications
You can also search for this author in PubMed Google Scholar
**aopeng Ren (任晓鹏)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Huifen XIA and Yongzhao ZHAN designed the research. Honglin LIU gave some theoretical guidance. **aopeng REN trained the model and processed the data. Huifen XIA drafted the paper. Yongzhao ZHAN revised and finalized the paper.

Corresponding author

Correspondence to Yongzhao Zhan (詹永照).

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. 61672268)

Rights and permissions

Reprints and permissions

About this article

Cite this article

**a, H., Zhan, Y., Liu, H. et al. Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization. Front Inform Technol Electron Eng 25, 809–823 (2024). https://doi.org/10.1631/FITEE.2300024

Download citation

Received: 13 January 2023
Accepted: 06 July 2023
Published: 05 July 2024
Issue Date: June 2024
DOI: https://doi.org/10.1631/FITEE.2300024

Key words

关键词

CLC number

TP391.4

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Subscribe and save

Buy Now

Search

Navigation