Abstract
Due to factors such as motion blur, video out-of-focus, and occlusion, multi-frame human pose estimation is a challenging task. Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue. Currently, most methods explore temporal consistency through refinements of the final heatmaps. The heatmaps contain the semantics information of key points, and can improve the detection quality to a certain extent. However, they are generated by features, and feature-level refinements are rarely considered. In this paper, we propose a human pose estimation framework with refinements at the feature and semantics levels. We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions. An attention mechanism is then used to fuse auxiliary features with current features. In terms of semantics, we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps. The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018, and the results demonstrate the effectiveness of our method.
摘要
由于运动模糊、视频失焦和遮挡等因素, 多帧人体姿态估计是一项有挑战性的任务。利用连续帧之间的时间一致性是解决这一问题的有效方法。目前, 大多数方法通过修**最终热图来利用时间一致性。热图包含了关键点的语义信息, 可在一定程度上提高检测质量。它们由特征生成, 但这些方法很少考虑特征级别的修**。本文提出一种人体姿态估计框架, 该框架在特征和语义层面进行了改进。将辅助特征与当前帧的特征对齐, 以减少不同特征分布带来的损失。使用注意力机制将辅助特征与当前特征融合。在语义方面, 使用相邻热图之间的差异作为辅助特征来修**当前热图。在大型基准数据集PoseTrack2017和PoseTrack2018上验证了该方法的有效性。
Data availability
The code is available at https://github.com/Elvis-Aron/FaSRnet. The other data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. IEEE Conf on Computer Vision and Pattern Recognition, p.3686–3693. https://doi.org/10.1109/CVPR.2014.471
Andriluka M, Iqbal U, Insafutdinov E, et al., 2018. PoseTrack: a benchmark for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5167–5176. https://doi.org/10.1109/CVPR.2018.00542
Bertasius G, Feichtenhofer C, Tran D, et al., 2019. Learning temporal pose estimation from sparsely-labeled videos. Proc 33rd Int Conf on Neural Information Processing Systems, p.3027–3038.
Cai YH, Wang ZC, Luo ZX, et al., 2020. Learning delicate local representations for multi-person pose estimation. 16th European Conf on Computer Vision, p.455–472. https://doi.org/10.1007/978-3-030-58580-8_27
Cao Z, Hidalgo G, Simon T, et al., 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Patt Anal Mach Intell, 43(1):172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Chu X, Yang W, Ouyang WL, et al., 2017. Multi-context attention for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.5669–5678. https://doi.org/10.1109/CVPR.2017.601
Dang YH, Yin JQ, Zhang SJ, et al., 2022a. Learning human kinematics by modeling temporal correlations between joints for video-based human pose estimation. https://doi.org/10.48550/ar**v.2207.10971
Dang YH, Yin JQ, Zhang SJ, 2022b. Relation-based associative joint location for human pose estimation in videos. IEEE Trans Image Process, 31:3973–3986. https://doi.org/10.1109/TIP.2022.3177959
Doering A, Iqbal U, Gall J, 2018. Joint flow: temporal flow fields for multi person tracking. https://doi.org/10.48550/ar**v.1805.04596
Fang HS, **e SQ, Tai YW, et al., 2017. RMPE: regional multiperson pose estimation. IEEE Int Conf on Computer Vision, p.2353–2362. https://doi.org/10.1109/ICCV.2017.256
Fang HS, Li JF, Tang HY, et al., 2023. AlphaPose: whole-body regional multi-person pose estimation and tracking in realtime. IEEE Trans Patt Anal Mach Intell, 45(6):7157–7173. https://doi.org/10.1109/TPAMI.2022.3222784
Fang ZJ, López AM, 2020. Intention recognition of pedestrians and cyclists by 2D pose estimation. IEEE Trans Intell Transp Syst, 21(11):4773–4783. https://doi.org/10.1109/TITS.2019.2946642
Girdhar R, Gkioxari G, Torresani L, et al., 2018. Detect-and-track: efficient pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.350–359. https://doi.org/10.1109/CVPR.2018.00044
Guo HK, Tang T, Luo GZ, et al., 2019. Multi-domain pose network for multi-person pose estimation and tracking. European Conf on Computer Vision, p.209–216. https://doi.org/10.1007/978-3-030-11012-3_17
Hwang J, Lee J, Park S, et al., 2019. Pose estimator and tracker using temporal flow maps for limbs. Int Joint Conf on Neural Networks, p.1–8. https://doi.org/10.1109/IJCNN.2019.8851734
Insafutdinov E, Andriluka M, Pishchulin L, et al., 2017. ArtTrack: articulated multi-person tracking in the wild. Conf on Computer Vision and Pattern Recognition, p.1293–1301. https://doi.org/10.1109/CVPR.2017.142
Iqbal U, Milan A, Gall J, 2017. PoseTrack: joint multi-person pose estimation and tracking. IEEE Conf on Computer Vision and Pattern Recognition, p.4654–4663. https://doi.org/10.1109/CVPR.2017.495
** S, Liu WT, Ouyang WL, et al., 2019. Multi-person articulated tracking with spatial and temporal embeddings. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5657–5666. https://doi.org/10.1109/CVPR.2019.00581
** S, Liu WT, ** for multi-person pose estimation. 16th European Conf on Computer Vision, p.718–734. https://doi.org/10.1007/978-3-030-58571-6_42
Li DW, Chen XT, Zhang Z, et al., 2018. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. IEEE Int Conf on Multimedia and Expo, p.1–6. https://doi.org/10.1109/ICME.2018.8486604
Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13th European Conf on Computer Vision, p.740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Liu ZG, Wu S, ** SY, et al., 2019. Towards natural and accurate future motion prediction of humans and animals. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9996–10004. https://doi.org/10.1109/CVPR.2019.01024
Liu ZG, Chen HM, Feng RY, et al., 2021. Deep dual consecutive network for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.525–534. https://doi.org/10.1109/CVPR46437.2021.00059
Liu ZG, Feng RY, Chen HM, et al., 2022. Temporal feature alignment and mutual information maximization for video-based human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10996–11006. https://doi.org/10.1109/CVPR52688.2022.01073
Luo Y, Ren J, Wang ZX, et al., 2018. LSTM pose machines. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5207–5215. https://doi.org/10.1109/CVPR.2018.00546
Martinez J, Hossain R, Romero J, et al., 2017. A simple yet effective baseline for 3D human pose estimation. IEEE Int Conf on Computer Vision, p.2659–2668. https://doi.org/10.1109/ICCV.2017.288
Pfister T, Charles J, Zisserman A, 2015. Flowing ConvNets for human pose estimation in videos. IEEE Int Conf on Computer Vision, p.1913–1921. https://doi.org/10.1109/ICCV.2015.222
Sapp B, Taskar B, 2013. MODEC: multimodal decomposable models for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.3674–3681. https://doi.org/10.1109/CVPR.2013.471
Shao ZP, Zhou W, Wang WZ, et al., 2023. A temporal densely connected recurrent network for event-based human pose estimation. https://doi.org/10.48550/ar**v.2209.07034
Snower M, Kadav A, Lai F, et al., 2020. 15 keypoints is all you need. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6737–6747. https://doi.org/10.1109/CVPR42600.2020.00677
Song J, Wang LM, van Gool L, et al., 2017. Thin-slicing network: a deep structured model for pose estimation in videos. IEEE Conf on Computer Vision and Pattern Recognition, p.5563–5572. https://doi.org/10.1109/CVPR.2017.590
Sun K, **ao B, Liu D, et al., 2019. Deep high-resolution representation learning for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5686–5696. https://doi.org/10.1109/CVPR.2019.00584
Tian YP, Zhang YL, Fu Y, et al., 2020. TDAN: temporally-deformable alignment network for video super-resolution. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3357–3366. https://doi.org/10.1109/CVPR42600.2020.00342
Wang J, Long X, Gao Y, et al., 2020. Graph-PCNN: two stage human pose estimation with graph pose refinement. 16th European Conf on Computer Vision, p.492–508. https://doi.org/10.1007/978-3-030-58621-8_29
Wang M, Hong RC, Yuan XT, et al., 2012. Movie2Comics: towards a lively video content presentation. IEEE Trans Multim, 14(3):858–870. https://doi.org/10.1109/TMM.2012.2187181
Wang MC, Tighe J, Modolo D, 2020. Combining detection and tracking for human pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11085–11093. https://doi.org/10.1109/CVPR42600.2020.01110
Wang XL, Girshick R, Gupta A, et al., 2018. Non-local neural networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7794–7803. https://doi.org/10.1109/CVPR.2018.00813
Wang XT, Chan KCK, Yu K, et al., 2019. EDVR: video restoration with enhanced deformable convolutional networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1954–1963. https://doi.org/10.1109/CVPRW.2019.00247
Weinzaepfel P, Revaud J, Harchaoui Z, et al., 2013. DeepFlow: large displacement optical flow with deep matching. IEEE Int Conf on Computer Vision, p. 1385–1392. https://doi.org/10.1109/ICCV.2013.175
**ao B, Wu HP, Wei YC, 2018. Simple baselines for human pose estimation and tracking. 15th European Conf on Computer Vision, p.472–487. https://doi.org/10.1007/978-3-030-01231-1_29
**u YL, Li JF, Wang HY, et al., 2018. Pose flow: efficient online pose tracking. https://doi.org/10.48550/ar**v.1802.00977
Yang X, Wang M, Hong RC, et al., 2017. Enhancing person re-identification in a self-trained subspace. ACM Trans Multim Comput Commun Appl, 13(3):27. https://doi.org/10.1145/3089249
Yang X, Wang M, Tao DC, 2018. Person re-identification with metric learning using privileged information. IEEE Trans Image Process, 27(2):791–805. https://doi.org/10.1109/TIP.2017.2765836
Yang YD, Ren Z, Li HX, et al., 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8070–8080. https://doi.org/10.1109/CVPR46437.2021.00798
Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions. https://doi.org/10.48550/ar**v.1511.07122
Zhang JB, Zhu Z, Zou W, et al., 2019. FastPose: towards real-time pose estimation and tracking via scale-normalized multitask networks. https://doi.org/10.48550/ar**v.1908.05593
Zheng W, Li L, Zhang ZX, et al., 2019. Relational network for skeleton-based action recognition. IEEE Int Conf on Multimedia and Expo, p.826–831. https://doi.org/10.1109/ICME.2019.00147
Zhu XZ, Hu H, Lin S, et al., 2019. Deformable ConvNets V2: more deformable, better results. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9300–9308. https://doi.org/10.1109/CVPR.2019.00953
Author information
Authors and Affiliations
Contributions
Yuanhong ZHONG designed the research. Qianfeng XU and Daidi ZHONG processed the data. Yuanhong ZHONG, Qianfeng XU, and Daidi ZHONG drafted the paper. Xun YANG and Shanshan WANG helped organize the paper. All the authors revised and finalized the paper.
Corresponding author
Ethics declarations
All the authors declare that they have no conflict of interest.
Additional information
Project supported by the National Key Research and Development Program of China (Nos. 2021YFC2009200 and 2023YFC3606100) and the Special Project of Technological Innovation and Application Development of Chongqing, China (No. cstc2019jscx-msxmX0167)
Rights and permissions
About this article
Cite this article
Zhong, Y., Xu, Q., Zhong, D. et al. FaSRnet: a feature and semantics refinement network for human pose estimation. Front Inform Technol Electron Eng 25, 513–526 (2024). https://doi.org/10.1631/FITEE.2200639
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.2200639
Key words
- Human pose estimation
- Multi-frame refinement
- Heatmap and offset estimation
- Feature alignment
- Multi-person