Log in

SAN: Structure-aware attention network for dyadic human relation recognition in images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We introduce a new dataset and method for Dyadic Human relation Recognition (DHR). DHR is a new task that concerns the recognition of the type (i.e., verb) and roles of a two-person interaction. Unlike past human action detection, our goal is to extract richer information regarding the roles of actors, i.e., which subjective person is acting on which objective person. For this, we introduce the DHR-WebImages dataset which consists of a total of 22,046 images of 51 verb classes of DHR with per-image annotation of the verb and role, and also a test set for evaluating generalization capabilities, which we refer to as DHR-Generalization. We tackle DHR by introducing a novel network inspired by the hierarchical nature of cognitive human perception. At the core of the network lies a “structure-aware attention” module that weights and integrates various hierarchical visual cues associated with the DHR instance in the image. The feature hierarchy consists of three levels, namely the union, human, and joint levels, each of which extracts visual features relevant to the participants while modeling their cross-talk. We refer to this network as Structure-aware Attention Network (SAN). Experimental results show that SAN achieves accurate DHR robust to lacking visibility of actors, and outperforms past methods by 3.04 mAP on DHR-WebImages verb task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability statement

The datasets generated during and analysed during the current study are available in the github repository, https://github.com/kaenkogashi/DHR-WebImages.

References

  1. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6202–6211

  2. Yan S, **ong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI

  3. Stergiou A, Poppe R (2019) Analyzing human human interactions: a survey. CVIU 188:102799. https://doi.org/10.1016/j.cviu.2019.102799

    Article  Google Scholar 

  4. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698

  5. Zhang, Z, Ma, X, Song, R, Rong, X, Tian, X, Tian, G, Li, Y: Deep learning based human action recognition: a survey. In: Chinese automation congress (CAC), pp 3780–3785 (2017). https://doi.org/10.1109CAC.2017.8243438

  6. Gupta A, Gupta K, Gupta K, Gupta K (2020) A survey on human activity recognition and classification. In: ICCSP, pp 0915–0919. https://doi.org/10.1109/ICCSP48568.2020.9182416

  7. Birdwhistell RL (1952) Introduction to Kinesics: an annotation system for analysis of body motion and gesture. Foreign Service Institute, Department of State

  8. Poppe R (2017) In: Burgoon JK, Magnenat-Thalmann N, Pantic M, Vinciarelli A (eds) Automatic analysis of bodily social signals. Cambridge University Press., pp 155–167

  9. Palmer SE (1975) Visual perception and world knowledge: notes on a model of sensory-cognitive interaction. Explorations in Cognition 279–307

  10. Gupta S, Malik J (2015) Visual semantic role labeling. ar**v:1505.04474

  11. Chao Y-W, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: WACV, pp 381–389

  12. Kogashi K, Nobuhara S, Nishino K (2022) Dyadic human relation recognition. In: The IEEE International conference on multimedia and expo (ICME)

  13. Chen S, Li Z, Tang Z (2020) Relation r-cnn: a graph based relation-aware network for object detection. IEEE Signal Process Lett 27:1680–1684. https://doi.org/10.1109/LSP.2020.3025128

    Article  Google Scholar 

  14. Quan Y, Li Z, Chen S, Zhang C, Ma H (2021) Joint deep separable convolution network and border regression reinforcement for object detection. Neural Comput Appl 33(9):4299–4314

    Article  Google Scholar 

  15. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE TPAMI 39(6):1137–1149

    Article  Google Scholar 

  16. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788. https://doi.org/10.1109/CVPR.2016.91

  17. Carion N, Massa F, Synnaeve G, Nicolas Usunier AK, Zagoruyko S (2020) End-to-end object detection with transformers. https://github.com/facebookresearch/detr

  18. Gao C, Zou Y, Huang J-B (2018) ican: instance-centric attention network for human-object interaction detection. In: BMVC

  19. Wan B, Zhou D, Liu Y, Li R, He X (2019) Pose-aware multi-level feature network for human object interaction detection. In: ICCV, pp 9468–9477. https://doi.org/10.1109/ICCV.2019.00956

  20. Ulutan O, Iftekhar ASM, Manjunath BS (2020) Vsgnet: spatial attention network for detecting human object interactions using graph convolutions. In: CVPR, pp 13614–13623. https://doi.org/10.1109/CVPR42600.2020.01363

  21. Li Y-L, Zhou S, Huang X, Xu L, Ma Z, Fang H-S, Wang Y, Lu C (2019) Transferable interactiveness knowledge for human-object interaction detection. In: CVPR, pp 3580–3589

  22. Li Y-L, Liu X, Lu H, Wang S, Liu J, Li J, Lu C (2020) Detailed 2d-3d joint representation for human-object interaction. In: CVPR, pp 10163–10172

  23. Liao Y, Liu S, Wang F, Chen Y, Qian C, Feng J (2020) Ppdm: parallel point detection and matching for real-time human-object interaction detection. In: CVPR, pp 479–487

  24. Xu B, Wong Y, Li J, Zhao Q, Kankanhalli MS (2019) Learning to detect human-object interactions with knowledge. In: CVPR

  25. Gao C, Xu J, Zou Y, Huang J-B (2020) Drg: dual relation graph for human-object interaction detection. In: European conference on computer vision

  26. Li Y, Liu X, Wu X, Li Y, Lu C (2020) HOI Analysis: integrating and decomposing human-object interaction. ar**v:2010.16219

  27. Kim B, Choi T, Kang J, Kim HJ (2020) Uniondet: union-level detector towards real-time human-object interaction detection. In: ECCV, pp 498–514

  28. Liu Y, Chen Q, Zisserman A (2020) Amplifying key cues for human-object-interaction detection. In: ECCV

  29. Qi S, Wang W, Jia B, Shen J, Zhu S-C (2018) Learning human-object interactions by graph parsing neural networks. In: ECCV

  30. Wang S, Yap K-H, Yuan J, Tan Y-P (2020) Discovering human interactions with novel objects via zero-shot learning. In: CVPR

  31. Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Detecting human-object interaction via fabricated compositional learning. In: CVPR

  32. Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: ICCV

  33. Liu Y, Yuan J, Chen CW (2020) Consnet: learning consistency graph for zero-shot human-object interaction detection. In: ACM MM, pp 4235–4243

  34. Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y, Sun J (2021) End-to-end human object interaction detection with hoi transformer. In: CVPR

  35. Kim B, Lee J, Kang J, Kim E-S, Kim HJ (2021) Hotr: end-to-end human-object interaction detection with transformers. In: CVPR

  36. Tamura M, Ohashi H, Yoshinaga T (2021) QPIC: query-based pairwise human-object interaction detection with image-wide contextual information. In: CVPR

  37. Li K, Wang S, Zhang X, Xu Y, Xu W, Tu Z (2021) Pose recognition with cascade transformers. In: Proceedings of CVPR, pp 1944–1953

  38. Miech A, Alayrac J-B, Laptev I, Sivic J, Zisserman A (2021) Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: Proceedings of CVPR, pp 9826–9836

  39. Wang H, Zhu Y, Adam H, Yuille A, Chen L-C (2021) Max-deeplab: end-to-end panoptic segmentation with mask transformers. In: Proceedings of CVPR, pp 5463–5474

  40. Ryoo MS, Aggarwal JK (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)

  41. Patron A, Marszalek M, Zisserman A, Reid I (2010) High five: recognising human interactions in tv shows. In: BMVC, pp 50–15011. https://doi.org/10.5244/C.24.50

  42. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: CVPR

  43. van Gemeren C, Tan RT, Poppe R, Veltkamp RC (2014) Dyadic interaction detection from pose and flow. In: European conference on computer vision (ECCV)

  44. Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW), IEEE

  45. Joo H, Simon T, Li X, Liu H, Tan L, Gui L, Banerjee S, Godisart TS, Nabbe B, Matthews I, Kanade T, Nobuhara S, Sheikh Y (2017) Panoptic studio: a massively multiview system for social interaction capture. IEEE Transactions on pattern analysis and machine intelligence

  46. Ricci E, Varadarajan J, Subramanian R, Bulò SR, Ahuja N, Lanz O (2015) Uncovering interactions and interactors: joint estimation of head, body orientation and f-formations from surveillance videos. In: 2015 ICCV, pp 4660–4668. https://doi.org/10.1109/ICCV.2015.529

  47. Smaira L, Carreira J, Noland E, Clancy E, Wu A, Zisserman A (2020) A short note on the kinetics-700-2020 human action dataset. ar**v:2010.10864

  48. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, Gutfruend D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 1–8. https://doi.org/10.1109/TPAMI.2019.2901464

  49. Zhao H, Yan Z, Torresani L, Torralba A (2019) HACS: human action clips and segments dataset for recognition and temporal localization. ar**v:1712.09374

  50. Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C, Malik J (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp 6047–6056. https://doi.org/10.1109/CVPR.2018.00633

  51. Wang L, Tong Z, Ji B, Wu G (2020) Tdn: temporal difference networks for efficient action recognition. ar**v:2012.10071

  52. Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S (2016) Social lstm: human trajectory prediction in crowded spaces. In: CVPR, pp 961–971. https://doi.org/10.1109/CVPR.2016.110

  53. Fan L, Wang W, Zhu S-C, Tang X, Huang S (2019) Understanding human gaze communication by spatio-temporal graph reasoning. In: ICCV, pp 5723–5732. https://doi.org/10.1109/ICCV.2019.00582

  54. Sun Q, Schiele B, Fritz M (2017) A domain based approach to social relation recognition. In: CVPR, pp 21–26

  55. Ibrahim, MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: CVPR, pp 1971–1980. https://doi.org/10.1109/CVPR.2016.217

  56. Wu J, Wang L, Wang L, Guo J, Wu G (2019) Learning actor relation graphs for group activity recognition. In: CVPR, pp 9956–9966. https://doi.org/10.1109/CVPR.2019.01020

  57. Shu T, Todorovic S, Zhu S-C (2017) Cern: confidence-energy recurrent network for group activity recognition. In: CVPR

  58. Curto D, Clapés A, Selva J, Smeureanu S, Junior JCSJ, Gallardo-Pujol D, Guilera G, Leiva D, Moeslund TB, Escalera S, Palmero C (2021) Dyadformer: a multi-modal transformer for long-range modeling of dyadic interactions. In: Proceedings of ICCV Workshops, pp 2177–2188

  59. Shu T, Gao X, Ryoo MS, Zhu S-C (2017) Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: ICRA, pp 1669–1676

  60. Miller GA (1995) Wordnet: a lexical database for english. In: Communications of the ACM vol 38, no 11: 39-41, pp 5463–5474

  61. 7ESL: 7 Steps to Learn English. https://7esl.com/english-verbs/

  62. Wu Y, Kirillov A, Massa F, Lo W-Y, Girshick R (2019) Detectron2. https://github.com/facebookresearch/detectron2

  63. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: ICCV, pp 2980–2988. https://doi.org/10.1109/ICCV.2017.322

  64. Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2019) Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI 43(1):172–186

    Article  Google Scholar 

  65. Lin T-Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft coco: common objects in context. In: ECCV

  66. Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In: KDD, pp 2623–2631

  67. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  68. Lin T-Y, Dollár P, Ross Girshick KH, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: CVPR

Download references

Acknowledgements

This work was in part supported by JSPS 20H05951, JSPS 21H04893, and JST JPMJCR20G7.

Funding

Partial financial support was received from JSPS 20H05951, JSPS 21H04893, and JST JPMJCR20G7.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaen Kogashi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 2560 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kogashi, K., Nobuhara, S. & Nishino, K. SAN: Structure-aware attention network for dyadic human relation recognition in images. Multimed Tools Appl 83, 46947–46966 (2024). https://doi.org/10.1007/s11042-023-17229-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17229-1

Keywords

Navigation