Log in

QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human-object interaction detection can be mainly categorized into two core problems, namely human-object association detection and interaction understanding. Firstly, for association detection, previous methods tend to directly detect obvious human-object interaction pairs, while ignoring some interaction pairs that may have potential interaction relationships, which is contrary to the actual situation. Secondly, for the interaction understanding problem, traditional methods face the challenges of long-tailed distribution and zero-shot detection, which cannot flexibly deal with complex and changing real-world scenarios. To this end, adaptive Query Learning for HOI Detection via vision-language knowledge Transfer(QLDT) is proposed. Specifically, a two-stage dynamic matching scoring algorithm based on dynamically changing thresholds and scores is designed to explore obscure H-O pairs and labeling to enlarge the sample size. Secondly, a visual-language pre-trained model GLIP (Grounded Language-Image Pre-training), is introduced to enhance the model’s interactive comprehension ability, extract the visual and linguistic features of the images through GLIP, minimize the gap with the predicted values using cross-entropy loss, and take the maximum value of the score with the obscure H-O pairs as the final prediction, which ensures the model’s positivity. The proposed method shows excellent performance on both HICO-DET and V-COCO datasets, for HICO-DET, QLDT achieved 35.37% mAP on the full category, 30.15% mAP on the rare category, and also improved on all five zero-shot metrics. For V-COCO, 62.74% mAP and 67.71% mAP were achieved under Scenario 1 and 2, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Data Availability and Access

The data that support the findings of this study are openly available in HICO-DET and V-COCO.

Code Availability

The code of this paper is available at https://github.com/KKKarlW/QLDT.git

References

  1. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778

  2. Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  3. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2961–2969

  4. Luo W, Zhang H, Li J, Wei X-S (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549

    Article  Google Scholar 

  5. Zhang H, Qian F, Shang F, Du W, Qian J, Yang J (2020) Global convergence guarantees of (a) gist for a family of nonconvex sparse learning problems. IEEE Trans Cybernet 52(5):3276–3288

    Article  Google Scholar 

  6. Wu G, Ning X, Hou L, He F, Zhang H, Shankar A (2023) Three-dimensional softmax mechanism guided bidirectional gru networks for hyperspectral remote sensing image classification. Signal Process 212:109151

  7. Zhang H, Qian F, Zhang B, Du W, Qian J, Yang J (2022) Incorporating linear regression problems into an adaptive framework with feasible optimizations. IEEE Trans Multimed

  8. Li LH, Zhang P, Zhang H, Yang J, Li C, Zhong Y, Wang L, Yuan L, Zhang L, Hwang J-N et al (2022) Grounded language-image pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10965–10975

  9. **e C, Zeng F, Hu Y, Liang S, Wei Y (2023) Category query learning for human-object interaction classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15275–15284

  10. Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Affordance transfer learning for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 495–504

  11. Zhong X, Qu X, Ding C, Tao D (2021) Glance and gaze: inferring action-aware points for one-stage human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13234–13243

  12. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32

  13. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the international conference on machine learning, pp 8748–8763

  14. Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023) Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23390–23400

  15. Wu M, Gu J, Shen Y, Lin M, Chen C, Sun X (2023) End-to-end zero-shot hoi detection via vision and language knowledge distillation. Proceedings of the AAAI conference on artificial intelligence 37:2839–2846

    Article  Google Scholar 

  16. Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20123–20132

  17. Ning S, Qiu L, Liu Y, He X (2023) Hoiclip: efficient knowledge transfer for hoi detection with vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23507–23517

  18. Chao Y-W, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision, pp 381–389

  19. Gupta S, Malik J (2015) Visual semantic role labeling. ar**v preprint ar**v:1505.04474

  20. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the european conference on computer vision, pp 213–229

  21. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  22. Kim B, Lee J, Kang J, Kim E-S, Kim HJ (2021) Hotr: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 74–83

  23. Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10410–10419

  24. Zhang A, Liao Y, Liu S, Lu M, Wang Y, Gao C, Li X (2021) Mining the benefits of two-stage and one-stage hoi detection. Adv Neural Inf Process Syst 34:17209–17220

    Google Scholar 

  25. Zhou D, Liu Z, Wang J, Wang L, Hu T, Ding E, Wang J (2022) Human-object interaction detection via disentangled transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19568–19577

  26. Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11825–11834

  27. Chan S, Wang W, Shao Z, Bai C (2023) Sgpt: the secondary path guides the primary path in transformers for hoi detection. In: Proceedings of the IEEE international conference on robotics and automation, pp 7583–7590

  28. Lei T, Caba F, Chen Q, ** H, Peng Y, Liu Y (2023) Efficient adaptive human-object interaction detection with concept-guided memory. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6480–6490

  29. Cao Y, Tang Q, Yang F, Su X, You S, Lu X, Xu C (2023) Re-mine, learn and reason: exploring the cross-modal semantic correlations for language-guided hoi detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 23492–23503

  30. Chen M, Liao Y, Liu S, Chen Z, Wang F, Qian C (2021) Reformulating hoi detection as adaptive set prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9004–9013

  31. Dong L, Li Z, Xu K, Zhang Z, Yan L, Zhong S, Zou X (2022) Category-aware transformer network for better human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19538–19547

  32. Qu X, Ding C, Li X, Zhong X, Tao D (2022) Distillation using oracle queries for transformer-based human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19558–19567

  33. Zhong X, Ding C, Li Z, Huang S (2022) Towards hard-positive query mining for detr-based human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 444–460

  34. Jia C, Yang Y, **a Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the international conference on machine learning, pp 4904–4916

  35. Zhou P, Chi M (2019) Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 843–851

  36. Gu X, Lin T-Y, Kuo W, Cui Y (2021) Open-vocabulary object detection via vision and language knowledge distillation. ar**v preprint ar**v:2104.13921

  37. Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N (2021) Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1780–1790

  38. Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D (2021) Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2085–2094

  39. Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2022) Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293–304

    Article  Google Scholar 

  40. Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Proceedings of the european conference on computer vision, pp 696–712

  41. Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091

  42. Shen L, Yeung S, Hoffman J, Mori G, Fei-Fei L (2018) Scaling human-object interaction recognition through zero-shot learning. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision, pp 1568–1576

  43. Bansal A, Rambhatla SS, Shrivastava A, Chellappa R (2020) Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI conference on artificial intelligence, pp 10460–10469

  44. Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9677–9685

  45. Hou Z, Peng X, Qiao Y, Tao D (2020) Visual compositional learning for human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 584–600

  46. Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Detecting human-object interaction via fabricated compositional learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14646–14655

  47. Liu Y, Yuan J, Chen CW (2020) Consnet: learning consistency graph for zero-shot human-object interaction detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 4235–4243

  48. Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1981–1990

  49. **a L, Li R (2020) Multi-stream neural network fused with local information and global information for hoi detection. Appl Intell 50(12):4495–4505

    Article  Google Scholar 

  50. He H, Yuan Y, Yue X, Hu H (2022) Rankseg: adaptive pixel classification with image category ranking for segmentation. In: Proceedings of the european conference on computer vision, pp 682–700

  51. Gupta A, Narayan S, Joseph K, Khan S, Khan FS, Shah M (2022) Ow-detr: open-world detection transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9235–9244

  52. Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97

    Article  MathSciNet  Google Scholar 

  53. Liu X, Li Y-L, Wu X, Tai Y-W, Lu C, Tang C-K (2022) Interactiveness field in human-object interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20113–20122

  54. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  55. Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8359–8367

  56. Qi S, Wang W, Jia B, Shen J, Zhu S-C (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the european conference on computer vision, pp 401–417

  57. Li Y-L, Zhou S, Huang X, Xu L, Ma Z, Fang H-S, Wang Y, Lu C (2019) Transferable interactiveness knowledge for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3585–3594

  58. Gao C, Xu J, Zou Y, Huang J-B (2020) Drg: dual relation graph for human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 696–712

  59. Ulutan O, Iftekhar A, Manjunath BS (2020) Vsgnet: spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13617–13626

  60. Li Y-L, Liu X, Wu X, Li Y, Lu C (2020) Hoi analysis: integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022

    Google Scholar 

  61. Zhang FZ, Campbell D, Gould S (2021) Spatially conditioned graphs for detecting human-object interactions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13319–13327

  62. Zhang FZ, Campbell D, Gould S (2022) Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20104–20112

  63. Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T, Kloft M, Shen D, Yin J, Gao W (2019) Multiple kernel \( k \) k-means with incomplete kernels. IEEE Trans Pattern Anal Mach Intell 42(5):1191–1204

    Google Scholar 

  64. Zhou Z, Zhang B, Yu X (2022) Immune coordination deep network for hand heat trace extraction. Infrared Phys Tech 127:104400

  65. Yu X, Ye X, Zhang S (2022) Floating pollutant image target extraction algorithm based on immune extremum region. Digital Signal Process 123:103442

    Article  Google Scholar 

  66. Yu X, Zhou Z, Gao Q, Li D, Ríha K (2018) Infrared image segmentation using growing immune field and clone threshold. Infrared Phys Tech 88:184–193

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

**ncheng Wang performed the methodology and conceptualization; Yongbin Gao performed the review and supervision; Wenjun Yu performed the data curation; Chenmou Wu performed the validation; Mingxuan Chen performed the formal analysis; Honglei Ma performed the investigation; Zhichao Chen performed the data checks.

Corresponding author

Correspondence to Yongbin Gao.

Ethics declarations

Ethical and Informed Consent for Data Used

Informed consent was obtained from the Shanghai University of Engineering Science and COMAC Shanghai Aircraft Manufacturing Co., LTD for the publication of this article from all authors.

Competing Interests

The corresponding author of this paper is the associate editor of Applied Intelligence.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Gao, Y., Yu, W. et al. QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05653-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-05653-1

Keywords

Navigation