QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer

Wang, **ncheng; Gao, Yongbin; Yu, Wenjun; Wu, Chenmou; Chen, Mingxuan; Ma, Honglei; Chen, Zhichao

doi:10.1007/s10489-024-05653-1

QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer

Published: 09 July 2024

(2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

**ncheng Wang¹,
Yongbin Gao¹,
Wenjun Yu¹,
Chenmou Wu¹,
Mingxuan Chen¹,
Honglei Ma¹ &
…
Zhichao Chen²

Abstract

Human-object interaction detection can be mainly categorized into two core problems, namely human-object association detection and interaction understanding. Firstly, for association detection, previous methods tend to directly detect obvious human-object interaction pairs, while ignoring some interaction pairs that may have potential interaction relationships, which is contrary to the actual situation. Secondly, for the interaction understanding problem, traditional methods face the challenges of long-tailed distribution and zero-shot detection, which cannot flexibly deal with complex and changing real-world scenarios. To this end, adaptive Query Learning for HOI Detection via vision-language knowledge Transfer(QLDT) is proposed. Specifically, a two-stage dynamic matching scoring algorithm based on dynamically changing thresholds and scores is designed to explore obscure H-O pairs and labeling to enlarge the sample size. Secondly, a visual-language pre-trained model GLIP (Grounded Language-Image Pre-training), is introduced to enhance the model’s interactive comprehension ability, extract the visual and linguistic features of the images through GLIP, minimize the gap with the predicted values using cross-entropy loss, and take the maximum value of the score with the obscure H-O pairs as the final prediction, which ensures the model’s positivity. The proposed method shows excellent performance on both HICO-DET and V-COCO datasets, for HICO-DET, QLDT achieved 35.37% mAP on the full category, 30.15% mAP on the rare category, and also improved on all five zero-shot metrics. For V-COCO, 62.74% mAP and 67.71% mAP were achieved under Scenario 1 and 2, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Availability and Access

The data that support the findings of this study are openly available in HICO-DET and V-COCO.

Code Availability

The code of this paper is available at https://github.com/KKKarlW/QLDT.git

References

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2961–2969
Luo W, Zhang H, Li J, Wei X-S (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Article Google Scholar
Zhang H, Qian F, Shang F, Du W, Qian J, Yang J (2020) Global convergence guarantees of (a) gist for a family of nonconvex sparse learning problems. IEEE Trans Cybernet 52(5):3276–3288
Article Google Scholar
Wu G, Ning X, Hou L, He F, Zhang H, Shankar A (2023) Three-dimensional softmax mechanism guided bidirectional gru networks for hyperspectral remote sensing image classification. Signal Process 212:109151
Zhang H, Qian F, Zhang B, Du W, Qian J, Yang J (2022) Incorporating linear regression problems into an adaptive framework with feasible optimizations. IEEE Trans Multimed
Li LH, Zhang P, Zhang H, Yang J, Li C, Zhong Y, Wang L, Yuan L, Zhang L, Hwang J-N et al (2022) Grounded language-image pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10965–10975
**e C, Zeng F, Hu Y, Liang S, Wei Y (2023) Category query learning for human-object interaction classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15275–15284
Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Affordance transfer learning for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 495–504
Zhong X, Qu X, Ding C, Tao D (2021) Glance and gaze: inferring action-aware points for one-stage human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13234–13243
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the international conference on machine learning, pp 8748–8763
Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023) Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23390–23400
Wu M, Gu J, Shen Y, Lin M, Chen C, Sun X (2023) End-to-end zero-shot hoi detection via vision and language knowledge distillation. Proceedings of the AAAI conference on artificial intelligence 37:2839–2846
Article Google Scholar
Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20123–20132
Ning S, Qiu L, Liu Y, He X (2023) Hoiclip: efficient knowledge transfer for hoi detection with vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23507–23517
Chao Y-W, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision, pp 381–389
Gupta S, Malik J (2015) Visual semantic role labeling. ar**v preprint ar**v:1505.04474
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the european conference on computer vision, pp 213–229
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Kim B, Lee J, Kang J, Kim E-S, Kim HJ (2021) Hotr: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 74–83
Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10410–10419
Zhang A, Liao Y, Liu S, Lu M, Wang Y, Gao C, Li X (2021) Mining the benefits of two-stage and one-stage hoi detection. Adv Neural Inf Process Syst 34:17209–17220
Google Scholar
Zhou D, Liu Z, Wang J, Wang L, Hu T, Ding E, Wang J (2022) Human-object interaction detection via disentangled transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19568–19577
Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11825–11834
Chan S, Wang W, Shao Z, Bai C (2023) Sgpt: the secondary path guides the primary path in transformers for hoi detection. In: Proceedings of the IEEE international conference on robotics and automation, pp 7583–7590
Lei T, Caba F, Chen Q, ** H, Peng Y, Liu Y (2023) Efficient adaptive human-object interaction detection with concept-guided memory. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6480–6490
Cao Y, Tang Q, Yang F, Su X, You S, Lu X, Xu C (2023) Re-mine, learn and reason: exploring the cross-modal semantic correlations for language-guided hoi detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 23492–23503
Chen M, Liao Y, Liu S, Chen Z, Wang F, Qian C (2021) Reformulating hoi detection as adaptive set prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9004–9013
Dong L, Li Z, Xu K, Zhang Z, Yan L, Zhong S, Zou X (2022) Category-aware transformer network for better human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19538–19547
Qu X, Ding C, Li X, Zhong X, Tao D (2022) Distillation using oracle queries for transformer-based human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19558–19567
Zhong X, Ding C, Li Z, Huang S (2022) Towards hard-positive query mining for detr-based human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 444–460
Jia C, Yang Y, **a Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the international conference on machine learning, pp 4904–4916
Zhou P, Chi M (2019) Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 843–851
Gu X, Lin T-Y, Kuo W, Cui Y (2021) Open-vocabulary object detection via vision and language knowledge distillation. ar**v preprint ar**v:2104.13921
Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N (2021) Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1780–1790
Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D (2021) Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2085–2094
Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2022) Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293–304
Article Google Scholar
Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Proceedings of the european conference on computer vision, pp 696–712
Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Shen L, Yeung S, Hoffman J, Mori G, Fei-Fei L (2018) Scaling human-object interaction recognition through zero-shot learning. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision, pp 1568–1576
Bansal A, Rambhatla SS, Shrivastava A, Chellappa R (2020) Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI conference on artificial intelligence, pp 10460–10469
Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9677–9685
Hou Z, Peng X, Qiao Y, Tao D (2020) Visual compositional learning for human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 584–600
Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Detecting human-object interaction via fabricated compositional learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14646–14655
Liu Y, Yuan J, Chen CW (2020) Consnet: learning consistency graph for zero-shot human-object interaction detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 4235–4243
Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1981–1990
**a L, Li R (2020) Multi-stream neural network fused with local information and global information for hoi detection. Appl Intell 50(12):4495–4505
Article Google Scholar
He H, Yuan Y, Yue X, Hu H (2022) Rankseg: adaptive pixel classification with image category ranking for segmentation. In: Proceedings of the european conference on computer vision, pp 682–700
Gupta A, Narayan S, Joseph K, Khan S, Khan FS, Shah M (2022) Ow-detr: open-world detection transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9235–9244
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97
Article MathSciNet Google Scholar
Liu X, Li Y-L, Wu X, Tai Y-W, Lu C, Tang C-K (2022) Interactiveness field in human-object interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20113–20122
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8359–8367
Qi S, Wang W, Jia B, Shen J, Zhu S-C (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the european conference on computer vision, pp 401–417
Li Y-L, Zhou S, Huang X, Xu L, Ma Z, Fang H-S, Wang Y, Lu C (2019) Transferable interactiveness knowledge for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3585–3594
Gao C, Xu J, Zou Y, Huang J-B (2020) Drg: dual relation graph for human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 696–712
Ulutan O, Iftekhar A, Manjunath BS (2020) Vsgnet: spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13617–13626
Li Y-L, Liu X, Wu X, Li Y, Lu C (2020) Hoi analysis: integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022
Google Scholar
Zhang FZ, Campbell D, Gould S (2021) Spatially conditioned graphs for detecting human-object interactions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13319–13327
Zhang FZ, Campbell D, Gould S (2022) Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20104–20112
Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T, Kloft M, Shen D, Yin J, Gao W (2019) Multiple kernel \( k \) k-means with incomplete kernels. IEEE Trans Pattern Anal Mach Intell 42(5):1191–1204
Google Scholar
Zhou Z, Zhang B, Yu X (2022) Immune coordination deep network for hand heat trace extraction. Infrared Phys Tech 127:104400
Yu X, Ye X, Zhang S (2022) Floating pollutant image target extraction algorithm based on immune extremum region. Digital Signal Process 123:103442
Article Google Scholar
Yu X, Zhou Z, Gao Q, Li D, Ríha K (2018) Infrared image segmentation using growing immune field and clone threshold. Infrared Phys Tech 88:184–193
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, 201620, Shanghai, China
**ncheng Wang, Yongbin Gao, Wenjun Yu, Chenmou Wu, Mingxuan Chen & Honglei Ma
COMAC Shanghai Aircraft Manufacturing Co., LTD, 201324, Shanghai, China
Zhichao Chen

Authors

**ncheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yongbin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Wenjun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chenmou Wu
View author publications
You can also search for this author in PubMed Google Scholar
Mingxuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Honglei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

**ncheng Wang performed the methodology and conceptualization; Yongbin Gao performed the review and supervision; Wenjun Yu performed the data curation; Chenmou Wu performed the validation; Mingxuan Chen performed the formal analysis; Honglei Ma performed the investigation; Zhichao Chen performed the data checks.

Corresponding author

Correspondence to Yongbin Gao.

Ethics declarations

Ethical and Informed Consent for Data Used

Informed consent was obtained from the Shanghai University of Engineering Science and COMAC Shanghai Aircraft Manufacturing Co., LTD for the publication of this article from all authors.

Competing Interests

The corresponding author of this paper is the associate editor of Applied Intelligence.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Gao, Y., Yu, W. et al. QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05653-1

Download citation

Accepted: 25 June 2024
Published: 09 July 2024
DOI: https://doi.org/10.1007/s10489-024-05653-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer

Abstract

Access this article

Subscribe and save

Buy Now

Data Availability and Access

Code Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and Informed Consent for Data Used

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation