Video object segmentation through semantic visual words matching

Hao, Chuanyan; Chen, Yadang; Wu, Weimin; Yang, Zhi-**n; Wu, Enhua

doi:10.1007/s11042-023-14361-w

Video object segmentation through semantic visual words matching

Published: 10 January 2023

Volume 82, pages 19591–19605, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Chuanyan Hao ORCID: orcid.org/0000-0003-3887-5438¹,
Yadang Chen²,
Weimin Wu¹,
Zhi-**n Yang³ &
…
Enhua Wu⁴

175 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Video object segmentation (VOS) has been widely used in the fields of computer vision. However, existing VOS algorithms have drawbacks, such as difficulty with object deformation, occlusion, and fast motion. We therefore propose an effective VOS algorithm based on semantic visual words matching. Specifically, given the support frame and its corresponding mask, the frame is firstly input to the encoder with an embedding layer, and then a clustering algorithm is followed to generate a group of semantic visual words according to its mask. For a query frame to be segmented, a matching operation is performed against words generated from the support frame. In this manner, each pixel on query frame can be classified into different object categories by the obtained similarity. What’s more, a self-attention mechanism is applied to enhance the embedding features in order to capture the global dependencies before the words matching. For further handling the object changing and global mismatch problems, an online update and correction mechanism are also employed in our method. Experiments show that our proposed method achieved competitive results on the DAVIS 2016 and DAVIS 2017 datasets. J&F-mean, the mean value between regional similarity and contour accuracy, reached 83.2% and 72.3% on DAVIS 2016 and DAVIS 2017, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Saliency-based dual-attention network for unsupervised video object segmentation

Article 22 September 2023

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

COMatchNet: Co-Attention Matching Network for Video Object Segmentation

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

Behl HS, Najafi M, Arnab A, Torr PHS (2019) Meta learning deep visual words for fast video object segmentation. In: Proceedings of the 2019 conference on neural information processing systems machine learning for autonomous driving workshop
Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Gool LV (2017) One-shot video object segmentation. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition, pp 5320–5329
Hospedales T, Antoniou A, Micaelli P, Storkey A (2020) Meta-learning in neural networks: a survey. In: Arxiv preprint ar**v:2004.05439
Hu YT, Huang JB, Schwing AG (2018) Videomatch: Matching based video object segmentation. In: Proceedings of the 2018 European conference on computer vision
Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2019) Lucid data dreaming for video object segmentation. International Journal of Computer Vision
Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 735–750
Liang Y, Li X, Jafari N, Chen Q (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the 2020 conference on neural information processing systems
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Van Gool L (2020) Video object segmentation with episodic graph memory networks. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 661–679
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3618–3627. https://doi.org/10.1109/CVPR.2019.00374
Lu X, Wang W, Shen J, Crandall D, Luo J (2022) Zero-shot video object segmentation with co-attention siamese networks. IEEE Trans Pattern Anal Mach Intell 44 (4):2228–2242. https://doi.org/10.1109/TPAMI.2020.3040258
Google Scholar
Lu X, Wang W, Shen J, Crandall D, Van Gool L (2021) Segmenting objects from relational visual data. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3115815
Luiten J, Voigtlaender P, Leibe B (2018) Premvos:proposal-generation, refinement and merging for the davis challenge on video object segmentation 2018. In: The 2018 DAVIS challenge on video object segmentation - CVPR workshops
Maninis K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2019) Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell 41(6):1515–1530
Article Google Scholar
Meinhardt T, Leal-taixe L (2020) Make one-shot video object segmentation efficient again. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates, Inc., pp 10607–10619. https://proceedings.neurips.cc/paper/2020/file/781397bc0630d47ab531ea850bddcf63-Paper.pdf
Oh SW, Lee J, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the 2018 IEEE conference on computer vision and pattern recognition, pp 7376–7385
Oh SW, Lee J, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the 2019 IEEE international conference on computer vision, pp 9225–9234
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 629–645
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser U, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the 2019 IEEE conference on computer vision and pattern recognition, pp 9473–9482
Wang Z, Xu J, Liu L, Zhu F, Shao L (2019) Ranet: Ranking attention network for fast video object segmentation. In: 2019 IEEE/CVF international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2019.00408, pp 3977–3986
Woo S, Park J, Lee J, Kweon IS (2018) CBAM: convolutional block attention module. In: Computer vision – ECCV 2018, Lecture notes in computer science. https://doi.org/10.1007/978-3-030-01234-2_1, vol 11211. Springer, pp 3–19
**e H, Yao H, Zhou S, Zhang S, Sun W (2021) Efficient regional memory network for video object segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR46437.2021.00134, pp 1286–1295
Yang L, Wang Y, **ong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp 6499–6507
Yang Z, Wei Y, Yang Y (2020) Collaborative video object segmentation by foreground-background integration. In: Proceedings of the 2020 European conference on computer vision

Download references

Author information

Authors and Affiliations

School of Education Science and Technology, Nan**g University of Posts and Telecommunications, Nan**g, Jiangsu, People’s Republic of China
Chuanyan Hao & Weimin Wu
School of Computer and Software, Nan**g University of Information Science and Technology, Nan**g, Jiangsu, People’s Republic of China
Yadang Chen
State Key Laboratory of Internet of Things for Smart City, Department of Electromechanical Engineering, University of Macao, Macau, People’s Republic of China
Zhi-**n Yang
State Key Laboratory of Computer Science, Institute of Software, University of Chinese Academy of Sciences, Bei**g, People’s Republic of China
Enhua Wu

Authors

Chuanyan Hao
View author publications
You can also search for this author in PubMed Google Scholar
Yadang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-**n Yang
View author publications
You can also search for this author in PubMed Google Scholar
Enhua Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuanyan Hao.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 61802197) and is also funded in part by the Science and Technology Development Fund, Macau SAR (File Nos. SKL-IOTSC-2018-2020, 0018/2019/AKP, 00 08/2019/AGJ, and FDCT/194/2017/A3), in part by the University of Macau under Grant MYRG2018-00248-FST and MYRG2019-0137-FST.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hao, C., Chen, Y., Wu, W. et al. Video object segmentation through semantic visual words matching. Multimed Tools Appl 82, 19591–19605 (2023). https://doi.org/10.1007/s11042-023-14361-w

Download citation

Received: 18 August 2021
Revised: 01 August 2022
Accepted: 02 January 2023
Published: 10 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-023-14361-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video object segmentation through semantic visual words matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Saliency-based dual-attention network for unsupervised video object segmentation

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

COMatchNet: Co-Attention Matching Network for Video Object Segmentation

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Video object segmentation through semantic visual words matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Saliency-based dual-attention network for unsupervised video object segmentation

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

COMatchNet: Co-Attention Matching Network for Video Object Segmentation

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation