Search
Search Results
-
VISIONE 5.0: Enhanced User Interface and AI Models for VBS2024
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can... -
DiveXplore at the Video Browser Showdown 2024
According to our experience from VBS2023 and the feedback from the IVR4B special session at CBMI2023, we have largely revised the diveXplore system... -
Contextual Augmentation with Bias Adaptive for Few-Shot Video Object Segmentation
Few-shot video object segmentation (FSVOS) is a challenging task that aims to segment new object classes across query videos with limited annotated... -
Face Forgery Detection via Texture and Saliency Enhancement
In recent years, AI-driven advancements have resulted in increasingly sophisticated face forgery techniques, posing a challenge in distinguishing... -
Cross-Modal Semantic Alignment Learning for Text-Based Person Search
Text-based person search aims to retrieve pedestrian images corresponding to a specific identity based on a textual description. Existing methods... -
Dive into Coarse-to-Fine Strategy in Single Image Deblurring
The coarse-to-fine approach has gained significant popularity in the design of networks for single image deblurring. Traditional methods used to... -
CLF-Net: A Few-Shot Cross-Language Font Generation Method
Designing a font library takes a lot of time and effort. Few-shot font generation aims to generate a new font library by referring to only a few... -
Advancing Incremental Few-Shot Semantic Segmentation via Semantic-Guided Relation Alignment and Adaptation
Incremental few-shot semantic segmentation aims to extend a semantic segmentation model to novel classes according to only a few labeled data, while... -
MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction
Speaker extraction to separate the target speech from the mixed audio is a problem worth studying in the speech separation field. Since human... -
A Lightweight Local Attention Network for Image Super-Resolution
For many years, deep neural networks have been used for Single Image Super-resolution (SISR) tasks. However, more extensive networks require higher... -
Find the Cliffhanger: Multi-modal Trailerness in Soap Operas
Creating a trailer requires carefully picking out and piecing together brief enticing moments out of a longer video, making it a challenging and... -
Dynamic-Static Graph Convolutional Network for Video-Based Facial Expression Recognition
Most of the current methods for video-based facial expression recognition (FER) in the wild are based on deep neural networks with attention... -
Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval
Recently, cross-modal hashing has become a promising line of research in cross-modal retrieval. It not only takes advantage of complementary multiple... -
Multi-scale Decomposition Dehazing with Polarimetric Vision
In this paper, the problem of simultaneous image dehazing of near and far scenes in hazy weather is addressed. We propose the multi-scale... -
Super-Resolution-Assisted Feature Refined Extraction for Small Objects in Remote Sensing Images
Despite achieving impressive results in object detection in natural scenes, the task of object detection in remote sensing images is still full of... -
Two-Stage Reasoning Network with Modality Decomposition for Text VQA
Text-based Visual Question Answering (Text VQA) is a challenging task that requires a comprehensive understanding of scene texts in an image. Scene... -
Localization and Local Motion Magnification of Pulsatile Regions in Endoscopic Surgery Videos
Localization of neurovascular bundles or vessels is critical in endoscopic surgery. It still remains challenging to identify neurovascular bundles... -
A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue
The task of Video-Grounded Dialogue involves develo** a multimodal chatbot capable of answering sequential questions from humans regarding video... -
Co-speech Gesture Generation with Variational Auto Encoder
The research field of generating natural gestures from speech input is called co-speech gesture generation. Co-speech generation methods should... -
LRATNet: Local-Relationship-Aware Transformer Network for Table Structure Recognition
Table structure recognition is a challenging task due to complex background and various styles of tables. Existing methods address this challenge by...