Search
Search Results
-
Enhancing Semantics-Driven Recommender Systems with Visual Features
Content-based semantics-driven recommender systems are often used in the small-scale news recommendation domain, founded on the TF-IDF measure but... -
FaSRnet: a feature and semantics refinement network for human pose estimation
Due to factors such as motion blur, video out-of-focus, and occlusion, multi-frame human pose estimation is a challenging task. Exploiting temporal...
-
Enhancing Visual Question Answering with Generated Image Caption
Visual Question Answering (VQA) poses a formidable challenge, necessitating computer systems to proficiently execute essential computer vision tasks,... -
A Cross-Modal View to Utilize Label Semantics for Enhancing Student Network in Multi-label Classification
Knowledge transfer has become a promising approach for improving the performance and efficiency of relatively lightweight networks. Previous research... -
Visual and language semantic hybrid enhancement and complementary for video description
It is a fundamental task of computer vision to describe and express the visual content of a video in natural language, which not only highly...
-
Zero-shot image classification via Visual–Semantic Feature Decoupling
Zero-shot image classification refers to the use of labeled images to train a classification model that can correctly classify images of unseen...
-
Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning
Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object... -
Enhancing Fairness of Visual Attribute Predictors
The performance of deep neural networks for image recognition tasks such as predicting a smiling face is known to degrade with under-represented... -
Mutually guided learning of global semantics and local representations for image restoration
The global semantics and the local scene representation are crucial for image restoration. Although existing methods have proposed various hybrid...
-
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the...
-
Multi-granularity hypergraph-guided transformer learning framework for visual classification
Fine-grained single-label classification tasks aim to distinguish highly similar categories but often overlook inter-category relationships....
-
End-to-End Image Compression Through Machine Semantics
With the increasing demand for AI automated analysis, machine semantics have replaced signals as a new focus in visual information compression. In... -
Lgvc: language-guided visual context modeling for 3D visual grounding
3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional...
-
SLOD2+WIN: semantics-aware addition and LoD of 3D window details for LoD2 CityGML models with textures
In many urban planning and visualization applications, it is crucial to have 3D window details. However, the process of acquiring and reconstructing...
-
Contrastive learning for unsupervised sentence embeddings using negative samples with diminished semantics
Unsupervised learning has made significant progress in recent years, driven by advancements in contrastive learning. However, current methods for...
-
GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering
The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity.... -
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
Dominant pre-training work for video-text retrieval mainly adopt the “dual-encoder” architectures to enable efficient retrieval, where two separate... -
Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization
Audio-visual event localization (AVE) task focuses on localizing audio-visual events where event signals occur in both audio and visual modalities.... -
An Effective Pre-trained Visual Encoder for Medical Visual Question Answering
Medical Visual Question Answering (Med-VQA) is a domain-specific task that answers a given clinical question regarding a radiology image. It requires... -
Indirect visual–semantic alignment for generalized zero-shot recognition
Our paper addresses the challenge of generalized zero-shot learning, where the label of a target image may belong to either a seen or an unseen...