Search
Search Results
-
Visual attention network
While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by...
-
MANet: Mixed Attention Network for Visual Explanation
Various visual explanation methods, such as CAM and Grad-CAM, have been proposed to visualize and interpret predictions made by CNNs. Recent efforts...
-
Multimodal attention-driven visual question answering for Malayalam
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer...
-
Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering
Visual question answering (VQA) is a multimodal task requiring a simultaneous understanding of both visual and textual content. Therefore, image and...
-
GVA: guided visual attention approach for automatic image caption generation
Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the...
-
Masked co-attention model for audio-visual event localization
The objective of Audio-Visual Event Localization (AVEL) is to leverage audio and video cues in a combined manner to localize video segments that...
-
Dual visual align-cross attention-based image captioning transformer
Region-based features widely used in image captioning are typically extracted using object detectors like Faster R-CNN. However, the approach has a...
-
IMCN: Improved modular co-attention networks for visual question answering
Many existing Visual Question Answering (VQA) methods use traditional attention mechanisms to focus on each region of the input image and each word...
-
Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
Video captioning generation has become one of the research hotspots in recent years due to its wide range of potential application scenarios. It...
-
Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Current visual question answering (VQA) has become a research hotspot in the computer vision and natural language processing field. A core solution...
-
Co-attention graph convolutional network for visual question answering
Visual Question Answering (VQA) is a challenging task that requires a fine-grained understanding of both the visual content of images and the textual...
-
Multistage attention region supplement transformer for fine-grained visual categorization
The classification of fine-grained images using computer technology employs neural network models to distinguish between instances of different...
-
AS-Net: active speaker detection using deep audio-visual attention
Active Speaker Detection (ASD) aims at identifying the active speaker among multiple speakers in a video scene. Previous ASD models often seek audio...
-
Hierarchical cross-modal contextual attention network for visual grounding
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has...
-
Local self-attention in transformer for visual question answering
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have...
-
Dual-feature collaborative relation-attention networks for visual question answering
Region and grid features extracted by object detection networks, which contain abundant image information, are widely used in visual question...
-
Graph attention network-optimized dynamic monocular visual odometry
Monocular Visual Odometry (VO) is often formulated as a sequential dynamics problem that relies on scene rigidity assumption. One of the main...
-
Hierarchical Attention Networks for Fact-based Visual Question Answering
Fact-based Visual Question Answering (FVQA) aims to answer questions with images and facts. It requires a fine-grained and simultaneous understanding...
-
Multi-scale network with shared cross-attention for audio–visual correlation learning
Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences...
-
Cross-modal attention guided visual reasoning for referring image segmentation
The goal of referring image segmentation (RIS) is to generate the foreground mask of the object described by a natural language expression. The key...