Search
Search Results
-
Dual visual align-cross attention-based image captioning transformer
Region-based features widely used in image captioning are typically extracted using object detectors like Faster R-CNN. However, the approach has a...
-
Sentimental Visual Captioning using Multimodal Transformer
We propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video....
-
ST-VQA: shrinkage transformer with accurate alignment for visual question answering
While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision...
-
RaSTFormer: region-aware spatiotemporal transformer for visual homogenization recognition in short videos
With the surge in network traffic, the homogenization of short video content is becoming increasingly prominent, resulting in low-quality...
-
Relation-wise transformer network and reinforcement learning for visual navigation
The task of object goal navigation is to drive an embodied agent to find the location of a given target only using visual observation. The map**...
-
TransFGVC: transformer-based fine-grained visual classification
Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to...
-
Visual contextual relationship augmented transformer for image captioning
AbstractThe image captioning task is among the most important tasks in computer vision. Most existing methods mine more useful contextual information...
-
Multistage attention region supplement transformer for fine-grained visual categorization
The classification of fine-grained images using computer technology employs neural network models to distinguish between instances of different...
-
A transformer based real-time photo captioning framework for visually impaired people with visual attention
In recent years, transformer-based photo captioning frameworks plays a crucial role in improving individuals’ overall well-being, self-reliance, and...
-
Local self-attention in transformer for visual question answering
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have...
-
Multi-granularity hypergraph-guided transformer learning framework for visual classification
Fine-grained single-label classification tasks aim to distinguish highly similar categories but often overlook inter-category relationships....
-
CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion
Multimodal machine translation is a technique that leverages computer vision to improve the quality of text translation. Most recent multimodal...
-
A robust attention-enhanced network with transformer for visual tracking
Recently, Siamese-based trackers have become particularly popular. The correlation module in these trackers is responsible for fusing the feature...
-
Enhancing Indian sign language recognition through data augmentation and visual transformer
This paper introduces a novel approach to Indian Sign Language Recognition (ISLR) by integrating Keras, Visual Transformers (ViT), and sophisticated...
-
Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human–computer...
-
CWC-transformer: a visual transformer approach for compressed whole slide image classification
The rapid development of Artificial Intelligence (AI) technology accelerates the application of computational pathology in clinical decision-making....
-
Multi-modal transformer using two-level visual features for fake news detection
Fake news with multimedia data is ubiquitous on the Internet nowadays, and it is difficult for users to distinguish them. Therefore, it is necessary...
-
Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions
AbstractAudio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS...
-
A Comparative Study of CNN- and Transformer-Based Visual Style Transfer
Vision Transformer has shown impressive performance on the image classification tasks. Observing that most existing visual style transfer (VST)...
-
Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking
Siamese-based trackers have achieved outstanding tracking performance. However, these trackers in complex scenarios struggle to adequately integrate...