-
Chapter and Conference Paper
Descriptive Attributes for Language-Based Object Keypoint Detection
Multimodal vision and language (VL) models have recently shown strong performance in phrase grounding and object detection for both zero-shot and finetuned cases. We adapt a VL model (GLIP) for keypoint detect...
-
Chapter and Conference Paper
Text-Driven Stylization of Video Objects
We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1)...
-
Chapter and Conference Paper
SITTA: Single Image Texture Translation for Data Augmentation
Recent advances in data augmentation enable one to translate images by learning the map** between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on ...
-
Article
Open AccessOccluded Video Instance Segmentation: A Benchmark
Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, th...
-
Chapter and Conference Paper
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorizati...
-
Chapter and Conference Paper
Visual Prompt Tuning
The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, i.e., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alte...
-
Chapter and Conference Paper
On Label Granularity and Object Localization
Weakly supervised object localization (WSOL) aims to learn representations that encode object location using only image-level category labels. However, many objects can be labeled at different levels of granul...
-
Article
Convolutional Networks with Adaptive Inference Graphs
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained d...
-
Chapter and Conference Paper
A Metric Learning Reality Check
Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look a...
-
Chapter and Conference Paper
Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset
In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorizatio...
-
Chapter and Conference Paper
Learning Gradient Fields for Shape Generation
In this work, we propose a novel technique to generate shapes from point cloud data. A point cloud can be viewed as samples from a distribution of 3D points whose density is concentrated near the surface of th...
-
Chapter and Conference Paper
Deep Fundamental Matrix Estimation Without Correspondences
Estimating fundamental matrices is a classic problem in computer vision. Traditional methods rely heavily on the correctness of estimated key-point correspondences, which can be noisy and unreliable. As a resu...
-
Article
Vision-based real estate price estimation
Since the advent of online real estate database companies like Zillow, Trulia and Redfin, the problem of automatic estimation of market values for houses has received considerable attention. Several real estat...
-
Chapter and Conference Paper
Learning Single-View 3D Reconstruction with Limited Pose Supervision
It is expensive to label images with 3D structure or precise camera pose. Yet, this is precisely the kind of annotation required to train single-view 3D reconstruction models. In contrast, unlabeled images or ...
-
Chapter and Conference Paper
Multimodal Unsupervised Image-to-Image Translation
Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding image...
-
Chapter and Conference Paper
Convolutional Networks with Adaptive Inference Graphs
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained d...
-
Chapter
Cross-View Image Geo-localization
The recent availability of large amounts of geo-tagged imagery has inspired a number of data-driven solutions to the image geo-localization problem. Existing approaches predict the location of a query image by...
-
Article
Editorial: Special Issue on Active and Interactive Methods in Computer Vision
-
Article
The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization
We present a visual recognition system for fine-grained visual categorization. The system is composed of a human and a machine working together and combines the complementary strengths of computer vision algor...
-
Chapter and Conference Paper
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This ...