-
Chapter and Conference Paper
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categor...
-
Chapter and Conference Paper
Single-Stream Multi-level Alignment for Vision-Language Pretraining
Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text repre...
-
Chapter and Conference Paper
Bayesian Semantic Instance Segmentation in Open Set World
This paper addresses the semantic instance segmentation task in the open-set conditions, where input images can contain known and unknown object classes. The training process of existing semantic instance segm...
-
Chapter and Conference Paper
Multi-modal Cycle-Consistent Generalized Zero-Shot Learning
In generalized zero shot learning (GZSL), the set of classes are split into seen and unseen classes, where training relies on the semantic features of the seen and unseen classes and the visual representations...