|
PositiveCoOp: Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations
Samyak Rawlekar, Shubhang Bhatnagar , Narendra Ahuja
WACV 2025
abstract /
project page /
paper
Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters.
|
|
Improving Multi-label Recognition using Class Co-Occurrence Probabilities
Samyak Rawlekar*, Shubhang Bhatnagar* , Vishnuvardhan Pogunulu Srinivasulu, Narendra Ahuja
CVPRW 2024, ICPR 2024 (Oral Top-5%)
abstract /
project page /
paper
Multi-label Recognition (MLR) involves the identification of multiple objects within an image.
To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for
the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the
training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the
co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the
conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR
datasets, where our approach outperforms all state-of-the-art methods.
|
|