PositiveCoOp: Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations
Winter Conference on Applications of Computer Vision (WACV 2025)

Abstract

Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters.

Motivation



Prompt learning is a popular approach to adapt vision-language models (VLMs) for recognition task. It helps take advantage of the latent knowledge in the text encoder to learn prompts whose embeddings in the shared vision-language embedding space correspond to the presence of a class. Several works have extended prompt learning to multi-label recognition by learning dual prompts: a positive and a negative prompt, to detect the presence and absence of a class in the image, respectively. However, a closer analysis of VLMs reveals that they have been trained on paired image-caption datasets where captions correspond to the presence of objects rather than their absence. This raises questions about the guidance from VLMs for learning negative prompts in such approaches. To this end, we conduct a thorough empirical study to evaluate the contribution of VLM guidance in learning both positive and negative prompts.


Proposed Model

VLM-based MLR approaches like DualCoOp propose learning both positive and negative prompts using CLIP's guidance: one for class presence and one for class absence. In PositiveCoOp (NegativeCoOp), for a given class j, only the positive (negative) prompt tj,+ (tj,-) is learned through CLIP, while the negative (positive) prompt is replaced by a learned text embedding rj,- (rj,+) in the feature space, independent of CLIP's text encoder. For both PositiveCoOp and NegativeCoOp, we obtain the final predictions pi,j,+ and pi,j,- by calculating the cosine similarity of the image features with the embedding of the positive text prompt rj,+ and learned text embedding rj,-. This is then aggregated using the class-specific feature aggregation strategy following DualCoOp. Only the text embeddings and the prompts are trained using the widely used Asymmetric Loss.

Results and Analysis

A. The results of MLR with partial annotations on COCO and VOC2007 demonstrate that the performance of the prompting-based approaches follows the order: Positive CoOp > DualCoOp ≈ Baseline > NegativeCoOp.

B. The comparison of training parameters and GPU hours for the exsiting methods with the Baseline, PostitiveCoOp and NegativeCoOp reveal that the Baseline uses significantly fewer parameters and GPU hours than all other setups, while PositiveCoOp and NegativeCoOp require about half the parameters compared to DualCoOp.

C. We compare the average similarity between pairs of positive features and pairs of positive and negative on 80 classes of COCO dataset for two scenarios a) When we use a only one prompt and b) Using 85 default prompt templates for ImageNet. We observe that the similarity score between positive-positive prompt is close to positive-negative, implying that CLIP projects positive and negative prompts very closely in the feature space.

Citation

The website template was borrowed from Michaël Gharbi and Ref-NeRF.