Finding Distributed Object-Centric Properties in Self-Supervised Transformers

CVPR 2026 (Highlight)

Object-centric information encoded in patch-level interactions

Figure 1. Object-centric information is encoded in patch-level interactions. We visualize the inter-patch similarity maps computed from the Query (Q), Key (K), and Value (V) representations of patch tokens. Each component captures a complementary view of object structure. The Ensemble aggregates all three components, producing robust foreground-background separation and precise object localization.

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization. This occurs because the [CLS] token, trained on an image-level objective, summarizes the entire image rather than focusing strictly on objects, diluting the rich, object-centric information existing in local, patch-level interactions. We perform a systematic analysis of inter-patch similarity across all layers and attention components (query, key, and value) to uncover where object-centric information truly resides. Based on our insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information to drastically improve downstream tasks like unsupervised object discovery and hallucination mitigation in Multimodal Large Language Models (MLLMs).

Key Analyses & Insights

A core contribution of our work is shifting the focus away from the globally-optimized [CLS] token to uncover how local patch-level interactions natively encode scene structure. Our systematic analysis yields two critical findings that redefine how we extract emergent object representations from self-supervised ViTs:

Insight 1: Complementary Signals in Query, Key, and Value

Prior patch-level methods (like TokenCut) relied exclusively on Key (K) features. By computing normalized self-similarity matrices for all attention components, we discover that Queries (Q), Keys (K), and Values (V) all possess strong but complementary localization properties.

Query Similarity reveals which patches seek similar information.
Key Similarity reveals patches offering similar contextual cues.
Value Similarity identifies patches with similar feature content.

Combining these into a unified Ensemble Similarity Matrix mitigates individual component artifacts, creating a highly robust, low-noise representation of object saliency.

Insight 2: Object-Centricity is a Distributed Property

It is a common assumption that the strongest object representations exist exclusively in the final layer of a ViT. By characterizing every attention head across the network using our ensemble similarity matrix, we prove that object-centric information is hierarchically distributed.

Analysis of the object-centric head distribution

Figure 2. Analysis of the object-centric head distribution. (a) The heatmap demonstrates that numerous heads in intermediate layers (e.g., Layers 8-10) are consistently selected as object-centric, proving it is a distributed phenomenon. (b) The final layer contribution plot reveals several "noisy" (non-object-centric) heads. (c) The histogram shows the number of strongly active heads per layer.

Our analysis of over 4,000 COCO images reveals that:
1. Not all final-layer heads are object-centric. Several heads in the final layer encode global context rather than objects. Naively aggregating them introduces significant noise.
2. Intermediate layers matter. Numerous highly robust object-centric heads exist in intermediate layers. Ignoring them discards valuable localization data.

The Object-DINO Method

Motivated by our analysis, we introduce Object-DINO, an entirely training-free algorithm designed to automatically discover and extract the distributed set of object-centric heads. The process operates in two phases:

  • Phase 1: Feature Extraction. For every attention head across all layers, we compute the patch similarity maps from its Q, K, and V representations. We ensemble and flatten these maps to create a feature vector that describes that specific head's localization behavior.
  • Phase 2: Head Clustering & Selection. We cluster all heads based on these behavioral features using k-means. Guided by the empirical observation that object-centric information concentrates heavily toward the end of the network, we automatically identify the "Object Cluster" as the one containing the highest prevalence of final-layer heads.

By aggregating the similarity maps from only this specialized, distributed subset of heads, Object-DINO produces high-fidelity foreground-background separation while filtering out non-object noise.

Object-DINO Architecture

Figure 3. Object-DINO Overview. Our training-free algorithm computes patch similarity maps from Query, Key, and Value representations, ensembles them, and clusters the heads to automatically identify the distributed object-centric cluster, filtering noise from non-object-centric heads.

Applications & Downstream Results

1. Unsupervised Object Discovery

We integrate Object-DINO into the graph-based TokenCut framework, replacing their naive final-layer Key aggregation with our automatically discovered distributed heads. This single, training-free swap yields massive performance gains across all benchmarks for both DINO-v2 and DINO-v3 models. Specifically, we observe CorLoc improvements ranging from +3.6 to +12.4 points on PASCAL VOC and COCO 20k.

2. Mitigating Object Hallucinations in MLLMs

Multimodal Large Language Models (MLLMs) frequently hallucinate objects not present in the image. We leverage Object-DINO as an open-set visual grounder. We propose a training-free, dual-branch decoding strategy: a standard branch and a visual guidance branch steered by the Object-DINO saliency map. By amplifying tokens consistent with this explicit visual evidence, our method corrects hallucinations and achieves state-of-the-art Precision and F1 scores on the POPE and CHAIR benchmarks, without the heavy computational overhead of multi-pass diffusion methods.

Citation

The website template was inspired by Michaël Gharbi and Ref-NeRF.