Accepted at ECCV 2026

DnA: Denoising Attention for Visual Tasks

Ron Campos1, Subhajit Maity1, Xin Li1, Srijan Das2, Aritra Dutta1

1University of Central Florida   2University of North Carolina at Charlotte

Abstract

The softmax activation in multihead attention (MHA) is the de facto standard for attention-based models in visual perception tasks. However, standard softmax can produce noisy attention patterns that dilute relevant features and degrade its performance. In this paper, we propose Denoising Attention or DnA, in which, first, a positive query identifies which image features belong to the correct class, and a negative query identifies closely associated but irrelevant image features. DnA then projects these interactions into two distinct subspaces with larger principal angles, promoting subspace separation and improved discriminability. Using a ViT-B backbone, our proposed DnA achieves a 0.8% absolute gain on ImageNet-1K compared to the baseline. We further show improvements across multiple visual understanding tasks, including video understanding with video transformers (1.8%) and video LLMs (0.5%). Our extensive empirical analyses justify the design choices involving two interacting subspaces and the denoising effect of DnA.

Teaser comparing vanilla attention with DnA positive and negative subspaces.

In this image from ImageNet-1K, the correct class, the breastplate, is surrounded by closely associated secondary objects, a person, a helmet, etc. Traditional softmax attention projects all of these key–query interactions onto a single value subspace \(V\), so attention is misallocated to the secondary objects rather than the relevant one.

The reason is geometric: beyond score magnitudes, the geometry of feature subspaces affects discriminability, and larger principal angles between class subspaces reduce classification error. This motivates modeling positive and negative token interactions as a two-class problem, rather than collapsing them into one subspace.

Our proposed DnA does exactly this. In contrast to differential attention, which models the attention matrix with a single subspace \(V\), DnA introduces two queries over the same keys: a Positive query \(Q^+\), which features belong to the correct class?, and a Negative query \(Q^-\), which features do not belong to the correct class, but are closely associated? Their interactions are projected into two separate subspaces, \(V^+\) and \(V^-\), with larger principal angles between them, segregating the closely associated adversarial features from the relevant ones and improving discriminability.

Method

Denoising Attention

Let \(A_h = Q_hK_h^\top/\sqrt{d} \in \mathbb{R}^{N \times N}\) be the pre-softmax attention score matrix for the \(h^\text{th}\) head in a transformer block. The softmax activation \(\boldsymbol{\sigma}\) normalizes each row, producing the standard output \(\boldsymbol{\sigma}(A_h)V_h\). This emphasizes dominant positive interactions, but strongly negative interactions are suppressed even when they encode informative contrastive structure.

Instead of one query and one value projection, DnA uses two paired query–value branches: the positive branch asks which tokens support the target class, the negative branch asks which tokens are visually associated but irrelevant. With a learnable head-wise scaling \(\alpha_h\), the denoising attention for head \(h\) is:

\begin{equation*} \cA^{Q^{\pm}V^{\pm}}_h = \boldsymbol{\sigma}\left(\frac{Q_h^{+} K_h^\top}{\sqrt{d}}\right)V_h^{+}+\alpha_h\boldsymbol{\hat{\sigma}}\left(\frac{Q_h^{-} K_h^\top}{\sqrt{d}}\right)V_h^{-}. \end{equation*}

Here \(Q_h^+\) and \(Q_h^-\) are separate positive and negative queries, \(K_h\) is shared, and \(V_h^+\), \(V_h^-\) define separate value subspaces. The negative branch uses the softmin \(\boldsymbol{\hat{\sigma}}(z) = \boldsymbol{\sigma}(-z)\), the principled mirror of softmax: it maximizes Shannon entropy under a mean constraint of opposite sign, so it emphasizes exactly the strongly negative interactions softmax discards instead of suppressing them.

DnA architecture diagram with positive and negative branches.

Separate Value Subspaces

DnA uses two value subspaces, \(V_h^+\) and \(V_h^-\), rather than a single shared \(V.\) Unlike differential attention, which subtracts two branches in one value space and can cancel useful signal along with shared noise, separate subspaces keep the branches apart. Pushing them toward larger principal angles lowers the two-class classification error, sharpening discriminability rather than only preventing cancellation.

Minimal Architectural Change

DnA modifies only the attention operation and leaves the rest of the transformer block intact. This drop-in works across settings, replacing self-attention in ViT-B, both spatial and temporal attention in TimeSformer, and the cross-attention in the VisCoP adapter, which probes video LLMs.

Quantitative Results

Image Classification

On ImageNet-1K, DnA reaches 81.9% Top-1 accuracy on both validation and test, outperforming the baseline and competing attention variants while maintaining the same throughput during inference.

Model Validation Set Test Set Params GFLOPs Throughput
Acc.@1 Acc.@5 Acc.@1 Acc.@5
ViT-B 81.1 95.6 81.1 95.6 86.6M 17.6 909.1 img/s
+ Differential Attention 81.4(↑0.3) 95.7(↑0.1) 81.5(↑0.4) 95.6 86.6M 17.6 909.1 img/s
+ Cog Attention 81.4(↑0.3) 95.7(↑0.1) 81.5(↑0.4) 95.7(↑0.1) 86.6M 17.6 909.1 img/s
+ DnA 81.9(↑0.8) 95.9(↑0.3) 81.9(↑0.8) 95.8(↑0.2) 100.7M 21.1 909.1 img/s

Video Transformer

Replacing TimeSformer's spatial and temporal attention with DnA improves performance on Toyota Smarthome and NTU RGB+D 60.

Model Toyota CS Toyota CV2 NTU60 CS NTU60 CV
TimeSformer 67.5 59.5 81.2 88.6
+ Differential Attention 66.6 59.4 81.5 89.1
+ DnA 68.8 63.5 82.4 89.2

Video LLM

DnA improves VideoLLaMA3-style egocentric video understanding on Ego-in-Exo PerceptionMCQ.

Model Act. Task HOI Hand Avg.
VideoLLaMA3 74.9 75.8 75.2 65.3 72.8
VisCoP 81.8 86.1 79.3 65.1 78.1
+ DnA 83.1 87.1 79.2 65.1 78.6

Understanding the Attention Subspaces

DnA is designed around the idea that positive and negative token interactions should not collapse into the same representational space. The analyses below test whether the two branches actually behave as separate subspaces after training.

We summarize two measurements from the paper: intruder dimensions between branch outputs, and the normalized trace similarity between \(V^+\) and \(V^-\). Together, they show whether DnA is separating useful and distracting interactions rather than learning two nearly identical signals.

Intruder dimension counts comparing DnA and differential attention.

(a) Intruder Dimension Analysis

To see how DnA's two branches relate, we compare them against the two branches of the differential transformer. We take the SVD of each branch's output and, among the Top-k left singular vectors, count the intruders: directions whose cosine similarity to the others falls below a threshold, meaning the two branches span near-orthogonal directions. More intruders means better-separated subspaces, which implies lower classification error. DnA's intruder counts peak around 8-9, compared to 6-7 for differential attention, allowing it to suppress noise where differential attention's overlapping branches risk discarding real signal.

Distribution of normalized trace similarities between DnA value subspaces.

(b) Similarity between \(V^+\) and \(V^-\)

We use the normalized Frobenius inner product to measure how aligned the two value subspaces are, where 1 means fully aligned and 0 means orthogonal. Plotting it for every sample, head, and layer on the ImageNet-1K validation set gives 7.2M values. The mean is 0.22 and the median is 0.18: \(V^+\) and \(V^-\) are quasi-orthogonal for most heads, with most of the mass near 0. The same separation appears in the outputs, where the cosine similarity between DnA's two branch outputs is just 0.32 against 0.96 for differential attention, whose branches share a single value space. The two branches of DnA learn different signals rather than near-identical ones that cancel.

Qualitative Results

Softmax attention is densely spread around the image, with limited focus on the class of interest. Differential attention shows improvement but retains substantial background noise. In DnA, the positive activations stay focused on the object, while the negative branch \(\mathcal{A}^{Q^-V^-}_h\) suppresses irrelevant background areas, creating a clear contrast between class features and noisy distractors.

The negative branch (blue) denoises the background semantics, which in turn sharpens the attention from the positive branch (red). This is most evident where the object sits among visually prominent distractors, producing cleaner attention than both ViT-B and differential attention.

Attention visualizations for ViT-B comparing softmax, differential, and DnA attention.
Attention for ViT-B under softmax, differential, and DnA attention: red marks positive activations and blue marks the negative branch.

BibTeX

@inproceedings{campos2026dna,
  title     = {DnA: Denoising Attention for Visual Tasks},
  author    = {Campos, Ron and Maity, Subhajit and Li, Xin and Das, Srijan and Dutta, Aritra},
  booktitle = {Proceedings of the European Conference on Computer Vision},
  year      = {2026}
}