In this image from ImageNet-1K, the correct class, the breastplate, is surrounded by closely associated secondary objects, a person, a helmet, etc. Traditional softmax attention projects all of these key–query interactions onto a single value subspace \(V\), so attention is misallocated to the secondary objects rather than the relevant one.
The reason is geometric: beyond score magnitudes, the geometry of feature subspaces affects discriminability, and larger principal angles between class subspaces reduce classification error. This motivates modeling positive and negative token interactions as a two-class problem, rather than collapsing them into one subspace.
Our proposed DnA does exactly this. In contrast to differential attention, which models the attention matrix with a single subspace \(V\), DnA introduces two queries over the same keys: a Positive query \(Q^+\), which features belong to the correct class?, and a Negative query \(Q^-\), which features do not belong to the correct class, but are closely associated? Their interactions are projected into two separate subspaces, \(V^+\) and \(V^-\), with larger principal angles between them, segregating the closely associated adversarial features from the relevant ones and improving discriminability.