Beyond Sequential Distance: Rethinking Position Encoding in Multimodal Language Models

As Multimodal Large Language Models (MLLMs) evolve from single turn question answering systems into sophisticated agents capable of processing lengthy documents, extended video sequences, and complex interleaved conversations, a subtle but critical failure mode has emerged. Researchers from CASIA, UCAS, and Tencent Hunyuan Team identify this phenomenon in their paper "Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding" as visual fading: the progressive degradation of attention to visual tokens as generated text sequences lengthen. This degradation is not merely a consequence of softmax dilution or capacity limitations, but stems from a fundamental architectural mismatch between how we encode positions in language and how we process persistent visual information.

The Sequential Bias in Multimodal Attention

To understand visual fading, we must examine the dominant position encoding mechanism in modern MLLMs: Multimodal Rotary Position Embedding (MRoPE). MRoPE extends standard RoPE by decomposing token positions into temporal, height, and width components, thereby unifying text and visual tokens within a single sequential coordinate system. While elegant, this unification introduces a problematic inductive bias.

Standard RoPE possesses an inherent long-term decay property, where attention weights diminish as relative distances between tokens increase. For monomodal text, this property is beneficial; it encodes linguistic locality, ensuring that nearby words influence each other more strongly than distant ones. However, when applied to multimodal sequences, this decay penalizes inter-modal attention based purely on sequential distance. As the model generates text autoregressively, the relative distance between the initial visual tokens and the current text position grows monotonically. Consequently, MRoPE treats visual information as increasingly distant history, forcing the model to detach from visual constraints even when the image remains the primary subject of discourse.

The paper demonstrates this empirically through controlled probing experiments. As inter-modal distance increases, the proportion of attention allocated to visual tokens exhibits sharp, consistent decay. In practical terms, this means an MLLM analyzing a conference registration document might correctly extract the fee from the image in a short query, but hallucinate plausible sounding fees based on parametric knowledge when asked to generate a lengthy analysis of the same document. The visual signal fades not because the information is absent, but because the positional geometry treats it as temporally remote.

Disentangling Perceptual Geometry

The proposed solution, inter-modal Distance Invariant Position Encoding (DIPE), addresses this by fundamentally rethinking how spatial and temporal relationships should be modeled across modalities. Rather than enforcing a unified distance metric, DIPE orthogonally decomposes the attention mechanism into two distinct pathways.

For intra-modal interactions, DIPE retains standard MRoPE. This preserves the essential structural properties of each modality: the local syntactic dependencies within text and the two-dimensional spatial relationships within images. A word should attend more strongly to its immediate neighbors, and pixels should maintain their geometric relationships to adjacent regions.

For inter-modal interactions, however, DIPE introduces a crucial modification: an anchored query mechanism that constrains the perceptual distance between text and visual tokens to remain constant, regardless of sequence length. By anchoring the position calculation such that visual tokens maintain a fixed perceptual proximity to the generation process, DIPE effectively neutralizes the distance based penalty. The image remains "in front of" the model throughout generation, mirroring how human visual working memory maintains persistent reference to a scene during extended cognitive tasks.

Notably, DIPE achieves this without introducing additional trainable parameters, and it maintains full compatibility with FlashAttention and KV Cache optimization. This engineering efficiency suggests the mechanism could see rapid adoption; it modifies position calculation logic rather than model architecture, making it a drop in replacement for existing MRoPE implementations.

Architectural Implications and Cognitive Parallels

The success of DIPE, yielding a 4.10% average accuracy improvement in long context scenarios without degrading short context performance, reveals a deeper insight about multimodal architecture design. Current MLLMs inherit assumptions from unimodal language modeling that may be inappropriate for perception. Text is inherently sequential and ephemeral; words flow past and recede in importance. Visual information, conversely, functions more like a persistent workspace or external memory. The paper’s findings suggest that forcing both modalities into a sequential distance framework imposes an artificial cognitive load analogous to human working memory degradation under high cognitive demands.

This observation connects to broader questions about how biological systems process cross modal information. The authors briefly note that human visual attention does not recede as discourse lengthens; the image stays visually present. We might extend this analogy further. The human visual system employs distinct processing streams: the dorsal stream for spatial localization and the ventral stream for object recognition. Similarly, DIPE effectively creates specialized pathways for cross modal versus intra modal attention patterns. Just as the brain processes "where" and "what" through different circuits, MLLMs may require differentiated geometric frameworks for "how text relates to text" versus "how text relates to vision."

However, DIPE’s current formulation presents limitations worth examining. The anchored proximity assumption works well for static image understanding tasks like DocVQA, where the visual context remains constant throughout generation. Yet it remains unclear how this mechanism should adapt to dynamic scenarios involving multiple images, video sequences, or interleaved conversational turns where visual context legitimately changes over time. The current implementation assumes visual tokens occupy the beginning of the sequence; extending this to arbitrary interleaving patterns would require dynamic anchor updating or hierarchical visual position encodings.

Furthermore, the paper’s analysis reveals that DIPE restores visual attention primarily in shallow layers. This suggests that visual fading originates early in the processing hierarchy, not as a deep layer abstraction failure. Future work might investigate whether different anchoring strategies for different layer depths could yield further improvements, or whether visual tokens require distinct rotational frequencies in RoPE to better maintain their signal through deep networks.

Future Directions

The DIPE mechanism opens several avenues for future research. If inter-modal distance invariance proves beneficial for vision language tasks, we might ask whether other modality pairings require similar treatments. Audio language models, for instance, face analogous challenges where acoustic context should persist across lengthy transcriptions or analyses. Similarly, robotic foundation models processing continuous sensorimotor streams may need position encodings that distinguish between persistent environmental state and sequential action history.

More fundamentally, DIPE challenges the assumption that position encoding must be monolithic. Perhaps future architectures will employ entirely separate coordinate systems for different modalities, with learned cross modal warping functions that dynamically adjust perceptual distance based on task requirements rather than maintaining fixed invariance. The 4.10% improvement from a relatively simple geometric adjustment suggests that we have only begun to explore how positional biases shape multimodal reasoning.

As MLLMs advance toward true long context understanding spanning thousands of tokens across multiple modalities, mechanisms like DIPE represent necessary evolutions in how we architect attention. The transition from sequential distance to perceptual proximity is not merely a technical fix for visual fading; it is a conceptual shift toward treating multimodal context as a persistent, structured workspace rather than a fading tape of history.