Do VLMs Need Vision Transformers? A Controlled Analysis of State Space Models as Visual Backbones

The Tyranny of the Transformer in Multimodal Systems

Vision-Language Models have settled into a comfortable architectural monoculture. The standard recipe is deceptively simple: take a Vision Transformer pretrained on ImageNet, freeze it, attach a lightweight projection layer, and feed the resulting visual tokens into a Large Language Model. This design, popularized by LLaVA and its successors, has become so ubiquitous that we rarely question whether the vision encoder itself is optimal, or merely optimal for historical reasons. The prevailing assumption has been that scale and pretraining data matter more than architectural family; if your ViT is large enough and trained on enough images, it will serve the multimodal system adequately.

A new study by Kuo and Cascante-Bonilla challenges this assumption through rigorous controlled experimentation. In their paper, Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders, the authors systematically replace the vision backbone while holding every other component constant. Their findings suggest that State Space Models, specifically the VMamba architecture, may offer superior spatial reasoning capabilities compared to traditional Vision Transformers, particularly in localization and grounding tasks, while operating at substantially reduced computational scales.

The Methodological Necessity of Frozen Backbones

The central methodological contribution of this work lies not in novel architecture but in experimental hygiene. The authors adopt a LLaVA-style architecture consisting of a frozen vision encoder, a lightweight connector, and Vicuna-7B as the language model. Crucially, they keep the connector design, training data, optimization hyperparameters, and resolution settings identical across all experiments. By freezing the vision backbone during instruction tuning, they isolate architectural effects from the instabilities of joint multimodal optimization.

This approach reveals how previous comparisons have been confounded. When vision encoders are finetuned alongside the rest of the system, differences in performance may stem from optimization dynamics rather than representational capacity. Some architectures may simply be easier to optimize under standard instruction-tuning recipes, masking true differences in the quality of visual representations. The authors demonstrate that by freezing encoders initialized from matched ImageNet-1K checkpoints, we can finally ask: what does each architecture family actually deliver to the language model?

State Space Models and Spatial Reasoning

The specific SSM architecture evaluated is VMamba, which employs 2D-Selective-Scan (SS2D) mechanisms to process images. Unlike Vision Transformers, which rely on global self-attention with quadratic complexity relative to sequence length, VMamba builds representations through structured state-space updates implemented as multi-directional scans over the 2D grid. This structural difference has profound implications for how spatial information is encoded and transmitted to the language model.

The results are striking. Under matched conditions, VMamba achieves the strongest overall performance across both Visual Question Answering and grounding/localization benchmarks. Perhaps more surprisingly, VMamba outperforms substantially larger ViT variants on visual grounding tasks despite exhibiting lower linear probing accuracy on ImageNet. This decoupling of classification accuracy from downstream multimodal performance suggests that the inductive biases of state space models, specifically their inherent handling of spatial locality through selective scanning, capture visual priors more efficiently than global self-attention for dense prediction tasks.

The authors further explore whether these advantages persist after adaptation with dense task pretraining. When both SSM and ViT backbones undergo additional training on detection or segmentation objectives, performance improves across both families. However, the SSM backbone remains competitive while operating at a markedly smaller model scale. This efficiency gain is not merely about parameter count; it speaks to how effectively each architecture family converts visual input into tokens that retain fine-grained spatial relationships necessary for precise localization.

The Paradox of Scale and Pretraining Objectives

One of the most counterintuitive findings concerns the relationship between backbone capacity and VLM performance. The authors observe that higher ImageNet accuracy or larger backbone architectures do not reliably translate into better downstream VLM performance. This challenges the implicit assumption that vision encoders should be selected based on their performance on classification benchmarks or sheer parameter count.

Equally important is the analysis of pretraining objectives. The study reveals that dense-task tuning, whether detection or segmentation, generally improves VLM performance more than continued classification pretraining. This suggests that the visual representations most useful for language models are those trained to identify and localize objects precisely, rather than those optimized for global image categorization. For practitioners, this implies that the path to better multimodal systems may lie in diversifying pretraining objectives rather than simply scaling up classification models.

The paper also documents instability patterns in certain backbone-interface combinations, particularly under specific resolution and geometry settings. Some architectures exhibit collapse when visual token geometries change, requiring stabilization strategies that the authors propose and validate. These findings highlight that robustness to interface variations is an underappreciated property of vision encoders, one that becomes critical when deploying VLMs across diverse input resolutions or aspect ratios.

Architectural Implications and Efficiency Considerations

Beyond the specific empirical results, this work invites us to reconsider the computational geometry of visual representation for language models. Vision Transformers, despite their successes, impose significant memory and compute costs that scale quadratically with image resolution. As VLMs push toward higher resolution inputs to capture fine details, this complexity becomes a binding constraint. State space models offer sub-quadratic complexity while potentially encoding richer spatial information through their selective scan mechanisms.

My own analysis suggests that the advantage of SSMs in this context stems from their native handling of sequential dependencies across spatial dimensions. While ViTs must learn positional embeddings and attention patterns to reconstruct spatial relationships from flattened patches, SSMs maintain structured state transitions that inherently preserve locality. This architectural prior aligns well with the needs of grounding tasks, where the model must reference specific image regions with high precision.

Furthermore, the finding that dense pretraining outperforms classification pretraining indicates a fundamental mismatch between how we typically train vision encoders and how they are used in multimodal systems. Classification objectives encourage invariance to spatial transformations and aggregation of global features, while grounding and VQA require the preservation of discriminative local details. SSMs appear to bridge this gap more naturally than transformers, possibly due to their selective mechanisms that can maintain fine-grained information across long sequences without the dilution effects of global averaging.

Limitations and Open Questions

The study is not without limitations. The experiments focus primarily on VMamba as the representative SSM architecture, leaving open whether other state space variants, such as MambaVision or hybrid approaches, would exhibit similar characteristics. Additionally, the analysis is conducted under the specific constraint of frozen backbones. While this enables controlled comparison, it does not address whether SSMs would maintain their advantages under joint finetuning regimes, where optimization dynamics might favor different architectural properties.

The evaluation, though comprehensive, focuses on standard academic benchmarks for VQA and grounding. How these findings translate to more complex reasoning tasks, video understanding, or long-document analysis remains an open question. The sub-quadratic complexity of SSMs suggests potential advantages for processing high-resolution video frames or extremely large images, but these scenarios are not explicitly tested in the current work.

Conclusion: Toward a Pluralistic Vision Architecture

The work of Kuo and Cascante-Bonilla serves as a necessary corrective to the architectural convergence in multimodal AI. By demonstrating that State Space Models can serve as strong vision encoders, particularly for spatially demanding tasks, they open avenues for efficiency and performance that do not rely on simply scaling up transformer backbones.

The broader implication is that the field may be approaching a bifurcation in vision encoder design. For applications requiring coarse semantic understanding, traditional ViTs may remain sufficient. However, for tasks demanding precise spatial reasoning, grounding, or operation under constrained compute budgets, SSM-based encoders present a compelling alternative that challenges the hegemony of the transformer.

Future research should explore hybrid architectures that combine the global context modeling of attention with the efficiency and spatial fidelity of state space mechanisms. Additionally, the development of pretraining objectives specifically designed for multimodal compatibility, rather than repurposing classification or detection models, may yield further gains. As we move toward more sophisticated vision-language systems, the question is no longer whether transformers are necessary, but rather which architectural primitives best serve the specific cognitive demands of joint visual and linguistic reasoning.