Rethinking Action Spaces in Vision-Language-Action Models: The Case for Structured Decomposition in Active Manipulation

Introduction

Current Vision-Language-Action (VLA) models have achieved impressive results on manipulation tasks, yet they operate under a critical constraint that limits their deployment in complex environments. Most existing models, including recent heavyweights like GR00T N1 and π0, are trained on fixed, near-optimal head-camera views. This design choice, while simplifying the learning problem, creates fundamental brittleness. When viewpoints shift or occlusions occur, these systems fail to adapt because they lack the capacity for active perception. The ability to strategically move the camera to reveal task-critical information, then immediately ground those new observations into motor actions, remains a significant gap in robotic learning.

The paper "SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics" addresses this gap by challenging a deeply held assumption in the VLA community: that camera control and manipulation actions must share a unified action space and be trained jointly end-to-end. Instead, the authors propose a framework that decouples these modalities and employs a bottom-up training strategy, achieving substantial gains in both simulation and real-world deployment.

The Architectural Bet: Decoupling Perception from Action

SaPaVe (Semantic active Perception and active-View execution) rests on a simple but powerful insight: camera movement is embodiment-agnostic, while manipulation is deeply embodiment-specific. The logic of shifting viewpoint to see inside a cabinet or around an obstacle generalizes across different robot morphologies, whereas motor control requires precise calibration between visual input and end-effector dynamics. By separating these streams into a decoupled action space, SaPaVe allows each to benefit from different data sources and learning curricula.

This stands in stark contrast to the prevailing paradigm. Most VLA models employ a shared action space where camera movements and manipulator commands are interleaved in a single token stream or vector output. The authors argue that this forces the model to simultaneously learn information-seeking behavior (where should I look?) and goal-achievement behavior (how do I grasp?), which may create conflicting gradients during training. When the camera moves to reduce uncertainty, the manipulation policy must adapt to a changing visual field, a non-stationarity that monolithic models struggle to handle.

To resolve this, SaPaVe adopts a staged, bottom-up learning strategy. First, the system learns semantic camera control using ActiveViewPose-200K, a curated dataset of 200,000 image-language-camera movement pairs enriched with detailed semantic annotations. This stage isolates the perceptual policy, training it to understand high-level semantic goals (e.g., "look inside the cabinet to find the bowl") without the complexity of simultaneous motor control. Only in the second stage does the model integrate manipulation learning, using hybrid datasets that combine the pretrained camera adapter with action trajectories. This curriculum respects the distinct computational requirements of perception and action, allowing the model to build reliable, task-directed perception before attempting to ground it in motor control.

Technical Components and Geometric Robustness

Beyond the high-level architecture, SaPaVe introduces specific technical innovations to handle the challenges of active-view execution. When a camera moves dynamically during a task, the visual input changes non-linearly, creating a distribution shift that breaks standard VLA policies trained on static views. To address this, the authors implement a 3D geometry-aware module that injects spatial knowledge into the policy representation.

This module serves a critical function in maintaining robustness under dynamic viewpoints. For instance, when locating a range hood handle, the system need only shift the camera briefly upward to confirm the handle's presence; it does not need to center the object perfectly before initiating the grasping motion. The geometry-aware representations enable the policy to infer spatial relationships and execute actions from partial or off-center observations. This capability distinguishes SaPaVe from systems that treat camera movement merely as a preprocessing step for a fixed-view manipulation backbone, which would fail if the target drifts toward the image edge during camera motion.

To evaluate these capabilities rigorously, the authors introduce ActiveManip-Bench, the first simulated benchmark specifically designed for active manipulation. Existing benchmarks typically assume fixed camera positions, which fails to capture the reality of cluttered household environments where occlusions are inevitable. ActiveManip-Bench spans 12 tasks across 100 objects and 20 diverse scenes, providing rich annotations that enable evaluation of both perception and execution components. On this benchmark, SaPaVe achieves a 75.2% success rate, significantly outperforming fixed-view baselines and even surpassing Gemini 2.5 Pro on semantic active perception tasks by 16% despite using only 2B parameters.

Your Take: Beyond the Local Optimum of Shared Action Spaces

The success of SaPaVe suggests that the field may have mistaken a local optimum for a fundamental truth. The shared action space paradigm in VLA models emerged from legitimate empirical successes in fixed-view manipulation, where the distinction between looking and acting is less critical. However, the community appears to have overgeneralized these results, assuming that what works for static cameras must generalize to active perception.

There is a deeper theoretical issue at play. Active perception introduces a temporal dependency between sensing and acting that fundamentally differs from the conditional independence assumptions underlying many VLA architectures. When the camera moves, the policy must reason about information gain, not just state estimation. This requires learning exploration strategies that maximize the reduction of task-relevant uncertainty, a distinct objective from the state-conditioned motor control required for manipulation. Forcing both into the same optimization objective creates a representational tension. The gradients for camera control and manipulation may conflict, particularly in early training when the model has not yet disentangled these modalities.

SaPaVe's bottom-up training can be viewed as a form of curriculum learning that respects these representational differences. By first teaching the model what to look for and how to look