HyperAIHyperAI

Command Palette

Search for a command to run...

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Abstract

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.

One-sentence Summary

Sber Robotics Center’s Green-VLA introduces a staged Vision-Language-Action framework that unifies heterogeneous robot data via semantic action alignment, DATAQA filtering, and RL refinement, enabling zero-shot generalization to humanoids and manipulators while boosting long-horizon success through episode-progress prediction and joint-guidance modules.

Key Contributions

  • Green-VLA introduces a quality-driven data pipeline with DATAQA filters and temporal alignment, processing 3,000 hours of heterogeneous robot demonstrations to ensure stable, high-fidelity training across diverse embodiments and control types.
  • It proposes a five-stage training curriculum (L0–R2) that progressively bridges web-scale vision-language models to embodiment-specific RL alignment, enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms without architectural changes.
  • Evaluated on the high-DoF Green humanoid and other platforms, Green-VLA achieves state-of-the-art performance in long-horizon tasks, with RL alignment (R2) yielding significant gains in success rate, robustness, and precise object targeting via joint-prediction guidance.

Introduction

The authors leverage a staged Vision-Language-Action (VLA) framework called Green-VLA to address the gap between scalable foundation models and real-world robotic deployment. While prior VLAs benefit from large datasets and unified architectures, they suffer from heterogeneous, noisy data and reliance on behavior cloning—which fails to optimize for long-horizon success or generalize across robot embodiments. Green-VLA overcomes these by introducing a five-stage training curriculum that progressively builds semantic grounding, unifies action spaces across platforms, and refines policies via reinforcement learning. It also integrates a DATAQA pipeline for data quality filtering and a joint-prediction module for precise targeting, enabling zero-shot generalization to new robots—including the high-DoF Green humanoid—while improving robustness, success rate, and task efficiency.

Dataset

The authors use a two-stage training pipeline for Green-VLA, combining web-scale multimodal data (L1) and large-scale robotics data (R0), totaling over 200M samples and 3,000+ hours of robot trajectories. Here’s how the dataset is composed, processed, and used:

  • Dataset Composition & Sources:

    • L1 (24M samples): Web multimodal data from 10+ sources including RefSpatial, AgibotWorld, RoboPoint, ShareRobot, Robo2VLM, PixMo-Points, MS COCO, A-OKVQA, OpenSpaces, and Sun RGB-D. Tasks cover VQA, pointing, captioning, spatial reasoning, and pixel-wise trajectory prediction.
    • R0 (184M samples): Robotics data from 12+ open-source and 2 internally collected datasets, including AgiBotWorld, Droid, Galaxea, Action_net, Fractal, Robomind, RDT, Bridge, BiPlay, Green Humanoid, and ALOHA any_pick. Covers humanoid and manipulator platforms with diverse skills and environments.
  • Key Subset Details:

    • Web Data (L1): Samples are weighted during training for task balance. RefSpatial and PixMo-Points map spatial points to PaliGemma tokens. MS COCO is limited to object detection; Sun RGB-D uses only 2D annotations.
    • Robotics Data (R0): AgiBotWorld_twofinger (774h) and Droid (512h) are largest. Green Humanoid (48h real) is synthetically expanded to 167h via mirroring (flipping cameras/joints, swapping left/right) and time-reversal (only for reversible tasks like pick/place). ALOHA any_pick adds 11.2h focused on pick/place tasks.
  • Data Processing & Filtering:

    • All robot data is temporally normalized to align physical progress across sources.
    • DATAQA pipeline filters episodes using four metrics: jitter (motion smoothness), diversity (visual/state-space), sharpness (Laplacian-based image clarity), and state variance.
    • Additional filters: remove episodes with missing cameras/frames, extreme durations, low motion activity, or invalid gripper patterns (e.g., “open-closed-open” for pick-and-place).
    • Visual diversity is measured via DINOv3 feature variance over time; state diversity via Frobenius norm of state covariance.
  • Training Usage:

    • L1 trains general visual-language priors; R0 trains embodiment-agnostic policies.
    • Mixture ratios are weighted by dataset quality metrics (Table 1). AgiBot twofinger and Galaxea receive highest weights due to size and visual diversity. Droid leads among single-arm datasets for smoothness and task diversity.
    • All trajectories are stored with RGB, proprioception, and language instructions. Augmented mirrored/reversed episodes are added to the R0 mixture.
  • Cropping & Metadata:

    • No explicit cropping mentioned; processing focuses on temporal alignment, state normalization, and spatial token mapping.
    • Metadata includes task text, camera views, joint/Cartesian states, and filtered quality scores per episode.

Method

The authors leverage a staged training pipeline and a unified transformer-based architecture to build Green-VLA, a vision-language-action model capable of multi-embodiment control and real-world deployment. The overall framework is structured around five sequential training stages—L0 through R2—each designed to address distinct bottlenecks in generalization, embodiment adaptation, and long-horizon robustness. As shown in the figure below, the pipeline begins with a base vision-language model (L0), progresses through web-scale multimodal pretraining (L1) and general robotics pretraining (R0), then undergoes embodiment-specific fine-tuning (R1), and finally integrates reinforcement learning alignment (R2) to close the last-mile performance gap.

At the architectural level, Green-VLA employs a single transformer backbone that fuses multimodal inputs—RGB observations, proprioceptive state, and natural language instructions—into a shared token sequence. This sequence is augmented with an embodiment-specific control prompt cec_ece, which encodes the robot’s physical configuration (e.g., number of arms, control type, mobility). The model then predicts actions in a fixed 64-dimensional unified action space Au\mathcal{A}_uAu, where each index corresponds to a semantically consistent physical degree of freedom across all supported embodiments. This design replaces naive padding of heterogeneous action spaces with a masked behavior cloning objective:

Luni(θ)=E[me(πθ(xte,ce)Φe(ate))22],\mathcal{L}_{\mathrm{uni}}(\theta) = \mathbb{E} \Big[ \Big\| m_{e} \odot \big( \pi_{\theta} ( x_{t}^{e} , c_{e} ) - \Phi_{e} ( a_{t}^{e} ) \big) \Big\|_{2}^{2} \Big],Luni(θ)=E[me(πθ(xte,ce)Φe(ate))22],

where mem_eme is a binary mask indicating active slots for embodiment eee, and Φe\Phi_eΦe maps native actions into the unified layout. The model’s output is then retargeted to the target robot’s native control space via Φe1\Phi_e^{-1}Φe1, enabling cross-embodiment transfer without semantic drift.

To support real-time inference, the architecture incorporates efficiency optimizations: SDPA attention kernels, lightweight prediction heads, and reduced denoising steps in the flow-matching action expert. The model also predicts an episode progress scalar ρ^t[0,1]\hat{\rho}_t \in [0,1]ρ^t[0,1], trained to reflect normalized time-to-completion, which enables downstream planners to dynamically adjust subgoals during long-horizon execution.

A key innovation is the integration of a high-level task planner based on GigaVision, which operates independently and remains frozen during training. This planner parses user instructions into atomic subtasks (e.g., “pick [item] with [left] hand”) and conditions Green-VLA via structured prompts. During execution, Green-VLA signals task completion via an episode-end probability; if this exceeds a threshold, the planner queries a feedback module to validate success or trigger replanning. Refer to the framework diagram for the full inference loop, which includes guidance and safety modules.

To handle novel objects specified in language, Green-VLA incorporates a Joint Prediction Module (JPM) that infers a 3D target point pp^*p in the robot’s workspace from the instruction and visual observation. This point is derived by first predicting a 2D affordance location via a vision-language model, then lifting it to 3D using camera intrinsics and pose:

[p1]=Tcw[d(u,v)K1[u,v,1]1].\left[ \begin{array}{l} p^{\star} \\ 1 \end{array} \right] = T_{c}^{w} \left[ \begin{array}{l} d(u,v) \, K^{-1} [u, v, 1]^{\top} \\ 1 \end{array} \right].[p1]=Tcw[d(u,v)K1[u,v,1]1].

The resulting point is used to compute a feasible joint configuration qq^*q via inverse kinematics, which then guides the flow-matching policy via pseudoinverse guidance (PiGDM), biasing the velocity field toward trajectories that reach the target.

For safety and stability, Green-VLA includes an online OOD detector modeled as a Gaussian Mixture Model fitted on training robot states. If the predicted trajectory leads to a low-density state sss (i.e., ptrain(s)<τoodp_{\text{train}}(s) < \tau_{\text{ood}}ptrain(s)<τood), the model corrects the action by nudging the state toward higher density using the gradient of the GMM:

ss+αptrain(s),α=0.2.s \gets s + \alpha \nabla p_{\text{train}}(s), \quad \alpha = 0.2.ss+αptrain(s),α=0.2.

This correction mechanism, illustrated in the inference diagram, ensures the policy remains within the training distribution during long-horizon tasks.

Training stability across diverse embodiments is achieved through a scheduled dataset sampling curriculum. Instead of sampling datasets according to their target weights wiw_iwi from the start, Green-VLA uses a power-law schedule:

Wi(t)=wiαtjwjαt,αt[0,1],  α0=0,  αT=1,W_{i}^{(t)} = \frac{w_{i}^{\alpha_{t}}}{\sum_{j} w_{j}^{\alpha_{t}}}, \quad \alpha_{t} \in [0,1], \; \alpha_{0} = 0, \; \alpha_{T} = 1,Wi(t)=jwjαtwiαt,αt[0,1],α0=0,αT=1,

which begins with uniform sampling and gradually converges to the target distribution. This prevents early dominance by large datasets and allows the model to first learn shared structure before specializing.

To standardize heterogeneous demonstrations, Green-VLA performs temporal alignment using optical flow magnitude as a proxy for motion speed. Trajectories are resampled via monotonic cubic splines to normalize motion per step, ensuring consistent dynamics across datasets. Additionally, a speed-conditioned modulation is introduced during training: for each sample, a scalar vp(v)v \sim p(v)vp(v) is drawn, and the target trajectory is warped to operate at different temporal resolutions. The model’s hidden states are modulated via:

h~t=RMSNorm(ht),h^t=γ(v)h~t+β(v),\tilde{h}_{t} = \mathrm{RMSNorm}(h_{t}), \quad \hat{h}_{t} = \gamma(v) \, \tilde{h}_{t} + \beta(v),h~t=RMSNorm(ht),h^t=γ(v)h~t+β(v),

where γ(v)\gamma(v)γ(v) and β(v)\beta(v)β(v) are learned functions. This enables Green-VLA to operate on multiple temporal “zoom levels,” supporting both fine-grained manipulation and coarse long-horizon motion.

The final R2 stage employs two RL fine-tuning strategies that preserve the base model’s weights. The first, trajectory optimization with native fine-tuning, trains a separate critic using Implicit Q-Learning and uses its gradient aQ(s,a)\nabla_a Q(s,a)aQ(s,a) to iteratively refine actions:

aa+ηaQ(s,a)aQ(s,a),a \gets a + \eta \, \frac{ \nabla_{a} Q ( s , a ) }{ \| \nabla_{a} Q ( s , a ) \| },aa+ηaQ(s,a)aQ(s,a),

where η\etaη is a hyperparameter. Optimized trajectories are validated in the environment before being added to the training set. The second approach, optimization of the source distribution, trains a small actor network to sample noise vectors that, when fed into the flow-matching model, yield higher returns. This method enables online RL without directly modifying the base policy’s parameters.

Experiment

  • Green-VLA demonstrates strong task-following and cross-embodiment generalization after R0 pretraining, outperforming prior models despite using substantially less training data, validating the value of unified action representation and quality-aligned datasets.
  • In R1 fine-tuning, Green-VLA adapts effectively to new environments like CALVIN and e-commerce shelves without prior exposure, showing robust compositional generalization and improved performance under fine-grained instruction following, especially with JPM guidance for SKU-level accuracy.
  • R2 RL alignment significantly boosts long-horizon task consistency and physical reliability, increasing success rates and average chain length across benchmarks, particularly in challenging manipulation scenarios involving deformable or slippery objects.
  • Humanoid evaluations confirm the system’s ability to execute complex, multi-step instructions with precise hand selection, object disambiguation, and full upper-body coordination—even under out-of-distribution scene layouts—while maintaining efficiency in table cleaning and sorting workflows.
  • Across all phases and embodiments, Green-VLA consistently improves with staged training, showing that combining unified pretraining, embodiment-specific tuning, and RL refinement yields robust, scalable, and instruction-following capable robotic policies.

The authors evaluate Green-VLA across training phases R0, R1, and R2, showing progressive gains in both pick accuracy and task success across diverse objects. Results show that R2 RL alignment delivers the largest performance boost, particularly in success rate, outperforming prior models including fine-tuned baselines. Green-VLA consistently improves with each stage, demonstrating that staged training enhances both precision and robustness in manipulation tasks.

The authors evaluate Green-VLA against several baselines on object picking and table-cleaning tasks, finding that it achieves higher success rates and faster execution times across multiple object categories. Results show Green-VLA outperforms prior models even without embodiment-specific fine-tuning, indicating strong generalization from multi-embodiment pretraining. The model’s efficiency and accuracy improvements are consistent across both single-item and full-task scenarios.

The authors evaluate Green-VLA across training phases R0, R1, and R2, measuring performance on visual matching and variant aggregation tasks. Results show that Green-VLA consistently improves across phases, with R2 achieving the highest overall score, indicating that reinforcement learning alignment enhances both task precision and robustness. The model outperforms baselines like π₀ and OpenVLA, particularly in fine-grained object selection and handling variant conditions.

The authors use a unified multi-embodiment dataset to train Green-VLA, achieving strong task-following and efficiency despite using substantially less data than prior systems. Results show that their model generalizes well across diverse robotic platforms and task types, with performance gains further amplified through embodiment-specific fine-tuning and RL alignment. The architecture’s separation of perception and control enables robust execution even under out-of-distribution conditions and complex multi-step workflows.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp