StaKe logo StaKe

Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision

IROS 2026

Peiyan Li1,2, Qisen Ma1,2, Yan Huang1,2,3, Liang Wang1,2
Corresponding Author

TL;DR:

We propose StaKe, a plug-in auxiliary supervision framework for VLA fine-tuning that automatically derives stage and keyframe signals from demonstration gripper states. It consistently improves long-horizon manipulation success without changing the base policy architecture or the inference loop.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for generalizable robotic manipulation. During fine-tuning, however, action supervision applies equally across all timesteps, without structured supervision on which manipulation stage the robot is in or what the next gripper-event target should be. This causes failures to concentrate around challenging gripper-event transitions. To address this, we propose StaKe, a plug-in auxiliary supervision framework that automatically derives two complementary signals from demonstration gripper states without manual annotation: a stage classifier that identifies the current manipulation stage, and a keyframe predictor that estimates the target joint action at the next gripper transition. Both are modeled as lightweight auxiliary heads that enrich the learned representations during training, while leaving the base VLA policy architecture and inference loop unchanged. Experiments on bimanual simulation and single-arm Franka real-robot tasks show that StaKe consistently improves success rates (relative gains of 14% and 56%, respectively), with larger improvements on longer-horizon tasks that involve more gripper-event transitions. Ablation studies validate each design choice, and qualitative analysis confirms that the learned representations faithfully track manipulation stages. These results indicate that structured supervision is an effective and general strategy for enhancing VLA fine-tuning in long-horizon manipulation.

Method

When fine-tuning VLA models, a continuous action loss is applied uniformly over the whole trajectory. Yet manipulation failures are not uniform—they tend to concentrate around gripper-state transitions, appearing either as a motion-fail (failing to reach and grasp during free-space motion) or a skill-fail (grasping successfully but erring during post-grasp execution such as handover, stacking, or rotation).

Failure modes

Figure 1: Manipulation failures concentrate around gripper-state transitions. A motion-fail occurs when the robot fails to reach and grasp the target during free-space motion, while a skill-fail occurs when the robot grasps successfully but fails during post-grasp skill execution.

A manipulation trajectory is not homogeneous: it alternates between free-space motion and contact-constrained skill execution, with gripper open/close events marking the boundaries. StaKe injects this structure into VLA fine-tuning through two auxiliary signals derived automatically from demonstration gripper states—requiring no manual annotation and no change to inference:

  • Stage Supervision (SS): a lightweight head classifies each timestep into the motion or skill stage, encouraging the backbone to encode stage-aware representations.
  • Keyframe Supervision (KS): a second head predicts the joint action at the next gripper-transition keyframe, anchoring the policy to physically meaningful transition targets.

Both signals are implemented as auxiliary heads attached to dedicated learnable query tokens, jointly optimized with the flow-matching policy loss during training. At inference, the auxiliary heads are not invoked, so the base policy and its inference loop remain entirely unchanged.

StaKe overview

Figure 2: Overview of the StaKe framework. Two learnable query tokens are appended to the pre-trained VLM backbone, each feeding a lightweight auxiliary head: a stage classification head (binary cross-entropy loss) and a keyframe prediction head (L1 loss). Both heads are active only during training and jointly optimized with the flow-matching policy loss; at inference the query tokens remain but the heads are not invoked.

Citation

@article{xu2026stake,
    author  = {Xu, Yuan and Chen, Yixiang and Wang, Kai and Yang, Jiabing and Li, Peiyan and Ma, Qisen and Huang, Yan and Wang, Liang},
    title   = {Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision},
    journal = {arXiv preprint arXiv:2606.26801},
    year    = {2026}
}