pi05-build-block-tower-rlt-6mix-retain-alpha05
RL Token (RLT) encoder-decoder trained on the 6-dataset build-block-tower mixture, on top of the retain/step_49999/alpha_0.5 checkpoint from pi05-build-block-tower-6mix.
What is this?
This model is a lightweight transformer encoder-decoder which takes inputs from a frozen Pi-05 VLA backbone. The encoder compresses the VLA's final-layer prefix embeddings into a single RL token via a learned query. The decoder autoregressively reconstructs the original embeddings from only this token, forcing it to act as an information bottleneck. See Xu et al. (2026), Precise Manipulation with Efficient Online RL for the method.
This variant uses the retain/alpha_0.5 backbone β a checkpoint produced by applying representation-level retention (alpha=0.5 interpolation toward the pre-fine-tuning weights) to the 6mix baseline at step 49999.
Training
- Config:
pi05_rlt_build_block_tower_6mix - VLA backbone:
pravsels/pi05-build-block-tower-6mixretain/step_49999/alpha_0.5 (frozen,rl_vla_loss_weight=0.0) - Encoder-decoder: 2-layer transformer, 8 heads, 8192 MLP dim, 2048 embedding dim
- Dataset: 6 LeRobot v2.1 datasets (build_block_tower + dAgger 1.0.0β1.4.0)
- Batch size: 36
- LR: 5e-5 cosine (1k warmup)
- Steps: 50,000 (initial 20k + resumed 30k)
- Runtime: ~14h total on 4x GH200 (Isambard)
Loss progression
| Step | Train Loss | Val Loss | Gap |
|---|---|---|---|
| 0 | β | 11354.7 | β |
| 1,000 | β | 2507.0 | β |
| 5,000 | 640.8 | 701.6 | 60.8 |
| 10,000 | 472.8 | 536.0 | 63.2 |
| 15,000 | 399.9 | 487.8 | 87.9 |
| 19,999 | 356.3 | 464.8 | 108.5 |
| 25,000 | 326.6 | 446.8 | 120.2 |
| 30,000 | 304.9 | 439.3 | 134.4 |
| 35,000 | 288.4 | 432.7 | 144.3 |
| 40,000 | 275.6 | 423.3 | 147.7 |
| 45,000 | 265.4 | 414.3 | 148.9 |
| 49,000 | 259.8 | 425.4 | 165.6 |
| 49,900 | 256.4 | β | β |
Val loss decreased steadily, reaching a minimum of 414.3 at step 45,000 before beginning to rise. The 45,000 checkpoint is recommended for deployment as it has the lowest validation loss.
Checkpoints
| Step | Val Loss | Recommended | Params SHA256 |
|---|---|---|---|
| 19999 | 464.8 | 493ae11e5c95be5340e9106e54cac3f2219f6d1407a9081fc1c35595f5143cdb |
|
| 45000 | 414.3 | β | 1af50d87b765942801fd6be6afb5df3bcc69065636b614702f0b1f34fd3daec1 |
Verifying checkpoint hashes
cd checkpoints/<step> && find params -type f | sort | xargs sha256sum | sha256sum
Repo layout
assets/ # Norm stats, valid indices, episode split
checkpoints/19999/params/ # Step 19999 model weights
checkpoints/45000/params/ # Step 45000 model weights (recommended)
TRAINING_LOG.md # Training log
W&B
Training curves: https://wandb.ai/pravsels/pi05-build-block-tower-rlt-6mix-retain-alpha05/runs/g5myo76p
Usage
import openpi.models.model as _model
import openpi.training.config as _config
config = _config.get_config(pi05_rlt_build_block_tower_6mix)
params = _model.restore_params(checkpoints/45000/params, restore_type=np.ndarray)
model = config.model.load(params)