SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Model Details
Model Description
SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It pioneers a paradigm shift from continuous to sparse video generation for longer prediction horizons. By guiding trajectory inference with a generated sparse future spanning a 20-second horizon, it achieves sub-second inference (a 27× speed-up). It also marks the first realization of beyond-the-view navigation in challenging night scenes.
- Developed by: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
- Shared by: The University of Hong Kong & OpenDriveLab
- Model type: Video Generation-based Model for Vision-Language Navigation
- Language(s) (NLP): English (Instruction prompts)
- License: CC BY-NC-SA 4.0
- Finetuned from model: Based on UMT5-XXL (text encoder) and Wan2.1 VAE.
Model Sources
- Repository: https://github.com/OpenDriveLab/SparseVideoNav
- Paper: arXiv:2602.05827
- Project Page: https://opendrivelab.com/SparseVideoNav
Uses
Direct Use
The model is designed for generating sparse future video frames based on a current visual observation (video) and a natural language instruction (e.g., "turn right"). It is primarily intended for research in Embodied AI, specifically Vision-Language Navigation (VLN) in real-world environments.
Out-of-Scope Use
The model is a research prototype and is not intended for deployment in safety-critical real-world autonomous driving or robotic navigation systems without further extensive testing, safety validation, and fallback mechanisms.
How to Get Started with the Model
Use the code below to get started with the model using our custom pipeline.
Ensure you have cloned the GitHub repository and installed the requirements.
from omegaconf import OmegaConf
from inference import SVNPipeline
# Load configuration
cfg = OmegaConf.load("config/inference.yaml")
cfg.ckpt_path = "/path/to/models/SparseVideoNav-Models" # Path to your downloaded weights
cfg.inference.device = "cuda:0"
# Initialize pipeline
pipeline = SVNPipeline.from_pretrained(cfg)
# Run inference (Returns np.ndarray (T, H, W, C) uint8)
video = pipeline(video="/path/to/input.mp4", text="turn right")
BibTeX
@article{zhang2026sparse,
title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
journal={arXiv preprint arXiv:2602.05827},
year={2026}
}
- Downloads last month
- -
