TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
Abstract
TokenDial enables precise attribute control in text-to-video models by using additive offsets in spatiotemporal token space for coherent edits without retraining.
We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing (2026)
- The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering (2026)
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls (2026)
- PISCO: Precise Video Instance Insertion with Sparse Control (2026)
- Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades (2026)
- RegionRoute: Regional Style Transfer with Diffusion Model (2026)
- VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.27520 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper

