WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation
Abstract
Waypoint Diffusion Transformers address trajectory conflicts in pixel-space flow matching by using semantic waypoints from pre-trained vision models to disentangle generation paths and accelerate training convergence.
While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.
Community
Recent flow matching models avoid VAE reconstruction bottlenecks by operating directly in pixel space, but the pixel manifold lacks semantic continuity. Optimal transport paths for different semantic endpoints overlap and intersect, causing severe trajectory conflict and slow convergence.
WiT introduces discriminative semantic waypoints projected from pretrained vision models, then factors the transport into two easier mappings: prior-to-waypoint and waypoint-to-pixel. A lightweight waypoint generator predicts these semantic anchors from the current noisy state, and the primary diffusion transformer consumes them via Just-Pixel AdaLN for dense spatial modulation.
On ImageNet 256x256, WiT outperforms strong pixel-space baselines, matches JiT-L/16 at 600 epochs after only 265 epochs, and pushes pure pixel-space generation closer to or beyond heavyweight latent-space diffusion models.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper