Beyond Language Modeling: An Exploration of Multimodal Pretraining Paper • 2603.03276 • Published 3 days ago • 69
Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models Paper • 2602.24264 • Published 7 days ago • 14
Enhancing Spatial Understanding in Image Generation via Reward Modeling Paper • 2602.24233 • Published 7 days ago • 47
BitDance: Scaling Autoregressive Generative Models with Binary Tokens Paper • 2602.14041 • Published 20 days ago • 52
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Paper • 2602.08683 • Published 26 days ago • 50
Autoregressive Image Generation with Masked Bit Modeling Paper • 2602.09024 • Published 25 days ago • 6
Autoregressive Image Generation with Masked Bit Modeling Paper • 2602.09024 • Published 25 days ago • 6
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss Paper • 2602.02493 • Published Feb 2 • 44
DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models Paper • 2512.24165 • Published Dec 30, 2025 • 51
Next-Embedding Prediction Makes Strong Vision Learners Paper • 2512.16922 • Published Dec 18, 2025 • 87
Towards Scalable Pre-training of Visual Tokenizers for Generation Paper • 2512.13687 • Published Dec 15, 2025 • 106
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images Paper • 2511.22805 • Published Nov 27, 2025 • 4
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Paper • 2511.22699 • Published Nov 27, 2025 • 240
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models Paper • 2511.22625 • Published Nov 27, 2025 • 47