Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
• 2402.17177
• Published
• 88
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with
Audio2Video Diffusion Model under Weak Conditions
Paper
• 2402.17485
• Published
• 194
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
• 2403.00522
• Published
• 46
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K
Text-to-Image Generation
Paper
• 2403.04692
• Published
• 40
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Paper
• 2311.12793
• Published
• 18
FlashFace: Human Image Personalization with High-fidelity Identity
Preservation
Paper
• 2403.17008
• Published
• 22
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
Paper
• 2406.09414
• Published
• 103
Vision language models are blind
Paper
• 2407.06581
• Published
• 84
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published
• 120
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published
• 92
Paper
• 2408.07009
• Published
• 62
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
• 2409.12576
• Published
• 16
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
• 2412.07589
• Published
• 48
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human
Animation Models
Paper
• 2502.01061
• Published
• 223
Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions
Paper
• 2501.10020
• Published
• 25
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published
• 158
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Paper
• 2411.19108
• Published
• 20
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper
• 2511.22699
• Published
• 239