Adapting Large Language Models via Reading Comprehension
Paper
• 2309.09530
• Published • 82
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper
• 2309.09958
• Published • 20
Noise-Aware Training of Layout-Aware Language Models
Paper
• 2404.00488
• Published • 10
Streaming Dense Video Captioning
Paper
• 2404.01297
• Published • 13
Aurora-M: The First Open Source Multilingual Language Model Red-teamed
according to the U.S. Executive Order
Paper
• 2404.00399
• Published • 42
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
• 2404.03413
• Published • 27
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
• 2404.03118
• Published • 25
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak
Attacks?
Paper
• 2404.03411
• Published • 10
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
• 2404.02258
• Published • 107
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale
Prediction
Paper
• 2404.02905
• Published • 74
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
• 2404.02733
• Published • 22
FlowMind: Automatic Workflow Generation with LLMs
Paper
• 2404.13050
• Published • 34
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image
Synthesis
Paper
• 2404.13686
• Published • 29
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
• 2404.14396
• Published • 19
LAMBDA: A Large Model Based Data Agent
Paper
• 2407.17535
• Published • 37
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
• 2407.17490
• Published • 31
Very Large-Scale Multi-Agent Simulation in AgentScope
Paper
• 2407.17789
• Published • 40
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published • 17
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published • 41
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
• 2407.16224
• Published • 29
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published • 39
An Object is Worth 64x64 Pixels: Generating 3D Object via Image
Diffusion
Paper
• 2408.03178
• Published • 40
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Paper
• 2408.02629
• Published • 15
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published • 93
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable
Transcripts
Paper
• 2409.00447
• Published • 3
QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design
Paper
• 2505.16175
• Published • 42