FlowInOne: Unifying Multimodal Generation as Image-in, Image-out Flow Matching

TL;DR: The first vision-centric image-in, image-out image generation model.

About

We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

🧪 Usage

you can download the model weights and model preparation

# model weights
wget -O /path/to/download https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/flowinone_256px.pth
# model preparation
wget -O /path/to/download https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/preparation.tar.gz
# unzip
tar -xzvf "preparation.tar.gz" -C "/path/to/preparation"

you can download the dataset examples

wget -O /path/to/download https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/flowinone_demo_dataset.tar.gz
# unzip
tar -xzvf "flowinone_demo_dataset.tar.gz" -C "/path/to/flowinone_demo_dataset"

Our training and inference scripts are now available on GitHub!

Downloads last month: -; Downloads are not tracked for this model. How to track

CSU-JPG
/

FlowInOne

FlowInOne: Unifying Multimodal Generation as Image-in, Image-out Flow Matching

About

🧪 Usage

Datasets used to train CSU-JPG/FlowInOne