JoyAI-Image-Edit
Awakening Spatial Intelligence in Unified Multimodal Understanding and Generation
[](https://joyai-image.s3.cn-north-1.jdcloud-oss.com/JoyAI-Image.pdf)
[](https://github.com/jd-opensource/JoyAI-Image)
[](https://huggingface.co/jdopensource/JoyAI-Image-Edit)
[](LICENSE)
## 🐶 JoyAI-Image-Edit
JoyAI-Image-Edit is a multimodal foundation model specialized in instruction-guided image editing. It enables precise and controllable edits by leveraging strong spatial understanding, including scene parsing, relational grounding, and instruction decomposition, allowing complex modifications to be applied accurately to specified regions.
## 🚀 Quick Start
### 1. Environment Setup
**Requirements**: Python >= 3.10, CUDA-capable GPU
Create a virtual environment and install:
```bash
git clone https://github.com/jd-opensource/JoyAI-Image
cd JoyAI-Image
conda create -n joyai python=3.10 -y
conda activate joyai
pip install -e .
```
> **Note on Flash Attention**: `flash-attn >= 2.8.0` is listed as a dependency for best performance.
#### Core Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| `torch` | >= 2.8 | PyTorch |
| `transformers` | >= 4.57.0, < 4.58.0 | Text encoder |
| `diffusers` | >= 0.34.0 | Pipeline utilities |
| `flash-attn` | >= 2.8.0 | Fast attention kernel |
### 2. Inference
#### Image Editing
```bash
python inference.py \
--ckpt-root /path/to/ckpts_infer \
--prompt "Turn the plate blue" \
--image test_images/test_1.jpg \
--output outputs/result.png \
--seed 123 \
--steps 30 \
--guidance-scale 5.0 \
--basesize 1024
```
---
### CLI Reference (`inference.py`)
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--ckpt-root` | str | *required* | Checkpoint root |
| `--prompt` | str | *required* | Edit instruction or T2I prompt |
| `--image` | str | None | Input image path (required for editing, omit for T2I) |
| `--output` | str | `example.png` | Output image path |
| `--steps` | int | 50 | Denoising steps |
| `--guidance-scale` | float | 5.0 | Classifier-free guidance scale |
| `--seed` | int | 42 | Random seed for reproducibility |
| `--neg-prompt` | str | `""` | Negative prompt |
| `--basesize` | int | 1024 | Bucket base size for input image resizing (256/512/768/1024) |
| `--config` | str | auto | Config path; defaults to `/infer_config.py` |
| `--rewrite-prompt` | flag | off | Enable LLM-based prompt rewriting |
| `--rewrite-model` | str | `gpt-5` | Model name for prompt rewriting |
| `--hsdp-shard-dim` | int | 1 | FSDP shard dimension for multi-GPU (set to GPU count) |
### Spatial Editing Reference
JoyAI-Image supports three spatial editing prompt patterns: **Object Move**, **Object Rotation**, and **Camera Control**. For the most stable behavior, we recommend following the prompt templates below as closely as possible.
#### 1. Object Move
Use this pattern when you want to move a target object into a specified region.
**Prompt template:**
```text
Move the