Simple diffusion

Community Article Published April 6, 2026

Hello everyone! We are building a simple, fast, and compact diffusion model that can be trained and run on consumer GPUs while maintaining high quality. Simple Diffusion (sdxs-1b) is the first result of our experiments; we are releasing it as an alpha version under the Apache-2.0 license, along with open-source code for data preparation and training. https://huggingface.co/AiArtLab/sdxs-1b

How the model was created

The work began in December 2024. We started by studying linear transformers based on Sana (we are AiArtLab — a small Telegram chat where a group of enthusiasts experiments with models). Stas and I are responsible for the diffusion side. After unsuccessful fine-tuning and subsequent reworking of that model, we concluded that it would be better to build everything from scratch.

Since February 2025, I switched to an evolutionary development path based on the UNet architecture, while Stas chose a more revolutionary approach and started developing DiT architectures and fast autoencoders. Over the course of six months, many architectural variants were trained, unfortunately without success. Models either diverged to infinity, failed to generalize anatomy, or captured anatomy well but could not generate fine details.

From August to September 2025, I decided to practice on simpler models and ran experiments with different VAE variants, which eventually resulted in creating a state-of-the-art VAE (at that time) and a collection of fine-tuned improvements for various VAEs.

In December 2025, the first version of sdxs-1b was trained (then ~0.8B parameters), featuring a UNet in the style of SD1.5, Long CLIP, a 16-channel Simple VAE, and a flow matching target.

It might have been reasonable to stop at a small model — its capacity was sufficient for anime, for example — but we aimed for higher quality and expansion toward photorealistic detail.

In January–February 2026, all key ideas from SDXL were tested one by one, but ultimately we returned to a simpler, classical architecture. Surprisingly, the tests showed either degradation or significant slowdown with only marginal improvements across all five main hypotheses proposed by Stability in SDXL (despite it being one of the best models, in my opinion).

Around that time, Stas convinced me to revisit Flux.2 experiments. It turned out the model was more complex than expected and effectively 128-channel. However, tests showed that a small model could not be trained effectively with 128 channels. As a result, the VAE was converted to 32 channels (where the extended 1.5 model was already training successfully), then further hacked and fine-tuned into a full-fledged 32-channel asymmetric VAE. Fine-tuning took a surprisingly short time (about 2 days) and was done on a single GPU. Training a VAE “from scratch” tends to blur details, so we used five simultaneous targets along with a custom normalization algorithm.

VAE quality

At 16× compression, our model demonstrates record PSNR and LPIPS values with a large margin and surpasses Flux1 VAE even when compared to models with 8× compression. Of course, much of this is thanks to Flux.2, which served as the foundation.

8x scale factor
SDXL             | MSE=1.925e-03 PSNR=30.00 LPIPS=0.123 Edge=0.181 KL=32.113
FLUX.1           | MSE=4.098e-04 PSNR=36.06 LPIPS=0.033 Edge=0.083 KL=13.127
FLUX.2           | MSE=2.425e-04 PSNR=38.33 LPIPS=0.023 Edge=0.065 KL=2.160

16x scale factor
Wan2.2-TI2V-5B (2Gb) | MSE=7.034e-04 PSNR=34.65 LPIPS=0.050 Edge=0.115 KL=9.429
sdxs-1b (200Mb)      | MSE=2.655e-04 PSNR=37.83 LPIPS=0.026 Edge=0.066 KL=2.170

To the best of our knowledge, this VAE achieves SOTA for architectures with a 16x scale factor, and is highly competitive overall.

Later, we switched to Qwen3.5 as the text encoder, as I found its training speed remarkable compared to non-multimodal models.

April 2026 — present day: alpha version

promo

Data and preprocessing

For test training, we assembled a dataset of ~1–2 million images from open datasets such as Midjourney, Nijourney, etc. These are mostly drawings and illustrations, with a small portion of photographs. Captions are provided both in danbooru format (short tags) and as natural descriptions (up to 250 tokens). Before training, images were resized from 768 to 1408 pixels (step of 64), allowing the model to handle different aspect ratios. The training itself is performed at half that resolution.

Final architecture

SDXS-1B consists of three components:

UNet — 1.6B parameters (about twice as many as SD1.5). Architecturally, the model is closer to Stable Diffusion 1.5 than to SDXL. The number of blocks is relatively small and evenly distributed, which proved optimal for balanced attention across different parts of the image (anatomy vs. details).

Text encoder Qwen3.5–2B. We tested various encoders (CLIP, LongCLIP, SigLIP, MexMaSigLip, Qwen-0.6B, etc.) and found that Qwen3.5 provides the best quality after LongCLIP, while adding multilingual and multimodal capabilities along with typical LLM advantages (refinement, chat, image understanding). Embeddings are taken from the penultimate layer, enabling more effective extraction of structural information.

Asymmetric VAE (32 latent channels, encoder 8×, decoder 16×). The VAE compresses the image by 8×, while the decoder reconstructs it at twice the scale (16× instead of the usual 8×). During inference, this acts as an integrated latent 2× upscaler, increasing resolution within the trained range (512–768px) without hallucinating new details. This approach improves generation quality and scalability while significantly reducing training cost. For users, the main benefit is faster inference, as generating at 512 is disproportionately faster than at 1024 with similar results.

Upscaler

A key feature of the VAE is the built-in 2× upscaler. It doubles image resolution, removes pixelation, and preserves style and details (VAEs are inherently trained to reconstruct images as faithfully as possible).

Examples show that the upscaled image remains essentially the same, just at twice the resolution. This approach works well even for video and can be used independently of the generative pipeline (you can use only the VAE weights). Of course, there are specialized upscalers for specific tasks (JPEG artifact removal, TV compression artifacts, etc.), but there is a class of problems where fidelity is more important than creativity, such as enlarging X-rays or reproductions of paintings.

original 2x upscale

We also implemented prompt refinement: short or tag-based prompts are automatically enhanced by an internal refiner to help the model better understand the task (text processing differs between CLIP-like and language models). It is also possible to input an image for encoding or even audio instead of text (experimental image-to-image mode).

Training process and hyperparameters

Training was staged: first teaching the model basic forms (low-level features), then overall composition, and finally fine details (faces, textures). We used AdamW8bit with a base LR of 4e-5 and a minimum of 4e-6 or lower, which proved quite forgiving to tuning errors. We experimented with learning rate scaling across stages (0.5, 1.0, 3.0, 5.0 scenarios as described in the README), but generally stayed within standard values. Experiments with Muon showed degradation in anatomy. Overall, the role of hyperparameters is greatly overestimated — the model either learns or it does not.

Training used between 1 and 8 RTX 5090 GPUs and took about 2–3 months in total (the model was continuously refined and retrained). For text preprocessing, we limited sequence length to 250 tokens and applied a relatively high dropout (10%) on the text encoder output to improve robustness to noise.

All code is open-source: the repository includes dataset conversion scripts and a simple monolithic train.py. Detailed commands are provided in the README.

Usage instructions

You can load SDXS-1B easily via the Hugging Face Diffusers library.

It is also possible to enhance prompts or upscale images directly within the pipeline.

Additional materials include training code, dataset preparation tools, an online demo, and model weights.

The average quality of generations on random prompts (texts unfamiliar to the model) will be lower due to dataset limitations.

Limitations

Despite the work done, SDXS-1B remains an alpha version. Due to the relatively small dataset, it does not cover all domains — for example, photographs are weaker than cartoon-style illustrations. For now, we recommend adding "photo" to the negative prompt to suppress photorealism.

The model is released under the Apache-2.0 license, allowing free use in projects.

Future work and outlook

In the future, a fine-tuned version of this model could potentially restore vision to blind people, alter reality in real time, or run locally on mobile devices. At the moment, however, it is mainly useful for generating (sometimes amusing) images.

Training is currently paused, and we have released sdxs-1b “as is.”

Next steps include expanding the training dataset and completing training, as well as exploring Turbo LoRA and ControlNet with a gradual transition toward a video model.

The model is available on Hugging Face, and we welcome community contributions to its training and development.

I would be glad to hear any collaboration proposals — contact details are provided in the model card.

Community

Sign up or log in to comment