OSDFace β Pretrained Weights (Mirror)
This is an unofficial mirror. All credit goes to the original authors. The weights are mirrored here from the official OSDFace repository for convenience, as the original download is hosted on OneDrive/Google Drive which can be slow or inaccessible in some regions. Please cite the original paper and star the original repo if you use these weights.
Overview
OSDFace (One-Step Diffusion Model for Face Restoration) is a single-step diffusion model that restores degraded, low-quality face images into high-fidelity, identity-consistent outputs. It was accepted at CVPR 2025.
Unlike multi-step diffusion approaches, OSDFace requires only one forward pass through a modified Stable Diffusion 2.1 UNet, making it significantly faster at inference while achieving state-of-the-art results on both synthetic (CelebA-Test) and real-world (Wider-Test, LFW-Test, WebPhoto-Test) benchmarks.
The key innovations are:
- Visual Representation Embedder (VRE): A VQ-VAE encoder that tokenizes the low-quality input face and produces visual prompt embeddings via a vector-quantized dictionary. These embeddings replace the text encoder's output and are fed directly into the UNet's cross-attention layers.
- Facial Identity Loss: A face-recognition-derived loss that enforces identity consistency between the restored and ground-truth faces.
- GAN Guidance: A generative adversarial network guides the one-step diffusion to align the output distribution with the ground truth.
Usage
Prerequisites
- Base model: stabilityai/stable-diffusion-2-1-base
- Python 3.10, PyTorch 2.4.0, diffusers 0.27.2
Quick Start
# Clone the official repo
git clone https://github.com/jkwang28/OSDFace.git
cd OSDFace
# Download these weights into pretrained/
# Place: associate_2.ckpt, embedding_change_weights.pth, pytorch_lora_weights.safetensors
# Run inference (with LoRA merging for speed)
python infer.py \
--input_image data/WebPhoto-Test \
--output_dir results/WebPhoto-Test \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-2-1-base" \
--img_encoder_weight "pretrained/associate_2.ckpt" \
--ckpt_path pretrained \
--merge_lora \
--mixed_precision fp16 \
--gpu_ids 0
Note on the different pretrained model Although the project is based on
stabilityai/stable-diffusion-2-1-basewe useManojb/stable-diffusion-2-1-basebecause the former can't be downloaded from huggingface.
Files in This Repository
associate_2.ckpt (1.87 GB)
The VQ-VAE image encoder (referred to as the Visual Representation Embedder in the paper). This is the core component that understands the degraded input face.
It contains a multi-head encoder with downsampling blocks, a mid-block with attention, and a vector quantizer with a learned 1024-entry codebook (embedding dim 512). At inference, the encoder processes a 512Γ512 low-quality face, extracts spatial features, quantizes them against the codebook, and selects the 77 closest (non-duplicate) codebook entries β producing a (batch, 77, 512) tensor that acts as a drop-in replacement for CLIP text embeddings in the UNet's cross-attention.
Loaded via: --img_encoder_weight associate_2.ckpt
embedding_change_weights.pth (1.58 MB)
A lightweight embedding projection module (TwoLayerConv1x1) that maps the VRE output from 512 dimensions to 1024 dimensions, matching the hidden size expected by Stable Diffusion 2.1's UNet cross-attention layers.
Architecture: two 1Γ1 Conv1d layers with SiLU activations (512 β 256 β 1024), operating over the 77-token sequence.
This module is used in the default configuration (without --cat_prompt_embedding). When --cat_prompt_embedding is enabled, the VRE instead outputs 154 tokens at 512-dim which are reshaped to 77 tokens at 1024-dim, bypassing this module entirely.
Loaded from: <ckpt_path>/embedding_change_weights.pth
pytorch_lora_weights.safetensors (67.9 MB)
LoRA (Low-Rank Adaptation) weights for the Stable Diffusion 2.1 UNet. These adapt the frozen SD2.1 UNet to perform one-step face restoration conditioned on the VRE embeddings.
Default LoRA configuration: rank 16, alpha 16 (effective scaling factor alpha/rank = 1.0). The weights cover both standard LoRA layers (lora_A/lora_B) and some additional lora.up/lora.down layers.
These can be loaded in two ways:
- Dynamic loading (default): loaded at runtime via
diffusers'load_lora_weights() - Merged loading (
--merge_lora): pre-merged into the UNet weights before inference for slightly faster execution
Loaded from: <ckpt_path>/pytorch_lora_weights.safetensors
Key Inference Arguments
| Argument | Default | Description |
|---|---|---|
--merge_lora |
off | Merge LoRA into UNet weights (recommended) |
--mixed_precision |
fp32 |
Use fp16 for faster inference / lower VRAM |
--gpu_ids |
[0] |
Multi-GPU support, e.g. --gpu_ids 0 1 2 3 |
--cat_prompt_embedding |
off | Alternative embedding strategy (skips embedding_change module) |
--lora_rank |
16 | LoRA rank (must match training) |
--lora_alpha |
16 | LoRA alpha (must match training) |
Inference Pipeline (Summary)
- Input image resized to 512Γ512
- VRE encodes the LQ face β
(B, 77, 512)visual prompt - Embedding projection maps to
(B, 77, 1024)(or concatenation path) - VAE encodes the LQ face to latent space
- UNet performs a single denoising step at timestep 399, conditioned on the visual prompt
- Predicted clean latent is decoded by the VAE β restored face
Citation
@InProceedings{wang2025osdface,
author = {Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
title = {{OSDFace}: One-Step Diffusion Model for Face Restoration},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {12626-12636}
}
Links
- π Paper (arXiv)
- π» Official Repository
- π Project Page
Model tree for alecccdd/OSDFace
Base model
stabilityai/stable-diffusion-2-1-base
