LuxDiT
This is the luxdit_video checkpoint for LuxDiT: Lighting Estimation with Video Diffusion Transformer. It is finetuned on video data and includes a LoRA adapter for real scenes.
Model description
LuxDiT is a generative lighting estimation model that predicts high-quality HDR environment maps from visual input. It produces accurate lighting while preserving scene semantics, enabling realistic virtual object insertion under diverse lighting conditions. This model is ready for non-commercial use.
- Checkpoint: Video model (video-finetuned)
- LoRA: Included for real-scene generalization
- Paper: LuxDiT: Lighting Estimation with Video Diffusion Transformer
- Project page: https://research.nvidia.com/labs/toronto-ai/LuxDiT/
Use case: LuxDiT supports studies and prototyping in video lighting estimation. This release is an open-source implementation of our research paper, intended for AI research, development, and benchmarking for lighting estimation research.
Model architecture
- Architecture type: Transformer (based on CogVideoX)
- Parameters: 5B
- Input: RGB video frames; shape
[batch_size, num_frames, height, width, 3]; recommended resolution 480×720 - Output: RGB video frames (dual tonemapped LDR and log); output resolution 256×512; use the HDR merger to obtain
.exrHDR envmaps
Software and hardware
- Runtime: Python and PyTorch
- Supported hardware: NVIDIA Ampere (e.g. A100 GPUs)
- Operating system: Linux
How to use
Download from Hugging Face
From the LuxDiT repository root:
python download_weights.py --repo_id <HF_ORG>/luxdit_video
This saves the checkpoint to checkpoints/luxdit_video by default. Use --local_dir to override.
Inference: synthetic videos (in-domain)
Requires --data_type video:
DIT_PATH=checkpoints/luxdit_video
INPUT_DIR=examples/input_demo/synthetic_videos
OUTPUT_DIR=test_output/synthetic_videos
python inference_luxdit.py \
--config configs/luxdit_base.yaml \
--transformer_path $DIT_PATH \
--input_dir $INPUT_DIR \
--output_dir $OUTPUT_DIR \
--resolution 480 720 \
--guidance_scale 2.5 \
--num_inference_steps 40 \
--seed 33 \
--data_type video
python hdr_merger.py \
--input_dir $OUTPUT_DIR/ldr_log \
--output_dir $OUTPUT_DIR/hdr
Inference: real scenes from video (with LoRA)
Use the LoRA adapter in this checkpoint for better generalization to real video:
DIT_PATH=checkpoints/luxdit_video
LORA_PATH=checkpoints/luxdit_video/lora
INPUT_DIR=examples/input_demo/scene_videos
OUTPUT_DIR=test_output/scene_videos
python inference_luxdit.py \
--config configs/luxdit_base.yaml \
--transformer_path $DIT_PATH \
--lora_dir $LORA_PATH \
--lora_scale 0.8 \
--input_dir $INPUT_DIR \
--output_dir $OUTPUT_DIR \
--resolution 480 720 \
--guidance_scale 2.5 \
--num_inference_steps 40 \
--seed 33 \
--data_type video
python hdr_merger.py \
--input_dir $OUTPUT_DIR/ldr_log \
--output_dir $OUTPUT_DIR/hdr
Adjust lora_scale (e.g. 0.0–1.0) to control how much the input scene is merged into the estimated envmap.
Inference: object video scan with camera poses
For multi-view object captures, you can optionally provide camera poses (--camera_pose_file, OpenCV format) to align the estimated envmap to a canonical layout:
DIT_PATH=checkpoints/luxdit_video
LORA_PATH=checkpoints/luxdit_video/lora
INPUT_DIR=examples/input_demo/object_scans/antman
OUTPUT_DIR=test_output/object_scans/antman
CAM_FILE=examples/input_demo/object_scans/antman/antman.camera.json
python inference_luxdit.py \
--config configs/luxdit_base.yaml \
--transformer_path $DIT_PATH \
--lora_dir $LORA_PATH \
--lora_scale 0.0 \
--input_dir $INPUT_DIR \
--output_dir $OUTPUT_DIR \
--resolution 512 512 \
--guidance_scale 2.5 \
--num_inference_steps 40 \
--seed 33 \
--data_type video \
--camera_pose_file $CAM_FILE
python hdr_merger.py \
--input_dir $OUTPUT_DIR/ldr_log \
--output_dir $OUTPUT_DIR/hdr
Related checkpoints
| Checkpoint | Description |
|---|---|
| luxdit_base | Image-finetuned, with LoRA for real scenes |
| luxdit_video (this model) | Video-finetuned, with LoRA for real scenes |
For single-image inputs, use luxdit_base instead.
Training data
This checkpoint is video-finetuned on the SyntheticScenes dataset:
- Modality: Video
- Scale: ~108,000 rendered videos; each video has 57 frames at 704×1280 resolution
- Collection: Synthetic data generated with an OptiX-based physically based path tracer
- Labels: Produced by the renderer (no manual labeling)
- Per sample: Input RGB (LDR) video and HDR environment lighting
Testing and evaluation use held-out 10% splits of the same dataset.
Output format
The model outputs dual tonemapped environment maps (LDR and log); use the HDR merger to get .exr HDR envmaps. By default, the camera pose of the input image defines the world frame. See the main README for the exact layout of LDR/log vs merged HDR.
Ethical considerations
NVIDIA believes Trustworthy AI is a shared responsibility. When using this model in accordance with the terms of service, ensure it meets requirements for your use case and addresses potential misuse. You are responsible for having proper rights and permissions for all input image and video content; if content includes people, personal health information, or intellectual property, generated outputs will not blur or preserve proportions of subjects. Users are responsible for model inputs and outputs and for implementing appropriate guardrails and safety mechanisms before deployment. To report model quality, risk, security vulnerabilities, or other concerns, see NVIDIA AI Concerns.
License
NVIDIA OneWay Noncommercial License. See the LICENSE in the LuxDiT repository.
Citation
@article{liang2025luxdit,
title={Luxdit: Lighting estimation with video diffusion transformer},
author={Liang, Ruofan and He, Kai and Gojcic, Zan and Gilitschenski, Igor and Fidler, Sanja and Vijaykumar, Nandita and Wang, Zian},
journal={arXiv preprint arXiv:2509.03680},
year={2025}
}
- Downloads last month
- 7
