GaussianFormer-V10b
GaussianFormer is a proof-of-concept neural renderer that takes a 3D Gaussian Splatting scene as input and synthesizes novel views without per-scene optimization at inference time. It adapts RenderFormer (SIGGRAPH 2025) — a transformer-based neural renderer designed for triangle meshes — by replacing its mesh input encoder with a Gaussian-native module that maps each Gaussian's 14-dimensional parameter vector to a scene token. The two-stage transformer (view-independent scene encoder + view-dependent ray decoder) is otherwise unchanged.
This checkpoint is V10b epoch 26: an LPIPS-VGG perceptual fine-tune (loss weight 0.2) of V9 epoch 60.
Code & docs: github.com/SVLwoof/gaussianformer
Lineage
- Trained from the RenderFormer architecture (
microsoft/renderformer-v1.1-swin-large) with the mesh input encoder replaced by a learned linear projection over 14-d Gaussian parameters:[pos(3), scale(3), quat(4), rgb(3), opacity(1)]. - V9 (pretrain). Single-object 3DGS scans from Objaverse_Splats (2,667 train / 183 val), pure log-HDR L1 loss, 60 epochs on 3 GPUs.
- V10b (this checkpoint). Fine-tuned from V9 ep60 with a combined log-HDR L1 + LPIPS-VGG loss on tonemapped LDR output, 26 epochs on 4 GPUs (cosine LR 5e-5 → 5e-7).
How to use
import torch
import h5py, numpy as np
from gaussianformer.pipelines.rendering_pipeline import GaussianFormerRenderingPipeline
device = torch.device("cuda")
pipeline = GaussianFormerRenderingPipeline.from_pretrained("shahafvl/gaussianformer-v10b")
pipeline.to(device)
# Load an HDF5 scene with fields: means [N,3], scales [N,3], rotations [N,4]
# (w,x,y,z), colors [N,3], opacities [N,1], c2w [V,4,4], fov [V].
with h5py.File("scene.h5", "r") as f:
g = torch.cat([
torch.from_numpy(np.array(f["means"]).astype(np.float32)),
torch.from_numpy(np.array(f["scales"]).astype(np.float32)),
torch.from_numpy(np.array(f["rotations"]).astype(np.float32)),
torch.from_numpy(np.array(f["colors"]).astype(np.float32)),
torch.from_numpy(np.array(f["opacities"]).astype(np.float32)).reshape(-1, 1),
], dim=-1).unsqueeze(0).to(device)
mask = torch.ones(g.shape[:2], dtype=torch.bool, device=device)
c2w = torch.from_numpy(np.array(f["c2w"]).astype(np.float32)).unsqueeze(0).to(device)
fov = torch.from_numpy(np.array(f["fov"]).astype(np.float32))[..., None].unsqueeze(0).to(device)
imgs = pipeline(gaussians=g, mask=mask, c2w=c2w, fov=fov,
resolution=512, torch_dtype=torch.float16)
# imgs: [1, V, H, W, 3], linear HDR
For a complete CLI inference script (with tone mapping + EXR/PNG output) see
infer_gaussian.py
in the project repo.
Headline results
PSNR (dB, vs full-gsplat ground truth) on a real-world object scan (Tomatoes) across token budgets N:
| Model | N=5k | N=10k | N=20k | N=30k |
|---|---|---|---|---|
| V6 (multi-obj) | 21.78 | 21.95 | 22.54 | 23.00 |
| V9 ep60 (L1) | 25.90 | 26.94 | 27.74 | 28.24 |
| V10b ep26 | 25.80 | 26.70 | 27.41 | 27.81 |
V10b is the best perceptual model (LPIPS-trained); V9 ep60 is the best raw PSNR model. The slight PSNR regression vs V9 is expected — LPIPS gradients optimize for feature-space similarity, not pixel fidelity, and the visual sharpness gain is the goal.
Limitations
- The transformer's self-attention is O(N²), so input is capped to N≤30k Gaussians per scene via a LightGaussian-style importance score. Higher fidelity needs an attention scheme that scales beyond this.
- Trained on isolated single objects (Objaverse_Splats). Multi-object scenes, large-scale captures, and unbounded backgrounds degrade significantly.
- The model is a proof of concept. It still trails standard rasterized 3DGS in output quality on the Tomatoes evaluation.
Training data license
Trained on the Objaverse_Splats subset of Objaverse, whose terms restrict commercial use. The CC-BY-NC-4.0 license on this checkpoint inherits that restriction. For research/non-commercial use only.
Citation
GaussianFormer is built on RenderFormer; if you use this checkpoint in academic work, please cite the original RenderFormer paper:
@inproceedings{zeng2025renderformer,
title = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination},
author = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong},
booktitle = {ACM SIGGRAPH 2025 Conference Papers},
year = {2025}
}
The GaussianFormer adaptation, training pipeline, and this checkpoint are documented at github.com/SVLwoof/gaussianformer.
- Downloads last month
- 57
Model tree for shahafvl/gaussianformer-v10b
Base model
microsoft/renderformer-v1.1-swin-large