GaussianFormer-V10b

GaussianFormer is a proof-of-concept neural renderer that takes a 3D Gaussian Splatting scene as input and synthesizes novel views without per-scene optimization at inference time. It adapts RenderFormer (SIGGRAPH 2025) — a transformer-based neural renderer designed for triangle meshes — by replacing its mesh input encoder with a Gaussian-native module that maps each Gaussian's 14-dimensional parameter vector to a scene token. The two-stage transformer (view-independent scene encoder + view-dependent ray decoder) is otherwise unchanged.

This checkpoint is V10b epoch 26: an LPIPS-VGG perceptual fine-tune (loss weight 0.2) of V9 epoch 60.

Code & docs: github.com/SVLwoof/gaussianformer

Lineage

Trained from the RenderFormer architecture (microsoft/renderformer-v1.1-swin-large) with the mesh input encoder replaced by a learned linear projection over 14-d Gaussian parameters: [pos(3), scale(3), quat(4), rgb(3), opacity(1)].
V9 (pretrain). Single-object 3DGS scans from Objaverse_Splats (2,667 train / 183 val), pure log-HDR L1 loss, 60 epochs on 3 GPUs.
V10b (this checkpoint). Fine-tuned from V9 ep60 with a combined log-HDR L1 + LPIPS-VGG loss on tonemapped LDR output, 26 epochs on 4 GPUs (cosine LR 5e-5 → 5e-7).

How to use

import torch
import h5py, numpy as np
from gaussianformer.pipelines.rendering_pipeline import GaussianFormerRenderingPipeline

device = torch.device("cuda")
pipeline = GaussianFormerRenderingPipeline.from_pretrained("shahafvl/gaussianformer-v10b")
pipeline.to(device)

# Load an HDF5 scene with fields: means [N,3], scales [N,3], rotations [N,4]
# (w,x,y,z), colors [N,3], opacities [N,1], c2w [V,4,4], fov [V].
with h5py.File("scene.h5", "r") as f:
    g = torch.cat([
        torch.from_numpy(np.array(f["means"]).astype(np.float32)),
        torch.from_numpy(np.array(f["scales"]).astype(np.float32)),
        torch.from_numpy(np.array(f["rotations"]).astype(np.float32)),
        torch.from_numpy(np.array(f["colors"]).astype(np.float32)),
        torch.from_numpy(np.array(f["opacities"]).astype(np.float32)).reshape(-1, 1),
    ], dim=-1).unsqueeze(0).to(device)
    mask = torch.ones(g.shape[:2], dtype=torch.bool, device=device)
    c2w = torch.from_numpy(np.array(f["c2w"]).astype(np.float32)).unsqueeze(0).to(device)
    fov = torch.from_numpy(np.array(f["fov"]).astype(np.float32))[..., None].unsqueeze(0).to(device)

imgs = pipeline(gaussians=g, mask=mask, c2w=c2w, fov=fov,
                resolution=512, torch_dtype=torch.float16)
# imgs: [1, V, H, W, 3], linear HDR

For a complete CLI inference script (with tone mapping + EXR/PNG output) see infer_gaussian.py in the project repo.

Headline results

PSNR (dB, vs full-gsplat ground truth) on a real-world object scan (Tomatoes) across token budgets N:

Model	N=5k	N=10k	N=20k	N=30k
V6 (multi-obj)	21.78	21.95	22.54	23.00
V9 ep60 (L1)	25.90	26.94	27.74	28.24
V10b ep26	25.80	26.70	27.41	27.81

V10b is the best perceptual model (LPIPS-trained); V9 ep60 is the best raw PSNR model. The slight PSNR regression vs V9 is expected — LPIPS gradients optimize for feature-space similarity, not pixel fidelity, and the visual sharpness gain is the goal.

Limitations

The transformer's self-attention is O(N²), so input is capped to N≤30k Gaussians per scene via a LightGaussian-style importance score. Higher fidelity needs an attention scheme that scales beyond this.
Trained on isolated single objects (Objaverse_Splats). Multi-object scenes, large-scale captures, and unbounded backgrounds degrade significantly.
The model is a proof of concept. It still trails standard rasterized 3DGS in output quality on the Tomatoes evaluation.

Training data license

Trained on the Objaverse_Splats subset of Objaverse, whose terms restrict commercial use. The CC-BY-NC-4.0 license on this checkpoint inherits that restriction. For research/non-commercial use only.

Citation

GaussianFormer is built on RenderFormer; if you use this checkpoint in academic work, please cite the original RenderFormer paper:

@inproceedings{zeng2025renderformer,
  title     = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination},
  author    = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong},
  booktitle = {ACM SIGGRAPH 2025 Conference Papers},
  year      = {2025}
}

The GaussianFormer adaptation, training pipeline, and this checkpoint are documented at github.com/SVLwoof/gaussianformer.

Downloads last month: 57

Model tree for shahafvl/gaussianformer-v10b

Base model

microsoft/renderformer-v1.1-swin-large

Finetuned

(2)

this model

shahafvl
/

gaussianformer-v10b