GaussianFormer-V10b

GaussianFormer is a proof-of-concept neural renderer that takes a 3D Gaussian Splatting scene as input and synthesizes novel views without per-scene optimization at inference time. It adapts RenderFormer (SIGGRAPH 2025) — a transformer-based neural renderer designed for triangle meshes — by replacing its mesh input encoder with a Gaussian-native module that maps each Gaussian's 14-dimensional parameter vector to a scene token. The two-stage transformer (view-independent scene encoder + view-dependent ray decoder) is otherwise unchanged.

This checkpoint is V10b epoch 26: an LPIPS-VGG perceptual fine-tune (loss weight 0.2) of V9 epoch 60.

Code & docs: github.com/SVLwoof/gaussianformer

Lineage

  • Trained from the RenderFormer architecture (microsoft/renderformer-v1.1-swin-large) with the mesh input encoder replaced by a learned linear projection over 14-d Gaussian parameters: [pos(3), scale(3), quat(4), rgb(3), opacity(1)].
  • V9 (pretrain). Single-object 3DGS scans from Objaverse_Splats (2,667 train / 183 val), pure log-HDR L1 loss, 60 epochs on 3 GPUs.
  • V10b (this checkpoint). Fine-tuned from V9 ep60 with a combined log-HDR L1 + LPIPS-VGG loss on tonemapped LDR output, 26 epochs on 4 GPUs (cosine LR 5e-5 → 5e-7).

How to use

import torch
import h5py, numpy as np
from gaussianformer.pipelines.rendering_pipeline import GaussianFormerRenderingPipeline

device = torch.device("cuda")
pipeline = GaussianFormerRenderingPipeline.from_pretrained("shahafvl/gaussianformer-v10b")
pipeline.to(device)

# Load an HDF5 scene with fields: means [N,3], scales [N,3], rotations [N,4]
# (w,x,y,z), colors [N,3], opacities [N,1], c2w [V,4,4], fov [V].
with h5py.File("scene.h5", "r") as f:
    g = torch.cat([
        torch.from_numpy(np.array(f["means"]).astype(np.float32)),
        torch.from_numpy(np.array(f["scales"]).astype(np.float32)),
        torch.from_numpy(np.array(f["rotations"]).astype(np.float32)),
        torch.from_numpy(np.array(f["colors"]).astype(np.float32)),
        torch.from_numpy(np.array(f["opacities"]).astype(np.float32)).reshape(-1, 1),
    ], dim=-1).unsqueeze(0).to(device)
    mask = torch.ones(g.shape[:2], dtype=torch.bool, device=device)
    c2w = torch.from_numpy(np.array(f["c2w"]).astype(np.float32)).unsqueeze(0).to(device)
    fov = torch.from_numpy(np.array(f["fov"]).astype(np.float32))[..., None].unsqueeze(0).to(device)

imgs = pipeline(gaussians=g, mask=mask, c2w=c2w, fov=fov,
                resolution=512, torch_dtype=torch.float16)
# imgs: [1, V, H, W, 3], linear HDR

For a complete CLI inference script (with tone mapping + EXR/PNG output) see infer_gaussian.py in the project repo.

Headline results

PSNR (dB, vs full-gsplat ground truth) on a real-world object scan (Tomatoes) across token budgets N:

Model N=5k N=10k N=20k N=30k
V6 (multi-obj) 21.78 21.95 22.54 23.00
V9 ep60 (L1) 25.90 26.94 27.74 28.24
V10b ep26 25.80 26.70 27.41 27.81

V10b is the best perceptual model (LPIPS-trained); V9 ep60 is the best raw PSNR model. The slight PSNR regression vs V9 is expected — LPIPS gradients optimize for feature-space similarity, not pixel fidelity, and the visual sharpness gain is the goal.

Limitations

  • The transformer's self-attention is O(N²), so input is capped to N≤30k Gaussians per scene via a LightGaussian-style importance score. Higher fidelity needs an attention scheme that scales beyond this.
  • Trained on isolated single objects (Objaverse_Splats). Multi-object scenes, large-scale captures, and unbounded backgrounds degrade significantly.
  • The model is a proof of concept. It still trails standard rasterized 3DGS in output quality on the Tomatoes evaluation.

Training data license

Trained on the Objaverse_Splats subset of Objaverse, whose terms restrict commercial use. The CC-BY-NC-4.0 license on this checkpoint inherits that restriction. For research/non-commercial use only.

Citation

GaussianFormer is built on RenderFormer; if you use this checkpoint in academic work, please cite the original RenderFormer paper:

@inproceedings{zeng2025renderformer,
  title     = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination},
  author    = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong},
  booktitle = {ACM SIGGRAPH 2025 Conference Papers},
  year      = {2025}
}

The GaussianFormer adaptation, training pipeline, and this checkpoint are documented at github.com/SVLwoof/gaussianformer.

Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shahafvl/gaussianformer-v10b

Finetuned
(2)
this model

Dataset used to train shahafvl/gaussianformer-v10b