arxiv:2604.01411

Test-Time Scaling Makes Overtraining Compute-Optimal

Published on Apr 1

· Submitted by

Nicholas Roberts on Apr 6

University of Wisconsin-Madison

Upvote

Authors:

Kelly Buchanan ,

Abstract

Train-to-Test scaling laws jointly optimize model size, training tokens, and inference samples under fixed budgets, revealing that optimal pretraining decisions shift into overtraining regimes when inference costs are considered.

AI-generated summary

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.

View arXiv page View PDF Add to collection

Community

nick11roberts

Paper submitter 1 day ago

That new LFM2.5-350M is super overtrained — and everyone was shocked by how far they pushed it. As it turns out, we have a scaling law for that. T² (Train-to-Test) scaling combines Chinchilla pretraining scaling with test-time scaling via repeated sampling, and finds that when you account for inference compute, radical overtraining becomes compute optimal.

librarian-bot

about 19 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 17 hours ago

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/test-time-scaling-makes-overtraining-compute-optimal-6527-8b212681
Covers the executive summary, detailed methodology, and practical applications.

mishig

about 5 hours ago

Here are the main results from the paper, which introduces Train-to-Test (T²) scaling laws that jointly optimize pretraining and test-time compute:

Core Finding: Overtraining Becomes Compute-Optimal

The central result is that when you account for inference costs (test-time scaling via repeated sampling), the optimal pretraining strategy shifts dramatically toward smaller, heavily overtrained models—far beyond the standard Chinchilla recommendation of ~20 tokens per parameter.

Figure 2: Both T² approaches forecast that optimal pretraining requires extreme overtraining (left panel) with dramatically smaller model sizes (middle panel) compared to Chinchilla scaling, when accounting for a fixed inference budget (C_inf ≈ 140B FLOPs, roughly equivalent to one forward pass of a 70B model).

Key Quantitative Shifts

Tokens per parameter: While Chinchilla recommends ~20 tokens/parameter, T² scaling suggests ratios orders of magnitude higher (the exact multiplier depends on the inference budget, but the trend shows extreme overtraining)
Model size: Optimal models are substantially smaller than Chinchilla would prescribe for the same training compute
Inference allocation: Smaller models get allocated more samples (k) since k = C_inf/(2N), creating a multiplicative trade-off

Validation: T² Extrapolates to Real Overtrained Checkpoints

The authors validate their scaling laws by training 21 new heavily overtrained checkpoints (beyond the standard Chinchilla suite) and confirming the predictions hold:

Figure 4: Both T² approaches successfully extrapolate to overtrained checkpoints. Approach 1 achieves 2.8% relative error, while Approach 2 achieves 8.4%.

Table 1 results (selected examples from the paper):

Task	Chinchilla Optimal	T² Overtrained	Improvement
ARC-Easy	52.1% (412M params)	56.3% (206M params, 4× overtrained)	+4.2%
LAMBADA	61.4% (412M params)	64.8% (206M params, 4× overtrained)	+3.4%
SciQ	82.1% (206M params)	85.2% (103M params, 8× overtrained)	+3.1%

All comparisons use identical inference budgets (C_inf = 2×10⁹ FLOPs), meaning the smaller overtrained models use more repeated samples (k) than larger Chinchilla-optimal models.

Post-Training Robustness

The findings persist after supervised fine-tuning (SFT) and standard fine-tuning (FT), though the overtraining recommendation is slightly subdued (consistent with prior work showing overtrained models are harder to fine-tune):

Figure 5: After post-training on ARC-Easy, SciQ, and OpenBookQA, the optimal frontier still favors smaller overtrained models over Chinchilla-optimal baselines, confirming that T² scaling remains relevant for deployed models.

The T² Scaling Approaches

The paper proposes two complementary modeling approaches that agree on the overtraining conclusion:

Approach 1 (Blue): Extends Chinchilla loss scaling with a power-law term for repeated sampling:
- L̂(N,D,k) = E + A/N^α + B/D^β + G/k^γ
Approach 2 (Red): Models pass@k directly using Beta regression on per-question accuracies, accounting for the concavity of the pass@k metric (Jensen's gap)

Both approaches show that under inference-corrected budgets, Chinchilla scaling exhibits non-monotonic improvement (larger training budgets can hurt performance if model size is wrong), while T² scaling shows consistent monotonic gains when jointly optimizing pretraining and test-time compute.

Practical Recommendation

If you know your test-time scaling budget (i.e., you'll use repeated sampling with pass@k evaluation), you should:

Train a smaller model than Chinchilla would suggest
Train it for many more tokens (heavy overtraining)
Use more inference samples at test time (which is cheaper per-sample with the smaller model)

This "small but overtrained" configuration maximizes end-to-end performance under joint train-test compute budgets.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.01411

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01411 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01411 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01411 in a Space README.md to link it from this page.