Pair 3 Baseline: 7.5B Standard Transformer

What This Model Is (And Isn't)

This is NOT a general-purpose language model. This is the baseline control in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement (arXiv:2601.21766).

This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is catastrophically overfit (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.

Checkpoints

Two checkpoints are provided:

Checkpoint	Step	Tokens Seen	Purpose
`step_010000.safetensors`	10,000 / 95,367	5.2B / 50B	Best generalizing model (lowest val loss)
`step_095367.safetensors`	95,367 / 95,367	50B / 50B	Final checkpoint (for head-to-head comparison with CoFrGeNet-F)

Why Two Checkpoints?

The best language model and the best experiment endpoint are different checkpoints:

Step 10K saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually use this as an LLM, use this checkpoint.
Step 95K completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.

Evaluation Results

Metric	Step 10K (Best LLM)	Step 20K	Step 95K (Final)
WikiText-2 PPL	39.52	52.21	2,952,579
WikiText-103 PPL	39.52	52.21	2,952,579
LAMBADA PPL	51.48	76.88	5,240,843
LAMBADA Acc	15.89%	13.12%	6.31%
Throughput	29,561 tok/s	26,799 tok/s	55,693 tok/s
Gen Speed	104.47 ms/tok	88.06 ms/tok	47.19 ms/tok

Evaluated with scripts/04_evaluate.py on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.

Context: Pair 1 (450M) vs Pair 3 (7.5B)

Model	Params	WikiText-2 PPL	LAMBADA Acc
Pair 1 Baseline (final)	450M	23.69	26.88%
Pair 3 Baseline (step 10K)	7.5B	39.52	15.89%
Pair 3 Baseline (final)	7.5B	2,952,579	6.31%

The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.

Architecture

Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).

Parameter	Value
Layers	36
Hidden dim	4096
Attention heads	32
Head dim	128
FFN inner dim	16,384 (4x hidden)
Vocab size	50,257 (GPT-2 tokenizer)
Context length	1,024
Total parameters	7,458,103,296
Weight tying	Yes (lm_head = tok_emb)

Training Details

Setting	Value
Dataset	FineWeb-Edu 50BT (educational web text)
Tokenizer	GPT-2 (tiktoken, `gpt2` encoding)
Hardware	8x NVIDIA B200 (179 GB each)
Parallelism	FSDP FULL_SHARD
Precision	bfloat16
Optimizer	AdamW (fused), beta1=0.9, beta2=0.95
Learning rate	3e-4 peak, cosine decay to 0
Warmup	2,000 steps
Weight decay	0.1 (2D weight tensors only)
Gradient clipping	1.0 max norm
Batch size	524,288 tokens/update (micro_batch=64, no grad accumulation)
Total steps	95,367 (1 epoch over 50B tokens)
Throughput	~132,800 tok/s
Wall time	~5.5 days
torch.compile	Disabled (dtype mismatch crash at 7B+ scale)

Validation Loss Trajectory

The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.

Step	Train Loss	Val Loss	Val PPL
5,000	~3.0	3.01	20.3
8,000	~2.9	2.94	18.8
10,000	~2.9	2.95	19.0
20,000	~1.2	3.05	21.2
40,000	~0.4	3.33	27.8
60,000	~0.04	7.13	1,251
80,000	~0.01	11.60	109,013
95,367	0.008	~12.0	~163,000

The CoFrGeNet-F Experiment

This model is one half of Pair 3 in a series of experiments testing IBM Research's CoFrGeNet-F architecture (arXiv:2601.21766). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.

Experiment Design

Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.

	Baseline (this model)	CoFrGeNet-F
Params	7.5B	~4.8B (35% fewer)
Architecture	36L, 4096d, 32h, standard FFN	36L, 4608d, 36h, Cffn (L=3, d=5)
Data	50B tokens FineWeb-Edu	50B tokens FineWeb-Edu
LR / Schedule	3e-4, cosine to 0	3e-4, cosine to 0
Batch size	524,288 tokens	524,288 tokens

The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at cahlen/pair3-cofrgenet-5b when training completes.

Prior Pairs

Pair	Baseline	CoFrGeNet-F	Result
Pair 1	450M, WikiText-2 PPL 23.69	410M, PPL 56.61	Baseline wins
Pair 3	7.5B (this model)	4.8B (training next)	Pending

Usage

from safetensors.torch import load_file
import torch

# Load the best-generalization checkpoint
state_dict = load_file("step_010000.safetensors")

# You'll need the model definition from the project repo
# git clone https://github.com/cahlen/cofrgenet-f
# Then:
from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer

config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
model = BaselineTransformer(config)
model.load_state_dict(state_dict, strict=False)  # strict=False for weight tying
model.eval()

Project Links

GitHub: cahlen/cofrgenet-f
HuggingFace (all models): cahlen/cofrgenet-f
CoFrGeNet-F paper: arXiv:2601.21766
Project Wiki: GitHub Wiki

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cahlen/pair3-baseline-7b

Paper for cahlen/pair3-baseline-7b

CoFrGeNet: Continued Fraction Architectures for Language Generation

Paper • 2601.21766 • Published Jan 29 • 1

Evaluation results

Perplexity (step 10K, best generalization) on WikiText-2
test set self-reported

39.520
Perplexity (step 95K, final) on WikiText-2
test set self-reported

2952578.700
Accuracy (step 10K, best generalization) on LAMBADA
test set self-reported

15.890
Accuracy (step 95K, final) on LAMBADA
test set self-reported

6.310