Pair 3 Baseline: 7.5B Standard Transformer

What This Model Is (And Isn't)

This is NOT a general-purpose language model. This is the baseline control in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement (arXiv:2601.21766).

This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is catastrophically overfit (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.

Checkpoints

Two checkpoints are provided:

Checkpoint Step Tokens Seen Purpose
step_010000.safetensors 10,000 / 95,367 5.2B / 50B Best generalizing model (lowest val loss)
step_095367.safetensors 95,367 / 95,367 50B / 50B Final checkpoint (for head-to-head comparison with CoFrGeNet-F)

Why Two Checkpoints?

The best language model and the best experiment endpoint are different checkpoints:

  • Step 10K saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually use this as an LLM, use this checkpoint.
  • Step 95K completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.

Evaluation Results

Metric Step 10K (Best LLM) Step 20K Step 95K (Final)
WikiText-2 PPL 39.52 52.21 2,952,579
WikiText-103 PPL 39.52 52.21 2,952,579
LAMBADA PPL 51.48 76.88 5,240,843
LAMBADA Acc 15.89% 13.12% 6.31%
Throughput 29,561 tok/s 26,799 tok/s 55,693 tok/s
Gen Speed 104.47 ms/tok 88.06 ms/tok 47.19 ms/tok

Evaluated with scripts/04_evaluate.py on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.

Context: Pair 1 (450M) vs Pair 3 (7.5B)

Model Params WikiText-2 PPL LAMBADA Acc
Pair 1 Baseline (final) 450M 23.69 26.88%
Pair 3 Baseline (step 10K) 7.5B 39.52 15.89%
Pair 3 Baseline (final) 7.5B 2,952,579 6.31%

The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.

Architecture

Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).

Parameter Value
Layers 36
Hidden dim 4096
Attention heads 32
Head dim 128
FFN inner dim 16,384 (4x hidden)
Vocab size 50,257 (GPT-2 tokenizer)
Context length 1,024
Total parameters 7,458,103,296
Weight tying Yes (lm_head = tok_emb)

Training Details

Setting Value
Dataset FineWeb-Edu 50BT (educational web text)
Tokenizer GPT-2 (tiktoken, gpt2 encoding)
Hardware 8x NVIDIA B200 (179 GB each)
Parallelism FSDP FULL_SHARD
Precision bfloat16
Optimizer AdamW (fused), beta1=0.9, beta2=0.95
Learning rate 3e-4 peak, cosine decay to 0
Warmup 2,000 steps
Weight decay 0.1 (2D weight tensors only)
Gradient clipping 1.0 max norm
Batch size 524,288 tokens/update (micro_batch=64, no grad accumulation)
Total steps 95,367 (1 epoch over 50B tokens)
Throughput ~132,800 tok/s
Wall time ~5.5 days
torch.compile Disabled (dtype mismatch crash at 7B+ scale)

Validation Loss Trajectory

The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.

Step Train Loss Val Loss Val PPL
5,000 ~3.0 3.01 20.3
8,000 ~2.9 2.94 18.8
10,000 ~2.9 2.95 19.0
20,000 ~1.2 3.05 21.2
40,000 ~0.4 3.33 27.8
60,000 ~0.04 7.13 1,251
80,000 ~0.01 11.60 109,013
95,367 0.008 ~12.0 ~163,000

The CoFrGeNet-F Experiment

This model is one half of Pair 3 in a series of experiments testing IBM Research's CoFrGeNet-F architecture (arXiv:2601.21766). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.

Experiment Design

Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.

Baseline (this model) CoFrGeNet-F
Params 7.5B ~4.8B (35% fewer)
Architecture 36L, 4096d, 32h, standard FFN 36L, 4608d, 36h, Cffn (L=3, d=5)
Data 50B tokens FineWeb-Edu 50B tokens FineWeb-Edu
LR / Schedule 3e-4, cosine to 0 3e-4, cosine to 0
Batch size 524,288 tokens 524,288 tokens

The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at cahlen/pair3-cofrgenet-5b when training completes.

Prior Pairs

Pair Baseline CoFrGeNet-F Result
Pair 1 450M, WikiText-2 PPL 23.69 410M, PPL 56.61 Baseline wins
Pair 3 7.5B (this model) 4.8B (training next) Pending

Usage

from safetensors.torch import load_file
import torch

# Load the best-generalization checkpoint
state_dict = load_file("step_010000.safetensors")

# You'll need the model definition from the project repo
# git clone https://github.com/cahlen/cofrgenet-f
# Then:
from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer

config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
model = BaselineTransformer(config)
model.load_state_dict(state_dict, strict=False)  # strict=False for weight tying
model.eval()

Project Links

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cahlen/pair3-baseline-7b

Paper for cahlen/pair3-baseline-7b

Evaluation results