Pair 3 Baseline: 7.5B Standard Transformer
What This Model Is (And Isn't)
This is NOT a general-purpose language model. This is the baseline control in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement (arXiv:2601.21766).
This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is catastrophically overfit (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.
Checkpoints
Two checkpoints are provided:
| Checkpoint | Step | Tokens Seen | Purpose |
|---|---|---|---|
step_010000.safetensors |
10,000 / 95,367 | 5.2B / 50B | Best generalizing model (lowest val loss) |
step_095367.safetensors |
95,367 / 95,367 | 50B / 50B | Final checkpoint (for head-to-head comparison with CoFrGeNet-F) |
Why Two Checkpoints?
The best language model and the best experiment endpoint are different checkpoints:
- Step 10K saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually use this as an LLM, use this checkpoint.
- Step 95K completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.
Evaluation Results
| Metric | Step 10K (Best LLM) | Step 20K | Step 95K (Final) |
|---|---|---|---|
| WikiText-2 PPL | 39.52 | 52.21 | 2,952,579 |
| WikiText-103 PPL | 39.52 | 52.21 | 2,952,579 |
| LAMBADA PPL | 51.48 | 76.88 | 5,240,843 |
| LAMBADA Acc | 15.89% | 13.12% | 6.31% |
| Throughput | 29,561 tok/s | 26,799 tok/s | 55,693 tok/s |
| Gen Speed | 104.47 ms/tok | 88.06 ms/tok | 47.19 ms/tok |
Evaluated with scripts/04_evaluate.py on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.
Context: Pair 1 (450M) vs Pair 3 (7.5B)
| Model | Params | WikiText-2 PPL | LAMBADA Acc |
|---|---|---|---|
| Pair 1 Baseline (final) | 450M | 23.69 | 26.88% |
| Pair 3 Baseline (step 10K) | 7.5B | 39.52 | 15.89% |
| Pair 3 Baseline (final) | 7.5B | 2,952,579 | 6.31% |
The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.
Architecture
Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).
| Parameter | Value |
|---|---|
| Layers | 36 |
| Hidden dim | 4096 |
| Attention heads | 32 |
| Head dim | 128 |
| FFN inner dim | 16,384 (4x hidden) |
| Vocab size | 50,257 (GPT-2 tokenizer) |
| Context length | 1,024 |
| Total parameters | 7,458,103,296 |
| Weight tying | Yes (lm_head = tok_emb) |
Training Details
| Setting | Value |
|---|---|
| Dataset | FineWeb-Edu 50BT (educational web text) |
| Tokenizer | GPT-2 (tiktoken, gpt2 encoding) |
| Hardware | 8x NVIDIA B200 (179 GB each) |
| Parallelism | FSDP FULL_SHARD |
| Precision | bfloat16 |
| Optimizer | AdamW (fused), beta1=0.9, beta2=0.95 |
| Learning rate | 3e-4 peak, cosine decay to 0 |
| Warmup | 2,000 steps |
| Weight decay | 0.1 (2D weight tensors only) |
| Gradient clipping | 1.0 max norm |
| Batch size | 524,288 tokens/update (micro_batch=64, no grad accumulation) |
| Total steps | 95,367 (1 epoch over 50B tokens) |
| Throughput | ~132,800 tok/s |
| Wall time | ~5.5 days |
| torch.compile | Disabled (dtype mismatch crash at 7B+ scale) |
Validation Loss Trajectory
The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 5,000 | ~3.0 | 3.01 | 20.3 |
| 8,000 | ~2.9 | 2.94 | 18.8 |
| 10,000 | ~2.9 | 2.95 | 19.0 |
| 20,000 | ~1.2 | 3.05 | 21.2 |
| 40,000 | ~0.4 | 3.33 | 27.8 |
| 60,000 | ~0.04 | 7.13 | 1,251 |
| 80,000 | ~0.01 | 11.60 | 109,013 |
| 95,367 | 0.008 | ~12.0 | ~163,000 |
The CoFrGeNet-F Experiment
This model is one half of Pair 3 in a series of experiments testing IBM Research's CoFrGeNet-F architecture (arXiv:2601.21766). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.
Experiment Design
Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.
| Baseline (this model) | CoFrGeNet-F | |
|---|---|---|
| Params | 7.5B | ~4.8B (35% fewer) |
| Architecture | 36L, 4096d, 32h, standard FFN | 36L, 4608d, 36h, Cffn (L=3, d=5) |
| Data | 50B tokens FineWeb-Edu | 50B tokens FineWeb-Edu |
| LR / Schedule | 3e-4, cosine to 0 | 3e-4, cosine to 0 |
| Batch size | 524,288 tokens | 524,288 tokens |
The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at cahlen/pair3-cofrgenet-5b when training completes.
Prior Pairs
| Pair | Baseline | CoFrGeNet-F | Result |
|---|---|---|---|
| Pair 1 | 450M, WikiText-2 PPL 23.69 | 410M, PPL 56.61 | Baseline wins |
| Pair 3 | 7.5B (this model) | 4.8B (training next) | Pending |
Usage
from safetensors.torch import load_file
import torch
# Load the best-generalization checkpoint
state_dict = load_file("step_010000.safetensors")
# You'll need the model definition from the project repo
# git clone https://github.com/cahlen/cofrgenet-f
# Then:
from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer
config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
model = BaselineTransformer(config)
model.load_state_dict(state_dict, strict=False) # strict=False for weight tying
model.eval()
Project Links
- GitHub: cahlen/cofrgenet-f
- HuggingFace (all models): cahlen/cofrgenet-f
- CoFrGeNet-F paper: arXiv:2601.21766
- Project Wiki: GitHub Wiki
License
MIT
Dataset used to train cahlen/pair3-baseline-7b
Paper for cahlen/pair3-baseline-7b
Evaluation results
- Perplexity (step 10K, best generalization) on WikiText-2test set self-reported39.520
- Perplexity (step 95K, final) on WikiText-2test set self-reported2952578.700
- Accuracy (step 10K, best generalization) on LAMBADAtest set self-reported15.890
- Accuracy (step 95K, final) on LAMBADAtest set self-reported6.310