ChessTransformer200M
A 204M parameter chess-native transformer trained on Stockfish-labeled positions.
Architecture
- Encoder: FusedBoardEncoder (256d) — learned piece-color + square + context embeddings
- Backbone: 16-layer Transformer (1024d, 16 heads, FFN 4096, GELU, norm_first)
- Policy Head: SpatialPolicyHead (from×to square features, 512d)
- Value Head: WDL (win/draw/loss) classification
Training
- Dataset: avewright/chess-positions-lichess-sf (10.2M positions seen out of 48M available)
- Steps: 10,000 optimizer steps (effective batch 1024)
- Final Policy Loss: ~2.5 (estimated from loss curve)
- Top-1 Accuracy: 18.4% (on 5K eval positions vs Stockfish best moves)
- GPU: NVIDIA A40 46GB, FP16 + torch.compile
- Training time: ~6 hours to step 10,000
Usage
import torch
from play import ChessTransformer200M, load_model, encode_board, get_model_move
import chess
model = load_model("best_model.pt", torch.device("cpu"))
board = chess.Board()
move, info = get_model_move(model, board, torch.device("cpu"))
print(f"Best move: {move.uci()}, Top 5: {info['top_moves']}")
Files
best_model.pt— Model weights only (816 MB)training_log.json— Loss curve dataconfig.json— Architecture config
Known Issues
- Training hit FP16 NaN at step ~13,800. Best checkpoint is step 10,000.
- Model is only ~21% through 1 epoch of the 48M subset dataset.
- Opens with 1.d4 as White. Plays reasonable chess but still early in training.
- Downloads last month
- 22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support