avewright
/

chess-transformer-200m

Model card Files Files and versions

ChessTransformer200M

A 204M parameter chess-native transformer trained on Stockfish-labeled positions.

Architecture

Encoder: FusedBoardEncoder (256d) — learned piece-color + square + context embeddings
Backbone: 16-layer Transformer (1024d, 16 heads, FFN 4096, GELU, norm_first)
Policy Head: SpatialPolicyHead (from×to square features, 512d)
Value Head: WDL (win/draw/loss) classification

Training

Dataset: avewright/chess-positions-lichess-sf (10.2M positions seen out of 48M available)
Steps: 10,000 optimizer steps (effective batch 1024)
Final Policy Loss: ~2.5 (estimated from loss curve)
Top-1 Accuracy: 18.4% (on 5K eval positions vs Stockfish best moves)
GPU: NVIDIA A40 46GB, FP16 + torch.compile
Training time: ~6 hours to step 10,000

Usage

import torch
from play import ChessTransformer200M, load_model, encode_board, get_model_move
import chess

model = load_model("best_model.pt", torch.device("cpu"))
board = chess.Board()
move, info = get_model_move(model, board, torch.device("cpu"))
print(f"Best move: {move.uci()}, Top 5: {info['top_moves']}")

Files

best_model.pt — Model weights only (816 MB)
training_log.json — Loss curve data
config.json — Architecture config

Known Issues

Training hit FP16 NaN at step ~13,800. Best checkpoint is step 10,000.
Model is only ~21% through 1 epoch of the 48M subset dataset.
Opens with 1.d4 as White. Plays reasonable chess but still early in training.

Downloads last month: 22

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train avewright/chess-transformer-200m