ChessTransformer200M

A 204M parameter chess-native transformer trained on Stockfish-labeled positions.

Architecture

  • Encoder: FusedBoardEncoder (256d) — learned piece-color + square + context embeddings
  • Backbone: 16-layer Transformer (1024d, 16 heads, FFN 4096, GELU, norm_first)
  • Policy Head: SpatialPolicyHead (from×to square features, 512d)
  • Value Head: WDL (win/draw/loss) classification

Training

  • Dataset: avewright/chess-positions-lichess-sf (10.2M positions seen out of 48M available)
  • Steps: 10,000 optimizer steps (effective batch 1024)
  • Final Policy Loss: ~2.5 (estimated from loss curve)
  • Top-1 Accuracy: 18.4% (on 5K eval positions vs Stockfish best moves)
  • GPU: NVIDIA A40 46GB, FP16 + torch.compile
  • Training time: ~6 hours to step 10,000

Usage

import torch
from play import ChessTransformer200M, load_model, encode_board, get_model_move
import chess

model = load_model("best_model.pt", torch.device("cpu"))
board = chess.Board()
move, info = get_model_move(model, board, torch.device("cpu"))
print(f"Best move: {move.uci()}, Top 5: {info['top_moves']}")

Files

  • best_model.pt — Model weights only (816 MB)
  • training_log.json — Loss curve data
  • config.json — Architecture config

Known Issues

  • Training hit FP16 NaN at step ~13,800. Best checkpoint is step 10,000.
  • Model is only ~21% through 1 epoch of the 48M subset dataset.
  • Opens with 1.d4 as White. Plays reasonable chess but still early in training.
Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train avewright/chess-transformer-200m