March 31, 2026
Computers have been superhuman at chess for almost thirty years. But modern chess engines achieve this performance in a way that is extremely hard for humans to replicate. First, they use hyper optimized search techniques to look 20-30 moves in advance across many different combinations to reason near perfectly about complex tactical positions. Second, they distill billions of games into sophisticated neural networks to quickly evaluate the subtleties of positions with features that are hard for humans to understand.
On the other hand, if you watch an LLM play chess it seems to reason in a way that is much more congruous with how humans approach chess. Thinking through simple positional features or trying to calculate out lines and often getting lost or making small computational errors. This is reminiscent of using the AI coding tools, where it is pretty easy to follow how the models reason about the problem although they exploit a combination of extremely endurance and often superhuman insight or encyclopedic knowledge of the situation.
To understand the state of the art on these tasks, I built a series of chess benchmarks to understand the current suite of frontier models, where they excel, and where they fall short. This is by no means the first chess benchmark, but I think it is comprehensive and easy to understand and well documents a moment in time of model capability as these models appear to be moving past my skill level (I peaked around 1800 Elo). Traditional chess engines have been imperfect teachers, but it is my hope that LLM models will eventually help us learn chess better by distilling insights about the game into human understandable reasoning on demand.
Chess Bench breaks down chess playing ability into full game, puzzles, and endgames. The model is treated just like a human. It is given the board state at every turn, and asked to reason about the correct move. I'll go into more detail below about the exact mechanics of each challenge.
| Model | Endgame Win % | Puzzle Elo | Full Game Elo |
|---|---|---|---|
| Gemini 3.1 Pro | 70% | 2141 | 1920 |
| GPT 5.4 | 65% | 2054 | — |
| Opus 4.6 | 5% | 1027 | — |
All models were run with maximum thinking configured.
To make this as realistic as possible, I pass the previous move and the current board state into the model at every turn. This imitates how a human would play chess, they do not need to mentally recreate the board from the move sequence. They can just observe the board. I also maintain some reasoning history between moves so that the model can complete nuanced tactical sequences in a coherent manner.
The standard formula for winning a chess game is to gain an advantage in the opening, convert it to a material gain in the middle game, and then use the material advantage to win the endgame. Because of this, early in a chess player's career they are taught how to win a variety of endgames. Some are simple (King + Queen vs. King) and some are quite complex (King + Rook + Pawn vs. King + Rook).
To test the state of the art LLMs, I set them up with 20 theoretically won endgame positions across 4 difficulty tiers and had them play against Stockfish at maximum strength. The model must convert the winning position into checkmate. A draw or loss counts as a failure.
| Position | Stockfish | Gemini 3.1 Pro | GPT 5.4 | Opus 4.6 |
|---|---|---|---|---|
| Tier 1: Elementary | ||||
| KQ vs K, Central King | Win (10) | Win (19) | Draw | Draw |
| KQ vs K, Corner Defense | Win (8) | Win (13) | Win (13) | Win (15) |
| KR vs K, Central King | Win (16) | Win (35) | Win (37) | Draw |
| KR vs K, Edge Defense | Win (16) | Win (23) | Draw | Draw |
| Tier 2: Intermediate | ||||
| KBB vs K, Central | Win (19) | Draw | Win (35) | Draw |
| KP vs K, Advanced Passed Pawn | Win (15) | Win (17) | Win (25) | Draw |
| KP vs K, King Outflanks | Win (17) | Win (37) | Win (31) | Draw |
| KP vs K, King Supports Pawn | Win (15) | Win (29) | Win (23) | Draw |
| KP vs K, Opposition Critical | Win (12) | Win (21) | Win (37) | Draw |
| Tier 3: Advanced | ||||
| KBN vs K, Drive to Correct Corner | Win (33) | Draw | Draw | Draw |
| KBN vs K, Wrong Corner Start | Win (31) | Win (57) | Draw | Draw |
| KQ vs KR, Central | Win (67) | Loss | Win (21) | Loss |
| KQ vs KR, Rook Defending | Win (22) | Draw | Draw | Loss |
| KRP vs KR, Advanced Rook Pawn | Win (19) | Win (39) | Win (77) | Draw |
| KRP vs KR, Lucena Position | Win (14) | Win (55) | Win (31) | Loss |
| KRP vs KR, Pawn on 6th with Support | Win (33) | Draw | Draw | Draw |
| Tier 4: Complex | ||||
| KQP vs KQ, Advanced Pawn | Win (21) | Win (23) | Win (25) | Draw |
| KQP vs KQ, Pawn on 7th | Win (10) | Win (33) | Win (29) | Loss |
| KRBP vs KRB, Passed Pawn | Win (31) | Draw | Win (47) | Loss |
| KRR vs KR, Two Rooks Dominate | Win (40) | Win (67) | Draw | Loss |
| Total | 100% | 70% | 65% | 5% |
Win (N) = checkmate in N moves.
Reading the traces, Gemini reasons about complex endgames in a way that feels very familiar to me. Below is a Tier 4 endgame where Gemini converts a Queen + Pawn vs. Queen position. It checks the king, skewers the queen, promotes the pawn, and then methodically mates with King and Queen.
To benchmark tactical ability, I curated 100 puzzles from Lichess spanning ratings from 500 to 2500. Each puzzle presents a critical position where there is one clearly best move or forcing sequence. The model sees only the board state and must find the winning continuation.
Puzzle Elo is estimated using a Glicko-style rating system: each puzzle the model attempts adjusts its rating based on whether it found the solution and the puzzle's difficulty. This is the same approach Lichess uses to rate human puzzle performance.
| Puzzle Rating | Count | Gemini 3.1 Pro | GPT 5.4 | Opus 4.6 |
|---|---|---|---|---|
| 500–700 | 10 | 10/10 | 9/10 | 7/10 |
| 700–900 | 10 | 10/10 | 10/10 | 6/10 |
| 900–1100 | 10 | 10/10 | 10/10 | 6/10 |
| 1100–1300 | 10 | 9/10 | 10/10 | 3/10 |
| 1300–1500 | 10 | 8/10 | 10/10 | 2/10 |
| 1500–1700 | 10 | 6/10 | 3/10 | 1/10 |
| 1700–1900 | 10 | 7/10 | 8/10 | 1/10 |
| 1900–2100 | 10 | 9/10 | 9/10 | 1/10 |
| 2100–2300 | 10 | 7/10 | 5/10 | 1/10 |
| 2300–2500 | 10 | 5/10 | 2/10 | 1/10 |
| Total | 100 | 81/100 | 76/100 | 29/100 |
| Estimated Elo | 2141 | 2054 | 1027 |
Below is a 1910-rated puzzle, a mate in 2 that requires spotting a queen sacrifice. Toggle between models to see how each one reasons about the position. Gemini and GPT both find the winning Qxg6+, while Opus considers it but talks itself out of it, playing f5 instead.
Gemini and GPT 5.4 both reason conditionally through the tactic. The key insight is that the bishop on a2 pins the f7 pawn along the diagonal, which means after Qxg6+ the pawn cannot recapture and the king is forced to move into a mating net. Opus investigates Qxg6+ but misses the bishop's role in the pin, concludes the queen capture is unsound, and settles for the much less interesting f5 instead.
Puzzles and endgames test isolated skills, but full games require sustained play across all phases: opening preparation, middlegame tactics, and endgame technique. To measure this, I had Gemini 3.1 Pro climb the Elo ladder: 16 games (8 openings × 2 colors) at each Stockfish skill level, from 0 through 8.
Each Stockfish skill level maps to a CCRL Elo rating, giving us a performance curve. A BayesElo analysis of the full 144 games estimates Gemini at 1920 Elo (95% CI: 1831–2010), consistent with the eyeball estimate from the win-rate crossover. Interestingly, the analysis reveals a large white advantage (129 Elo vs the typical 30–40 in human chess). LLMs seem to play significantly better with the initiative.
| Stockfish Elo | W | L | D | Win % |
|---|---|---|---|---|
| 1320 | 16 | 0 | 0 | 100% |
| 1444 | 13 | 3 | 0 | 81% |
| 1566 | 10 | 6 | 0 | 63% |
| 1729 | 9 | 7 | 0 | 56% |
| 1953 | 7 | 9 | 0 | 44% |
| 2204 | 7 | 9 | 0 | 44% |
| 2363 | 3 | 12 | 1 | 22% |
| 2500 | 2 | 14 | 0 | 13% |
| 2596 | 1 | 15 | 0 | 6% |
| Total | 68 | 75 | 1 | 47% |
16 games per level (8 openings × 2 colors). BayesElo estimate: 1920 (95% CI: 1831–2010).
At skill 8, Gemini managed just 1 win out of 16 games, but what a win. Below is Gemini's only victory against the strongest Stockfish level tested (~2596 Elo). It's a 67-move Italian Game where Gemini slowly builds a passed g-pawn, promotes it to a queen, and delivers checkmate.
Gemini is clearly the most optimized model for chess. It slightly outperforms GPT 5.4 on Elo but massively outperforms on speed. GPT 5.4 performs significantly worse with thinking set to high instead of xhigh and can take up to 30 minutes per move while Gemini can easily play a full game in 30 minutes or less. It almost appears like GPT 5.4 is deriving chess from first principles while Gemini is explicitly trained to reason about chess (would make sense given how much time DeepMind has spent historically on games). Opus has no understanding of the geometry of the board. I have noticed this in other settings as well where Opus struggles with spatial reasoning. This may be related to the fact that Anthropic has invested far less in multimodal capabilities and mathematical reasoning.
I would have expected the models to be a bit stronger playing endgames. Endgames are reasonably heuristic based. Once you know the meta strategy for how to trap the king in the corner, you can convert a lot of similar looking endgames. All this intuition should be available to them but they still struggle in certain scenarios to fully convert endgame strategy into wins. In a future post I will explore some techniques for improving language models on chess while preserving this reasoning trace style gameplay.