Benchmarking Frontier LLMs on Chess

March 31, 2026

Computers Can Already Play Chess

Computers have been superhuman at chess for almost thirty years. But modern chess engines achieve this performance in a way that is impossible for humans to replicate. First, they use hyper-optimized search techniques to look 20-30 moves in advance across many different combinations to reason near perfectly about complex tactical positions. Second, they distill billions of games into sophisticated neural networks to quickly evaluate the subtleties of positions with features that are hard for humans to understand.

Evaluation

LLMs Play Chess Differently

On the other hand, if you watch an LLM play chess it seems to reason in a way that is much more congruous with how humans approach chess, thinking through simple positional features or trying to calculate out lines and often getting lost or making small computational errors. Therefore, if LLMs can play chess at a high level, they may be able to distill insights to human players in a way that is much more tractable.

To understand their current performance, I built a series of chess benchmarks to evaluate them across a few facets of the game. This is by no means the first chess benchmark, but I think it is comprehensive, easy to understand, and documents a moment in time as these models appear to be moving past my skill level (I peaked around 1800 Elo many years ago).

♜︎

♞︎

♜︎

♚︎

♟︎

♝︎

♟︎

♝︎

♟︎

♛︎

♞︎

♟︎

♝︎

♟︎

♛︎

♟︎

♜︎

♚︎

a
b
c
d
e
f
g
h

Gemini 3.1 Pro

The Knight on g5 is staring directly at f7. If my Knight leaps to f7, it places the Black King in check. The King can't go to g8 because my Bishop controls the square. The g7 pawn stops it from going to g7. There's nowhere for the King to run. Checkmate!

Nf7#

Frontier Model Chess Performance

Chess Bench breaks down chess playing ability into full game, puzzle, and endgame performance. The model is treated just like a human. It is given the board state at every turn, and asked to reason about the correct move.

Model	Endgame Win %	Puzzle Elo	Full Game Elo
Gemini 3.1 Pro	75%	2141	1920
GPT 5.4	55%	2054	—
Opus 4.6	5%	1027	—

All models were run with maximum thinking configured.

To make this as realistic as possible, I pass the previous move and the current board state into the model at every turn. This imitates how a human would play chess, they do not need to mentally recreate the board from the move sequence. They can just observe the board. I also maintain some reasoning history between moves so that the model can complete nuanced tactical sequences in a coherent manner.

Endgames

The standard formula for winning a chess game is to gain an advantage in the opening, convert it to a material gain in the middle game, and then use the material advantage to win the endgame. Because of this, early in a chess player's career they are taught how to win a variety of endgames. Some are simple (King + Queen vs. King) and some are quite complex (King + Rook + Pawn vs. King + Rook).

To test the models, I set them up with 20 theoretically won endgame positions across 4 difficulty tiers and had them play against Stockfish. The model must convert the winning position into checkmate. A draw or loss counts as a failure.

Endgame Results by Position

Position	Stockfish	Gemini 3.1 Pro	GPT 5.4	Opus 4.6
Tier 1: Elementary
KQ vs K, Central King	Win (7)	Win (10)	Draw	Draw
KQ vs K, Corner Defense	Win (7)	Win (7)	Win (7)	Win (8)
KR vs K, Central King	Win (16)	Win (18)	Win (19)	Draw
KR vs K, Edge Defense	Win (12)	Win (12)	Draw	Draw
Tier 2: Intermediate
KBB vs K, Central	Win (27)	Win (19)	Win (25)	Draw
KP vs K, Advanced Passed Pawn	Win (8)	Win (9)	Win (9)	Draw
KP vs K, King Outflanks	Win (13)	Win (19)	Win (16)	Draw
KP vs K, King Supports Pawn	Win (12)	Win (15)	Win (12)	Draw
KP vs K, Opposition Critical	Win (11)	Win (11)	Win (19)	Draw
Tier 3: Advanced
KBN vs K, Drive to Correct Corner	Win (35)	Draw	Draw	Draw
KBN vs K, Wrong Corner Start	Win (31)	Win (29)	Draw	Draw
KQ vs KR, Central	Win (20)	Loss	Loss	Loss
KQ vs KR, Rook Defending	Win (31)	Draw	Loss	Loss
KRP vs KR, Advanced Rook Pawn	Win (14)	Win (20)	Draw	Draw
KRP vs KR, Lucena Position	Win (13)	Win (28)	Win (16)	Loss
KRP vs KR, Pawn on 6th with Support	Win (22)	Draw	Draw	Draw
Tier 4: Complex
KQP vs KQ, Advanced Pawn	Win (11)	Win (12)	Win (19)	Draw
KQP vs KQ, Pawn on 7th	Win (9)	Win (17)	Win (15)	Loss
KRBP vs KRB, Passed Pawn	Win (14)	Draw	Win (27)	Loss
KRR vs KR, Two Rooks Dominate	Win (20)	Win (34)	Draw	Loss
Total	100%	75%	55%	5%

Win (N) = checkmate in N moves.Number = checkmate in N moves. Draw/Loss = failed to convert.

Reading the traces, Gemini reasons about complex endgames in a way that feels very familiar to me. Below is a Tier 4 endgame where Gemini converts a Queen + Pawn vs. Queen position. It checks the king, skewers the queen, promotes the pawn, and then methodically mates with King and Queen.

KQP vs KQ: Advanced Pawn

♟︎

♚︎

♛︎

a
b
c
d
e
f
g
h

Start

Gemini 3.1 Pro

Use the arrows to step through the game.

Start

Puzzles

To benchmark tactical ability, I curated 100 recent puzzles from Lichess spanning ratings from 500 to 2500. Each puzzle presents a critical position where there is one clearly best move or forcing sequence. The model sees only the board state and must find the winning continuation.

Puzzle Accuracy by Rating Tier

Puzzle Rating	Count	Gemini 3.1 Pro	GPT 5.4	Opus 4.6
500–700	10	10/10	9/10	7/10
700–900	10	10/10	10/10	6/10
900–1100	10	10/10	10/10	6/10
1100–1300	10	9/10	10/10	3/10
1300–1500	10	8/10	10/10	2/10
1500–1700	10	6/10	3/10	1/10
1700–1900	10	7/10	8/10	1/10
1900–2100	10	9/10	9/10	1/10
2100–2300	10	7/10	5/10	1/10
2300–2500	10	5/10	2/10	1/10
Total	100	81/100	76/100	29/100
Estimated Elo		2141	2054	1027

Again, Gemini excels both on speed and accuracy. GPT 5.4 is also quite strong but can easily take up to 30 minutes per move for a puzzle. Opus is hopeless; it cannot reason about even a mildly nuanced tactic. Below is a mate in 2 that requires spotting a queen sacrifice. Gemini and GPT both find the winning Qxg6+, while Opus considers it but talks itself out of it, playing f5 instead.

Puzzle: White to Move

♜︎

♛︎

♜︎

♚︎

♟︎

♞︎

♟︎

♝︎

♟︎

♞︎

♟︎

♛︎

♟︎

♝︎

♟︎

♝︎

♟︎

♜︎

♚︎

♜︎

a
b
c
d
e
f
g
h

Mate in 2

The first thing I see is a juicy target: Black’s pawn on g6. My Queen on g4 is staring it down. Black’s King is on g8. My a2 Bishop is pointing right at f7. If I play Qxg6+, the King has to move — and if Kf8, then Qxf7 is checkmate! The Bishop on a2 covers f7 along the diagonal, and the King has nowhere to run.

Qxg6+ ✓

Gemini and GPT 5.4 both reason conditionally through the tactic. The key insight is that the bishop on a2 pins the f7 pawn along the diagonal, which means after Qxg6+ the pawn cannot recapture and the king is forced to move into a mating net. Opus investigates Qxg6+ but misses the bishop's role in the pin, concludes the queen capture is unsound, and settles for the much less interesting f5 instead.

Full Games

Puzzles and endgames test isolated skills, but full games require sustained play across all phases: opening preparation, middlegame tactics, and endgame technique. To measure this, I had Gemini 3.1 Pro climb the Elo ladder: 16 games (8 openings × 2 colors) at each Stockfish skill level, from 0 through 8. Gemini is the only model that can complete games at a reasonable speed, although based on the narrower benchmarks, I would expect GPT 5.4 could play around 1800 Elo given 8–12 hours per game.

Each Stockfish skill level maps to an estimated Elo rating, giving us a performance curve. A BayesElo analysis of the full 144 games estimates Gemini at 1920 Elo.

Gemini 3.1 Pro vs. Stockfish 18

Stockfish Elo	W	L	D	Win %
1320	16	0	0	100%
1444	13	3	0	81%
1566	10	6	0	63%
1729	9	7	0	56%
1953	7	9	0	44%
2204	7	9	0	44%
2363	3	12	1	22%
2500	2	14	0	13%
2596	1	15	0	6%
Total	68	75	1	47%

16 games per level (8 openings × 2 colors). BayesElo estimate: 1920 (95% CI: 1831–2010).

Below is Gemini's most impressive win, a 67-move Italian Game where it slowly builds a passed g-pawn against a highly rated Stockfish opponent.

Italian Game: Gemini (White) vs. Stockfish (~2596 Elo)

♜︎

♝︎

♛︎

♚︎

♝︎

♞︎

♜︎

♟︎

♞︎

♟︎

♝︎

♟︎

♞︎

♟︎

♜︎

♞︎

♝︎

♛︎

♚︎

♜︎

a
b
c
d
e
f
g
h

Start

Gemini 3.1 Pro

Use the arrows to step through the game.

Start

Observations

Gemini is clearly the most optimized model for chess. It slightly outperforms GPT 5.4 on Elo but massively outperforms on speed. GPT 5.4 performs significantly worse with thinking set to high instead of xhigh and can take up to 30 minutes per move while Gemini can easily play a full game in 30 minutes or less. It almost appears like GPT 5.4 is deriving chess from first principles while Gemini is explicitly trained to reason about chess (would make sense given how much time DeepMind has spent historically researching games). Opus has no understanding of the geometry of the board. I have noticed this in other settings as well where Opus struggles with spatial reasoning. This may be related to the fact that Anthropic has invested far less in multimodal capabilities and mathematical reasoning.

I would have expected the models to be a bit stronger playing endgames. Endgames are reasonably heuristic-based. Once you know the meta strategy for how to trap the king in the corner, you can convert a lot of similar-looking endgames. All this intuition should be available to them but they still struggle in certain scenarios to fully convert endgame strategy into wins. In a future post I will explore some techniques for improving language models on chess while preserving this reasoning trace style gameplay.