← Back

Benchmarking Frontier LLMs on Chess

March 31, 2026

Computers Can Already Play Chess

Computers have been superhuman at chess for almost thirty years. But modern chess engines achieve this performance in a way that is extremely hard for humans to replicate. First, they use hyper optimized search techniques to look 20-30 moves in advance across many different combinations to reason near perfectly about complex tactical positions. Second, they distill billions of games into sophisticated neural networks to quickly evaluate the subtleties of positions with features that are hard for humans to understand.

Search
20–30 moves deep
+
Evaluation
+1.7

LLMs Play Chess Differently

On the other hand, if you watch an LLM play chess it seems to reason in a way that is much more congruous with how humans approach chess. Thinking through simple positional features or trying to calculate out lines and often getting lost or making small computational errors. This is reminiscent of using the AI coding tools, where it is pretty easy to follow how the models reason about the problem although they exploit a combination of extremely endurance and often superhuman insight or encyclopedic knowledge of the situation.

To understand the state of the art on these tasks, I built a series of chess benchmarks to understand the current suite of frontier models, where they excel, and where they fall short. This is by no means the first chess benchmark, but I think it is comprehensive and easy to understand and well documents a moment in time of model capability as these models appear to be moving past my skill level (I peaked around 1800 Elo). Traditional chess engines have been imperfect teachers, but it is my hope that LLM models will eventually help us learn chess better by distilling insights about the game into human understandable reasoning on demand.

8
7
6
5
4
3
2
1
a
b
c
d
e
f
g
h
Gemini 3.1 Pro
The Knight on g5 is staring directly at f7. If my Knight leaps to f7, it places the Black King in check. The King can't go to g8 because my Bishop controls the square. The g7 pawn stops it from going to g7. There's nowhere for the King to run. Checkmate!
Nf7#

Frontier Model Chess Performance

Chess Bench breaks down chess playing ability into full game, puzzles, and endgames. The model is treated just like a human. It is given the board state at every turn, and asked to reason about the correct move. I'll go into more detail below about the exact mechanics of each challenge.

ModelEndgame Win %Puzzle EloFull Game Elo
Gemini 3.1 Pro70%21411920
GPT 5.465%2054
Opus 4.65%1027

All models were run with maximum thinking configured.

To make this as realistic as possible, I pass the previous move and the current board state into the model at every turn. This imitates how a human would play chess, they do not need to mentally recreate the board from the move sequence. They can just observe the board. I also maintain some reasoning history between moves so that the model can complete nuanced tactical sequences in a coherent manner.

Endgames

The standard formula for winning a chess game is to gain an advantage in the opening, convert it to a material gain in the middle game, and then use the material advantage to win the endgame. Because of this, early in a chess player's career they are taught how to win a variety of endgames. Some are simple (King + Queen vs. King) and some are quite complex (King + Rook + Pawn vs. King + Rook).

To test the state of the art LLMs, I set them up with 20 theoretically won endgame positions across 4 difficulty tiers and had them play against Stockfish at maximum strength. The model must convert the winning position into checkmate. A draw or loss counts as a failure.

Endgame Results by Position
PositionStockfishGemini 3.1 ProGPT 5.4Opus 4.6
Tier 1: Elementary
KQ vs K, Central KingWin (10)Win (19)DrawDraw
KQ vs K, Corner DefenseWin (8)Win (13)Win (13)Win (15)
KR vs K, Central KingWin (16)Win (35)Win (37)Draw
KR vs K, Edge DefenseWin (16)Win (23)DrawDraw
Tier 2: Intermediate
KBB vs K, CentralWin (19)DrawWin (35)Draw
KP vs K, Advanced Passed PawnWin (15)Win (17)Win (25)Draw
KP vs K, King OutflanksWin (17)Win (37)Win (31)Draw
KP vs K, King Supports PawnWin (15)Win (29)Win (23)Draw
KP vs K, Opposition CriticalWin (12)Win (21)Win (37)Draw
Tier 3: Advanced
KBN vs K, Drive to Correct CornerWin (33)DrawDrawDraw
KBN vs K, Wrong Corner StartWin (31)Win (57)DrawDraw
KQ vs KR, CentralWin (67)LossWin (21)Loss
KQ vs KR, Rook DefendingWin (22)DrawDrawLoss
KRP vs KR, Advanced Rook PawnWin (19)Win (39)Win (77)Draw
KRP vs KR, Lucena PositionWin (14)Win (55)Win (31)Loss
KRP vs KR, Pawn on 6th with SupportWin (33)DrawDrawDraw
Tier 4: Complex
KQP vs KQ, Advanced PawnWin (21)Win (23)Win (25)Draw
KQP vs KQ, Pawn on 7thWin (10)Win (33)Win (29)Loss
KRBP vs KRB, Passed PawnWin (31)DrawWin (47)Loss
KRR vs KR, Two Rooks DominateWin (40)Win (67)DrawLoss
Total100%70%65%5%

Win (N) = checkmate in N moves.

Reading the traces, Gemini reasons about complex endgames in a way that feels very familiar to me. Below is a Tier 4 endgame where Gemini converts a Queen + Pawn vs. Queen position. It checks the king, skewers the queen, promotes the pawn, and then methodically mates with King and Queen.

KQP vs KQ: Advanced Pawn
8
7
6
5
4
3
2
1
a
b
c
d
e
f
g
h
Gemini 3.1 Pro
Use the arrows to step through the game.
Start

Puzzles

To benchmark tactical ability, I curated 100 puzzles from Lichess spanning ratings from 500 to 2500. Each puzzle presents a critical position where there is one clearly best move or forcing sequence. The model sees only the board state and must find the winning continuation.

Puzzle Elo is estimated using a Glicko-style rating system: each puzzle the model attempts adjusts its rating based on whether it found the solution and the puzzle's difficulty. This is the same approach Lichess uses to rate human puzzle performance.

Puzzle Accuracy by Rating Tier
Puzzle RatingCountGemini 3.1 ProGPT 5.4Opus 4.6
500–7001010/109/107/10
700–9001010/1010/106/10
900–11001010/1010/106/10
1100–1300109/1010/103/10
1300–1500108/1010/102/10
1500–1700106/103/101/10
1700–1900107/108/101/10
1900–2100109/109/101/10
2100–2300107/105/101/10
2300–2500105/102/101/10
Total10081/10076/10029/100
Estimated Elo214120541027

Below is a 1910-rated puzzle, a mate in 2 that requires spotting a queen sacrifice. Toggle between models to see how each one reasons about the position. Gemini and GPT both find the winning Qxg6+, while Opus considers it but talks itself out of it, playing f5 instead.

Puzzle: White to Move
8
7
6
5
4
3
2
1
a
b
c
d
e
f
g
h
Mate in 2
The first thing I see is a juicy target: Black’s pawn on g6. My Queen on g4 is staring it down. Black’s King is on g8. My a2 Bishop is pointing right at f7. If I play Qxg6+, the King has to move — and if Kf8, then Qxf7 is checkmate! The Bishop on a2 covers f7 along the diagonal, and the King has nowhere to run.
Qxg6+

Gemini and GPT 5.4 both reason conditionally through the tactic. The key insight is that the bishop on a2 pins the f7 pawn along the diagonal, which means after Qxg6+ the pawn cannot recapture and the king is forced to move into a mating net. Opus investigates Qxg6+ but misses the bishop's role in the pin, concludes the queen capture is unsound, and settles for the much less interesting f5 instead.

Full Games

Puzzles and endgames test isolated skills, but full games require sustained play across all phases: opening preparation, middlegame tactics, and endgame technique. To measure this, I had Gemini 3.1 Pro climb the Elo ladder: 16 games (8 openings × 2 colors) at each Stockfish skill level, from 0 through 8.

Each Stockfish skill level maps to a CCRL Elo rating, giving us a performance curve. A BayesElo analysis of the full 144 games estimates Gemini at 1920 Elo (95% CI: 1831–2010), consistent with the eyeball estimate from the win-rate crossover. Interestingly, the analysis reveals a large white advantage (129 Elo vs the typical 30–40 in human chess). LLMs seem to play significantly better with the initiative.

Gemini 3.1 Pro vs. Stockfish 18
Stockfish EloWLDWin %
13201600100%
1444133081%
1566106063%
172997056%
195379044%
220479044%
2363312122%
2500214013%
259611506%
Total6875147%

16 games per level (8 openings × 2 colors). BayesElo estimate: 1920 (95% CI: 1831–2010).

At skill 8, Gemini managed just 1 win out of 16 games, but what a win. Below is Gemini's only victory against the strongest Stockfish level tested (~2596 Elo). It's a 67-move Italian Game where Gemini slowly builds a passed g-pawn, promotes it to a queen, and delivers checkmate.

Italian Game: Gemini (White) vs. Stockfish (~2596 Elo)
8
7
6
5
4
3
2
1
a
b
c
d
e
f
g
h
Gemini 3.1 Pro
Use the arrows to step through the game.
Start

Observations

Gemini is clearly the most optimized model for chess. It slightly outperforms GPT 5.4 on Elo but massively outperforms on speed. GPT 5.4 performs significantly worse with thinking set to high instead of xhigh and can take up to 30 minutes per move while Gemini can easily play a full game in 30 minutes or less. It almost appears like GPT 5.4 is deriving chess from first principles while Gemini is explicitly trained to reason about chess (would make sense given how much time DeepMind has spent historically on games). Opus has no understanding of the geometry of the board. I have noticed this in other settings as well where Opus struggles with spatial reasoning. This may be related to the fact that Anthropic has invested far less in multimodal capabilities and mathematical reasoning.

I would have expected the models to be a bit stronger playing endgames. Endgames are reasonably heuristic based. Once you know the meta strategy for how to trap the king in the corner, you can convert a lot of similar looking endgames. All this intuition should be available to them but they still struggle in certain scenarios to fully convert endgame strategy into wins. In a future post I will explore some techniques for improving language models on chess while preserving this reasoning trace style gameplay.