← Back

A Review of Opus 4.7

April 17, 2026

Over the past few months I’ve built up a small repository of private frontier model benchmarks that I’ve published on my blog (and I have a few more still unreleased). Naturally, with a new foundation model release I wanted to revisit these evaluations to understand the direction of progress.

Overall, I found 4.7 to be a strict improvement over 4.6 on every benchmark. The largest gains came in 2D vision tasks, negotiation, and long-horizon strategic thinking. This tracks pretty closely to the categories Anthropic highlighted in their announcement post.

A lot of the discourse has been around the character of the model. I have some ideas for how to measure this that I will release in the future, but for now I will refrain from commenting on that outside of a short section at the end.

Chess Perception Bench

From Analyzing Chess Input Modalities. The table grades 30 one-move puzzles per cell across five input modalities: UCI move history, PGN notation, a FEN string, a 2D PNG render, and a pair of photos of a real board. Opus 4.7 is a strict improvement over 4.6, although the vision gains are much larger on the 2D PNG than the 3D photos.

Puzzle Accuracy by Model and Modality
ModelUCIPGNFENPNG ImagePhotos
Gemini 3.1 Pro97%100%100%77%73%
GPT 5.493%100%100%93%40%
Opus 4.753%50%50%57%20%
Opus 4.633%33%40%20%17%
Qwen 3.5 27B13%7%23%37%17%

30 one-move puzzles per cell, thinking effort set to “low” for every model.

Chess Bench

From Benchmarking Frontier LLMs on Chess. I reran the endgame (20 theoretically won positions against Stockfish at skill 20) and puzzle (100 Lichess puzzles, 500–2500 Elo) benchmarks with maximum thinking. Opus 4.7 is a huge improvement over 4.6 on endgames, and a minor improvement on puzzles.

Chess Bench Summary
ModelEndgame WinsPuzzles SolvedPuzzle Elo
Gemini 3.1 Pro75%81/1002141
GPT 5.455%76/1002054
Opus 4.750%33/1001110
Opus 4.65%29/1001027

Opus 4.7 is approaching GPT 5.4 level on endgames, as you can see in the table breakdown below:

Endgame Results by Position
PositionStockfishGemini 3.1 ProOpus 4.7Opus 4.6
Tier 1: Elementary
KQ vs K, Central KingWin (7)Win (10)Win (23)Draw
KQ vs K, Corner DefenseWin (7)Win (7)Win (15)Win (8)
KR vs K, Central KingWin (16)Win (18)DrawDraw
KR vs K, Edge DefenseWin (12)Win (12)Win (89)Draw
Tier 2: Intermediate
KBB vs K, CentralWin (27)Win (19)Win (91)Draw
KP vs K, Advanced Passed PawnWin (8)Win (9)Win (15)Draw
KP vs K, King OutflanksWin (13)Win (19)Win (35)Draw
KP vs K, King Supports PawnWin (12)Win (15)DrawDraw
KP vs K, Opposition CriticalWin (11)Win (11)Win (29)Draw
Tier 3: Advanced
KBN vs K, Drive to Correct CornerWin (35)DrawDrawDraw
KBN vs K, Wrong Corner StartWin (31)Win (29)DrawDraw
KQ vs KR, CentralWin (20)LossDrawLoss
KQ vs KR, Rook DefendingWin (31)DrawDrawLoss
KRP vs KR, Advanced Rook PawnWin (14)Win (20)Win (59)Draw
KRP vs KR, Lucena PositionWin (13)Win (28)Win (49)Loss
KRP vs KR, Pawn on 6th with SupportWin (22)DrawLossDraw
Tier 4: Complex
KQP vs KQ, Advanced PawnWin (11)Win (12)LossDraw
KQP vs KQ, Pawn on 7thWin (9)Win (17)DrawLoss
KRBP vs KRB, Passed PawnWin (14)DrawLossLoss
KRR vs KR, Two Rooks DominateWin (20)Win (34)Win (119)Loss
Total100%75%50%5%

Win (N) = checkmate in N moves. Draw / Loss = failed to convert.

Negotiation Bench

From A Negotiation Benchmark for Frontier Models. I added Opus 4.7 to the round-robin and reran the cross-provider tournament. Opus 4.7 made a huge leap over the already dominant 4.6. This lines up with research from Andon Labs showing that 4.7 can be even more aggressive than 4.6.

Negotiation Skill
ModelEloAvg ScoreWords/msg
Opus 4.716880.6824
Opus 4.615120.6236
Gemini 3.1 Pro14390.6127
GPT 5.413610.5834

150 games per model across the round-robin; 100% deal rate for every model.

Opus 4.7 is the new clear top of the table, picking up about +175 Elo over 4.6 and beating every opponent in head to head. Direct against 4.6 it wins 40 of 50 games and outscores it 0.65 to 0.56.

In head-to-heads, 4.7's offer was accepted in 70% of the ending splits. 4.7 opened stronger and held firmer, leveraging the stochastic deadline to talk 4.6 into accepting worse deals.

Optimization Arena

We have three open-ended trading challenges on Optimization Arena that I thought would make for interesting benchmarks. However, after running them a few times through the Claude Code and Codex harnesses I saw too much variance in scores to make a reliable report.

I have heard reports from top contestants that 4.7 was unable to make additional progress on any top-tier solutions. This matched what I saw locally. Early reports suggest this model is helpful for getting started but not a game changer for any of our trading problems.

Overall Vibes

On balance, this new model clearly outperforms on well-structured tasks. It is too early to give a vibes assessment, and historically first impressions of a model release are not too correlated with the eventual consensus.

The thing I like most is the speed and smoothness with which it is being served. Maybe this is a first-day effect, maybe this model is genuinely easier to serve for some architectural reason, or maybe it just uses fewer tokens per unit of insight.

The thing I like least is the overly literal instruction following and subtle lack of humanity. In one case, while making a distribution chart for an article, I asked it for a figure caption. It wrote out a two-sentence caption that described what smoothing function was used to create the chart. This is obviously information a reader does not care about, and I suspect 4.6 would have handled this correctly.