A Review of Opus 4.7

April 17, 2026

Over the past few months I’ve built up a small repository of private frontier model benchmarks that I’ve published on my blog (and I have a few more still unreleased). Naturally, with a new foundation model release I wanted to revisit these evaluations to understand the direction of progress.

Overall, I found 4.7 to be a strict improvement over 4.6 on every benchmark. The largest gains came in 2D vision tasks, negotiation, and long-horizon strategic thinking. This tracks pretty closely to the categories Anthropic highlighted in their announcement post.

A lot of the discourse has been around the character of the model. I have some ideas for how to measure this that I will release in the future, but for now I will refrain from commenting on that outside of a short section at the end.

Chess Perception Bench

From Analyzing Chess Input Modalities. The table grades 30 one-move puzzles per cell across five input modalities: UCI move history, PGN notation, a FEN string, a 2D PNG render, and a pair of photos of a real board. Opus 4.7 is a strict improvement over 4.6, although the vision gains are much larger on the 2D PNG than the 3D photos.

Puzzle Accuracy by Model and Modality

Model	UCI	PGN	FEN	PNG Image	Photos
Gemini 3.1 Pro	97%	100%	100%	77%	73%
GPT 5.4	93%	100%	100%	93%	40%
Opus 4.7	53%	50%	50%	57%	20%
Opus 4.6	33%	33%	40%	20%	17%
Qwen 3.5 27B	13%	7%	23%	37%	17%

30 one-move puzzles per cell, thinking effort set to “low” for every model.

Chess Bench

From Benchmarking Frontier LLMs on Chess. I reran the endgame (20 theoretically won positions against Stockfish at skill 20) and puzzle (100 Lichess puzzles, 500–2500 Elo) benchmarks with maximum thinking. Opus 4.7 is a huge improvement over 4.6 on endgames, and a minor improvement on puzzles.

Chess Bench Summary

Model	Endgame Wins	Puzzles Solved	Puzzle Elo
Gemini 3.1 Pro	75%	81/100	2141
GPT 5.4	55%	76/100	2054
Opus 4.7	50%	33/100	1110
Opus 4.6	5%	29/100	1027

Opus 4.7 is approaching GPT 5.4 level on endgames, as you can see in the table breakdown below:

Endgame Results by Position

Position	Stockfish	Gemini 3.1 Pro	Opus 4.7	Opus 4.6
Tier 1: Elementary
KQ vs K, Central King	Win (7)	Win (10)	Win (23)	Draw
KQ vs K, Corner Defense	Win (7)	Win (7)	Win (15)	Win (8)
KR vs K, Central King	Win (16)	Win (18)	Draw	Draw
KR vs K, Edge Defense	Win (12)	Win (12)	Win (89)	Draw
Tier 2: Intermediate
KBB vs K, Central	Win (27)	Win (19)	Win (91)	Draw
KP vs K, Advanced Passed Pawn	Win (8)	Win (9)	Win (15)	Draw
KP vs K, King Outflanks	Win (13)	Win (19)	Win (35)	Draw
KP vs K, King Supports Pawn	Win (12)	Win (15)	Draw	Draw
KP vs K, Opposition Critical	Win (11)	Win (11)	Win (29)	Draw
Tier 3: Advanced
KBN vs K, Drive to Correct Corner	Win (35)	Draw	Draw	Draw
KBN vs K, Wrong Corner Start	Win (31)	Win (29)	Draw	Draw
KQ vs KR, Central	Win (20)	Loss	Draw	Loss
KQ vs KR, Rook Defending	Win (31)	Draw	Draw	Loss
KRP vs KR, Advanced Rook Pawn	Win (14)	Win (20)	Win (59)	Draw
KRP vs KR, Lucena Position	Win (13)	Win (28)	Win (49)	Loss
KRP vs KR, Pawn on 6th with Support	Win (22)	Draw	Loss	Draw
Tier 4: Complex
KQP vs KQ, Advanced Pawn	Win (11)	Win (12)	Loss	Draw
KQP vs KQ, Pawn on 7th	Win (9)	Win (17)	Draw	Loss
KRBP vs KRB, Passed Pawn	Win (14)	Draw	Loss	Loss
KRR vs KR, Two Rooks Dominate	Win (20)	Win (34)	Win (119)	Loss
Total	100%	75%	50%	5%

Win (N) = checkmate in N moves. Draw / Loss = failed to convert.

Negotiation Bench

From A Negotiation Benchmark for Frontier Models. I added Opus 4.7 to the round-robin and reran the cross-provider tournament. Opus 4.7 made a huge leap over the already dominant 4.6. This lines up with research from Andon Labs showing that 4.7 can be even more aggressive than 4.6.

Negotiation Skill

Model	Elo	Avg Score	Words/msg
Opus 4.7	1688	0.68	24
Opus 4.6	1512	0.62	36
Gemini 3.1 Pro	1439	0.61	27
GPT 5.4	1361	0.58	34

150 games per model across the round-robin; 100% deal rate for every model.

Opus 4.7 is the new clear top of the table, picking up about +175 Elo over 4.6 and beating every opponent in head to head. Direct against 4.6 it wins 40 of 50 games and outscores it 0.65 to 0.56.

In head-to-heads, 4.7's offer was accepted in 70% of the ending splits. 4.7 opened stronger and held firmer, leveraging the stochastic deadline to talk 4.6 into accepting worse deals.

Optimization Arena

We have three open-ended trading challenges on Optimization Arena that I thought would make for interesting benchmarks. However, after running them a few times through the Claude Code and Codex harnesses I saw too much variance in scores to make a reliable report.

I have heard reports from top contestants that 4.7 was unable to make additional progress on any top-tier solutions. This matched what I saw locally. Early reports suggest this model is helpful for getting started but not a game changer for any of our trading problems.

Overall Vibes

On balance, this new model clearly outperforms on well-structured tasks. It is too early to give a vibes assessment, and historically first impressions of a model release are not too correlated with the eventual consensus.

The thing I like most is the speed and smoothness with which it is being served. Maybe this is a first-day effect, maybe this model is genuinely easier to serve for some architectural reason, or maybe it just uses fewer tokens per unit of insight.

The thing I like least is the overly literal instruction following and subtle lack of humanity. In one case, while making a distribution chart for an article, I asked it for a figure caption. It wrote out a two-sentence caption that described what smoothing function was used to create the chart. This is obviously information a reader does not care about, and I suspect 4.6 would have handled this correctly.