April 17, 2026
Over the past few months I’ve built up a small repository of private frontier model benchmarks that I’ve published on my blog (and I have a few more still unreleased). Naturally, with a new foundation model release I wanted to revisit these evaluations to understand the direction of progress.
Overall, I found 4.7 to be a strict improvement over 4.6 on every benchmark. The largest gains came in 2D vision tasks, negotiation, and long-horizon strategic thinking. This tracks pretty closely to the categories Anthropic highlighted in their announcement post.
A lot of the discourse has been around the character of the model. I have some ideas for how to measure this that I will release in the future, but for now I will refrain from commenting on that outside of a short section at the end.
From Analyzing Chess Input Modalities. The table grades 30 one-move puzzles per cell across five input modalities: UCI move history, PGN notation, a FEN string, a 2D PNG render, and a pair of photos of a real board. Opus 4.7 is a strict improvement over 4.6, although the vision gains are much larger on the 2D PNG than the 3D photos.
| Model | UCI | PGN | FEN | PNG Image | Photos |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 97% | 100% | 100% | 77% | 73% |
| GPT 5.4 | 93% | 100% | 100% | 93% | 40% |
| Opus 4.7 | 53% | 50% | 50% | 57% | 20% |
| Opus 4.6 | 33% | 33% | 40% | 20% | 17% |
| Qwen 3.5 27B | 13% | 7% | 23% | 37% | 17% |
30 one-move puzzles per cell, thinking effort set to “low” for every model.
From Benchmarking Frontier LLMs on Chess. I reran the endgame (20 theoretically won positions against Stockfish at skill 20) and puzzle (100 Lichess puzzles, 500–2500 Elo) benchmarks with maximum thinking. Opus 4.7 is a huge improvement over 4.6 on endgames, and a minor improvement on puzzles.
| Model | Endgame Wins | Puzzles Solved | Puzzle Elo |
|---|---|---|---|
| Gemini 3.1 Pro | 75% | 81/100 | 2141 |
| GPT 5.4 | 55% | 76/100 | 2054 |
| Opus 4.7 | 50% | 33/100 | 1110 |
| Opus 4.6 | 5% | 29/100 | 1027 |
Opus 4.7 is approaching GPT 5.4 level on endgames, as you can see in the table breakdown below:
| Position | Stockfish | Gemini 3.1 Pro | Opus 4.7 | Opus 4.6 |
|---|---|---|---|---|
| Tier 1: Elementary | ||||
| KQ vs K, Central King | Win (7) | Win (10) | Win (23) | Draw |
| KQ vs K, Corner Defense | Win (7) | Win (7) | Win (15) | Win (8) |
| KR vs K, Central King | Win (16) | Win (18) | Draw | Draw |
| KR vs K, Edge Defense | Win (12) | Win (12) | Win (89) | Draw |
| Tier 2: Intermediate | ||||
| KBB vs K, Central | Win (27) | Win (19) | Win (91) | Draw |
| KP vs K, Advanced Passed Pawn | Win (8) | Win (9) | Win (15) | Draw |
| KP vs K, King Outflanks | Win (13) | Win (19) | Win (35) | Draw |
| KP vs K, King Supports Pawn | Win (12) | Win (15) | Draw | Draw |
| KP vs K, Opposition Critical | Win (11) | Win (11) | Win (29) | Draw |
| Tier 3: Advanced | ||||
| KBN vs K, Drive to Correct Corner | Win (35) | Draw | Draw | Draw |
| KBN vs K, Wrong Corner Start | Win (31) | Win (29) | Draw | Draw |
| KQ vs KR, Central | Win (20) | Loss | Draw | Loss |
| KQ vs KR, Rook Defending | Win (31) | Draw | Draw | Loss |
| KRP vs KR, Advanced Rook Pawn | Win (14) | Win (20) | Win (59) | Draw |
| KRP vs KR, Lucena Position | Win (13) | Win (28) | Win (49) | Loss |
| KRP vs KR, Pawn on 6th with Support | Win (22) | Draw | Loss | Draw |
| Tier 4: Complex | ||||
| KQP vs KQ, Advanced Pawn | Win (11) | Win (12) | Loss | Draw |
| KQP vs KQ, Pawn on 7th | Win (9) | Win (17) | Draw | Loss |
| KRBP vs KRB, Passed Pawn | Win (14) | Draw | Loss | Loss |
| KRR vs KR, Two Rooks Dominate | Win (20) | Win (34) | Win (119) | Loss |
| Total | 100% | 75% | 50% | 5% |
Win (N) = checkmate in N moves. Draw / Loss = failed to convert.
From A Negotiation Benchmark for Frontier Models. I added Opus 4.7 to the round-robin and reran the cross-provider tournament. Opus 4.7 made a huge leap over the already dominant 4.6. This lines up with research from Andon Labs showing that 4.7 can be even more aggressive than 4.6.
| Model | Elo | Avg Score | Words/msg |
|---|---|---|---|
| Opus 4.7 | 1688 | 0.68 | 24 |
| Opus 4.6 | 1512 | 0.62 | 36 |
| Gemini 3.1 Pro | 1439 | 0.61 | 27 |
| GPT 5.4 | 1361 | 0.58 | 34 |
150 games per model across the round-robin; 100% deal rate for every model.
Opus 4.7 is the new clear top of the table, picking up about +175 Elo over 4.6 and beating every opponent in head to head. Direct against 4.6 it wins 40 of 50 games and outscores it 0.65 to 0.56.
In head-to-heads, 4.7's offer was accepted in 70% of the ending splits. 4.7 opened stronger and held firmer, leveraging the stochastic deadline to talk 4.6 into accepting worse deals.
We have three open-ended trading challenges on Optimization Arena that I thought would make for interesting benchmarks. However, after running them a few times through the Claude Code and Codex harnesses I saw too much variance in scores to make a reliable report.
I have heard reports from top contestants that 4.7 was unable to make additional progress on any top-tier solutions. This matched what I saw locally. Early reports suggest this model is helpful for getting started but not a game changer for any of our trading problems.
On balance, this new model clearly outperforms on well-structured tasks. It is too early to give a vibes assessment, and historically first impressions of a model release are not too correlated with the eventual consensus.
The thing I like most is the speed and smoothness with which it is being served. Maybe this is a first-day effect, maybe this model is genuinely easier to serve for some architectural reason, or maybe it just uses fewer tokens per unit of insight.
The thing I like least is the overly literal instruction following and subtle lack of humanity. In one case, while making a distribution chart for an article, I asked it for a figure caption. It wrote out a two-sentence caption that described what smoothing function was used to create the chart. This is obviously information a reader does not care about, and I suspect 4.6 would have handled this correctly.