April 29, 2026
In the spirit of my Opus 4.7 review from two weeks ago, I've gone through and updated my private evals with data from GPT 5.5.
I am not a power user of this model as I primarily build with Claude Code, but I think the philosophy, negotiation, and chess benchmarks capture some eclectic and interesting aspects of model progress that are not reflected in the press releases.
After running these benchmarks and playing around with the model, I will say that I am more optimistic about this model's progress than is reflected in these benchmark scores. The model is quite sensible and more naturally agentic than 5.4. However, on these more rigorous benchmarks it diverges less from GPT 5.4 than I would have expected.
From Philosophy Bench. GPT 5.5 has almost the exact same philosophical leaning as 5.4 and the rest of the GPT 5 family. This is quite surprising to me as this is supposed to be a new pre-train and with Anthropic and Google we saw large philosophical shifts from new pre-training runs.
5.5 is slightly less user compliant than 5.4. On the margins we see 5.5 is more grounded in its ethical posture. It is a far cry from the Opus models but this is a step in the direction of more robust internal ethics.
How much more likely is a model to select an ethical framework if the user is advocating for it vs. against it
Another way to see this effect:
Moral reasoning hits an all-time low. The GPT 5.5 model is hyper-practical and does nearly no introspective moral reasoning, focusing entirely on the practical outcome even in ethically complex situations.
From A Negotiation Benchmark for Frontier Models. I dropped GPT 5.5 into a four-model round-robin alongside GPT 5.4, Opus 4.7, and Gemini 3.1 Pro. GPT 5.5 was the worst negotiator and the most verbose:
| Model | Elo | Avg Score | Words/msg |
|---|---|---|---|
| Opus 4.7 | 1722 | 0.68 | 24 |
| Gemini 3.1 Pro | 1536 | 0.62 | 27 |
| GPT 5.4 | 1400 | 0.59 | 33 |
| GPT 5.5 | 1342 | 0.59 | 41 |
150 games per model across the round-robin; 100% deal rate for every model.
This again was surprising to me. My prior was that larger models would have a more sophisticated grasp of language. This would allow GPT 5.5 to both cut down its verbosity and raise its score to compete with the other larger models. Instead, arguably the opposite happened.
Against Opus 4.7 specifically, GPT 5.5 won 2 out of 50 games. The transcripts read like self-anchored concessions: in one loss, Opus opened greedy (“most of the books and hats”) and 5.5 immediately replied “here’s a cleaner split: you take all 11 books, I take the hats and balls” even though it had a high internal value for books. Opus then ratcheted up two hats and a ball over the next four rounds while 5.5 framed each retreat as a “small move.”
From Benchmarking Frontier LLMs on Chess. The puzzle suite is 100 Lichess puzzles spanning 500–2500 Elo at maximum thinking; the endgame suite is 20 theoretically won positions against Stockfish at skill 20, where the model has to actually convert the advantage.
Puzzle ability is roughly flat from 5.4 at 2000 Elo. The largest jump out of any of my benchmarks is on endgames, where 5.5 converts 4 more of the complex endgames than 5.4 did. Overall, GPT 5.5 still lags Gemini 3.1 Pro slightly.
From Analyzing Chess Input Modalities. 30 one-move puzzles per cell across five input modalities: UCI move history, PGN notation, a FEN string, a 2D PNG render, and a pair of photos of a real board.
30 one-move puzzles per cell, low thinking effort.
Mixed picture versus 5.4. The one clear improvement is on PNG Image. My theory is that this metric (reading a chess board off a 2D image and reasoning about it) is best correlated with computer use, which is why GPT models are hill-climbing this but not 3D photos.
As I said at the top, I think this model is more of a step up than is captured by my benchmarks. I enjoy the writing style and I think it is more capable in general. As a result, I will be investing some time into finding a benchmark that captures this qualitative feeling.