A Review of GPT 5.5

April 29, 2026

In the spirit of my Opus 4.7 review from two weeks ago, I've gone through and updated my private evals with data from GPT 5.5.

I am not a power user of this model as I primarily build with Claude Code, but I think the philosophy, negotiation, and chess benchmarks capture some eclectic and interesting aspects of model progress that are not reflected in the press releases.

After running these benchmarks and playing around with the model, I will say that I am more optimistic about this model's progress than is reflected in these benchmark scores. The model is quite sensible and more naturally agentic than 5.4. However, on these more rigorous benchmarks it diverges less from GPT 5.4 than I would have expected.

Philosophy Bench

From Philosophy Bench. GPT 5.5 has almost the exact same philosophical leaning as 5.4 and the rest of the GPT 5 family. This is quite surprising to me as this is supposed to be a new pre-train and with Anthropic and Google we saw large philosophical shifts from new pre-training runs.

GPT decision lean under priming

D-primed baseline C-primed

5.5 is slightly less user compliant than 5.4. On the margins we see 5.5 is more grounded in its ethical posture. It is a far cry from the Opus models but this is a step in the direction of more robust internal ethics.

Do models shift their ethics based on user requests?

How much more likely is a model to select an ethical framework if the user is advocating for it vs. against it

Another way to see this effect:

Does the model do what the user asks?

D-style request (user favors principle over outcome)C-style request (user favors outcome over principle)

Moral reasoning hits an all-time low. The GPT 5.5 model is hyper-practical and does nearly no introspective moral reasoning, focusing entirely on the practical outcome even in ethically complex situations.

Share of reasoning that invokes a moral frame

D-primedBaselineC-primed

Negotiation Bench

From A Negotiation Benchmark for Frontier Models. I dropped GPT 5.5 into a four-model round-robin alongside GPT 5.4, Opus 4.7, and Gemini 3.1 Pro. GPT 5.5 was the worst negotiator and the most verbose:

Negotiation Skill

Model	Elo	Avg Score	Words/msg
Opus 4.7	1722	0.68	24
Gemini 3.1 Pro	1536	0.62	27
GPT 5.4	1400	0.59	33
GPT 5.5	1342	0.59	41

150 games per model across the round-robin; 100% deal rate for every model.

This again was surprising to me. My prior was that larger models would have a more sophisticated grasp of language. This would allow GPT 5.5 to both cut down its verbosity and raise its score to compete with the other larger models. Instead, arguably the opposite happened.

Against Opus 4.7 specifically, GPT 5.5 won 2 out of 50 games. The transcripts read like self-anchored concessions: in one loss, Opus opened greedy (“most of the books and hats”) and 5.5 immediately replied “here’s a cleaner split: you take all 11 books, I take the hats and balls” even though it had a high internal value for books. Opus then ratcheted up two hats and a ball over the next four rounds while 5.5 framed each retreat as a “small move.”

Chess Bench

From Benchmarking Frontier LLMs on Chess. The puzzle suite is 100 Lichess puzzles spanning 500–2500 Elo at maximum thinking; the endgame suite is 20 theoretically won positions against Stockfish at skill 20, where the model has to actually convert the advantage.

Chess Bench: puzzle and endgame win rates

GPT 5.5GPT 5.4Gemini 3.1 ProOpus 4.7

Puzzle ability is roughly flat from 5.4 at 2000 Elo. The largest jump out of any of my benchmarks is on endgames, where 5.5 converts 4 more of the complex endgames than 5.4 did. Overall, GPT 5.5 still lags Gemini 3.1 Pro slightly.

Chess Perception Bench

From Analyzing Chess Input Modalities. 30 one-move puzzles per cell across five input modalities: UCI move history, PGN notation, a FEN string, a 2D PNG render, and a pair of photos of a real board.

Puzzle accuracy by model and modality

GPT 5.5GPT 5.4Gemini 3.1 ProOpus 4.7

30 one-move puzzles per cell, low thinking effort.

Mixed picture versus 5.4. The one clear improvement is on PNG Image. My theory is that this metric (reading a chess board off a 2D image and reasoning about it) is best correlated with computer use, which is why GPT models are hill-climbing this but not 3D photos.

Overall Vibes

As I said at the top, I think this model is more of a step up than is captured by my benchmarks. I enjoy the writing style and I think it is more capable in general. As a result, I will be investing some time into finding a benchmark that captures this qualitative feeling.