April 10, 2026
Recently we ran a challenge on Optimization Arena around negotiating with LLMs. It is inspired by a paper from Meta's FAIR where two parties have a set of shared resources but they both have different internal values for each resource.
Both players see the pool but have hidden valuations. A good deal exploits complementary preferences.
The two parties negotiate in natural language and propose fair splits. There are some positive sum gradients but to get a truly high score you have to learn your opponent's hidden preferences and out negotiate them.
The challenge we ran had participants test different prompts against a baseline strategy. At the conclusion we found that the dominant strategies involved probing the competitor for their internal valuations and then attempting to bully them with highly structured scripts. Here was the winning strategy from the hackathon:
Part of the reason this succeeded was because we used Gemini 3.1 Flash Lite with no thinking. The model isn't able to reason and develop its own strategy on the fly and so the best strategy ends up being priming it with a script. But what happens if we give this game to the smartest frontier models with their thinking turned up?
I tested the frontier models along four axes to better understand their negotiating skills:
I set up these experiments to minimize the amount of RNG variance as much as possible. I generated hypothetical matches of 10 games, and played each one twice with the players flipping roles. There is still some stochasticity with language models but this is irreducible.
I took Claude Opus 4.6 and ran tournaments with the thinking level set to Low, Medium, High, and Max. Each match is 10 games which end in one resource split. Each game is minimum 5 rounds. Starting with the 5th round, there is a 30% chance that the game ends each round and the players end with -0.5. If they come to a deal they score points/max_total_points. This stochasticity opens up the possibility of the models not always being forced to accept in the 5th round.
| Opus 4.6 Thinking Level | Elo | Avg Score | Words/msg |
|---|---|---|---|
| Medium | 1581 | 0.64 | 46 |
| High | 1529 | 0.63 | 61 |
| Max | 1450 | 0.64 | 64 |
| Low | 1438 | 0.61 | 21 |
The results are remarkably flat. The low thinking model seems to have a relatively unsophisticated negotiating strategy as evidenced by the low word count per message during negotiation. But above the threshold of medium there is a low return to ideation.
I reran the same experiments but with Opus 4.6, Sonnet 4.6, and Haiku 4.5, all at medium thinking since the previous section showed that maxing out the thinking budget doesn't help. Curiously, Sonnet wins in a convincing manner. Given the broad consensus that Opus is a better communicator, this is surprising to me. I replicated this result across quite a few experiments.
| Model | Elo | Avg Score | Words/msg |
|---|---|---|---|
| Sonnet 4.6 | 1622 | 0.64 | 58 |
| Opus 4.6 | 1484 | 0.62 | 51 |
| Haiku 4.5 | 1393 | 0.56 | 60 |
Reading the transcripts, Sonnet's edge seems to come from cleaner strategic communication. In one game it opened with “I'll take all 8 books, and you can have all 9 hats and all 3 balls. Books are extremely valuable to me, while hats and balls aren't. If you value hats/balls highly, this could be a great deal for both of us.” It immediately signals its preferences, probes for the opponent's, and frames the trade as mutually beneficial. Haiku's problem is the opposite: 44% of its messages leak its exact point valuations (“balls are worth 8 points each to me”), giving away its hand. Sonnet and Opus almost never do this.
The next experiment compared GPT 5.4 with Gemini 3.1 Pro and Opus 4.6. I used medium thinking as with previous experiments.
| Model | Elo | Avg Score | Words/msg |
|---|---|---|---|
| Opus 4.6 | 1616 | 0.65 | 42 |
| Gemini 3.1 Pro | 1467 | 0.63 | 30 |
| GPT-5.4 | 1415 | 0.59 | 46 |
The most striking pattern in the transcripts is how differently the models handle deadline pressure. When overtime approaches, Opus tends to frame concessions as mutual wins: “Let's meet in the middle. That splits the difference on books. This is a fair compromise and we should lock it in before the deadline pressure kicks in next round!” GPT-5.4 never says “Hi”, opening every game with “Opening proposal:” or “Thanks”, and uses zero exclamation marks across 273 messages. It accepts earlier and more passively, which explains its lower scores. Opus leans hard into collaborative framing, using “we/us/both” more than any other model, while Gemini is the most concise negotiator at just 30 words per message.
I let the models develop strategies over the course of a match. A match is ten games and at the start of each game the model gets to see the history of the previous games. This means if they were tricked in a specific way they can learn from it or try to exploit weaknesses they find in opponents.
| Model | Elo | Avg Score | Words/msg |
|---|---|---|---|
| Opus 4.6 | 1625 | 0.66 | 42 |
| Gemini 3.1 Pro | 1521 | 0.64 | 30 |
| GPT-5.4 | 1353 | 0.58 | 46 |
Despite more opportunity for improvement, the results here are mostly noise. In its thinking, Opus builds an explicit opponent model by game 5+ and references shared history publicly: “Our last few games have shown we can find deals quickly when we play to each other's strengths.” Interestingly, it sometimes hallucinates stable opponent preferences, claiming “opponent tends to value books highly” even though valuations are randomized each game. It seems like Opus has a small edge on the repeat game variation after looking at the slope of improvement games 1 through 10. But I would have expected more sophisticated meta learning the way that humans playing the negotiation challenge on Optimization Arena were able to hill climb quite a lot against a baseline.
Softer skills like negotiation are benchmarked far less frequently than math or coding, but as agents get increasingly integrated into the economy, the persuasiveness of an LLM will become quite important. As they conduct commerce on behalf of a user or write an email to a job candidate, the returns to persuasion skills can become useful but also scary.
We already see signs of this with sycophancy, or models convincing users that their ideas are insightful to get them to continue on a project even when it is questionably true. As we spend more of our lives talking to LLMs, it would be nice to see a robust analysis of these capabilities studied and stress tested.