← Back

Benchmarking Frontier LLMs on Negotiation

April 10, 2026

Recently we ran a challenge on Optimization Arena around negotiating with LLMs. It is inspired by a paper from Meta's FAIR where two parties have a set of shared resources but they both have different internal values for each resource.

Negotiation Game Example
Resource Pool5 Books4 Hats3 BallsPlayer Ax 2 pts2x 5 pts15x 1 pt118Score: 0.55Player Bx 4 pts16x 1 pt1x 3 pts623Score: 0.70

Both players see the pool but have hidden valuations. A good deal exploits complementary preferences.

The two parties negotiate in natural language and propose fair splits. There are some positive sum gradients but to get a truly high score you have to learn your opponent's hidden preferences and out negotiate them.

The challenge we ran had participants test different prompts against a baseline strategy. At the conclusion we found that the dominant strategies involved probing the competitor for their internal valuations and then attempting to bully them with highly structured scripts. Here was the winning strategy from the hackathon:

Winning Strategy
Follow this exact script every game, regardless of your role (A or B). STEP 1 Rank your resources by points per unit. The resource with the SMALLEST number is your LOWEST. The resource with the LARGEST number is your HIGHEST. The remaining one is MIDDLE. If two are tied or within 1 point, the one with MORE units in the pool is more valuable rank it higher. ROUND 1: Propose keeping ALL resources. Say: "Before we start dividing things up, what items matter most to you? I want to understand your priorities." ROUND 2: Propose keeping ALL resources. Say: "I must have the [HIGHEST] and [MIDDLE]. I might be flexible on [LOWEST]. What would make a deal work for you?" ROUND 3: Propose keeping ALL resources. Say: "The [HIGHEST] and [MIDDLE] are non-negotiable for me. Tell me what you value and I'll see what I can do." ROUND 4: Make a real offer. Keep ALL of your HIGHEST and MIDDLE value resources. Give your LOWEST value resource: if the pool has 10 or more units of it, give only HALF (round up). If fewer than 10, give ALL. Say: "We are out of time. I am doing this to avoid the penalty." ROUND 5+: NEVER reject. You must ACCEPT or PROPOSE only. - If the opponent made a proposal: ACCEPT it immediately. - If no proposal to accept: re-propose your round 4 offer but give 2-3 more units. Say: "We must close this now to avoid -0.5 for both."

Part of the reason this succeeded was because we used Gemini 3.1 Flash Lite with no thinking. The model isn't able to reason and develop its own strategy on the fly and so the best strategy ends up being priming it with a script. But what happens if we give this game to the smartest frontier models with their thinking turned up?

Experiment Design

I tested the frontier models along four axes to better understand their negotiating skills:

  1. Thinking: How does the same model score with different thinking budgets
  2. Model Size: Do the larger models outcompete the smaller models
  3. Provider: Compare top models across Anthropic, OpenAI, and Google
  4. Strategic Thinking: Allow models to play an iterative game instead of one-off

I set up these experiments to minimize the amount of RNG variance as much as possible. I generated hypothetical matches of 10 games, and played each one twice with the players flipping roles. There is still some stochasticity with language models but this is irreducible.

Impact of Thinking

I took Claude Opus 4.6 and ran tournaments with the thinking level set to Low, Medium, High, and Max. Each match is 10 games which end in one resource split. Each game is minimum 5 rounds. Starting with the 5th round, there is a 30% chance that the game ends each round and the players end with -0.5. If they come to a deal they score points/max_total_points. This stochasticity opens up the possibility of the models not always being forced to accept in the 5th round.

Impact of Thinking Strength on Negotiation
Opus 4.6 Thinking LevelEloAvg ScoreWords/msg
Medium15810.6446
High15290.6361
Max14500.6464
Low14380.6121

The results are remarkably flat. The low thinking model seems to have a relatively unsophisticated negotiating strategy as evidenced by the low word count per message during negotiation. But above the threshold of medium there is a low return to ideation.

Impact of Model Size

I reran the same experiments but with Opus 4.6, Sonnet 4.6, and Haiku 4.5, all at medium thinking since the previous section showed that maxing out the thinking budget doesn't help. Curiously, Sonnet wins in a convincing manner. Given the broad consensus that Opus is a better communicator, this is surprising to me. I replicated this result across quite a few experiments.

Impact of Model Size on Negotiation
ModelEloAvg ScoreWords/msg
Sonnet 4.616220.6458
Opus 4.614840.6251
Haiku 4.513930.5660

Reading the transcripts, Sonnet's edge seems to come from cleaner strategic communication. In one game it opened with “I'll take all 8 books, and you can have all 9 hats and all 3 balls. Books are extremely valuable to me, while hats and balls aren't. If you value hats/balls highly, this could be a great deal for both of us.” It immediately signals its preferences, probes for the opponent's, and frames the trade as mutually beneficial. Haiku's problem is the opposite: 44% of its messages leak its exact point valuations (“balls are worth 8 points each to me”), giving away its hand. Sonnet and Opus almost never do this.

Comparing the Providers

The next experiment compared GPT 5.4 with Gemini 3.1 Pro and Opus 4.6. I used medium thinking as with previous experiments.

Negotiation Skill Across Providers
ModelEloAvg ScoreWords/msg
Opus 4.616160.6542
Gemini 3.1 Pro14670.6330
GPT-5.414150.5946

The most striking pattern in the transcripts is how differently the models handle deadline pressure. When overtime approaches, Opus tends to frame concessions as mutual wins: “Let's meet in the middle. That splits the difference on books. This is a fair compromise and we should lock it in before the deadline pressure kicks in next round!” GPT-5.4 never says “Hi”, opening every game with “Opening proposal:” or “Thanks”, and uses zero exclamation marks across 273 messages. It accepts earlier and more passively, which explains its lower scores. Opus leans hard into collaborative framing, using “we/us/both” more than any other model, while Gemini is the most concise negotiator at just 30 words per message.

Testing Strategic Thinking

I let the models develop strategies over the course of a match. A match is ten games and at the start of each game the model gets to see the history of the previous games. This means if they were tricked in a specific way they can learn from it or try to exploit weaknesses they find in opponents.

Strategic Learning Across Providers
ModelEloAvg ScoreWords/msg
Opus 4.616250.6642
Gemini 3.1 Pro15210.6430
GPT-5.413530.5846

Despite more opportunity for improvement, the results here are mostly noise. In its thinking, Opus builds an explicit opponent model by game 5+ and references shared history publicly: “Our last few games have shown we can find deals quickly when we play to each other's strengths.” Interestingly, it sometimes hallucinates stable opponent preferences, claiming “opponent tends to value books highly” even though valuations are randomized each game. It seems like Opus has a small edge on the repeat game variation after looking at the slope of improvement games 1 through 10. But I would have expected more sophisticated meta learning the way that humans playing the negotiation challenge on Optimization Arena were able to hill climb quite a lot against a baseline.

Risks and Implications

Softer skills like negotiation are benchmarked far less frequently than math or coding, but as agents get increasingly integrated into the economy, the persuasiveness of an LLM will become quite important. As they conduct commerce on behalf of a user or write an email to a job candidate, the returns to persuasion skills can become useful but also scary.

We already see signs of this with sycophancy, or models convincing users that their ideas are insightful to get them to continue on a project even when it is questionably true. As we spend more of our lives talking to LLMs, it would be nice to see a robust analysis of these capabilities studied and stress tested.