← Back

Learnings from Optimization Arena

This article is adapted from a talk I gave at the Paradigm Automated Research Hackathon. Thanks to Dan Robinson for helping build and run Optimization Arena and Josu San Martin, Adrian le Bas, and the rest of the autoresearch community for sharing insights about what it takes to win these challenges.

April 9, 2026

We launched Optimization Arena in February as a platform to explore the future of research, where models and humans work together to solve complex problems. After running a series of competitions with well over 1000 players and nearly 20k submissions, we've observed that skill on these challenges transfers across domains, and that smart, determined generalists with a large token budget tend to eventually outcompete actual domain experts. This suggests that agent orchestration and harness building is a transferable skill that can be practiced. Below is an explanation of how the competition works, the tricks that I've seen work, and what I learned from the top solutions.

How Optimization Arena Works

Every challenge has an objective distilled to one number (e.g. make the most money in this trading simulation, write a kernel with the fewest clock cycles). The user is given a GitHub repo with an environment that mirrors the submission environment. Often we vary something small like the random seeds selected for the simulation. Then the user submits a block of code and we run it in a controlled environment. Humans and LLMs work together well on this challenge because humans give high-level guidance and LLMs can trivially implement the algorithms to test.

In practice, if you leave the execution environment unconstrained, someone will submit a block of code that breaks the rules (often this is unintentional and the agent just does it autonomously). Because of this, we have found that blockchain VMs actually work surprisingly well for environments because they aggressively constrain the type of execution that can be run. Other solutions are raw JSON, ONNX files, competition-stripped Python, or even text.

Tricks to Succeeding with Automated Research

After building challenges, competing, and talking to top participants, I've observed a few patterns that generalize across challenges.

Identify the Harness Bottleneck

When you run a system in a hot loop, some stage will be the bottleneck. You desperately want this to be the model thinking phase. If your harness is spending 5% of its time thinking and 95% of the time running a single-threaded CPU-bound loop, you will do much worse than if it spends 95% of its time thinking and farming out experiments.

Experiment Pipeline
BadBetterBestThinkingRunning verifier

Try to run as many experiments as possible

The practical implication is that everything surrounding the thinking step needs to be fast and parallelizable. Farm out evaluation to GPU and CPU clusters. Have the agent generate multiple hypotheses, run them in parallel, and then analyze the aggregate results. The model should always be the limiting factor.

This gets even more interesting as automated research starts to move out of software space (this has become a hot trend recently with startups). If you are running materials science or lab experiments, one of the main engineering hurdles is to build a scalable and parallelizable laboratory so that your agent can get maximum information throughput. As we push into more complex domains, a major limitation on research progress will become experiment throughput.

Inner Loop vs. Outer Loop

Language models have trouble with long horizon introspection due to context limitations. Because of this, a pattern has become popular where you build an actor agent and a critic agent. The inner loop is the optimization process: think about the problem, propose a candidate, evaluate it, and repeat. The outer loop is about improving how that inner loop operates. The most elegant version of this comes from a research team at Stanford. Anthropic also writes about this pattern in their article on building effective agents.

Outer Loop Optimization
Critic Agentcollect more dataduring verificationcache thisrepeated operationThinkingRunning verifierIntervention

The critic agent watches from a distance and periodically improves the process

I have yet to see an implementation of this that works out of the box across many problems, and in practice one of the most effective critics given current model capabilities is a smart human watching the agent reason and suggesting improvements to its research process.

Optimize for Breakthroughs

I like to think about hill climbing these challenges as similar to evolutionary algorithms. You can divide score jumps into minor optimizations and breakthroughs. The model has a strong tendency to get stuck at local optima and then start to grind out small edges. Often you will leave the model running overnight and it will spend six hours sweeping hyperparameters that barely matter.

Breakthroughs Drive Outcomes
ScoreTimeflatline

Most progress comes from a few key breakthroughs, not the grind between them

Because of this, it is important to constantly nudge the system and introduce entropy. Some ideas I have seen work well are letting multiple agents talk to each other, inserting literature review and web search steps, asking the model to introspect, and explaining to it concepts like AlphaEvolve or genetic algorithms to encourage it to look for bigger leaps. However, without new sources of entropy, the harness will inevitably flatline.

Learning from Top Submissions

Despite all the participants using the same models with the same rules, they produce submissions that look wildly different. For the top 8 strategies in the AMM challenge we had two black box neural networks implemented in Solidity, a few compact elegantly designed mathematical models that were 200-300 lines of code, and a few kitchen sink approaches with over 100 tuned constants, and one submission that attempted to fully model the competitor AMM in Solidity and use that estimate to make decisions.

However, despite the variety of the approaches, all the tech trees end in the same spot.

Edge by Simulation (sorted by difficulty)
Edge (bps)650500370Simulation (sorted)adrianlebhouseofjiaoartemisunhedged21afinkekjosusanmartinbasedfkfrok_ai

All 8 strategies produce nearly identical results across 20 simulations

Digging deeper, we can see that during critical moments in the challenge all the top strategies essentially move in lockstep.

Top Strategies Move in Lockstep
Ask Fee (bps)5002500971097159725adrianlebhouseofjiaoartemisunhedged21afinkekjosusanmartinbasedfkfrok_aifair price (right axis)

Fair price drops and all 8 strategies simultaneously flip protect side to ask

I think of these coding agents kind of like water filling every available crack. With a good harness setup you can essentially perfectly fit a submission to the observable reward, even if the problem has an irregular or abstract shape.

Extracting Elegant Ideas

There are a few key ideas contained in this challenge: estimating probabilities of arb trades and skewing based on information from the arbs, dynamically adjusting widths based on the estimated sim volatility, suppressing the skew from outlier trades, etc. But while I could not have built any of these solutions without an LLM, I learned a bit less than I expected about the core ideas of AMM design.

Distilling Insights from Optimized Code
uint256 toxEma = wmul(DECAY, toxEma) + wmul(uint256 dev = wdiv(absDiff(spot, pHat) * DEVslots[7] = slots[7] > WAD ? slots[7] - IDLE_DEuint256 sigmaHat = _powWad(sigmaHat, SIGMA_Dif (arbCount > 3) fee += SURGE_BUMP * (arbCouuint256 pHat = wmul(alpha, spot) + wmul(1e180xffa50014ffc30028ff97fff5ff7fffb9002e0041newAsk = baseFee + convex_k * dev * dev + voluint256 normGap = absDiff(normX * normY, k_tlambdaHat = wmul(LAMBDA_D, lambdaHat) + LAMBint256 signedErr = int256(spot) - int256(ancfee = F0 + C_SPREAD*vol + C_RATE*rate + C_RIuint256 stickyDev = max(dev, wmul(STICKY_D, slots[12] = direction > 0 ? buyTox : slots[12w2 = _relu(matmul(w1, hidden_weights) + biasEstimate fair priceEMA, dual anchor, or virtual reservesMeasure toxicity of flowdeviation, arb probability, or neural netSet asymmetric feeswiden the side facing adverse flow...what else is hiding in here?

8 wildly different implementations should distill down to a few shared insights

The best ideas in science are simple, elegant, and memetic. The research frontier is advanced by distilling down complex experimental data into the underlying truths about the world. From that perspective, these black box optimizers seem to be leaving something on the table. There is a great paper I love about extracting frontier chess ideas from AlphaZero and teaching them to the best chess players in the world. I feel like we need something of that shape here. The model solves the challenge, and then introspects and teaches us something about the world.

Building Challenges

It is harder to build these challenges than it is to solve them. The best challenges have elegant verification functions that mirror real-world problems, but not so much complexity as to invite overfitting.

Challenge Design Spectrum
Simpleelegant verifiermirrors real worldComplexmany parametersinvites overfittingsweet spotparametersoptimal AMM forstylized price seriesfee function to competeagainst benchmark AMMfully replicated environmentfrom Solana VM

Every additional parameter is a new vector for overfitting

More people should try to design challenges instead of just playing them. There are much higher returns to domain expertise on the challenge creation side than the challenge submission side. We struggled to account for all the nuances of problems that were outside our core area of expertise. Take what you understand well and distill it into a hill that can be climbed.