Numerical Optimization — Part II: Derivative-Free Methods
Black Box Optimization
Part I outlined the structure of our optimization effort before concluding with a simple benchmark of SciPy’s minimize_scalar methods.
To understand how I landed on Brent as the default solver in degenbot, we need to understand the major categories that classify optimization methods.
Calculus is the branch of mathematics concerning rates of change. The most familiar operation in calculus is the derivative, which expresses the rate of change of some function with respect to its input.
I studied mechanical engineering so I often think of calculus operations with respect to a mapped physical system.
Let’s consider a function that maps the displacement of some object from its origin over time:
This is a simple linear displacement function whereby the object is displaced at a constant rate.
The velocity of the object over time is modeled by the first derivative of the original function:
The d/dt notation tells us that the derivative of the enclosed function f is taken with respect to the variable t. The “with respect to” clause is somewhat redundant for a function with only one variable. The distinction becomes important when evaluating systems with multiple variables.
The mechanics of taking derivatives are not relevant here, but the expression above resolves to:
Or more plainly, the velocity of the object is a constant value.
The second derivative can expressed over the original function or its first derivative:
The second derivative of displacement represents acceleration. We know that an object at constant velocity is not accelerating, so that value is zero.
There are names for higher-order derivatives, but the first and second are usually all that you need to work through most problems.
These concepts will appear later when we study Newtonian methods.
Fermat’s Theorem
The study of the first derivative allows us to determine some of the system’s behavior. For systems that exhibit variable behavior, we can identify certain interesting points where the change becomes zero. This may indicate that the system has ceased to change (an object falls from a height and lands without bouncing), or that it has momentarily changed direction (that same object falls from a height, then bounces in the reverse direction).
In either case, the first derivative becomes zero are one or more points called local extrema. These indicates that the function has been maximized or minimized within some range. These local extrema may also represent global extrema, but not necessarily.
Black Box Functions
The example above is intentionally simplistic. For functions like these, we can solve for the minimum or maximum directly by taking the first derivative and solving that equation for all inputs that result in zero. Then we can inspect the original function at those values and easily identify the global extrema.
But life is not always easy, and we often want to optimize a system that can only be observed from the outside. These systems are “black boxes” because we have visibility on the inputs and outputs, but incomplete visibility on how they work. We can measure how the system changes empirically and approximate the first derivative, but the absence of a formal output function for that first derivative makes the direct solution infeasible.
Derivative-Free Methods
Several methods have been developed to handle the task of optimizing black box functions where the derivative cannot be calculated, or where that calculation is unacceptably costly.
These are commonly known as derivative-free algorithms, and they operate iteratively — varying the output and studying the change in output until converging on an optimum within some tolerance.
These methods are generally well understood and documented, which implies they are likely part of the AI training data used to build popular models.
Using Kimi K2.5, I described the test setup and asked it to summarize derivative-free methods available in SciPy and how they could be used to solve the problem.
This is the result:
Nelder-Mead Simplex Method
Nelder-Mead maintains a geometric simplex of n+1 vertices (a triangle in 2D, tetrahedron in 3D) where each vertex represents a candidate solution vector of trade amounts across pools. At each iteration, it evaluates the profit at all vertices by delegating to pool classes, identifies the worst-performing vertex, and attempts to improve it by reflecting it through the centroid of the remaining vertices. Depending on how the reflected point compares to the current best and second-worst values, the method may expand (take a larger step), contract (take a smaller step toward the centroid), or shrink the entire simplex toward the best vertex. This approach requires only objective comparisons—no gradients or smoothness assumptions—making it robust to the piecewise-constant regions and sudden jumps created by trading fees and liquidity limits in AMM pools.
Powell’s Method (Conjugate Direction Method)
Powell’s method performs sequential line minimizations along a set of directions that are iteratively updated to become conjugate (non-interfering) based on the observed improvement patterns. Starting with coordinate axes as initial search directions, it minimizes the profit function along each direction in turn by finding the optimal step size, then constructs a new conjugate direction from the overall displacement achieved during the cycle. The method replaces the direction that contributed least to the total improvement with this new conjugate direction, gradually learning the local curvature of the objective without ever computing derivatives. While efficient for smooth, locally quadratic objectives, Powell’s method can stall when encountering flat regions or discontinuities common in arbitrage problems where profits remain constant until crossing a fee threshold or liquidity boundary.
Differential Evolution (DE)
Differential Evolution is a population-based evolutionary algorithm that maintains a fixed-size swarm of candidate solution vectors, each representing a complete set of trade amounts across pools. For each generation, every individual in the population creates a mutant vector by adding the weighted difference of two randomly selected population members to a third base individual, effectively exploring directions suggested by the population’s diversity. The mutant then undergoes crossover with the original individual to create a trial vector, which is evaluated against the current individual by delegating both to pool classes for profit calculation; the better performer survives to the next generation. This stochastic, parallel search naturally explores multimodal landscapes with multiple local profit peaks (arising from different pool path combinations), handles discontinuities without issue, and can be embarrassingly parallelized across available compute resources.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
CMA-ES maintains a multivariate Gaussian distribution over the search space, where the mean represents the current best estimate of optimal trade amounts and the covariance matrix captures the pairwise correlations between amounts across different pools. At each iteration, it samples a population of candidate solutions from this distribution, evaluates their profit by delegating to pool classes, and updates the distribution parameters by re-weighting samples based on their ranked performance (not absolute values). The covariance matrix adaptation learns the local structure of the objective—such as the inherent coupling between trade amounts in multi-hop paths dictated by constant product formulas—allowing the algorithm to make elongated steps along valleys and navigate ill-conditioned problems efficiently. While highly sample-efficient and self-adaptive regarding step sizes, CMA-ES assumes local quadratic behavior and may stagnate if the profit landscape contains large truly flat regions where all sampled points yield identical values.
Constrained Optimization by Linear Approximation (COBYLA)
COBYLA is a trust-region algorithm designed for constrained black-box optimization that maintains both the current iterate and a set of “trust region” points used to construct linear approximations of the objective and constraint functions. At each iteration, it solves a linear programming subproblem using these approximations to find a step direction that improves profit while respecting constraints (such as slippage limits, available liquidity, or gas bounds), then evaluates the true objective and constraints at the proposed point by delegating to pool classes. The approximations are updated by replacing the oldest trust region point or adjusting the trust region radius based on how well the linear model predicted the actual function values. Unlike penalty-based methods, COBYLA maintains feasibility throughout the optimization (when started from a feasible point) and handles hard constraints natively without requiring derivative information, making it ideal for arbitrage scenarios where constraints like maximum slippage are non-negotiable safety limits.
Bayesian Optimization (Gaussian Process-Based)
Bayesian optimization builds a probabilistic surrogate model—typically a Gaussian Process—over the profit function, treating each evaluation of the black-box pool delegation as an observation that updates the posterior belief about the objective landscape. The Gaussian Process provides both a mean prediction of profit and an uncertainty estimate at any untested point, enabling the algorithm to use an acquisition function (such as Expected Improvement or Upper Confidence Bound) to intelligently select the next trade amount vector that optimally balances exploitation (high predicted profit) against exploration (high uncertainty). After each pool evaluation, the surrogate model is refit and the acquisition function optimized to suggest the next candidate, creating a highly sample-efficient search that minimizes expensive calls to slow or rate-limited pool simulations. While exceptionally efficient when evaluations are costly, the computational overhead of fitting Gaussian Processes scales cubically with the number of observations, making this approach suitable for offline path planning or high-stakes arbitrage validation rather than millisecond-latency production trading.
Hooke-Jeeves Pattern Search
Hooke-Jeeves performs a two-phase iteration consisting of exploratory moves and pattern moves: the exploratory phase samples points along each coordinate direction from the current base point by taking fixed-size steps, accepting any move that improves profit (even marginally), and uses the resulting point as the new base if any improvement was found. The pattern move then attempts to exploit the discovered trend by jumping from the original base point in the direction of the net displacement achieved during exploration, essentially accelerating along promising directions. If pattern moves continue to succeed, the step size is maintained or increased; if exploration fails to find improvement at a given resolution, the step size is reduced for finer local search. This deterministic, direction-based approach requires no derivatives and minimal bookkeeping, but its reliance on fixed exploratory step sizes and coordinate-aligned searches can lead to slow convergence on rotated or highly correlated objective landscapes like those arising from coupled pool interactions.
Basin-Hopping with Local Refinement
Basin-hopping is a two-stage global optimization strategy that combines random perturbations with local optimization to escape local profit maxima and find the global optimum across disconnected feasible regions. The algorithm starts from an initial trade amount vector and applies a local optimizer (such as Nelder-Mead or Powell’s method) to find the nearest local maximum, then randomly perturbs this solution by adding noise scaled to the problem dimensionality and accepts or rejects the new starting point based on a Metropolis criterion or strict improvement requirement. This process iterates, effectively allowing the solver to “hop” between basins of attraction in the profit landscape, which is essential for arbitrage problems where different pool paths create multiple distinct local optima. The method is particularly effective when combined with aggressive local optimizers that quickly refine each basin to its peak, trading off the number of expensive pool evaluations per hop against the probability of discovering the globally optimal arbitrage path.
Simulated / Dual Annealing
Simulated annealing is a probabilistic meta-heuristic inspired by the metallurgical annealing process, where the algorithm starts with high “temperature” allowing large random steps and gradually cools, reducing both step sizes and the probability of accepting worsening moves according to a Boltzmann-like acceptance criterion. At high temperatures, the solver explores broadly across the trade amount space, delegating to pools and accepting occasional profit decreases to escape local optima; as temperature decreases, it increasingly rejects inferior solutions and converges toward the global maximum. Dual annealing extends this with an adaptive strategy that combines classical annealing with local search visits and uses a fast annealing schedule for global exploration. While theoretically guaranteed to find the global optimum given infinite time and logarithmic cooling, simulated annealing typically requires many more objective evaluations than population-based methods to achieve comparable convergence, making it less suitable for time-sensitive arbitrage scenarios unless profit evaluations are extremely cheap.
Just Put My Benchmark in the Bag Bro
That’s nice and all, but what we really care about it speed and accuracy. So let’s ask the model to implement and evaluate them:

