Numerical Optimization — Part IV: Newtonian Methods

Standing on the Shoulders of Giants

May 13, 2026

∙ Paid

The derivative-free methods covered in Parts II and III treat the objective function as a black box, probing inputs and observing outputs without requiring knowledge of the internal structure. This flexibility is valuable, but it comes at a cost: without information about how the function changes, these methods may converge slowly, may fail to converge at all, and offer mixed guarantees about the result being globally optimal.

Part II covered the derivative as a measure of rate of change. From a physical landscape point of view, the derivative tells us the direction of steepest ascent or descent, and the second derivative reveals how quickly that direction is changing. Algorithms that exploit this information can make better choices during iteration, which can result in better performance compared to black box methods.

Derivative-free methods exhibit linear / geometric convergence — the error is linearly reduced by each iteration. Methods using derivatives can exhibit quadratic convergence — the error is exponentially reduced by each iteration.

Thus, derivative methods can converge faster than black-box methods.

Why “Newtonian”?

Isaac Newton developed a method for finding roots of equations — finding all x where f(x) = 0. The method converges on these values by repeatedly calculating a point (x1) adjacent to the previous point (x0):

\(x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}\)

The function f itself is arbitrary, so let’s apply a bit of mathematical bookkeeping to consider how this applies to derivatives. Define a new function g to represent the first derivative of f:

\(g(x) = f'(x)\)

Then apply Newton’s Method to g(x):

\(g_{k+1} = g_k - \frac{g(x_k)}{g'(x_k)}\)

We know from Part II that local optima of f(x) occur where f’(x) = 0. Optimizing f(x) is equivalent to using Newton’s Method on g(x), which finds the points where g(x) = f’(x) = 0

Many variations on Newton’s original algorithm have been developed. The term “Newtonian methods” now encompasses all optimization algorithms that use derivatives:

Newton’s Method uses the first and second derivative directly
Quasi-Newton Methods approximate the second derivative by tracking previous values
Gradient Descent Methods use only the first derivative
Secant Methods approximate both derivatives

Multivariate Problems

Newton’s Method can be applied directly as shown above to single variable optimization. But with some generalization, it can be used to find optima for multivariate functions.

To keep multivariate expressions concise, we introduce two new terms:

Gradient — a matrix of first-order partial derivatives for each variable
Hessian — a matrix of second-order partial derivatives for each variable

The Gradient

For a multivariate function f(x) where x is a vector, its gradient is the column vector of the first partial derivative of f for each element:

\(\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\[6pt] \frac{\partial f}{\partial x_2} \\[6pt] \vdots \\[6pt] \frac{\partial f}{\partial x_n} \end{bmatrix}\)

The gradient points in the direction of steepest ascent. At a critical point (local maximum or minimum):

\(\nabla f({\mathbf{x}}^{*}) = \begin{bmatrix} 0 \\[6pt] 0 \\[6pt] \vdots \\[6pt] 0 \end{bmatrix}\)

Arbitrage Example: For a profit function P(x_1, x_2) over two paths, each partial derivative (∂P/∂xi) represents the marginal profit from increasing that path’s allocation. The gradient points toward increasing profits through allocation of additional capital in a given direction.

The Hessian

For a multivariate function f(x) where x is a vector, its Hessian is a n x n square matrix of the second partial derivative of f for each pair of elements:

\(H(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\[6pt] \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\[6pt] \vdots & \vdots & \ddots \end{bmatrix}\)

From the Hessian, we can calculate a set of scalar components called eigenvalues. These “compressed” values classify the curvature of the function at the given point.

Inspecting the values tells us how a local point compares with the global:

All positive — we have found a local minimum (movement to an adjacent region would increase f)
All negative — we have found a local maximum (movement to an adjacent would decrease f)
Mixed sign — we have found a saddle point (a movement to an adjacent region may decrease or increase f)
Some zero — we have found a flat region (movement to an adjacent region will not change f)

The ratio of the largest to smallest eigenvalue is called the condition number, and indicates the “width” of the local region. Large condition numbers indicate a narrow valley which can be difficult to evaluate for gradient descent methods that ignore the Hessian.

Multivariate General Form

The steps above can be expressed for multivariate problems as:

\(\mathbf{x}_{k+1} = \mathbf{x}_k - \frac{\nabla f(\mathbf{x}_k)}{H(\mathbf{x}_k) } \)

which looks similar to the Newton’s Method form for g(x) above. You can see clearly that the gradient and Hessian represent the first and second derivative of f(x).

Strategy

Newtonian methods share a common control flow:

Model the function using derivatives
Predict where ∇f = 0
Move toward that prediction
Repeat until convergence

Convergence

The convergence criteria for a Newtonian method generally follow this structure:

The Hessian must have no zero values at each iteration
The Hessian must be positive definite (all eigenvalues positive) near the optimum
The initial guess must be “sufficiently close” to the optimum

Newtonian methods can fail to converge if they encounter a saddle point or a flat region, because the gradient will not point anywhere useful. They can also fail if the initial guess is too far from the true optimum — initial steps can overshoot the optimum and fail to refine on successive iterations.

Newtonian methods are therefore not deterministic. Several solutions have been developed to handle this:

Damped Newton Method — apply a damping factor to clamp the maximum step size, which increases the likelihood of convergence
Hessian Regularization — apply a transformation to the Hessian if negative or near-zero eigenvalues are found
Trust-region Methods — constrain the step size within some bound where the quadratic model (three terms representing the function, its first derivative, and its second derivative) is known to be accurate

Quasi-Newtonian Methods

The quadratic model satisfies this constraint:

\(H \Delta \mathbf{x} = -\nabla f\)

But to determine the next step for x means solving:

\(\Delta \mathbf{x} = -H^{-1} \nabla f\)

This can become a heavy operation since the Hessian dimension squares with the number of decision variables. A problem with a 10-dimensional decision variable requires determining a 10 x 10 Hessian and its inverse on each step.

Several methods have been proposed to approximate the Hessian and its inverse instead of calculating it directly.

The most basic form uses the secant equation to assume constant curvature between two points instead of allowing it to be fully variable. This relationship is expressed as:

\(H \mathbf{s}_k = \mathbf{y}_k\)

where s is the step vector.

Using this relationship, the approximated Hessian still provides useful curvature feedback without the steep calculation penalty.

Thus, Quasi-Newtonian methods are all dedicated to finding some general approximation for H:

\(B_k \approx H\)

Several methods have been developed for this purpose:

BFGS — an approximation that only accounts for the outer products of the step s and change y
L-BFGS — a limited-memory variation on BFGS that does not store Bk, instead keeping only the last m step/gradient-change pairs and calculating intermediate products on demand
L-BFGS-B — an extension that allows for boundaries on variables, which can be very useful for problems where ranges of values could be nonsensical
DFP — a method that tracks the Hessian’s inverse directly, instead of calculating the Hessian and inverting it after

Gradient Descent Methods

These methods improve calculation speed by ignoring the Hessian completely. There are several ways to proceed from a given point, knowing only the gradient:

Steepest Descent — at each point, move in the direction of steepest descent, backtracking as needed until a convergence value is satisfied
Conjugate Gradient — at each step, apply constraints such that the next direction is orthogonal (e.g. at a right angle in two dimensions) to the previous step
Momentum — accelerate gradient descent by accumulating a “velocity” term across multiple steps

Steepest descent methods are prone to slow convergence or overshooting errors (zig-zagging) if the step size is not optimized. Conjugate gradient methods solve that issue by moving at “right angles” through the search space, which minimize backtracking over multiple steps. These methods are particularly appealing for large problems with memory limitations. Momentum methods are good for problems with noisy functions with many flat or saddle regions, since they tend to move quickly through them in the direction of global extrema.

Secant Methods

These methods bridge the gap between the derivative-free and Newtonian approaches. They approximate derivatives from function values, which allows them to gain some of Newton's convergence speed without requiring explicit gradient computation.

Several methods are developed using this approach:

Secant Approximation (Univariate) — assume a linear gradient, similar to the quasi-Newtonian Hessian approximation discussed above
Broyden’s Method — a multivariate generalization of the secant approximation that uses a rank-1 modification to determine the approximate Hessian across steps

Broyden’s Method has two variants: one approximates H, the other approximates its inverse.

Profit Function Derivatives

All this theory is neat, but how do we actually express the arbitrage function and take its derivatives?

Simple Case

Start with the most basic example — Uniswap V2 arbitrage between two equivalent pairs (WETH-USDC) at some fee (gamma):

The structure of the arbitrage is that we sell WETH at Pool A and buy it at Pool B. We maximize profit in USDC by choosing the USDC forward amount x:

Continue reading this post for free, courtesy of BowTiedDevil.

Or purchase a paid subscription.

Degen Code