Deep Learning Introduction, undefined

Gradient-Free Methods

Gradient-free optimization methods tackle problems where derivative information is unavailable, unreliable, or prohibitively expensive to compute. Unlike gradient-based approaches that follow the steepest descent/ascent direction, these methods rely on direct sampling of the objective function to guide the search process. This makes them particularly valuable for black-box optimization scenarios, highly non-smooth functions, and problems where only function evaluations are possible.

These methods leverage diverse strategies to explore solution spaces effectively without gradient information—from physics-inspired processes like simulated annealing to direct search techniques that systematically probe the neighborhood of current solutions. While often requiring more function evaluations than gradient-based methods, they offer remarkable robustness across a wide range of problem types.

Gradient-free approaches shine particularly in situations with noisy function evaluations, discrete or mixed variables, and multi-modal landscapes with many local optima. Their ability to handle these challenging scenarios makes them essential tools in the optimization toolkit, especially for real-world problems where theoretical assumptions of smoothness and differentiability rarely hold.

Simulated Annealing

Simulated Annealing (SA) draws inspiration from the physical process of annealing in metallurgy—where metals are heated and then slowly cooled to reduce defects and increase strength through controlled crystallization. This optimization technique mimics this thermodynamic process to escape local optima and find near-global optimal solutions.

The algorithm begins with an initial solution and a high 'temperature' parameter. At each iteration, it randomly proposes a neighboring solution and decides whether to accept it based on both its quality and the current temperature. Better solutions are always accepted, but importantly, worse solutions may also be accepted with a probability that depends on how much worse they are and the current temperature.

This probabilistic acceptance of suboptimal moves allows the algorithm to escape local optima by occasionally moving 'uphill' in the early stages when the temperature is high. As the temperature gradually decreases according to a cooling schedule, the algorithm becomes increasingly selective, eventually converging toward a local optimum—but ideally after having explored enough of the solution space to find a high-quality region.

Simulated annealing has proven remarkably effective for combinatorial optimization problems like circuit design, job shop scheduling, and graph partitioning. Its simplicity of implementation combined with theoretical guarantees of convergence to global optima (given sufficiently slow cooling) makes it a popular choice for problems with complex, multimodal optimization landscapes.

Nelder-Mead Method

The Nelder-Mead method (also known as the simplex method) represents one of the most widely used direct search techniques for multidimensional unconstrained optimization without derivatives. Unlike population-based methods, it maintains just a single geometric figure—a simplex with n+1 vertices in n-dimensional space—and evolves this shape to explore the objective function landscape.

The algorithm iteratively transforms the simplex through a series of geometric operations—reflection, expansion, contraction, and shrinking—based on function evaluations at the vertices. These operations adaptively reshape and move the simplex to follow the landscape's contours, generally flowing toward better solutions while adjusting its shape to match the local geometry of the function being optimized.

This elegant approach makes remarkably efficient use of function evaluations, typically requiring far fewer calls to the objective function than many other gradient-free methods. Its adaptive behavior allows it to handle varying scales and correlations between different dimensions, naturally stretching along promising directions and contracting in others.

Despite its age (developed in the 1960s), the Nelder-Mead method remains a workhorse optimization technique, particularly well-suited for problems with up to 10-20 variables where function evaluations are expensive. It excels at finding local optima of non-differentiable functions and is widely implemented in scientific computing environments due to its reliability and relative simplicity.

Bayesian Optimization

Bayesian Optimization represents a sophisticated approach to black-box optimization particularly suited for expensive-to-evaluate objective functions. Unlike methods that require many function evaluations, Bayesian optimization uses a probabilistic model (typically a Gaussian process) to approximate the objective function and guide the selection of the most promising points to evaluate next.

The method operates through a sequential strategy that balances exploration and exploitation. First, it builds a surrogate model of the objective function based on previous evaluations. This model captures both the estimated function value at any point and the uncertainty in that estimate. Then, it uses an acquisition function that combines information about predicted values and uncertainties to determine the next most informative point to evaluate.

Common acquisition functions include Expected Improvement (which balances the value of exploring uncertain regions against exploiting regions with high predicted performance) and Upper Confidence Bound (which explicitly manages the exploration-exploitation tradeoff through a tunable parameter).

This approach has become the method of choice for hyperparameter tuning in machine learning, where each evaluation might require training a neural network for hours or days. It's also valuable in experimental design, drug discovery, material science, and other domains where each function evaluation is time-consuming or expensive. By making intelligent decisions about which points to evaluate, Bayesian optimization can find high-quality solutions with remarkably few function evaluations—often 10-100 times fewer than required by other global optimization methods.