Introduction to Model-Based Derivative-Free Optimization
Abstract
The field of derivative-free optimization (DFO) studies algorithms for nonlinear optimization that do not rely on the availability of gradient or Hessian information. It is primarily designed for settings when functions are black-box, expensive to evaluate and/or noisy. A widely used and studied class of DFO methods for local optimization is model-based DFO, where the general principles from derivative-based nonlinear optimization algorithms are followed, but local Taylor-type approximations are replaced with alternative local models constructed by interpolation. This document provides an overview of the basic algorithms and analysis for model-based DFO, covering worst-case complexity, approximation theory for polynomial interpolation models, and extensions to constrained and noisy problems.
Contents
1 Introduction
This article is an introduction to the algorithmic ideas and analysis for model-based derivative-free optimization (MBDFO). As the name suggests, derivative-free optimization (DFO)—model-based or otherwise—refers to optimization in the absence of derivative information, whether that be for the objective function and/or constraint functions.111DFO is sometimes called zeroth order optimization (ZOO), particularly in machine learning. Our focus here is nonlinear optimization, where we aim to minimize a (usually assumed smooth) function of several real variables, possibly with constraints, which may be specified either explicitly (e.g. ) or via other nonlinear function(s) (e.g. ). Most general-purpose nonlinear optimization algorithms, such as gradient descent or (quasi-)Newton methods, require some derivative information, such as the gradient or Hessian of the objective (and any nonlinear constraints). DFO methods become relevant in situations where no derivative information is available, and MBDFO methods form an important sub-class of DFO methods.
1.1 Overview of Derivative-Free Optimization
In order to understand the uses/benefits of DFO, we first have to consider how we may obtain derivative information. If we have a function , such as our objective function, there are three main ways to evaluate or approximate its derivatives [111, Chapter 8]:
- Explicit calculation
-
If the mathematical form of is known, we can directly compute the relevant derivatives by hand (or using a symbolic computation package);
- Automatic (aka algorithmic) differentiation
-
If we have computer code to compute , automatic differentiation software can be used to create code that computes derivatives of by analyzing the code for and repeatedly applying the chain rule.222There are two ‘modes’ of automatic differentiation, forward and reverse. If has many inputs and few outputs, which is typically the case for optimization, the reverse mode is usually more efficient. In machine learning, the reverse mode is often called backpropagation. This produces exact derivatives, typically for the cost (in both time and memory) of a small—that is, and independent of —number of evaluations of [76, Chapter 4].
- Finite differencing
-
We can approximate derivatives of by comparing values of at nearby points. For example, forward finite differencing estimates first derivatives via
(1.1) for some small , where is the -th coordinate vector.333If can be extended to complex-valued inputs, the complex step approximation can be used [138]. This is less susceptible than (1.1) to rounding errors when is very small. To evaluate a full gradient , we would need to evaluate at and for all (i.e. evaluations of ). This only requires the ability to evaluate , but is only an approximation.
In most circumstances, at least one of these three approaches can be successfully used without significant effort. DFO methods are useful in the situations where none of these approaches are possible or practical. This typically is in situations where at least two of the following apply:
- Black-box
-
That is, the underlying mathematical structure of is not available. This could be because the details are unknown (e.g. legacy or proprietary software is used) or a situation where a clear mathematical description does not exist (e.g. results of a real-world experiment). This immediately rules out both explicit calculation and automatic differentiation.
- Expensive to evaluate
-
If computing is costly to evaluate—whether that cost is time, effort or money—then any algorithms relying on many evaluations of are likely to be impractical. Finite differencing requires at least evaluations of at each iteration, and so may not be suitable in this case.
- Noisy
-
In essence, this means that evaluating at nearby inputs leads to non-negligible differences in outputs. This noise may be deterministic or stochastic, depending on whether evaluating repeatedly at the same point gives the same result or not. This would usually not refer to rounding errors, but more significant differences such as the results of a Monte Carlo simulation. Here, finite differencing cannot be used without significant care, as it relies on comparing function values at nearby points.
A large survey of practical examples where these situations arise and DFO methods are useful is [2]. However, some specific examples include the following.
Example 1.1 (Calibrating Climate Models [141]).
In many areas of science, mathematical models of real-world phenomena are constructed to help predict outcomes and make decisions. Often, these models have parameters (e.g. coefficients of different terms in a differential equation) that cannot be directly measured, and instead must be calibrated using indirect measurements. This most commonly leads to least-squares minimization problems
| (1.2) |
where are the model parameters, are observations, and represents evaluating the model with parameters and other inputs to produce a predicted value for . In atmospheric physics, can involve a global climate simulation, incorporating coupled ocean, atmospheric and sea ice dynamics, and so its mathematical description is too complex to allow for a clear description (i.e. is black-box). Because of this complexity, evaluating can be computationally expensive, e.g. 8 hours on a high-performance computing system to simulate 18 months of climate forecasts. Additionally, climate dynamics can exhibit chaotic behavior, which in practice means that very similar parameters can produce very different model results, which effectively mean that is noisy.
Example 1.2 (Quantum Optimization [1]).
One promising use of quantum computers is to solve combinatorial optimization problems such as the graph max-cut problem. In the Quantum Approximate Optimization Algorithm, a particular combinatorial problem is converted into a minimum eigenvalue problem
| (1.3) |
for some (Hermitian) matrix and complex vector . However, is described implicitly, and must be evaluated by performing specific calculations on a quantum computer. However, quantum computers are inherently stochastic, and so we can only ever evaluate random approximations to . Given the limited availability of quantum computing hardware, the cost of evaluating may also be high.
Types of DFO Methods
There are many different classes of DFO methods, both for local and global optimization. DFO methods are very common in global optimization (since converging to a stationary point is not sufficient), and the resources [8, 103, 129] provide an overview of different global optimization methods, such as Bayesian optimization, genetic algorithms, branch-and-bound, and Lipschitz optimization (such as DIRECT [89]). Our focus here is on local optimization, where we do aim to find (approximate) stationary points, just like many popular nonlinear optimization methods such as (quasi-)Newton methods. Indeed, global optimization based only on local problem information (such as objective values and/or derivatives) cannot succeed unless an algorithm densely samples the feasible region or some global information is used, such as convexity or global derivative bounds [139]. The most common DFO methods for local optimization are:
- Model-based DFO (MBDFO)
-
The focus of this introduction. Here, the goal is to mimic derivative-based optimization methods, substituting local gradient-based approximations such as Taylor series with local models constructed by interpolation.
- Direct search methods
-
Such methods attempt to iteratively improve on a candidate solution by sampling nearby points, but without using any gradient approximations. This definition is broad—see a discussion of this in [93, Section 1.4]—but typically nearby points are selected from a small number of perturbations of the current iterate. A famous example is the Nelder–Mead method, where a simplex of points is iteratively modified to find improved solutions, but more widely studied are Generating Set Search [93] and Mesh Adaptive Direct Search [5] methods, where the perturbations are chosen in an predictable, structured way.
- Finite differencing/implicit filtering
-
Although finite differencing-based derivative approximations are typically used within standard (derivative-based) optimization algorithms without modification, some works explicitly consider the management of the perturbation size in (1.1) the algorithm. If is large, this is sometimes called implicit filtering. Building gradient approximations by finite differencing-type approximations along randomly generated directions [109] is particularly popular in machine learning [68].
For readers interested in other classes of local DFO, we refer to the books [51, 91, 8] and survey papers [116, 93, 56, 15, 96, 61]. There are different reasons to prefer these classes—for example, direct search methods are much more developed than MBDFO at handling complex problem structures such as nonsmoothness and discrete variables. However, the similarity of MBDFO to widely recognized and successful (derivative-based) optimization algorithms is appealing, and they often perform well in practice, when measured by the total number of objective evaluations required to solve a problem (see below).
In direct search methods, the search–poll paradigm [25] allows for very flexible ‘search steps’, which allow for any procedure that produces potentially good points, to be combined with the rigorous ‘poll step’ from direct search methods. Search steps using MBDFO ideas is one popular approach in this framework which significantly improves the practical performance of direct search methods [55, 48]. Global surrogate models for the objective such as radial basis functions and Gaussian Processes can also enable a search step [25, 10, 13]. Here, a surrogate does not need to accurately approximate the objective, only accurately rank the quality of suggested points [8, 94].
Cheap vs. Expensive Evaluations
An extremely important issue in DFO is whether a given problem has functions (objective, constraints, etc.) that are ‘cheap’ or ‘expensive’ to evaluate. These terms are meant in a (somewhat) relative sense, as different people/contexts have different understandings of cost.444As above, ‘cost’ could be financial, computational effort or time, for example. This can vary from seconds (e.g. valve train design [43] and financial model calibration [112]) to hours (e.g. algorithm tuning [14]) to days (e.g. hydrofoil noise reduction [104]).
In the case of expensive evaluations, we typically mean that the cost of evaluating the objective (or other such function) is the dominant cost of the optimization process. If this is true, the effort required to determine the next candidate point to evaluate is essentially negligible, and so little attention need be given to topics such as efficient linear algebra or subproblem implementations, or memory management. Performing a significant amount of work within the algorithm in order to avoid even one extra evaluation is generally considered worthwhile. In this setting, some MBDFO software will store the full history of objective evaluations, which implicitly assumes the number of evaluations is not too large (e.g. the IBCDFO software tries to construct its models primarily by using already-evaluated points; see Section 4.4).
When evaluations are cheap, we need to pay more attention to these other factors impacting the performance of the optimization. However, as in many computational tasks, the level of effort that should be devoted to more efficient algorithm implementations depends on the context; whether solving an optimization problem takes 1 second or 10 seconds on a laptop is often not a critical distinction.
As described above, DFO is particularly important/useful in the expensive setting, but may also be used in the cheap setting if the situation requires it. In line with the expensive regime, numerical comparisons of DFO algorithms/implementations in the research literature are most commonly performed by comparing the number of evaluations required to solve a given problem, rather than other metrics such as number of iterations or runtime [105]. General advice on comparing optimization algorithms can be found in [19].
1.2 Model-Based DFO
In MBDFO, our broad goal is to use the algorithmic structures found in derivative-based algorithms, but replace local, Taylor series-based approximations for the objective (and/or constraints) with local models constructed by interpolation. Hence, a MBDFO algorithm maintains a small set of points in (of which one is usually the current iterate), and iteratively moves this set towards a solution. Although in principle this is a generic framework, in practice the underlying derivative-based algorithms used are trust-region methods [47]. Here, we control the maximum stepsize we are willing to take in any given iteration (balancing accurate approximations with fast convergence), and minimize Taylor-like models in a ball around the current iterate. Trust-region methods are a widely used class of methods for derivative-based optimization, alongside other approaches such as linesearches and regularization methods. In MBDFO, trust-region methods are much more widely used than these alternatives, because we always know in advance a region where our next iterate will be found, and it is much more natural to construct interpolation models when we know the exact region in which they need to be a good approximation.
A simple one-dimensional illustration of MBDFO is given in Figure 1.1. At the start of iteration , we have three interpolation points— (illustrated with a large circle) and two others (small circles)—which we use to construct a quadratic approximation (dashed line) to approximate the true objective (solid line). In practice, of course, the full function is not fully known to the algorithm and can only be sampled. We minimize inside the (shaded) trust region, for some , to get a tentative new iterate (illustrated with a star). We compute , and in this case observe that , and so we set as the next iterate. After accepting the step (and setting to help speed up convergence), in Figure 1(b), we have added as an interpolation point—and removed the left-most point from the interpolation set (illustrated with a cross)—to build a new model , which will subsequently be minimized inside .
Of course, in this article we will outline exactly how we select/update interpolation points, construct interpolation models, compute steps, and choose and . In particular, in higher dimensions we will need to ensure that the interpolation points are well-spaced, while remaining close to the current iterate, to ensure that a suitable model exists and is a sufficiently good approximation to the objective. Much of our theoretical discussion will be on interpolation set management.
1.3 Scope of this work
The goal of this work is to provide an accessible introduction to MBDFO. The intended audience is graduate students with an interest in nonlinear optimization, and researchers hoping to learn the fundamentals of this topic. Although the end of each section includes a discussion of important references and related works, it is not intended to be a comprehensive survey (see [96] for this), but rather to provide an overview of the fundamental algorithms, approximation techniques, and convergence theory used in the field. It aims to provide a deeper introduction than found in the books [111, 8]. The most similar existing work to this article is the book [51], but we include more recent advances (e.g. complexity analysis of algorithms, approximation theory in constrained regions, stochastic algorithms), and provide a more targeted analysis of core concepts (for example, we do not cover interpolation theory for higher order polynomial models), including some novel proofs of existing results. Some familiarity with nonlinear optimization techniques (e.g. from [111]), especially trust-region methods, would be helpful here but is not essential.
The largest part of our efforts is devoted to:
-
•
Introducing the basic (trust region) MBDFO algorithm and proving first- and second-order convergence and worst-case complexity bounds; and
-
•
Showing how linear and quadratic interpolation models can be constructed for use in MBDFO methods, and proving results quantifying their accuracy. We consider structured interpolation set choices yielding optimal algorithm bounds and their practical extension where points are selected from a database of existing evaluations (Section 4) and geometric approaches based on incremental/minimal updating of interpolation points (Section 5). These two approaches are both theoretically interesting and form the basis of practical implementations.
For both of these topics, we primarily consider the simplest case of unconstrained optimization problems with access to exact objective values. These two topics form the core from which more advanced algorithms can be built. For example, in later sections, we show how these ideas can be extended to handle constrained problems, and problems with inexact/noisy objective evaluations. At the end of each section, we provide a short overview of key references and related works, including more recent research directions.
Our focus here is on the key theoretical underpinnings of MBDFO. We largely avoid detailed discussions of practical considerations for software implementations, providing only occasional notes on this important topic (most notably on termination conditions, Section 3.3) and a list of some notable open-source MBDFO packages in Section 8. In particular, we give minimal attention to the numerical linear algebra required to solve the subproblems that arise in MBDFO (e.g. step calculation and interpolation model building), in line with the regime of expensive evaluations described in Section 1.1. All the methods describe here, except for Algorithm 7.2 in Section 7.2, are suitable for the expensive evaluation regime (see Remark 7.17 for a discussion of this point).
For brevity, we also avoid discussion of and comparison with other local DFO methods or global optimization methods, which often avoid using derivative information (see Section 1.1 and references therein for good resources on these methods). Although there are a wide range of techniques for global optimization, we note that Bayesian and surrogate optimization [133, 25, 126, 8] have similar principles to MBDFO, namely iteratively minimizing an easy-to-evaluate approximation to the objective built from evaluations at known points.
Structure
In Section 2 we give some preliminary technical results that we will use throughout, and introduces standard (i.e. derivative-based) trust-region methods for unconstrained optimization, including the general algorithmic framework, step calculation and convergence guarantees. In Section 3, we show how this framework extends to the MBDFO case, outlining what changes to the algorithmic framework are required, the requirements on the interpolation models, and the convergence guarantees. We then introduce the polynomial interpolation theory required to construct the models, and show how they satisfy the requirements from Section 3. In particular, Section 4 considers structured interpolation set construction and Section 5 extends these ideas to allow for incremental interpolation set updating. We then consider two important extensions of this theory: Section 6 covers problems with simple (e.g. bounds) and general nonlinear constraints, and Section 7 covers problems with deterministic and stochastic noise. In our summary, Section 8, we point to a selection of state-of-the-art software implementations of MBDFO.
Notation
Throughout, vectors and matrices will be written in bold (to distinguish from scalars), and functions will be written in bold if they have multiple outputs, e.g. but . The vector norm refers to the Euclidean 2-norm, and the matrix norm refers to the operator 2-norm (i.e. largest singular value). Other norms will be indicated explicitly, such as or . We will use to refer to the closed Euclidean ball centered at with radius , that is . The standard coordinate vectors in will be written as , where has a 1 in the -th entry, is the vector of all ones, will be the identity matrix, and will be a vector or matrix of zeros (depending on the context). For a matrix , we denote its Moore-Penrose pseudoinverse by . We use to denote the (2-norm) condition number of a matrix, .
2 Technical Preliminaries
2.1 Stationarity Conditions and Smoothness Assumptions
The largest focus of this work is to solve the unconstrained nonlinear optimization problem
| (2.1) |
where is a smooth (we will mostly assume continuously differentiable) but possibly nonconvex (i.e. not necessarily convex555The function is convex if for all and . An important consequence of convexity is that all local minimizers are global minimizers. If is known to be convex, (2.1) is typically much easier to solve and there are a plethora of specialized methods for this case [110].) objective function. In general, finding a global minimizer of a nonconvex function is very difficult and suffers from the curse of dimensionality666That is, the cost of finding a solution grows exponentially with the problem dimension . [36, Chapter 1.2.6], so we will restrict our consideration to finding local minima for (2.1).
Definition 2.1.
A point is a local minimizer of if there exists such that for all . The value is called the local minimum of .
For general problems such as (2.1), we need to specify what information about is known to any algorithm we develop, as this clearly constrains our available algorithmic choices. In nonlinear optimization, we typically use the oracle model for optimization algorithms: information about is available only via a given set of oracles, most commonly some/all of:
-
(a)
The zeroth-order oracle for is the map ;
-
(b)
The first-order oracle for is the map (provided is differentiable); and
-
(c)
The second-order oracle for is the map (provided is twice differentiable).
When we first explore classical (derivative-based) trust-region methods in Section 2.2, we will assume access to at least the zeroth- and first-order oracles for , with the second-order oracle often considered optional. When in Section 3 we begin to consider MBDFO algorithms, we will only assume access to a zeroth-order oracle.
Remark 2.2.
For the purposes of the theoretical analysis, we will assume is differentiable. However, all our algorithms are built assuming we cannot evaluate , for one (or more) of the reasons outlined above.
The limited problem information available in the oracle model means we need necessary or sufficient conditions for a point to be a local minimizer which can be checked using the oracles available to us.
Proposition 2.3 (Theorems 2.2–2.4, [111]).
Suppose and is continuously differentiable. Then,
-
(a)
If is a local minimizer of , then . (first-order necessary condition);
-
(b)
If is twice continuously differentiable and is a local minimizer of , then is positive semidefinite. (second-order necessary condition); and
-
(c)
If is twice continuously differentiable, and is positive definite, then is a local minimizer of . (second-order sufficient conditions).
In light of Proposition 2.3(a) and (b), we call points such that stationary or first-order optimal, and if both and is positive semidefinite then we say is second-order optimal.
In practice, our algorithms will be able to find points that approximately satisfy the first- or second-order necessary conditions. In the first-order case, this means finding a point with for some (small) tolerance . Since a second-order optimal point is defined by two necessary conditions and , our measure of approximate second-order optimality will be , where
| (2.2) |
That is, we hope to find a point with , which implies both and hold, the latter condition being equivalent to . For an algorithm with current iterate at iteration , we will denote and .
We will consider two different assumptions on our objective in (2.1), depending on whether we wish to prove convergence to first- or second-order optimal points. Our main discussion will focus on first-order convergence, in which case we will assume the following.
Assumption 2.4.
The function in (2.1) satisfies:
-
(a)
is continuously differentiable, and is Lipschitz continuous with constant ; and
-
(b)
is bounded below, there exists an such that for all .
Remark 2.5.
The global Lipschitz continuity of in Assumption 2.4(a) is very strong, excluding simple functions such as . As in [36], we make this assumption for ease of exposition, but in practice it can be weakened to being -Lipschitz continuous on any set containing all trust regions . For example, if for all iterates (i.e. our algorithm is monotone) and we have an upper bound on all trust-region radii, , then we can take . For problems with simple constraints (Section 6.1), we can restrict to the feasible region. Moreover, if is continuous and is bounded, then the existence of a suitable is automatic. This issue is discussed more in, for example, [47, Chapter 6.2.1] and [51, Chapter 10.2] for deterministic problems, and [41, Remark 4.2] for stochastic problems.
The most important consequence of Assumption 2.4 is a bound on the error in a first-order Taylor series for , specifically:
In the case where we are interested in convergence to second-order optimal points, we will use the following alternative smoothness assumption.
Assumption 2.7.
The function in (2.1) satisfies:
-
(a)
is twice continuously differentiable, and is Lipschitz continuous with constant ; and
-
(b)
is bounded below, there exists an such that for all .
Where Assumption 2.4(a) allows us to bound the first-order Taylor series error using Lemma 2.6, Assumption 2.7(a) allows us to bound the second-order Taylor series error via the bound [36, Corollary A.8.4]
| (2.4) |
Another consequence of Assumption 2.7(a) is that itself has a Lipchitz continuous gradient, and so a result similar to Lemma 2.6 applies, namely [111, Appendix A]
| (2.5) |
Although the main focus of this work is the solution of unconstrained problems (2.1), we will also consider problems with nonlinear constraints, of the form
| (2.6a) | ||||
| s.t. | (2.6b) | |||
| (2.6c) | ||||
The first-order optimality conditions for (2.6) are the Karush–Kuhn–Tucker (KKT) conditions: defining the Lagrangian as
| (2.7) |
where and , provided a suitable constraint qualification holds (see e.g. [111, Chapters 12.2 & 12.6]), a necessary condition for to be a local minimizer of (2.6) is that there exists such that
| (2.8a) | ||||
| (2.8b) | ||||
| (2.8c) | ||||
| (2.8d) | ||||
| (2.8e) | ||||
For an overview of introductory ideas from nonlinear optimization and more details on the above, see [111].
2.2 Trust-Region Methods
Trust-region methods are a popular class of algorithms for nonconvex optimization which have proven very successful in practice, featuring in the state-of-the-art codes GALAHAD [70] and KNitro [31], as well as in many of the methods from MATLAB’s Optimization Toolbox and SciPy’s optimization library, for example. Other common classes of algorithm are linesearch and regularization methods, which primarily differ in how global convergence777That is, convergence to a stationary point is guaranteed regardless of how far the algorithm starts from a solution, in contrast with local convergence, which assumes a starting point sufficiently close to a solution. is guaranteed. All of these classes are iterative, requiring the user to provide a starting point and generating a sequence which we hope will converge to a minimizer. They are also all based on iterative minimization of local approximations to the objective , usually Taylor series (or approximations thereof). In trust-region methods, the crucial ingredient that distinguishes it from the other classes is a positive scalar called the trust-region radius, updated at each iteration, that constrains the distance between consecutive iterates.
If we have zeroth- and first-order oracles for the objective , the key steps in one iteration of a trust-region method for solving (2.1) are:
-
1.
Build a (usually quadratic) model for that we expect to be accurate near the current iterate. This typically uses a second-order Taylor series for based at , with the Hessian potentially replaced with a quasi-Newton approximation such as the symmetric rank-1 update (see [111, Chapter 6]);
-
2.
Minimize the model in a neighborhood of the current iterate (the trust region, i.e. the region in which we trust the model to be accurate);
-
3.
Evaluate at the minimizer found in the previous step and decide whether or not to make this the new iterate, and update the trust-region radius. We accept a new iterate if it sufficiently decreases the objective, and increase the trust-region radius if we have been making good progress (to enable larger steps, hence faster progress to the minimizer), or decrease if we are not making progress (since our model is more accurate in a smaller neighborhood of the iterate).
This is formalized in Algorithm 2.1. The two most important aspects of Algorithm 2.1 are the model construction and step calculation in (2.9) and (2.10) respectively. Typical values for the parameters might be , , and , but these should not be viewed as prescriptive.
| (2.9) |
| (2.10) |
| (2.11) |
The core calculation in Algorithm 2.1 is the (approximate) solution of the trust-region subproblem (2.10). Ignoring the specific choice of model and centering the model at the current iterate, this subproblem has the general form
| (2.12) |
At first glance, solving (2.12) appears daunting: instead of our original problem of minimizing , now at each iteration we need to minimize a different function, but with constraints!
However, the special structure of (2.12)—minimizing a (possibly nonconvex) quadratic objective subject to single a Euclidean ball constraint—allows the efficient calculation of exact (global) minimizers via a one-dimensional search over the Lagrange mulitplier for the (squared) constraint .
Implementing an efficient algorithm to do this requires some effort; see [47, Chapter 7.3] or the newer [71] for details. To find a global minimizer, the dimension must not be too large, as the subproblem solver requires computing Cholesky factorizations of for several different values of , with a cost of operations each time.
Approximate Subproblem Solutions
However, we do not necessarily need the global solution to the trust-region subproblem for Algorithm 2.1 to converge. A very simple approximate solution can be found by performing 1 iteration of gradient descent with exact linesearch. That is, our approximate solution to (2.12) is of the form for . Restricting the objective to this ray, we get subject to . This is easy to globally minimize in , yielding the so-called Cauchy point
| (2.13) |
By direct computation, we can get a lower bound on how much the Cauchy point decreases the quadratic model.
It turns out that (2.14) is in fact sufficient to achieve first-order convergence of Algorithm 2.1. Hence, after re-introducing into the model as per (2.9), we make the following assumption regarding (2.10).
Assumption 2.9.
If is large and computation of the global minimizer of (2.12) is impractical, there are still alternative methods for improving on the Cauchy point. The most common is an adaptation of the conjugate gradient (CG) method for solving symmetric positive definite linear systems [111, Chapter 5.1]. In this method, sometimes called the Steihaug-Toint method, we use CG to solve starting from initial iterate . If any iteration would take us outside the feasible region, we truncate the step so we remain on the boundary of the feasible region and terminate. Similarly, if, at any iteration, we compute a search direction such that , then is not positive definite, and we move from our current CG iterate in the direction until we reach the boundary of the feasible region. This method has the advantage of only requiring via Hessian-vector products, and so is suitable for large-scale problems, and the first iterate is always the Cauchy point, so Assumption 2.9 is guaranteed. More details of this and other approximate subproblem solvers can be found in [47, Chapter 7.5].
To achieve second-order convergence of Algorithm 2.1, we first note that the quadratic model (2.9) has its own second-order optimality measure, namely
| (2.16) |
although here we will assume and , and so and . Here, we need a stronger assumption on our trust-region subproblem solution, to handle the case where is not positive semidefinite.888For example, if and then satisfies Assumption 2.9. Suppose that (i.e. ), and let be a normalized eigenvector corresponding to that eigenvalue, which is also a first-order descent direction; that is,
| (2.17) |
The last condition may be achieved by judicious sign choice in the eigenvector calculation. Similar to the definition of the Cauchy point (2.13), the eigenstep is the point in the direction that minimizes the model (subject to ). Since we have
| (2.18) |
we conclude that is decreasing as increases, so we have since . We then compute
| (2.19) |
This motivates the following assumption on our step calculation for second-order convergence, where we now require our step to be at least as good as the Cauchy step and, if (i.e. ), the eigenstep.
2.3 Convergence of Trust-Region Methods
We conclude this section by summarizing the convergence of Algorithm 2.1 to first- and second-order optimal points. Our results will also show the worst-case complexity of Algorithm 2.1, which essentially is a rate of convergence (i.e. how many iterations are needed to achieve or ?). We assume the following about the quadratic model (2.9).
Assumption 2.11.
Our main first-order convergence result is the following.
Theorem 2.12 (Theorem 2.3.7, [36]).
This tells us that there is a subsequence of iterations whose gradients converge to zero, and that we get for the first time after at most iterations. In practice, we typically terminate Algorithm 2.1 when is sufficiently small or we exceed some computational budget (e.g. maximum number of iterations).
Algorithm 2.1 to converge to second-order optimal solutions, we have to strengthen Assumption 2.11 to also require .
Theorem 2.13 (Theorem 3.2.6, [36]).
In summary, if we want to achieve optimality level , then trust-region methods require iterations to achieve first-order optimality101010This follows from Theorem 2.12, but the same result holds if we assume exact second-order models, [36, Theorem 3.2.1]. and iterations to achieve second-order optimality, assuming sufficient problem smoothness, model accuracy and subproblem solution quality.
Notes and References
3 Derivative-Free Trust-Region Methods
We now consider how to adapt the generic trust-region method from Section 2.2 to the case where we only have a zeroth order oracle for the objective. That is, we are solving (2.1), where is still differentiable (e.g. Assumption 2.4), but where we do not have access to in our algorithm.
As before, at each iteration we will construct a local quadratic model around , namely
| (3.1) |
noting that, unlike (2.9), we do not require (i.e. is allowed). For now, we will not define a specific way to construct the local quadratic model (3.1) using only function values. Instead, we will focus on simply defining some practical assumptions on the model accuracy (i.e. replacing Assumption 2.11). In Sections 4 and 5 we will outline concrete ways our model accuracy requirements can be specified.
Motivation: finite difference gradients
In the simplest case, imagine we do not have access to , and simply build in (3.1) from (forward) finite differences:
| (3.2) |
for some small , where is the -th coordinate vector. From standard analysis of finite differencing (e.g. [111, Chapter 8.1]), if is twice continuously differentiable with for all in , then . That is, the gradient error is of size as . Although derivative-based trust-region methods can handle inexact gradients [47, Chapter 8.4], they require a bound on the relative gradient error, whereas finite differencing gives an absolute error in the model gradient. We can only control the model error by choosing the value of , perhaps differently at each iteration. It is not obvious how we could pick if we instead wanted to control the relative gradient error, at least not without already having a good estimate of and (i.e. unless we already have the derivative information we are trying to calculate!).
Now, suppose we build a quadratic model (3.1) using this and any bounded Hessian (c.f. Assumption 2.119). Given , and hence , the error in the model (3.1) at a point is
| (3.3) | |||
| (3.4) |
So, if we want to approximate the objective within the trust region, , it makes sense to balance both error terms and set , leading to a model error of size .
Fully linear models
In our general MBDFO approach, we will use the trust-region radius to control the size of the model error (e.g. at iteration , use in (3.2), as suggested above). This suggests to us the following notion of model accuracy.
Definition 3.1.
Suppose we have and . A local model approximating is fully linear in if there exist constants , independent of , and , such that
| (3.5a) | ||||
| (3.5b) | ||||
for all . Sometimes we will use for notational convenience.
Remark 3.2.
‘Fully linear’ here refers not to the model being linear, but the model being as accurate an approximation (up to constants) as a linear Taylor series. Quadratic models as in (3.1) can be fully linear. However, in Sections 4 and 5 we will sometimes consider the case where (3.1) is indeed linear (i.e. ). For simplicity, we will generically refer to as a ‘quadratic model’ even if , unless we specifically wish to draw attention to being linear.
If we do indeed have access to a first-order oracle (i.e. not in the MBDFO case), then we can satisfy Definition 3.1 using the approaches from Section 2.2. For example, if Assumption 2.4(a) holds (and so we have Lemma 2.6):
-
•
The first-order Taylor series is fully linear in with and .
-
•
More generally, any quadratic model satisfying Assumption 2.11 is fully linear in with and .
In Sections 4 and 5 we will show how models satisfying Definition 3.1 may be constructed using only function values.
Assumption 3.3.
It is clear from (3.5) that we have to contend with absolute model errors, controlled by the size of (which will be set to the trust-region radius in our algorithm). This introduces several difficulties that are not present in the derivative-based case:
- •
-
•
If our model suggests we are close to a first-order solution (i.e. ), this does not mean that we are actually near a solution (i.e. ). By comparison, if we have relative errors in the model gradient, then if and only if [47, Lemma 8.4.1].
-
•
As a consequence of the above point, it is no longer clear when to terminate the algorithm in practice. If we had gradients (or relative gradient errors), then terminating when is sufficiently small guarantees we are close to a (first-order) solution. We specifically discuss termination in Section 3.3.
Unfortunately, this additional complexity is a consequence of having a weaker notion of model accuracy, one which can be practically achieved using only function evaluations.
3.1 First-Order Convergence
We are now ready to state our MBDFO variant of Algorithm 2.1, where our Taylor-based models (i.e. (2.9) satisfying Assumption 2.11) are replaced with fully linear models (i.e. (3.1) satisfying Assumption 3.3). It is almost identical to the derivative-based version Algorithm 2.1, including allowing inexact subproblem solutions (satisfying the Cauchy decrease condition, Assumption 2.9).
Aside from our new model accuracy condition (Assumption 3.3 instead of Assumption 2.11), the only difference is that we need to explicitly check our criticality measure121212That is, our measure of optimality, in this case (for first-order convergence)., and require for a step to be declared (very) successful. This is an important feature of MBDFO methods, and aims to address the decoupling of our measured distance to a solution from our true distance to solution . This mechanism ensures that, for successful iterations, if is large (i.e. we are far from first-order optimality) then so is (i.e. the model knows we are far from optimality); see (3.24) below.
Remark 3.4.
If at any iteration, then that iteration must be unsuccessful regardless of the value of . So, if this check is performed as soon as the model is constructed, we may save our efforts by not solving the trust-region subproblem or evaluating .
We now present our analysis of Algorithm 3.1.
Lemma 3.5.
Proof.
The crucial consequence of Lemma 3.5 is that remains large provided we have not yet converged.
Lemma 3.6.
Proof.
We proceed by induction. The result holds trivially for , so suppose for some . To find a contradiction assume that . Then , which by the mechanism for updating the trust-region radius means iteration was unsuccessful, and . From Lemma 3.5, this means we must have
| (3.12) |
But since is fully linear (Assumption 3.3), we get
| (3.13) |
which contradicts . ∎
Our main convergence result is the following. The arguments here can largely be used to prove the equivalent result for derivative-based methods, Theorem 2.12.
Theorem 3.7.
Proof.
We first partition the iterations , where is the set of successful or very successful iterations and is the set of unsuccessful iterations. From Lemma 3.6 we have for all .
Since if and only if , we have
| (3.15) | ||||
| (3.16) |
where the last inequality uses for all . If , then Assumptions 2.9 and 3.3(b), together with the criticality requirement , imply
| (3.17) | ||||
| (3.18) |
We conclude that
| (3.19) |
or
| (3.20) |
Separately, the mechanism for updating ensures that if and if . Hence
| (3.21) |
which gives
| (3.22) |
The result then follows from . ∎
We again summarize our result in terms of the key quantities of interest. Here, we recall (see Definition 3.1) and note that as and , to get the following first-order convergence result.
Corollary 3.8.
In the derivative-based case (Theorem 2.12), using corresponding to a Taylor model, the worst-case complexity was iterations. The bound in Corollary 3.8 is worse by a factor of , which arises from using to get (3.18) (see Remark 3.9).
We also have to assess the impact of using interpolation-based models on . In Section 4, we show that is usually of size , the same as for Taylor models, but with a constant that depends explicitly on the problem dimension , e.g. in Theorem 4.2. In some special cases, though, we can recover dimension-independent (Corollary 4.5). However, we will always get an explicit dimension dependency of at least if we count the number of (possibly expensive) objective evaluations.131313In the derivative-based setting, this does not arise because every evaluation of gives pieces of information.
Remark 3.9.
There are multiple ways to establish a model decrease of at least on successful iterations, as in (3.18) above. In derivative-based trust-region methods, when , we have
| (3.23) |
immediately from the assumption , and we never need to check (i.e. we may set ). In the DFO setting, if we know that is fully linear then we may instead use
| (3.24) |
to get . The approach in (3.18) is the most general, as it only requires the and does not require anything regarding model accuracy. The same flexibility appears in the second-order convergence theory (below).
3.2 Second-Order Convergence
Similar to Theorem 2.13, we now consider how to extend the above results to converge to second-order critical points. Just as in the derivative-based case, we need to assume is twice continuously differentiable (specifically, Assumption 2.7), and our trust-region subproblem solver needs to achieve at least the eigenstep decrease (Assumption 2.10).
However, motivated by (2.4), we need a stricter requirement on our model accuracy than fully linear. Where fully linear models match (up to constants) the error from a first-order Taylor series, we now require models which match the error from a second-order Taylor series.
Definition 3.10.
Suppose we have and . A local model approximating is fully quadratic in if there exist constants , independent of , and , such that
| (3.25a) | ||||
| (3.25b) | ||||
| (3.25c) | ||||
for all . Sometimes we will use for notational convenience.
If Assumption 2.7 and hence (2.4) and (2.5) hold, it is not hard to verify that the second-order Taylor series is fully quadratic in with , and .
Our model assumptions mimic Assumption 3.3 closely.
Assumption 3.11.
Then, our second-order algorithm is given in Algorithm 3.2. Compared to the first-order algorithm Algorithm 3.1, we require fully quadratic models rather than fully linear models, assume the eigenstep decrease from the trust-region subproblem solver, and compare with (2.16) to declare a step successful, instead of comparing just with . Additionally, we cap at a maximum level , a technicality needed to bound the error between the true and estimated criticality measures (Lemma 3.12). Ultimately, this comes from the need to control the size of the gradient error (3.25b): reflecting the ‘overloaded’ role of the trust-region radius, we have to cap the radius (and hence the size of the step ) solely because the radius also acts as the measure of model accuracy.
First, we show that fully quadratic models yield sufficiently good estimates of all the different criticality measures under consideration (see (2.2) and (2.16)).
Lemma 3.12.
Proof.
Note that the trust-region updating mechanism in Algorithm 3.2 and ensures for all iterations .
First, we have . Secondly, if is a normalized eigenvector corresponding to , then
| (3.26) |
and instead taking to be a normalized eigenvector for we get the reverse result , and so all together we get . The result then follows from the identity141414To show this, suppose . Then and a similar argument for the opposite direction. . ∎
We now state the second-order equivalent of Lemma 3.5, which is split into two results for the two criticality measures and .
Lemma 3.13.
Proof.
Lemma 3.14.
We now get our usual result, that is bounded away from zero provided we are not close to optimality.
Lemma 3.15.
Proof.
We again proceed by induction. The result holds trivially for , so suppose for some . To find a contradiction assume that . Then , which by the mechanism for updating the trust-region radius means iteration was unsuccessful, and .
We now get our main second-order complexity result.
Theorem 3.16.
Proof.
We again partition the iterations , where is the set of successful or very successful iterations and is the set of unsuccessful iterations. From Lemma 3.15, we have for all .
If , then the criticality requirement implies that either or . First, if , then Assumptions 2.10 and 3.11(b) give
| (3.38) |
Instead, if , then Assumption 2.10 gives
| (3.39) |
In either case, we always have
| (3.40) |
By the same reasoning as (3.16) in the proof of first-order convergence (Theorem 3.7), we have
| (3.41) |
or
| (3.42) |
The trust-region updating mechanism again gives us (3.22), and the result follows from . ∎
Recalling here the notation (see Definition 3.10) and noting that and as and , we get the following second-order convergence result.
Corollary 3.17.
This bound again matches from the derivative-based case (Theorem 2.13), but is again larger by a factor of arising from the same issue as the first-order case (Remark 3.9). Again, although —and now also —will usually depend on explicitly (Theorem 4.6), they can be made dimension-independent (Corollary 4.9), but we will always need at least objective evaluations (to compensate for the loss of information from the first- and second-order oracles).
3.3 Termination
We end this section by briefly discussing how to terminate MBDFO algorithms in practice.
In all nonlinear optimization methods, in practice it is common to terminate after some fixed budget. For derivative-based methods, this may be a maximum number of iterations or (particularly for large-scale problems) a maximum runtime. In MBDFO, where the objective may be very expensive to evaluate, it is more common to have a maximum number of objective evaluations (since we may have multiple evaluations per iteration, depending on how exactly the model (3.1) is constructed).
Separately, it is important to have a termination condition based on optimality; i.e. the algorithm terminates because it has reached a sufficiently accurate solution. In the derivative-based case, we have direct access to optimality measures such as , and so we can simply terminate if for some user-defined tolerance . Of course, this is not possible in the MBDFO case, as we no longer have .
Although we have access to at each iteration, a small value of does not always imply a small value of (the condition we actually want to be true). If the model is fully linear and , then we do have from (3.24), and so (again where is a user-defined tolerance) would imply . For this to be an appropriate termination condition, we require a certification that is fully linear.
In practice, however, MBDFO methods usually terminate with ‘optimality’ if is sufficiently small. This is theoretically justified, as we know that never gets too small if we remain far from optimality (e.g. Lemma 3.6). We also have the following result, to provide further comfort that we can terminate on small (noting that all our main results above are related to convergence over a subsequence of iterations, not the full sequence of iterations).
Proof.
We first consider the case where there are finitely many (very) successful iterations. This means that iteration is unsuccessful for all sufficiently large, say for fixed . That is, for for some . So, as .
Instead, suppose there are infinitely many (very) successful iterations. The below reasoning applies equally to either algorithm. For any such iteration , from Assumptions 3.3 and 2.9, the same reasoning as used to get (3.19)—without replacing with the lower bound —gives
| (3.43) |
where here is the (infinite) set of all successful or very successful iterations. That is, we have . Since is infinite, it must be that . For any sufficiently large (i.e. so that there was at least one (very) successful iteration before ), the trust-region updating mechanism guarantees , where is the last (very) successful iteration before , and so the result holds. ∎
The below result additionally tells us that terminating after the first iteration when drops below some termination threshold guarantees that the current iterate is a good solution. Specifically, we get that .
Lemma 3.19.
Proof.
Since , we know that . Moreover, since is only decreased on unsuccessful iterations, we know that iteration is unsuccessful and . From Lemma 3.6, for iteration to be unsuccessful we must have
| (3.45) |
Combined with we get
| (3.46) |
Therefore, since is fully linear, we get
| (3.47) | ||||
| (3.48) |
as required. ∎
Of course, Lemma 3.18 ensures that we will always terminate in finite time, for any choice of termination value .
Remark 3.20.
Notes and References
The first convergence theory for an algorithm similar to Algorithm 3.1 was given in [49], but the framework studied here using fully linear/fully quadratic models was introduced later in [50] and expanded in [51]. The extension of these results to include worst-case complexity bounds was first given in the PhD thesis [90], with the first-order complexity later published in [67]. To the best of the author’s knowledge, the first use of the simplified algorithmic framework assuming fully linear models at all iterations (and checking criticality within the trust-region updating) was in [17].
Currently, one of the most active research directions in MBDFO is adapting this framework to be better suited to large-scale problems (i.e. where the dimension is large, roughly ). The most promising direction seems to be methods that construct models in low-dimensional subspaces at each iteration, where the subspaces may be selected randomly [38, 60, 39] or using deterministic methods [151, 159].
4 Interpolation Model Construction
In Section 3, we introduced a basic MBDFO trust-region method. To make this method concrete, we need a way to construct quadratic models (3.1) at each iteration that are sufficiently accurate to guarantee convergence (i.e. satisfying Assumptions 3.3 or 3.11). In particular we focus on the fully linear/fully quadratic requirements Assumptions 3.3(a) and 3.11(a). Of course, we must be able to form these models using only zeroth-order information about the objective , which we will achieve by constructing models that interpolate over carefully chosen collections of points.
We first consider generating good collections of points for linear/quadratic interpolation to guarantee fully linear/fully quadratic models. We derive generic error bounds in terms of the linear algebra associated with the interpolation system, and consider special cases where points are sampled in a regular way around the current iterate (e.g. perturbations along coordinate axes). These special cases are useful to understand the theoretical capabilities of MBDFO methods, but this process of choosing a structured interpolation set is also practically useful. As outlined in Section 4.4, in the IBCDFO software collection (see Section 8), an interpolation set is built by appending suitable points one-by-one from the full history of the solver; the error bounds we derive are used to ensure ‘suitability’ and to complete the set of points if not enough historical points are suitable.
To simplify the derivation of error bounds, we will use the following result, which says that fully linear/fully quadratic models can be achieved by ensuring sufficiently accurate approximations to a Taylor series.
Lemma 4.1.
Suppose we have a quadratic model approximating near a point ,
| (4.1) |
Then, for any , the following results hold.
- (a)
- (b)
Proof.
We will use the technical results in Appendix A, which say that if two linear/quadratic functions are close in , then their coefficients are also close.
(a) Fix any , and use Lemma 2.6 to get
| (4.6) | ||||
| (4.7) |
and we get the value for after . Next, for any we have
| (4.8) |
and so Lemma A.1 gives . So, for an arbitrary , we use to get
| (4.9) | ||||
| (4.10) |
and we get the value of .
(b) First, to get , we follow the same reasoning as for (4.7) but replacing Lemma 2.6 with (2.4) to get
| (4.11) |
for any . Next, we use Lemma A.2 on (4.4) to get
| (4.12) |
So, for any we have
| (4.13) | ||||
| (4.14) |
where the last inequality uses (2.5), and we get the value for . Lastly, we have
| (4.15) |
and we get . ∎
Our goal is now to show that linear/quadratic interpolation models for can provide sufficiently good approximations to a Taylor series for , from which Lemma 4.1 tells us that the model is fully linear/fully quadratic, as needed for our algorithms.
4.1 Linear Interpolation
In the first instance, we will try to construct fully linear models (Definition 3.1) using linear interpolation. Here, for our objective function , we will build a local linear model around a base point (which will be at the -th iteration of our algorithm),
| (4.16) |
for some and ; i.e. taking in (3.1).151515This then satisfies Assumption 3.3(b) with . We choose the unknowns ( and ) to interpolate known values of at points, . That is, we pick and such that for all . This reduces to solving the following linear system:
| (4.17) |
We shall see that, for the resulting and to give a fully linear model (Definition 3.1) in a ball around the base point, —where at iteration we will usually have , our current iterate and , our current trust-region radius—the interpolation points will have to be close to the base point, (i.e. for a constant not much larger than 1). However, as we saw in Lemma 3.18, the trust-region radius as , and so the matrix in (4.17) becomes increasingly ill-conditioned as the algorithm progresses (as each column of except the first approaches zero).
To avoid this ill-conditioning, we can rescale the second through last columns161616This is a form of equilibration [69, Chapter 3.5.2]. of the linear system (4.17) by to replace with , to get
| (4.18) |
Now, we expect that all , and we avoid ill-conditioning resulting from . This rescaling does not guarantee that the new linear system (4.18) is well-conditioned—we will ensure this by judicious choice of the —but it does ensure that the magnitude of is not a source of ill-conditioning.
In this setting, since all , we know that is not too large, since . Hence if is invertible we have , where denotes the (2-norm) matrix condition number, so any ill-conditioning in is entirely driven by the size of . This is reflected in the following interpolation error bound, where we assume for some , and both and appear explicitly.
Theorem 4.2.
Proof.
Remark 4.3.
Theorem 4.2 gives the error bounds in terms of , as opposed to the 2-norm which occurs in the corresponding bounds in earlier works, e.g. [51]. This does not significantly change our results, but the proof approach where we control the model error relative to a Taylor series via (4.25) generalizes to interpolation theory in constrained regions (Section 6.1) and naturally yields bounds in terms of .
All that remains is to find a set of interpolation points so that and is not too large. The simplest way to do this is to use and perturbations around of size . The most natural way to do this is via coordinate perturbations around , i.e. taking our interpolation set to be
| (4.27) |
This gives us the following worst-case complexity bound for Algorithm 3.1, building on the general result Corollary 3.8.
Corollary 4.4.
Proof.
Given the interpolation points (4.27), we have
| (4.28) |
so and hence the fully linear constants from Theorem 4.2 satisfy . We then apply Corollary 3.8 with in Assumption 3.3(b) to get the iteration bound. The objective evaluation bound comes from observing that our interpolation mechanism requires evaluating at no more than points per iteration. ∎
Using the interpolation set (4.27) is equivalent to building a linear Taylor model using finite differencing (3.2) with the dynamically adjusted stepsize to estimate . However, it is this dynamic adjustment that is at the core of MBDFO, reflecting that we need to adjust our interpolation set depending on the required accuracy in the model (controlled by ).
The dimension dependency in Corollary 4.4 can be improved by considering coordinate perturbations of size .
Corollary 4.5.
Proof.
So, at least in terms of the explicit dependence on , Corollary 4.5 gives an iteration bound that matches the derivative-based case, with a evaluation bound that comes from the fact that we need objective evaluations to match the same amount of problem information as one gradient evaluation (i.e. real numbers).
However, taking an even smaller perturbation size cannot reduce to a smaller value (e.g. ) to further improve the worst-case complexity, since always has a component of size from Assumption 2.4 that is independent of the model construction.
4.2 Quadratic Interpolation
Now that we have a method for constructing fully linear models, which can be used to achieve convergence to first-order stationary points, we now consider the construction of fully quadratic models (Definition 3.10) using quadratic interpolation, which will allow us to find second-order stationary points using Algorithm 3.2.
In this case, we have an objective and we now build a local quadratic model around a base point (which again will usually be in the -th iteration of our algorithm),
| (4.29) |
for some , and symmetric. This model has degrees of freedom171717Note we always take to be the number of interpolation points, so the value of here is different to the linear case in Section 4.1., and so we again construct to interpolate at points .
To write this interpolation problem efficiently, we introduce the natural basis for quadratic functions , which we denote , defined as
| (4.30) |
With this basis, and noting , the model (4.29) can be written as
| (4.31) |
where contains the upper triangular elements of , arranged as
| (4.32) |
With these definitions, the interpolation problem is given by the linear system
| (4.33) |
Just like the linear case, the matrix can become very ill-conditioned if , a typical value for , gets very small. To address this, we define and instead solve
| (4.34) |
where and .
As before, since , we know that is not too large, and so—provided is invertible—the only potential source of ill-conditioning is if is large. As such, the size of determines the fully quadratic error bounds.
Theorem 4.6.
Proof.
For ease of notation, define as the second-order Taylor series for at . For , the th row of (4.34) gives and so
| (4.36) |
using (2.4). Now we choose and define , giving . Since is invertible, so is , and so we let be the unique solution to . Scaling rows 2 through of this equation by and rows through by , we get
| (4.37) |
We then get
| (4.38) | ||||
| (4.39) | ||||
| (4.40) | ||||
| (4.41) |
where the inequality comes from (4.36). We then use , and it remains to bound . To do this, we use the definition of (4.30) to write
| (4.42) |
Using Young’s inequality, for , we get
| (4.43) | ||||
| (4.44) | ||||
| (4.45) | ||||
| (4.46) |
and so combining with and , we get , which gives
| (4.47) |
The result then follows from Lemma 4.1(b). ∎
One way to choose the sample points to ensure that is not too large is to use along with the positive and negative perturbations along the coordinate axes for all , and for all combinations . That is,
| (4.48) |
Lemma 4.7.
For the interpolation set given by (4.48), we have .
Proof.
Suppose we solve for some right-hand side with and , and where . Then and so it suffices to find an upper bound on .
The choice of interpolation points (4.48) gives
| (4.49) |
in the linear system (4.34). The first row of (4.34) immediately gives and so since .
Next, for , the values and (corresponding to ) completely determine and via rows and of (4.34), namely
| (4.50) |
which gives and , and so and from .
Lastly, in , the column corresponding to for only has one nonzero entry, a 1 in the -th row corresponding to . Hence we get and so . ∎
All together, if we run our MBDFO algorithm Algorithm 3.2 with quadratic models given by solving (4.33) with points (4.48) at each iteration, we get the below second-order complexity result.
Corollary 4.8.
Under the assumptions of Theorem 3.16 and the sequence is uniformly bounded181818This is required for the derivative-based theory, via Assumption 2.119 and in Theorem 2.13., if the local quadratic model at each iteration of Algorithm 3.2 is generated by quadratic interpolation to the points (4.48) (with and ), then the iterates of Algorithm 3.2 achieve second-order optimality for the first time after at most iterations and objective evaluations.
Proof.
We note that all the interpolation points (4.48) satisfy (i.e. in the assumptions of Theorem 4.6). From Theorem 4.6 and Lemma 4.7, the model at each iteration is fully quadratic with . We also get from (3.25c) and so . We then apply Corollary 3.17 to get the iteration bound. The objective evaluation bound comes from observing that our interpolation mechanism requires evaluating at no more than points per iteration. ∎
Just as in the linear interpolation case (Corollary 4.5), we can improve the explicit dependence on dimension in the worst-case complexity bound by making the coordinate perturbations depend on the dimension. However, this construction does not align with a single interpolation set, rather it uses different objective evaluations to construct as in iteration . Specifically, we use the central difference approximation to the gradient with stepsize , and forward difference approximation to the Hessian with stepsize :
| (4.51a) | ||||
| (4.51b) | ||||
for all , which requires objective evaluations because the calculation of cannot re-use any objective evaluations used to construct .
Corollary 4.9.
Under the assumptions of Theorem 3.16 and the sequence is uniformly bounded, if the local quadratic model at each iteration of Algorithm 3.2 is generated as in (4.51), then the iterates of Algorithm 3.2 achieve second-order optimality for the first time after at most iterations and objective evaluations.
Proof.
From [21, Theorem 2.2] or [59, Lemma 5] we have and from [32, Proposition 2.7] or [59, Lemma 6] we have . By arguments similar to the proof of Lemma 4.1(b), the model is fully quadratic with (i.e. no explicit dependence on ). The remainder of the proof follows that of Corollary 4.8, but with the slightly larger objective evaluations per iteration. ∎
In terms of the dependency on , this objective evaluation bound again matches what we may expect from a derivative-based method using (genuine small-) finite differencing to approximate gradients and Hessians.
4.3 Underdetermined Quadratic Interpolation
So far, we have the option of constructing either fully linear models with linear interpolation (using points), or constructing fully quadratic models with quadratic interpolation (using points). In most cases, quadratic models give much better practical performance because they can adapt to the local curvature of , and so should be preferred over linear models, but here they come with the downside of requiring significantly more interpolation points—and hence objective evaluations—at each iteration. In this section, we outline an approach for constructing quadratic models (4.29) with interpolation points. In this case we have more degrees of freedom in the model than available evaluations, and so the interpolation problem is underdetermined (i.e. there are infinitely many quadratic functions satisfying the interpolation conditions). We describe one particular approach, minimum Frobenius norm underdetermined quadratic interpolation.
To that end, suppose we again have interpolation points , but now with . In general, there are infinitely many quadratic models (4.29) satisfying for all . Of these, we pick the model for which the Hessian has minimal Frobenius norm:
| (4.52) |
Since is linear in the coefficients of , and , this is a convex quadratic program with equality constraints, and so can be solved via a linear system.
Lemma 4.10 (Section 2, [119]).
Remark 4.11.
If , then we can still get a model by solving (4.52) via (4.65). However, since we have sufficient degrees of freedom that the interpolation conditions uniquely define the model, there is only one valid (corresponding to fully quadratic interpolation), and so we are better off solving the smaller system (4.33). However, this does mean that all the below results also apply to fully quadratic models.
Once again, if is a typical size of and we define for all , we get the scaled interpolation system
| (4.79) |
where and has -th row (c.f. (4.18)) We get the coefficients and , and so gives .
Remark 4.12.
We note that and are symmetric, and both and are positive semidefinite. This can be seen, e.g. for , by observing that for any we have
| (4.80) |
Hence and yield saddle point linear systems, which have been widely studied [20].
Since we choose to ensure is small, we can provide an upper bound on the size of the model Hessian. This will be necessary for us to apply Lemma 4.1(a) to show that the model is fully linear.
Lemma 4.13.
Proof.
By suitably modifying and , the model Hessian does not change if the objective function is changed by adding a linear function to it.202020From changing and suitably, any Hessian that interpolates also interpolates plus a linear function, so the minimizer must be the same. Hence, we consider interpolating to the function , and so by Lemma 2.6 we have , and so all terms in the right-hand side of (4.79) are bounded by . This means that for all we have
| (4.82) |
Finally, we have
| (4.83) |
and the result follows. ∎
We can now establish that our minimum Frobenius norm quadratic interpolation model is indeed fully linear.
Theorem 4.14.
Proof.
Fix and define . Since is invertible, has full column rank [20, Theorem 3.3], so the rows of span . Hence, the system is consistent. So, we take to be the minimal norm solution, with by similar reasoning as in the proof of Theorem 4.2. Multiplying the last rows of this (consistent) equation by , we again get (4.21) for our new . By following the same argument as for (4.24) and using the interpolation condition we get
| (4.85) | ||||
| (4.86) | ||||
| (4.87) |
and so
| (4.88) |
which finally gives
| (4.89) | ||||
| (4.90) |
The result then follows from Lemma 4.1(a). ∎
A popular structured interpolation set used for minimum Frobenius norm interpolation is , and for , with . This gives the associated interpolation matrix
| (4.101) |
and so . Applying Lemma 4.13, we get the bound . Combining with (which can be verified by direct computation), Theorem 4.14 tells us that the model is fully linear with . Applying these results to Corollary 3.8, we get a worst-case complexity for Algorithm 3.1 of iterations and objective evaluations. This is significantly worse than linear and even (fully) quadratic interpolation!
However, luckily for us it turns out that the bound from Lemma 4.13 is overly conservative in this case, because only the first row of has an absolute row sum of size , and this row is only used to calculate the model coefficient . Moreover, only contributes to the model Hessian via , so since , it does not affect the Hessian. Indeed, using the explicit form of (4.101), we compute that for , and so the model Hessian is
| (4.102) |
As per the proof of Lemma 4.13, we may without loss of generality assume , and so we get the improved bound . Hence we may take , a significant improvement over Lemma 4.13.
Taking from Theorem 4.14 with our tighter bound in Corollary 3.8, we arrive at the same worst-case complexity as linear interpolation.
Corollary 4.15.
Again, by scaling the perturbations in terms of the problem dimension, we can get improved dependence on dimension in the worst-case complexity bounds.
Corollary 4.16.
Under the assumptions of Theorem 3.7, and if for all , if the local quadratic model at each iteration of in Algorithm 3.1 is generated by minimum Frobenius norm quadratic interpolation to the points , then the iterates of Algorithm 3.1 achieve for the first time after at most iterations and objective evaluations.
Proof.
Remark 4.17.
A simple modification of this approach, which can be beneficial in practice, is to look at the minimum change in the Hessian between successive iterations of our main algorithm. That is, given some old Hessian approximation , we choose where solves
| (4.103) |
The corresponding linear system is the same as (4.65) but where the entries in the right-hand side are replaced with , and where . Following the proof of Lemma 4.13, we get and so our model Hessian can potentially grow rapidly,
| (4.104) |
Unfortunately this rate of growth in the model Hessian is sufficient that global convergence results do not apply.212121Trust-region methods can converge in the case of unbounded model Hessians, but the growth cannot exceed ; see [142, 58] for more details.
4.4 Practical Model Construction
We have seen that choosing interpolation sets based on structured perturbations around can yield fully linear/fully quadratic models. Although these can yield very good worst-case complexity bounds (e.g. Corollary 4.5), they do not allow us to reuse any existing objective evaluations during model construction. Given we often turn to MBDFO methods when objective evaluations are expensive, a pragmatic approach would allow us to construct models based on existing objective evaluations, while controlling the size of the relevant matrix norms.
To do this, given , and a collection of existing evaluations that are sufficiently close to , we first try to find a subset of that (together with ) is suitable for linear interpolation. If we cannot find points in that yield a sufficiently good linear interpolation set, we augment this set based on the null space of selected perturbations around (i.e. for selected points ). By maintaining a QR factorization of the selected perturbations, we can control (4.18) directly based on the diagonal entries of the matrix [147, Lemma 2.4]. This procedure is formalized in [146, Figure 4.2], and extended to bound-constrained problems in [53].
We can then optionally choose to extend the selected interpolation set by adding more points (that are sufficiently close to ) while ensuring that (4.79) is not too large. Two approaches for doing this are:
-
•
Ensuring the new point added does not significantly increase by using a QR factorization of and a Cholesky-like factorization of , as described in [148, Algorithm 4.2];
-
•
Ensuring the new point added does not significantly increase by controlling the Schur complement of the resulting after the new point is added [130, Algorithm 5.2].
The first of these approaches is used in the IBCDFO software collection (see Section 8).
In both these approaches, points are appended to the interpolation set from the set of existing evaluations, with the model quality controlled via the norm of the (inverse) interpolation matrix as per the error bounds derived in this section. In Section 5, we will instead consider geometric approaches for incrementally improving an interpolation set, where we take an existing interpolation set and determine how to change a small number of points to improve its quality.
4.5 Composite Models
Another important construction is for models where the objective function has some known structure. For example, this may be where results from composing a known function with a function for which derivatives are unavailable, namely
| (4.105) |
where (for some ) is smooth but has only a zeroth order oracle, and the smooth function and its derivatives are completely known. Perhaps the most common example of (4.105) is nonlinear least-squares minimization, where .
In such settings, because we receive more information with every zeroth order oracle query (the full vector instead of the scalar ), it can often be sufficient to build linear interpolation models for , and extend these to a quadratic model for using (4.105). For example, in the nonlinear least-squares case, if we sample for and perform linear interpolation on each component of , we get a model
| (4.106) |
for some and . We then naturally get a quadratic model for via
| (4.107) |
In effect, by using the structure of the problem (4.105) we get an approximate quadratic model for the cost of a linear interpolation model. If each component of is a fully linear model for the corresponding component of , then it can be shown that the quadratic model (4.107) is a fully linear model for [37, Lemma 3.3]. Of course, if a quadratic model for is available, then a more complex quadratic (or potentially up to quartic) model for can be derived in a similar way [155, 149].
This composite approach, where we receive more oracle information from and directly exploit some known structure, can significantly improve the practical performance of MBDFO methods. These ideas also extend to the case where is nonsmooth (e.g. or ), although solving the resulting trust-region step calculation and convergence theory is more complex [97, 98].
Notes and References
The first fully linear/fully quadratic error bounds for linear and quadratic interpolation were given in [46], and for linear regression and underdetermined quadratic interpolation in [45], although minimum Frobenius norm interpolation had been proposed earlier in [119]. These results are collected in the reference book [51] and summarized in [132]. The proofs presented here generalize to constrained interpolation (Section 6.1), and are based on the theory for linear interpolation and minimum Frobenius norm interpolation given in [86, 130].
The interpolation constructions here can also be thought of as a way to approximate the gradient or Hessian of a function just with function evaluations. Such approximations, known as simplex gradients/Hessians, have been studied in [127, 85, 84]. Using perturbations of size to improve the dimension dependence of the worst-case complexity bound for linear models (as in Corollary 4.5) was originally proposed for quadratic regularization methods in [73], and was adapted to MBDFO in [40, 57]. The extension to fully quadratic models is based on the constructions in [32, 59]. In particular, [59] achieves a better bound of objective evaluations by only re-sampling the Hessian every iterations. Although [59] uses a cubic regularization method, it is likely a similar improvement would be possible for a trust-region method.
How to best construct models in the underdetermined quadratic case is still actively being studied. Early alternatives to (4.52) or (4.103) were proposed in [52, 158], but recent years have seen several alternatives proposed [152, 153, 150]. The method in [150], for example, attempts to optimize the parameters of a specific model construction problem over the course of the algorithm.
Composite models for nonlinear least-squares problems were first introduced in [155] with the resulting worst-case complexity studied in [37]. Similar ideas can also be used when is nonsmooth [72, 97, 98, 101], although the resulting trust-region subproblem becomes significantly more difficult to solve. However, MBDFO for general nonsmooth problems is a relatively under-studied topic (see e.g. [90, 9]), in contrast to the relatively established approaches for direct search DFO methods [5, 11].
We conclude by noting that other forms of interpolation model have been used in MBDFO. The most notable example of non-polynomial models is the use of radial basis function (RBF) models, which take the form
| (4.108) |
where is some predetermined scalar function, such as the Gaussian function for some [30]. Because of the inclusion of a linear term in the model, such models can also be made fully linear through proper selection of the interpolation points [146, 147]. RBF models are primarily designed to globally approximate a function (rather than locally as here), and so are more commonly used in global optimization algorithms [23, 77]. Such models have many similarities with statistically motivated global function approximations such as Gaussian Processes [125], which are used in Bayesian (global) optimization algorithms [133] but again have also been used in MBDFO [16].
5 Incremental Geometry Improvement
In Section 4, we described how to generate a set of interpolation points in order to be fully linear/fully quadratic in . There, we had full control over where to place the interpolation points (e.g. using coordinate perturbations around ) to ensure the model was sufficiently accurate. This approach is of theoretical interest, and does motivate procedures to find a collection of suitable interpolation points from a database (and determine any extra evaluations required to get a good model). However, in many practically successful MBDFO algorithms, an incremental approach is used, where interpolation points for the next iteration are taken to be the points from the current iteration , with minimal changes to the set, such as adding the new iterate . This ensures we only ask for a very small number of new (possibly expensive) objective evaluations at each iteration. Here, we describe how to assess the quality of an interpolation set using geometric conditions, which determine how to minimally alter a given interpolation set to ensure its quality.
The primary tool we will use to assess the quality of an interpolation set, and determine which points to move (and where to move them to), are Lagrange polynomials. For a set of points , the associated Lagrange polynomials are a collection of polynomials satisfying the conditions , where is the Kronecker delta (i.e. if and 0 otherwise). The degree of the Lagrange polynomials will match the degree of the interpolation model we are trying to assess.
Lagrange polynomials are an important concept from polynomial approximation theory. For example, suppose we have a continuous function which we will approximate by a degree- polynomial via interpolation to points . The associated Lagrange polynomials determine the Lebesgue constant for the interpolation set, . We have the following result, which says that the interpolant has error within of the best possible error.
Proposition 5.1 (Theorem 15.1, [144]).
The error between the true (continuous) function and degree- polynomial interpolant satisfies
| (5.1) |
where is the supremum norm and is an optimal degree- polynomial interpolant to .
That is, choosing interpolation sets which make the associated Lagrange polynomials small in magnitude are associated with better polynomial approximations. We will see that a similar property holds in the multi-dimensional case.
We begin by re-deriving the fully linear/fully quadratic constants from Section 4, replacing all matrix norms with quantities related to the magnitude of the relevant Lagrange polynomials. We will then use this to build a MBDFO algorithm with self-correcting interpolation sets.
5.1 Interpolation Error Bounds
To begin, consider the case of linear interpolation to build a model (4.16) by solving the system (4.17) or (4.18). In this case, given the same base point as the model (4.16), the Lagrange polynomials are the linear functions
| (5.2) |
where , defined by the interpolation conditions for . That is, from (4.17) or (4.18) we can construct by solving
| (5.3) |
provided that and are invertible. Comparing to (4.17) or (4.18), one important consequence of this is that the interpolation model can be written as a linear combination of Lagrange polynomials,
| (5.4) |
If is constant, for all , then since the linear model is unique, the model must perfectly match the original function . Applying (5.4), we conclude that Lagrange polynomials form a partition of unity,
| (5.5) |
Separately, from (5.2) and (5.3) we observe that
| (5.6) |
This means that we can evaluate all Lagrange polynomials at a single point by solving a single linear system, without calculating any or :
| (5.7) |
We have seen a similar relationship before, in (4.21) in the proof of Theorem 4.2. Specifically, the vector in (4.21) is equal to , and so (4.25) can be written as
| (5.8) |
for all .
The bound (5.8) tells us that, just as for higher-order polynomial interpolation in the scalar case (Proposition 5.1), the Lebesgue constant is a useful way to measure the quality of an interpolation model. However, we will not be able to use the Lebesgue constant to determine how to incrementally improve an interpolation set. Instead, we consider in the following definition, which will apply equally to quadratic interpolation as well as linear.
Definition 5.2.
Suppose we have an interpolation set such that its Lagrange polynomials exist. Given and , we say that is -poised in for some if
| (5.9) |
There are two noteworthy observations that can be made at this point:
-
•
We can construct the Lagrange polynomials (i.e. exists) if and only if the relevant interpolation linear system is invertible;
-
•
Since the Lagrange polynomials always form a partition of unity (5.5) we have , and so . Moreover, if any , which is usually the case, the condition implies .
For example, the set from Corollary 4.4 has Lagrange polynomials
| (5.10) |
The maximizers of over are for and for , yielding .
Theorem 5.3.
We get similar results in the case of (fully) quadratic interpolation (4.29). If we have interpolation points with , then the associated Lagrange polynomials are
| (5.12) |
which we can construct by solving the same systems as (4.33) and (4.34), namely
| (5.13) |
This again gives us the relationship (5.4) and similar reasoning also gives (5.5). Recalling the natural quadratic basis (4.30) and (4.31), we also have
| (5.14) |
where , and combining with (5.13) we get the analog of (5.7), namely
| (5.15) |
Comparing with the proof of Theorem 4.6, we see that in (4.37) is equal to , and so (4.41) can be replaced with
| (5.16) |
for all .
Hence we get the following version of Theorem 4.6.
Theorem 5.4.
Considering again the structured interpolation points (4.48) for fully quadratic interpolation, in Appendix B we show that, for this set, .
Lastly, consider the case of minimum Frobenius norm quadratic interpolation, for an interpolation set with . Here, the associated Lagrange polynomials are again quadratic, (5.12), but are constructed by solving (c.f. (4.52))
| (5.18) |
which gives us (c.f. (4.65) and (4.79))
| (5.29) |
with associated Hessians . Once again, we get (5.4) and (5.5).
We now define to be the polynomial vector
| (5.30) |
defined so that the -th row/column of is , i.e. for . We also define analogously, replacing with and with , so . We can then observe that
| (5.31) |
or , and, recalling that and are symmetric, once again we can evaluate all Lagrange polynomials at a single point without constructing any explicitly, via
| (5.32) |
where refers to the first entries of the vector.
We now get our new version of Lemma 4.13.
Lemma 5.5.
Proof.
Our new version of Theorem 4.14 is the following.
Theorem 5.6.
Proof.
By following the same argument as used to derive (4.87) in the proof of Theorem 4.14, we get
| (5.36) |
for any , where is any vector satisfying
| (5.37) |
We now show that satisfies (5.37). Since is a global minimizer of (4.52), if is linear then we have exact interpolation, for all . Applying this to for and writing the resulting models in the form (5.4), we see
| (5.38) |
which, together with (5.5), is equivalent to satisfying (5.37). Hence, (5.36) becomes
| (5.39) |
and so
| (5.40) | ||||
| (5.41) |
The result then follows from Lemma 4.1(a) and . ∎
Now consider the set , as in Corollary 4.15. The associated Lagrange polynomials are:
| (5.42a) | ||||
| (5.42b) | ||||
| (5.42c) | ||||
for . Each of these takes values in for all and so .
Remark 5.7.
For the example interpolation sets considered here, applying the value of in the fully linear/fully quadratic constants gives a worse bound (in terms of ) than Section 4. These can be improved by calculating the constants in terms of the Lebesgue constant, but this does not allow for the procedures in Section 5.2 to be used.
5.2 Model Improvement
As mentioned previously, the main reason for using Lagrange polynomials in our model error bounds is that it allows us both check if a model is fully linear, and to decide how to make small changes to the interpolation set to make it fully linear. Essentially, we wish to find an interpolation set for which is sufficiently small, below some given value.
Firstly, we can determine by maximizing each Lagrange polynomial and its negative within the trust-region. Since the Lagrange polynomials are themselves linear or quadratic, this reduces to solving trust-region subproblems (2.12). In theory, this requires finding the global trust-region solution for each, but in practice solving these trust-region subproblems approximately can give a sufficiently good estimate [118].
Now, suppose we have an interpolation set which is -poised, but we want to change some points to reduce its value of below some threshold (recalling that if any ). The procedure for changing the interpolation set to improve its poisedness constant is quite simple, and given in Algorithm 5.1. At each iteration of Algorithm 5.1 we find and such that and replace with . We do not specify a particular interpolation model type; Algorithm 5.1 applies equally to linear, quadratic or minimum Frobenius norm quadratic interpolation without modification.
To analyze Algorithm 5.1, we consider the impact on the determinant of the relevant interpolation matrix (i.e. (4.17), (4.33) or (4.65) depending on the type of interpolation model) from changing a single point in the interpolation set.
The case for linear or quadratic interpolation is quite straightforward.
Lemma 5.8.
Suppose the set is used for either linear or quadratic interpolation, with interpolation matrix (i.e. or ), where . If we replace in with another point , then the resulting interpolation matrix satisfies
| (5.43) |
where the Lagrange polynomial is given by the original interpolation set (before the replacement is made).
Proof.
If is singular, the result holds trivially, so assume that is invertible. Replacing with changes the th row of from to in , where in the linear interpolation case or the natural quadratic basis (4.30) for quadratic interpolation. Equivalently, is the same as , but with its th column changed from to . However, from either (5.7) or (5.15), we have that is the th entry of . By Cramer’s rule for linear systems, this means , giving the result. ∎
For minimum Frobenius norm updating, a similar result holds, but it is more complicated to prove because of the more complicated structure of the interpolation matrix (4.65). In particular, replacing a single interpolation point changes both a row and column of , so Cramer’s rule cannot be used as in the proof of Lemma 5.8.
Lemma 5.9.
Suppose the set is used for minimum Frobenius norm interpolation, with interpolation matrix , where . If we replace in with another point , then the resulting interpolation matrix satisfies
| (5.44) |
where the Lagrange polynomial is given by the original interpolation set (before the replacement is made).
Proof.
See [130, Theorem 5.2]. ∎
We are now ready to show that Algorithm 5.1 achieves the desired outcome.
Theorem 5.10.
Algorithm 5.1 terminates in finite time, and the resulting interpolation set is -poised in , and we have for all .
Proof.
The termination condition for Algorithm 5.1 implies that is -poised in if it terminates. All points in the resulting are either in (if they were in initially) or are in (if they come from an update in line 4 of Algorithm 5.1). Hence all points satisfy since we assume . It remains to show that Algorithm 5.1 terminates in finite time.
Since the initial set is -poised, the corresponding interpolation matrix, say (either , or depending on the type of interpolation) is invertible, so initially. Call this value . However, since all points from the initial and any new points generated are all in the compact set , there is a finite maximum value, say , of over all interpolation sets contained in , From either Lemma 5.8 or Lemma 5.9, whenever we make the update in line 4 of Algorithm 5.1, we increase by a factor of at least (since ). Hence Algorithm 5.1 must terminate after at most iterations. ∎
Three iterations of Algorithm 5.1 for linear interpolation in two dimensions are illustrated in Figure 5.1. Initially, two interpolation points and are very close, yielding a large value of .222222The value of is still not unreasonably large here, simply so the locations of and can be illustrated clearly. After two iterations the value of has been decreased by more than an order of magnitude, and is very close to 1. Note that the center of the region is no longer chosen as an interpolation point, so is possible for this final interpolation set. However, after one iteration of Algorithm 5.1 we already have a small value of while retaining , which may be desirable in practice.
5.3 Incremental Geometry Improvement Algorithm
We conclude this section by describing a variant of Algorithm 3.1 which avoids the requirement that every model is fully linear, and instead handles non-fully linear models by incrementally improving the geometry of the current interpolation set. This core principle—only worry about the geometry of the interpolation set if the algorithm is not progressing, and only make minimal changes to the set if so—is used in many state-of-the-art MBDFO codes, such as Powell’s methods (see Section 8).
At each iteration of this algorithm, we have an interpolation set with , which can be used for linear, quadratic or minimum Frobenius norm quadratic interpolation. The algorithm ensures that the interpolation linear system is invertible at every iteration. After the model is constructed and a tentative step determined, a slightly different the trust-region mechanism to Algorithm 3.1 is used: if a step is not accepted and the interpolation set is not guaranteed to be fully linear, we perform a single iteration of the geometry-improving Algorithm 5.1 (and do not reduce ) instead of having an unsuccessful step (where is reduced).
Formally, the ‘replace distant point’ and ‘replace bad point’ steps in Algorithm 5.2 have the minimal requirements that the new geometry-improving point must satisfy, in practice a good choice—motivated by Lemma 5.8 or 5.9—is to find this point by maximizing the (magnitude of the) relevant Lagrange polynomial inside the current trust region . More details are given in Remark 5.16 below.
The full algorithm is given in Algorithm 5.2. It is written in a generic way, without specifying a specific interpolation type, which can be linear, fully quadratic232323Note that a fully quadratic model with is fully linear with constants and , or use minimum Frobenius interpolation with ; see Remark 4.11. or minimum Frobenius quadratic.
We begin with some basic results.
Lemma 5.11.
Regarding Algorithm 5.2, we have:
-
(a)
The interpolation linear system is invertible at every iteration (and hence Lagrange polynomials exist at every iteration);
-
(b)
If and for all except if , then the model is fully linear in with constants possibly depending on , and ;
-
(c)
We can only have finitely many consecutive iterations of types ‘replace distant point’ or ‘replace bad point’ before a (very) successful or unsuccessful iteration must occur; and
Proof.
First, we show (a) by induction. The case holds by construction of , so suppose it is invertible for some iteration . If iteration is (very) successful, then the next linear system is invertible by definition of . If iteration replaces a distant or bad interpolation point, then for , so the system is invertible by either Lemma 5.8 or 5.9. Lastly, if iteration is unsuccessful, then the linear system is unchanged.
To show (b), denote to be the index such that . Since the Lagrange polynomials always form a partition of unity, we have, for any ,
| (5.45) |
Hence , and the model is fully linear in with suitable constants by Theorem 5.3 or 5.6 (which also covers fully quadratic interpolation, per Remark 4.11). Finally, (c) is an immediate consequence of Theorem 5.10. ∎
The properties in Lemma 5.11 are sufficient to prove global convergence for Algorithm 5.2 (i.e. ), but to get a worst-case complexity bound, we need to slightly strengthen Lemma 5.11(c) to a uniform bound.
Assumption 5.12.
The number of consecutive iterations of types ‘replace distant point’ or ‘replace bad point’ cannot exceed some , independent of the starting interpolation set.
It was recently shown that Assumption 5.12 holds with for linear and fully quadratic interpolation [40, Theorem 5.1]. No such bound is known for minimum Frobenius quadratic interpolation, but it is likely that one exists.
Lemma 5.13.
Lemma 5.14.
Proof.
Our final worst-case complexity result is the following.
Theorem 5.15.
Proof.
The bounds on the number of (very) successful (3.20) and unsuccessful (3.22) iterations from the proof of Theorem 3.7 hold here, where we only need to note that is unchanged for ‘replace distant point’ and ‘replace bad point’ iterations to recover (3.22). Lastly, the total number of ‘replace distant point’ and ‘replace bad point’ iterations is at most by Assumption 5.12, so the total number of iterations of all types is at most , i.e. a factor of worse than Theorem 3.7. The objective evaluation bound comes from noting that Algorithm 5.2 requires at most 2 evaluations per iteration (once the first interpolation set is chosen). ∎
If we could construct linear interpolation models to give independent of as in Corollary 4.5, Theorem 5.15 with gives an evaluation complexity bound that is slightly worse than Corollary 4.5 by a factor of . We can both make independent of and remove the factor from the complexity bound by excluding the Lagrange polynomial associated with the current iterate in the -poisedness requirement (i.e. by modifying Definition 5.2) [40]. In general, if the interpolation set is updated well on (very) successful iterations—see Remark 5.16—then and are usually small and the approach of Algorithm 5.2 is efficient in practice.
Remark 5.16.
On (very) successful iterations, we have a flexible choice in how to update the interpolation set (line 9 of Algorithm 5.2). A simple way to do this would be , where is any index such that , which must exist by (5.5). Usually, all Lagrange polynomials will be nonzero at , and so, motivated by the geometry improving procedure, a good choice might be to replace a point that is far from and/or improves the poisedness of the interpolation set (i.e. is large, as in Algorithm 5.1). For example, in [118], is selected as
| (5.47) |
For ‘replace distant point’ and ‘replace bad point’ iterations, typically the point removed from the interpolation set is the maximizer of the relevant quantity (distance from or poisedness), and is replaced with a point that maximizes the magnitude of the Lagrange polynomial of the point being removed, as in Algorithm 5.1.
Powell’s Methods
The framework of Algorithm 5.2 broadly aligns with the approach in Powell’s highly regarded MBDFO software (see Section 8). The most notable distinction is that two trust-region radii are used in a complex way that partially decouples the step size constraint from the radius used to assess -poisedness. It also uses a simplified interpolation set management procedure:
-
•
On (very) successful steps, update the interpolation set as per Remark 5.16;
-
•
Otherwise, check that all interpolation points are sufficiently close to . If yes, decrease both trust-region radii. If no, replace the furthest interpolation point with the approximate maximizer of (the magnitude of) the associated Lagrange polynomial, and possibly decrease one trust-region radius.
Additionally, the implementation pays careful attention to efficient linear algebra and subproblem solvers. This approach has proven extremely successful in practice, despite the simplified interpolation set management having limited convergence guarantees (e.g. [81]).242424The use of two trust-region radii on its own still allows for convergence theory, e.g. [155]. An accessible overview of these algorithms can be found in [123].
Notes and References
More information on Lebesgue constants in approximation theory may be found in resources such as [144]. The theory of Lagrange polynomials to construct fully linear/fully quadratic interpolation models was originally developed in [46] for linear/quadratic interpolation and [45] for minimum Frobenius norm interpolation, with these results collected in [51]. We again note that minimum Frobenius norm quadratic models were originally studied in [119].
A more concise alternative to (5.8) and (5.16) are the bounds [117, 33]
| (5.48) |
for all , for linear and (fully) quadratic interpolation under Assumptions 2.4(a) and 2.7(a) respectively.
The incremental geometry improving algorithm (Algorithm 5.2) is based on the approach from [131], which also includes an example where simply updating by removing the furthest point from and replacing it with does not converge, demonstrating that consideration of interpolation set geometry is necessary for a convergent MBDFO method. The presentation in Section 5.3 draws on the complexity bounds from the more recent analysis [40], which also proves that Assumption 5.12 can be satisfied for polynomial interpolation over an arbitrary basis.
6 Constrained Optimization
In this section we consider how the above algorithmic ideas and approximation theory can be adapted to constrained optimization problems. We first consider simple convex constraints (not involving any derivative-free functions), such as bounds, and then consider the case of general, potentially derivative-free, constraints.
6.1 Simple Convex Constraints
We first consider the case of MBDFO in the presence of simple constraints on the decision variables. We will focus on the case where the feasible set is an easy-to-describe, convex set. Specifically, we will extend the above ideas to solve problems of the form
| (6.1) |
where is, as usual, smooth and nonconvex but with only a zeroth order oracle available, and the feasible set satisfies the following.
Assumption 6.1.
The feasible set in (6.1) is closed, convex and has nonempty interior.
This covers some of the most important constraint types, such as lower/upper bounds and linear inequality constraints. To allow us to apply our method to many choices of feasible sets, the only information about available to our algorithm will be its Euclidean projection operator,
| (6.2) |
which returns the closest point in to . This function is well-defined (i.e. there is a unique minimizer) whenever satisfies Assumption 6.1 [18, Theorem 6.25]. Since if and only if , the projection operator also gives us a test for feasibility. There are many simple sets for which is easy to compute (see [18, Table 6.1] for others), for example:
-
•
For bound constraints (including unbounded constraints, and/or ), we have for ; and
-
•
If for , then .
If we have many sets, each of which has their own projection operator, the projection onto the intersection of these sets can be computed using Dykstra’s algorithm [28].
A good criticality measure for (6.1), to assess how close we are to a solution, is the following252525Other criticality measures can also be used, with another common choice being .:
| (6.3a) | ||||
| s.t. | (6.3b) | |||
| (6.3c) | ||||
In the unconstrained case, , this reduces to , as we might hope. The suitability of as a criticality measure comes from the following.
Proposition 6.2 (Theorem 12.1.6, [47]).
If is twice continuously differentiable, then is continuous in , for all , and if and only if is a first-order critical point of (6.1).262626A point is first-order critical for (6.1) if , where is the normal cone for at . This is a first-order necessary condition for a constrained local minimizer [111, Theorem 12.8].
Our goal in this section is to develop a strictly feasible trust-region method for solving (6.1); that is, where we can only evaluate at feasible points. This may be necessary, for example, if we have a term in with bounds .
If at iteration we have a quadratic model (3.1) for the objective, we naturally get an estimate of the criticality measure (6.3c), namely
| (6.4a) | ||||
| s.t. | (6.4b) | |||
| (6.4c) | ||||
Again, if then . Assuming strict feasibility, our trust-region subproblem now looks like: set to be an approximate minimizer of
| (6.5) |
The presence of the feasibility condition means that the theory and algorithms in Section 2.2 for solving the trust-region subproblem no longer apply. In this setting, the analog of the Cauchy point is to consider the projected gradient path, . A suitable linesearch applied to can yield [47, Theorem 12.2.2] a step satisfying the following assumption, the constrained version of Assumption 2.9. Importantly, instead of the sufficient decrease condition depending on , it now depends on .
Assumption 6.3.
Assumption 6.3 is almost identical to Assumption 2.9, with the unconstrained criticality measure replaced with . Hence, as we shall see, this new assumption will not have a significant impact on the convergence of our algorithm.
The biggest change comes from the model accuracy requirements. Our fully linear definition (Definition 3.1) is based on comparing the model with the objective at all points inside the trust region. However, our strict feasibility requirement means that our interpolation model can only be constructed using points in . This limits our ability to have accurate models outside , which is required for Definition 3.1 to be satisfied.
Example 6.4.
Suppose we have problem dimension and our feasible region is defined by the simple bound constraints for some small . Now suppose we have the base point and we wish to perform linear interpolation to build a fully linear model in (i.e. ). Following the approach in Section 4.1, a natural choice of feasible interpolation points is . The resulting interpolation model can be shown to be fully linear with , and so our fully linear constants can be made arbitrarily large just by changing the feasible region. So, if we continued to use Definition 3.1 as our measure of model accuracy, the worst-case complexity of our algorithm could grow like for accuracy level .
However this definition is stronger than we need. Requiring strictly feasible interpolation points only limits our ability to approximate outside the feasible region, but that is exactly the region which is not relevant to our optimization. So, we now introduce a generalized definition of fully linear interpolation models, capturing the notion that we only care about model accuracy within the feasible region.
Definition 6.5.
Suppose we have and . A local model approximating is -feasible fully linear in if there exist constants , independent of , and , such that
| (6.7a) | ||||
| (6.7b) | ||||
Again, we will sometimes use for notational convenience.
In particular, we note the distinction in (6.7a) but in (6.7b) (as in (6.4c)). Also, compared to Definition 3.1, the gradient accuracy condition (6.7b) only considers the accuracy of , not at any other points in the trust region.272727Indeed, our convergence analysis in Section 3 only requires (3.5b) to hold at .
We are now ready to state our main MBDFO algorithm for solving (6.1), given in Algorithm 6.1. It is essentially the same as Algorithm 3.1, but replacing with , and using our new sufficient decrease and fully linear conditions. Specifically, our new model assumptions are given below.
Assumption 6.6.
We begin by stating the obvious result that all iterates of Algorithm 6.1 are feasible.
Proof.
This holds by induction on because by definition, and , with whenever from Assumption 6.3. ∎
The -feasible fully linear requirement (6.7b) is required principally to ensure the following, which essentially replaces (3.5b) in measuring the error in the criticality measure.
Lemma 6.8 (Lemma 2.4, [86]).
Now, the worst-case complexity of Algorithm 6.1 can be proven using essentially identical arguments to the complexity of Algorithm 3.1, replacing and with and respectively.
Theorem 6.9.
Proof.
See [86, Theorem 3.14] for details. ∎
We now have a worst-case complexity bound for Algorithm 6.1 which matches the unconstrained version, Corollary 3.8 for Algorithm 3.1. What remains is to specify how to construct -feasible fully linear models using only feasible points.
6.1.1 Constructing Feasible Interpolation Models
We wish to construct interpolation models that are -feasible fully linear (Definition 6.5) using only feasible interpolation points. To do this, we will use the notions of Lebesgue measure and -poisedness (Definition 5.2), since these more naturally extend to the constrained case than the matrix norms in Section 4.
Since we care about the accuracy of our model only at feasible points, the natural generalization of Definition 5.2 to the constrained case is to measure the size of Lagrange polynomials only within the feasible region.
Definition 6.10.
Suppose we an interpolation set such that its Lagrange polynomials exist. Given and , we say that is -poised in for some if
| (6.8) |
Aside from only considering , the main difference in Definition 6.10 compared to Definition 5.2 is that we also only consider rather than . This is required to handle the differing constraints and in (6.7).
Linear Interpolation Models
First, suppose we construct linear interpolation models using the interpolation set , for by solving (4.17) or (4.18), as usual. We first note that we have an alternative version of (5.8), namely
| (6.9) |
for all , where the proof is identical to the unconstrained case, just replacing with . If is -poised, then we have . By treating the cases and separately, we get a constrained version of Theorem 5.3.
Theorem 6.11.
Suppose satisfies Assumption 2.4(a) and we construct a linear model (4.16) for by solving (4.18), where we assume is invertible. If , with and for some and all (where is the number of interpolation points), and the interpolation set is -poised in , then the model is -feasible fully linear in with constants
| (6.10) |
Proof.
See [86, Theorem 4.4]. ∎
The fully linear constants in Theorem 6.11 are of size , which matches the values in the unconstrained case from Theorem 5.3.
Example 6.12 (Example 6.4 revisited).
Consider again the case with , with linear interpolation in using . By considering the enlarged trust region instead of , we see that the interpolation set is -poised in with and hence and are independent of from Theorem 6.11, a significant improvement over by applying the unconstrained approach from Definition 3.1 and Theorem 4.2.
Minimum Frobenius Norm Models
A similar approach works if we wish to use minimum Frobenius norm quadratic interpolation (4.52), with the same definition of Lebesgue measure and -poisedness Definition 6.10.
To begin, we state a bound on the size of the model Hessian, similar to (5.33) in the unconstrained case. Unfortunately we cannot bound entirely; we instead bound certain Rayleigh quotient-type expressions.
Lemma 6.13.
Proof.
The constrained version of Theorem 5.6 is the following.
Theorem 6.15.
Model Improvement
As in Section 5.2, we can use our new notion of -poisedness (Definition 6.10) to build algorithms to improve the geometry of an interpolation set. Just like the unconstrained case, this works equally well in the linear and minimum Frobenius norm quadratic interpolation cases. Indeed, our definitions ensure that Algorithm 5.1 extends trivially to the constrained case, as shown in Algorithm 6.2. The only changes are that the initial interpolation set must all be feasible, and our new point must also be feasible.
Theorem 6.16.
Algorithm 6.2 terminates in finite time, and the resulting interpolation set is -poised in , and we have and for all .
Proof.
However, the assumption in Algorithm 6.2 that the initial interpolation set is -poised in for some (with only feasible points) is not so easy to ensure. In the unconstrained case, we saw example sets in Section 4 that ensure the relevant interpolation linear system is invertible, but these do not necessarily work here as we cannot guarantee they are feasible. Instead, we can use Algorithm 6.3 to construct a strictly feasible set with invertible interpolation linear system.
Theorem 6.17.
Algorithm 6.3 terminates in finite time, and the resulting interpolation set is contained in and has an invertible interpolation linear system.
6.2 General Constraints
Since MBDFO differs most from derivative-based trust-region methods in its model construction, our focus in this section has been on building accurate models using strictly feasible points, and hence we have restricted ourselves to problems with simple constraints (6.1), rather than problems with general nonlinear constraints (for which we also may only have zeroth order oracles). To conclude this section, we briefly outline one approach for solving general constrained problems of the form (2.6). This form is often augmented with explicit bound constraints, although we do not do this here for simplicity. We assume that the objective and all constraint functions for only have zeroth order oracles, and so their derivatives must be approximated.
The algorithm we outline will be a sequential quadratic programming (SQP) MBDFO method, based on the COBYQA (Constrained Optimization BY Quadratic Approximation) algorithm available in recent versions of Python’s widely used SciPy optimization library. At each iteration of an SQP method, we construct a local quadratic model for the objective ,
| (6.17) |
and local linear models for all constraints ,
| (6.18) |
The main difference in these approximations is that we want to choose the model Hessian in to achieve (recalling the Lagrangian (2.7)) rather than , and so we actually need to approximate Hessians for each and have suitable Lagrange multiplier estimates. With these approximations, the trust-region subproblem becomes
| (6.19a) | ||||
| s.t. | (6.19b) | |||
| (6.19c) | ||||
| (6.19d) | ||||
In the unconstrained case, we decided if a step was good by measuring the decrease in the objective (2.11). Here, we measure the quality of a step by using a merit function, which combines the objective value and size of any constraint violations into a single scalar. For example, the merit function used in COBYQA is defined as
| (6.20) |
where the penalty parameter controls the relative weight applied to reductions in compared to improvements in feasibility.282828An alternative but common choice of merit function uses the norm of the constraint violation. Note that if is feasible. Given our models (6.17) and (6.18), we can derive our approximate merit function
| (6.21) |
The choice of accepting/rejecting a step and updating the trust-region radius is similar to the unconstrained case.
A prototypical algorithm is given in Algorithm 6.4. We note that increase the merit penalty parameter as ensures we encourage to gradually become feasible. The condition , where is a Lagrange multiplier estimate is to ensure that the merit function is exact; that is, the true solution is also a minimizer of for all sufficiently large. In the case of the merit function, ‘sufficiently large’ is [124, Theorem 4.2.1]. Motivated by (2.8), the Lagrange multiplier estimates are computed by solving the (bound-constrained) linear least-squares problem
| (6.22a) | ||||
| s.t. | (6.22b) | |||
| (6.22c) | ||||
| (6.23) |
Aside from the management of an interpolation set, which can use the techniques described in previous sections (but the interpolation problem is solved for and all constraints ), the main practical consideration is the calculation of the step (6.19). The largest difficulty is that the linearized constraints may be infeasible. A common solution to this issue is to decompose the computed step into a normal and tangential component, . The normal step aims to reduce constraint violation and the tangential step aims to decrease the objective without worsening the constraint violation. There are several ways to do this, but COBYQA uses the following approach. First, calculate the normal step by (approximately) solving
| (6.24a) | ||||
| s.t. | (6.24b) | |||
for some scalar (e.g. ), and then calculate the tangential step by approximately solving
| (6.25a) | ||||
| s.t. | (6.25b) | |||
| (6.25c) | ||||
| (6.25d) | ||||
Both of these subproblems (6.24) and (6.25) can be reformulated to be trust-region-like subproblems with linear constraints which can be solved using algorithms similar to the conjugate gradient-based method developed in [122]; see [124, Chapter 6] for details.
We also note that the requirement that is chosen to ensure can be satisfied relatively easily, under mild assumptions on the step components and .
Lemma 6.18.
Proof.
First, if , then we have , and the result follows by noting that is linear and increasing in . Instead, if then so and , so for all . ∎
Algorithm 6.4 does not have any convergence guarantees as written, and was developed purely as a practical method. Other MBDFO SQP methods such as [145, 82] also do not have convergence guarantees. However, stochastic SQP methods (extending the ideas from Section 7.2 to the constrained setting) with convergence guarantees [66, 65, 64] can be used in the deterministic setting, and there does not appear to be any fundamental issue preventing the convergence theory for derivative-based SQP trust-region methods [47, Chapter 15] being adapted to the MBDFO setting.
6.2.1 Evaluation Failures and Hidden Constraints
In some situations, we may also have to concern ourselves with objective evaluations failing (i.e. our procedure for evaluating fails or crashes unexpectedly). For example, the objective evaluation for a helicopter rotor blade design problem in [25] failed to compute over 60% of the time. More generally, large-scale computations (such as many objective evaluations in DFO contexts) can be exposed to hardware failure; for one high-performance computing system, hardware failures occurred on average once every 7.5 days [137].
This can be thought of as an example of hidden constraints [42, 100], where our problem takes the form
| (6.26) |
for some feasible set which is unknown to the algorithm (and where represents ‘evaluating succeeds’). This can be rigorously handled with unconstrained algorithms if we let and define whenever . Here, we must prove convergence to Clarke stationary points, although to the author’s knowledge this has not been applied to MBDFO, only direct search [42]. In practice, this can instead be handled by setting to some large (finite) value if [123].
Notes and References
Section 6.1 is based primarily on [86, 130]. These works draw on the derivative-based trust-region theory outlined in [47, Chapter 12], the complexity analysis for convex-constrained cubic regularization [34] and convergence theory for convex-constrained MBDFO using the original notion (Definition 3.1) of fully linear models [44]. A good resource for the theory of convex sets and convex optimization is [18].
The description of COBYQA (Algorithm 6.4) for general constraints is taken from [124]. More detail about derivative-based SQP methods, including convergence theory, alternative approaches for calculating and alternative globalization mechanisms such as filters (instead of a merit function), see [47, Chapter 15]. The specific step calculation given by (6.24) and (6.25) is a variant of the Byrd–Omojokun method [47, Chapters 15.4.2 & 15.4.4], originally from [113] (a thesis by Omojokun, supervised by Byrd). There are many other works which describe MBDFO methods for handling general constraints (2.6) described in the survey [96, Section 7]. However, we note in particular [145, 82], which propose SQP methods very similar to COBYQA. See [111] for an overview of theory and algorithms for general constrained optimization, including quadratic programming.
MBDFO for nonlinearly constrained problems is not as well-developed as for unconstrained problems. There are few implementations outside of COBYQA, and to the author’s knowledge none with both good practical performance and theoretical guarantees. By contrast, constrained optimization within the Mesh Adaptive Direct Search framework is well-studied in the sense of proving convergence to Clarke stationary points (e.g. [4, 5, 6, 3, 12]).
7 Optimization for Noisy Problems
A key use case of MBDFO methods is optimization for problems with some degree of noise present. In general, this refers to situations where an exact zeroth order oracle is not available (i.e. we can only evaluate , but not exactly). This comprises situations, for example, where the calculation of the objective requires performing a Monte Carlo simulation, or involves a physical/real-world experiment subject to some inherent noise or uncertainty.
In this section we discuss the cases of both deterministic and stochastic noise. By ‘deterministic noise’, we mean the following: if we evaluate the noisy function repeatedly at the same point, we get the same answer (e.g. roundoff errors, finite termination of an iterative process). Stochastic noise refers to the case where repeated evaluations at the same point can produce different value, and so the noisy value of can be treated as a random variable (e.g. Monte Carlo simulation, experimental errors).
In both cases, we outline realistic model accuracy assumptions (in place of ‘fully linear at every iteration’), and analyze the resulting variants of Algorithm 3.1. In practice, a simple heuristic (with theoretical justification for deterministic noise, see Remark C.3) for handling noisy problems is to use a generic algorithm such as Algorithm 3.1, but set very close to 1, for example , rather than more typical values such as [35].
7.1 Deterministic Noise
In the case of deterministic noise, we essentially have a zeroth order oracle that reliably returns an incorrect value. We will assume that the noise is uniformly bounded.
Assumption 7.1.
Given objective (2.1), we only have access to the noisy zeroth order oracle , satisfying for all , for some .
We illustrate the difficulty that the presence of noise introduces to optimization algorithms by first considering the case of finite differencing. The below result says that, without due care, finite differencing a noisy zeroth order oracle can give very inaccurate derivative estimates.
Lemma 7.2.
Proof.
This demonstrates that, in the noisy case, the error in finite differencing can increase significantly as . In fact, to minimize the error bound in Lemma 7.2 we should take to get gradient error .292929For this reason, when estimating the gradient of an objective using (7.1), both Python’s scipy.optimize.minimize and MATLAB’s fminunc use the default value , where is the machine epsilon. The advantage of MBDFO is that interpolation models are constructed from well-spaced points, typically of distance from each other, and so we avoid the case until is very small.
For example, suppose we construct our model (3.1) using linear interpolation to points for (4.17). With a noisy zeroth order oracle, we now solve
| (7.3) |
This gives us the following version of Theorem 4.2.
Theorem 7.3.
Suppose satisfies Assumption 2.4(a) and satisfies Assumption 7.1 and we construct a linear model (4.16) for by solving (7.3), where we assume is invertible. If for some and all (where is the number of interpolation points), then the model satisfies
| (7.4a) | ||||
| (7.4b) | ||||
for all , where and are the same as in Theorem 4.2 (i.e. (4.19)), and
| (7.5) |
Proof.
A similar result holds for minimum Frobenius norm quadratic models.
Theorem 7.4.
Proof.
Theorems 7.3 and 7.4 motivate the following assumption on our interpolation model, replacing Assumption 3.3.
Assumption 7.5.
We now consider the adaptation of Algorithm 3.1 to the deterministic noise setting. The only other change to the stated algorithm is that the numerator in the ratio test (7.12) is now the estimated predicted decrease, based on querying (satisfying Assumption 7.1) instead of . Our new algorithm is given in Algorithm 7.1.
| (7.12) |
The worst-case complexity analysis of Algorithm 7.1 is broadly similar to the noise-free case, but more complicated. The iteration complexity bound matches the noise-free case (Corollary 3.8) but only when the desired first-order accuracy is sufficiently large, .
Theorem 7.6.
Proof.
See Appendix C. ∎
7.2 Stochastic Noise
If the noise in our objective is stochastic, then, unlike the deterministic noise case, we can decrease the noise level by averaging many noisy values of to achieve, in principle, arbitrarily good estimates of the true objective. This means that, with care, we can achieve any desired level of optimality, , rather than being limited to optimality levels as in the deterministic case.
Formally, we will assume that our stochastic noise is unbiased, and so optimizing a function which has stochastic noise in its evaluations can be viewed as the stochastic optimization problem
| (7.14) |
where is a source of random noise. Here, we assume that we only have zeroth order stochastic oracles , for realizations of the noise . To make the problem tractable, we will assume that the randomness has bounded variance.
Assumption 7.7.
There exists such that for all .
The most important consequence of Assumption 7.7 is that we can easily construct estimates of in (7.14) which are arbitrarily good, with high probability.
Proposition 7.8 (Chebyshev’s inequality, e.g. Chapter 2.1 of [27]).
Suppose Assumption 7.7 holds. For any , suppose we generate i.i.d. realizations from and calculate the sample average, . Then for any .
Roughly speaking, the Central Limit Theorem says that the typical error between the sample mean from samples and the true mean is of size as . Chebyshev’s inequality gives a ‘finite ’ version of this idea. Specifically, motivated by the fully linear assumption, the below result confirms that to get a sample error of the desirable size requires averaging samples.
Corollary 7.9.
Suppose Assumption 7.7 holds. For any , and and , if then .
Another consequence of Chebyshev’s inequality is that, again by choosing the number of samples sufficiently large, we can guarantee that an interpolation model is fully linear with high probability.
Theorem 7.10.
Suppose satisfies Assumption 2.4(a), Assumption 7.7 holds, and we construct a linear model (4.16) for by solving (7.3) with noisy estimates (i.i.d. for each ), where we assume is invertible. Assume also that for some and all (where is the number of interpolation points). If for some and , then with probability at least the model is fully linear with constants
| (7.15) |
Proof.
An analogous result can be derived for minimum Frobenius norm models, by the same probabilistic argument as above and the use of Theorem 7.4, where we again need samples for each objective estimate.
The above results motivate Algorithm 7.2, a stochastic version of Algorithm 3.1. Algorithm 7.2 has very similar algorithmic structure to Algorithm 3.1, but with some simplifications to aid the analysis. Specifically,
-
•
We always increase on successful iterations, i.e. setting in Algorithm 3.1;
-
•
We maintain the same ratio for increasing and decreasing , i.e. setting in Algorithm 3.1;
-
•
We impose a maximum trust-region radius, for all , similar to Algorithm 3.2. We assume for some again for simplicity.
More importantly, in Algorithm 7.2 we now assume that the model is fully linear with high probability, and the approximate evaluations of and in the calculation of are accurate with high probability. In particular, we care about the likelihood of the events
| (7.18a) | ||||
| (7.18b) | ||||
where is the indicator function of an event (i.e. taking values 1 if the event occurs and 0 otherwise). Formally, all the randomness in Algorithm 7.2 comes from the models and the estimates . So, we define the filtration to be the -algebra generated by , representing all randomness up to the start of iteration . Similarly, we define to be generated by , representing all randomness up to the calculation of (i.e. from and ).
Given this filtration structure, we make the following assumptions about the model.
Assumption 7.11.
We also require the following assumption about the function value estimates used to calculate .
Assumption 7.12.
At each iteration of Algorithm 7.2, the estimates and satisfy for some and some .
Under Assumption 7.7, Corollary 7.9 and Theorem 7.10 show that Assumptions 7.11 and 7.12 can be satisfied by using sample averages for samples.303030We need rather than as in Corollary 7.9 because the condition requires we have accurate estimates for both and . However, Assumptions 7.11 and 7.12 are more general and may be satisfied under other assumptions on the randomness in the objective.
The worst-case complexity of Algorithm 7.2 is derived from the following general probabilistic result. This considers a non-negative random process which (before stopping) typically decreases in proportion to another non-negative random process . It says that, provided is unlikely to get small before stopping (specifically, increases in occur when a biased event occurs), we can bound how long it takes for stopping to occur.
Proposition 7.13 (Theorem 2, [24]).
Suppose we have random processes with , and
| (7.20) |
for some , where is the -algebra generated by and . For any , let be a stopping time with respect to the filtrations . If, for all ,
-
(a)
There exists and such that ;
-
(b)
There exists with such that implies , where and is from (a); and
-
(c)
There exists and non-increasing function such that implies ;
then
| (7.21) |
Proposition 7.13 can be thought of as a stochastic version of Theorem 3.7, where condition (b) is a requirement similar to the conclusion of Lemma 3.6.313131We will use here, but can be used to get second-order complexity results.
To apply Proposition 7.13 to analyze Algorithm 7.2, we will take to be a measure of algorithm progress based on (see Lemma 7.15 below) and the stopping time to be
| (7.22) |
and define the event to be when all probabilistic estimates are accurate
| (7.23) |
and so from Assumptions 7.11(a) and 7.12 we get (7.20) with (hence we will need to assume that ). Condition (a) in Proposition 7.13 is also automatically satisfied for Algorithm 7.2 by setting . We now need to establish that conditions (b) and (c) hold, for suitable choices of , and .
Lemma 7.14.
Proof.
Fix such that , and so . If (i.e. or ), then and the result holds. So now suppose that (i.e. ).
Since for some with (by the trust-region updating mechanism), if then and so , which means and the result holds.
The only remaining case to consider is where and . Since and , we have , and so since is fully linear (),
| (7.25) |
and so and
| (7.26) |
From Assumption 2.9 and we have . Separately, since the model is fully linear, and since we have . Thus
| (7.27) | ||||
| (7.28) | ||||
| (7.29) |
and so , or , and iteration is successful. Hence from , using since . ∎
Lemma 7.15.
Proof.
We now get our main complexity bound as an immediate consequence of Proposition 7.13.
Theorem 7.16.
Proof.
In general, we know that at each iteration we need to sample the stochastic objective times to satisfy our probabilistic Assumptions 7.11 and 7.12. From Lemma 7.14 we know that can typically get as small as . So, broadly speaking, we may require up to stochastic objective evaluations per iteration, or evaluations in total. A rigorous evaluation complexity bound for a variant of Algorithm 7.2 is proven in [88, Theorem 5].
Remark 7.17.
Using Corollary 7.9 and Theorem 7.10 to satisfy Assumptions 7.11 and 7.12 relies on repeatedly sampling the objective at every iterate and interpolation point, potentially a large number of times. As such, Algorithm 7.2 is the only method in this work that is not well-suited to the regime where objective evaluations are expensive. However, this is the price of seeking convergence to a stationary point (which is not generically possible in the case of deterministic noise, as the true objective value can never be known for any point). Convergence to a neighborhood, as in the case of deterministic noise, can be established for Algorithm 7.1 [32], which is more realistic for the expensive evaluation regime.
Notes and References
Lemma 7.2 is based on [111, Lemma 9.1]. Algorithm 7.1 and the associated worst-case complexity analysis (Theorem 7.6) is new, but broadly follows the approach from [32, 40]. In particular, [32] considers very flexible model and evaluation accuracy assumptions, including allowing for probabilistically accurate models and sufficient decrease estimates, similar to Section 7.2. It also includes second-order complexity theory. This requires modifying the ratio test (7.12) based on knowledge of . If the noise were stochastic, we could estimate based on the standard deviation of for different values of . For deterministic noise, must be estimated in a different way (e.g. [106]).
A procedure to estimate the size of (especially deterministic) noise in a function was developed in [106]. Accurate finite differencing schemes (i.e. picking the perturbation size based on the estimated noise level) are developed in [107, 136]. Several recent works have shown how to adapt existing (derivative-based) algorithms to appropriately handle deterministic noise in objective and gradient evaluations, such as [135, 140, 114]. They typically also rely on a modified ratio test, similar to the use of (7.12) in Algorithm 7.1. A modified ratio test is also used in the (derivative-based) trust-region solver TRU in the GALAHAD package [70] to handle roundoff errors. If the level of deterministic noise can be controlled (i.e. can be computed for any , but not ), then Algorithm 3.1 with minor adjustments can converge to any accuracy level [63].
A more widely studied setting for stochastic optimization is where stochastic gradient estimates, are available, which is the core of many algorithms for training machine learning models [26]. We also note that stricter assumptions about the distribution of , e.g. bounds on higher-order moments, can yield tighter bounds on than Chebyshev’s inequality (e.g. Hoeffding’s or Bernstein’s inequality), see [27].
Algorithm 7.2 was introduced in [41], where almost-sure convergence of to zero was shown, and the worst-case complexity analysis shown here was originally given in [24]. The work [24] also includes an extension of Algorithm 7.2 that converges to second-order critical points, with suitable worst-case complexity bounds. This approach has been extended to constrained problems via stochastic SQP methods (primarily aimed at the stochastic gradient setting but also applicable to MBDFO) in [66, 65, 64]. Adaptive sample averaging can improve performance, where the number of samples is chosen dynamically at every evaluation point, based on the sample standard deviation and , to ensure that the current estimate is sufficiently accurate [134, 80]. The sample complexity (i.e. total number of stochastic oracle calls) of stochastic MBDFO algorithms can be reduced under stronger assumptions on the stochastic oracle, such as bounds on tail probabilities [128] or the availability of common random numbers [78].
Instead of constructing interpolation models using averaged estimates for large , an alternative is to use regression models (i.e. but with very large ), based on the interpolation theory from [45, 51, 22] and discussed briefly in Section 4.3. In this case, the almost-sure convergence of to zero is shown in [95]. Regression models are useful for both deterministic and stochastic noise (unlike sample averaging), but sample averaging allows for more accurate estimates for deciding whether to accept a step or not. Radial basis function models can also cope with large numbers of interpolation points , and hence construct regression-type models suitable for noisy problems [16]. Error bounds for linear regression models as the number of points , relevant for problems with stochastic noise, are derived in [83].
A growing body of recent work extends the results from [41, 24] to more general (probabilistic) conditions on function value and model accuracy [32, 88, 66, 65], and to direct search methods [7, 62]. This is improving our understanding of how much progress can be made in MBDFO and similar methods for stochastic problems with given levels of accuracy. A greater understanding is useful here, since MBDFO algorithms for deterministic problems with minimal/no sample averaging can already perform quite well in some cases [35, Section 5].
8 Summary and Software
The creation and analysis of MBDFO algorithms requires an interesting combination of optimization and approximation theory, which has been extensively developed over the last 30 years in particular. We have introduced the most important tools used in these algorithms, in particular trust-region methods and the construction of fully linear/fully quadratic interpolation models, and have demonstrated how to extend these ideas to important settings such as constrained problems and noisy objective evaluations. As mentioned in the introduction, the intention was never to provide a comprehensive overview of developments in the field; for this, we direct the reader to [96], and also note the more detailed discussions of algorithm and interpolation theory in [51].
8.1 Software
We finish by providing a list of high-quality open-source software implementations of MBDFO methods, suitable for use in practical applications as well as a starting point for further research.323232Disclosure: The author is the primary developer of two packages listed here, namely Py-BOBYQA and DFO-LS.
- Software of M. J. D. Powell
-
A collection of several Fortran packages for MBDFO. This includes COBYLA [115] (general constrained problems, linear interpolation), UOBYQA [118] & NEWUOA [120] (unconstrained problems, fully and minimum Frobenius norm quadratic models respectively), BOBYQA [121] (bound-constrained problems, minimum Frobenius norm quadratic models) and LINCOA [122] (linearly constrained problems, minimum Frobenius norm quadratic models). They are most easily accessed through the PRIMA package [157], which also includes C, MATLAB, Python and Julia interfaces. Py-BOBYQA [35] is a separate, pure Python re-implementation of BOBYQA with additional heuristics to improve performance for noisy problems. An accessible overview of these algorithms can be found in [123].333333Mike Powell (1936–2015) was a pioneer of both (derivative-based) trust-region and MBDFO methods, among many other significant contributions to optimization and approximation theory [29].
- COBYQA [124]
-
A general-purpose Python package which can handle bound and nonlinear constraints. Nonlinear constraints are also assumed to be derivative-free. It is based on minimum Frobenius norm quadratic models. It is most readily available through SciPy’s optimization module.343434See https://scipy.org/.
- DFO-LS [35]
-
A Python package for nonlinear least-squares problems. It can handle simple bound and convex constraints, and nonsmooth regularizers, and it includes heuristics to improve performance for noisy problems. It is based on linear interpolation models for each term in the least-squares sum (see Section 4.5).
- IBCDFO
-
A collection of MATLAB and Python packages for MBDFO with composite models (see Section 4.5). Includes POUNDerS [149] (nonlinear least-squares with bound constraints using minimum Frobenius norm quadratic models), manifold sampling [97] & GOOMBAH [98] (both for unconstrained problems with general composite objectives (4.105) with possibly nonsmooth, using minimum Frobenius norm quadratic models)
- ASTRO-DF [79]
-
A Python package aimed at stochastic MBDFO problems. It can handle constraints and is well-suited to simulation-based optimization, where common random numbers can be used to evaluate the objective at different points in a consistent way. It is based on underdetermined quadratic interpolation models (using diagonal Hessians as per Corollary 4.15).
- NOMAD [11]
-
A widely used direct search code, written in C++ with MATLAB and Python interfaces. It constructs quadratic models with minimum Frobenius norm interpolation (or an alternative surrogate) to accelerate the direct search process via a model-based search step, and/or by ordering the points to be checked. It can handle nonsmooth problems, constraints, discrete variables, multi-objective problems and more.
Table 8.1 provides a list of URLs for the above packages, accurate at the time of writing. This list is not intended to be exhaustive, but hopefully can provide a starting point for the interested reader.
| Package | Location |
|---|---|
| PRIMA (Powell’s software) | https://github.com/libprima/prima |
| Py-BOBYQA | https://github.com/numericalalgorithmsgroup/pybobyqa |
| COBYQA | https://github.com/cobyqa/cobyqa |
| DFO-LS | https://github.com/numericalalgorithmsgroup/dfols |
| IBCDFO | https://github.com/POptUS/IBCDFO |
| ASTRO-DF | https://github.com/simopt-admin/simopt |
| NOMAD | https://github.com/bbopt/nomad |
8.2 Outlook
MBDFO theory and algorithms have matured greatly over the last 30 years, but there is still much ongoing work to be done. Compared to many other areas of optimization, the gap between theoretically studied algorithms and efficient, practical software is relatively large, and it remains to be seen how closely aligned these can be made. As local MBDFO methods mature, there is more scope to incorporate them into global optimization techniques such as multistart methods [99, 87], since MBDFO methods tend to outperform global optimization methods when good estimates of minimizers are available [129]. Another consequence of the increasing maturity is that MBDFO can now start to be applied to optimization problems with more complex structure, such as multi-objective optimization [54, 102], optimization on manifolds [108] and mixed integer problems [143, 92].
Acknowledgments
This work was supported by the Australian Research Council Discovery Early Career Award DE240100006. Thanks to Warren Hare for the encouragement to pursue this project. Thanks to Nicole Felice, Warren Hare, Jeffrey Larson, Fangyu Liu, Matt Menickelly, Clément Royer and Stefan Wild for spotting errors and providing helpful feedback. Thanks to the editor and anonymous referees for their suggestions on the manuscript.
References
- [1] (2024) Challenges and opportunities in quantum optimization. Nature Reviews Physics 6, pp. 718–735. External Links: Document Cited by: Example 1.2.
- [2] (2021) Two decades of blackbox optimization applications. EURO Journal on Computational Optimization 9, pp. 100011. External Links: Document Cited by: §1.1.
- [3] (2010) Globalization strategies for mesh adaptive direct search. Computational Optimization and Applications 46 (2), pp. 193–215. External Links: Document Cited by: §6.2.
- [4] (2004) A pattern search filter method for nonlinear programming without derivatives. SIAM Journal on Optimization 14 (4), pp. 980–1010. External Links: Document Cited by: §6.2.
- [5] (2006) Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on Optimization 17 (1), pp. 188–217. External Links: Document Cited by: item Direct search methods, §4.5, §6.2.
- [6] (2009) A progressive barrier for derivative-free nonlinear programming. SIAM Journal on Optimization 20 (1), pp. 445–472. External Links: Document Cited by: §6.2.
- [7] (2021) Stochastic mesh adaptive direct search for blackbox optimization using probabilistic estimates. Computational Optimization and Applications 79 (1), pp. 1–34. External Links: Document Cited by: §7.2.
- [8] (2017) Derivative-free and blackbox optimization. Springer Series in Operations Research and Financial Engineering, Springer, Cham, Switzerland. Cited by: §1.1, §1.1, §1.1, §1.3, §1.3.
- [9] (2020) Model-based methods in derivative-free nonsmooth optimization. In Numerical Nonsmooth Optimization: State of the Art Algorithms, A. M. Bagirov, M. Gaudioso, N. Karmitsa, M. M. Mäkelä, and S. Taheri (Eds.), External Links: Document Cited by: §4.5.
- [10] (2018) Order-based error for managing ensembles of surrogates in mesh adaptive direct search. Journal of Global Optimization 70 (3), pp. 645–675. External Links: Document Cited by: §1.1.
- [11] (2022) Algorithm 1027: NOMAD version 4: nonlinear optimization with the MADS algorithm. ACM Transactions on Mathematical Software 48 (3), pp. 1–22. External Links: Document Cited by: §4.5, item NOMAD [11], item NOMAD [11].
- [12] (2015) Linear equalities in blackbox optimization. Computational Optimization and Applications 61 (1), pp. 1–23. External Links: Document Cited by: §6.2.
- [13] (2022) Quantifying uncertainty with ensembles of surrogates for blackbox optimization. Computational Optimization and Applications 83 (1), pp. 29–66. External Links: Document Cited by: §1.1.
- [14] (2006) Finding optimal algorithmic parameters using derivative-free optimization. SIAM Journal on Optimization 17 (3), pp. 642–664. External Links: Document Cited by: §1.1.
- [15] (2014) A survey on direct search methods for blackbox optimization and their applications. In Mathematics Without Boundaries, P. M. Pardalos and T. M. Rassias (Eds.), pp. 31–56. External Links: Document Cited by: §1.1.
- [16] (2017) A trust-region method for derivative-free nonlinear constrained stochastic optimization. Note: arXiv preprint arXiv:1703.04156 External Links: 1703.04156 Cited by: §4.5, §7.2.
- [17] (2014) Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24 (3), pp. 1238–1264. External Links: Document Cited by: §3.3.
- [18] (2017) First-order methods in optimization. SIAM. Cited by: §6.1, §6.2.
- [19] (2017) Best practices for comparing optimization algorithms. Optimization and Engineering 18 (4), pp. 815–848. External Links: Document Cited by: §1.1.
- [20] (2005) Numerical solution of saddle point problems. Acta Numerica 14, pp. 1–137. External Links: Document Cited by: §4.3, Remark 4.12.
- [21] (2022) A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Foundations of Computational Mathematics 22, pp. 507–560. External Links: 1910.04055, Document Cited by: §4.2.
- [22] (2013) Derivative-free optimization of expensive functions with computational error using weighted regression. SIAM Journal on Optimization 23 (1), pp. 27–53. External Links: Document Cited by: §7.2.
- [23] (2000) Global optimization of costly nonconvex functions using radial basis functions. Optimization and Engineering 1, pp. 373–397. External Links: Document Cited by: §4.5.
- [24] (2019) Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS Journal on Optimization 1 (2), pp. 92–119. External Links: Document Cited by: §7.2, §7.2, §7.2, Proposition 7.13.
- [25] (1999) A rigorous framework for optimization of expensive functions by surrogates. Structural Optimization 13, pp. 1–13. External Links: Document Cited by: §1.1, §1.3, §6.2.1.
- [26] (2018) Optimization methods for large-scale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document Cited by: §7.2.
- [27] (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Oxford. Cited by: §7.2, Proposition 7.8.
- [28] (1986) A method for finding projections onto the intersection of convex sets in Hilbert spaces. In Advances in Order Restricted Statistical Inference, R. Dykstra, T. Robertson, and F. T. Wright (Eds.), New York, NY, pp. 28–47. External Links: ISBN 978-1-4613-9940-7 Cited by: §6.1.
- [29] (2017) Michael James David Powell. Note: https://www.damtp.cam.ac.uk/user/na/NA_papers/NA2017_04.pdf Cited by: footnote 33.
- [30] (2000) Radial basis functions. Acta Numerica 9, pp. 1–38. External Links: Document Cited by: §4.5.
- [31] (2006) Knitro: an integrated package for nonlinear optimization. In Large-Scale Nonlinear Optimization, P. Pardalos, G. Di Pillo, and M. Roma (Eds.), Vol. 83, pp. 35–59. External Links: Document Cited by: §2.2.
- [32] (2024) First- and second-order high probability complexity bounds for trust-region methods with noisy oracles. Mathematical Programming 207, pp. 55–106. External Links: Document Cited by: §4.2, §4.5, §7.2, §7.2, Remark 7.17.
- [33] (2023) The error in multivariate linear extrapolation with applications to derivative-free optimization. Note: arXiv preprint arXiv:2307.00358 Cited by: §5.3, §5.3.
- [34] (2012) An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis 32 (4), pp. 1662–1695. External Links: ISSN 0272-4979, 1464-3642, Document Cited by: §6.2.
- [35] (2019) Improving the flexibility and robustness of model-based derivative-free optimization solvers. ACM Transactions on Mathematical Software 45 (3), pp. 32:1–32:41. External Links: Document Cited by: §7.2, §7, item Software of M. J. D. Powell, item DFO-LS [35], item DFO-LS [35].
- [36] (2022) Evaluation complexity of algorithms for nonconvex optimization: theory, computation and perspectives. MOS-SIAM Series on Optimization, MOS/SIAM, Philadelphia. Cited by: §2.1, §2.1, §2.3, Theorem 2.12, Theorem 2.13, Remark 2.5, Lemma 2.6, footnote 10.
- [37] (2019) A derivative-free Gauss–Newton method. Mathematical Programming Computation 11 (4), pp. 631–674. External Links: Document Cited by: §4.5, §4.5.
- [38] (2023) Scalable subspace methods for derivative-free nonlinear least-squares optimization. Mathematical Programming 199 (1-2), pp. 461–524. External Links: Document Cited by: §3.3.
- [39] (2024) Randomized subspace derivative-free optimization with quadratic models and second-order convergence. arXiv. Note: arXiv preprint arXiv:2412.14431 External Links: 2412.14431, Document Cited by: §3.3.
- [40] (2025) On complexity of model-based derivative-free methods. arXiv. Note: arXiv preprint arXiv:2510.14935 External Links: 2510.14935 Cited by: §4.5, §5.3, §5.3, §5.3, §7.2.
- [41] (2018) Stochastic optimization using a trust-region method and random models. Mathematical Programming 169 (2), pp. 447–487. External Links: Document Cited by: Remark 2.5, §7.2, §7.2, §7.2.
- [42] (2016) Optimization with hidden constraints and embedded Monte Carlo computations. Optimization and Engineering 17 (1), pp. 157–175. External Links: Document Cited by: §6.2.1, §6.2.1.
- [43] (2000) Optimization of automotive valve train components with implicit filtering. Optimization and Engineering 1, pp. 9–27. External Links: Document Cited by: §1.1.
- [44] (2013) Global convergence of trust-region algorithms for convex constrained minimization without derivatives. Applied Mathematics and Computation 220, pp. 324–330. External Links: Document Cited by: §6.2.
- [45] (2008) Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA Journal of Numerical Analysis 28 (4), pp. 721–748. External Links: Document Cited by: §4.5, §5.3, §7.2.
- [46] (2007) Geometry of interpolation sets in derivative free optimization. Mathematical Programming 111 (1-2), pp. 141–172. External Links: Document Cited by: §4.5, §5.3.
- [47] (2000) Trust-region methods. MPS-SIAM Series on Optimization, Vol. 1, MPS/SIAM, Philadelphia. Cited by: §1.2, §2.2, §2.2, §2.3, Remark 2.5, Lemma 2.8, 2nd item, §3, §6.1, §6.2, §6.2, §6.2, Proposition 6.2.
- [48] (2013) Use of quadratic models with mesh-adaptive direct search for constrained black box optimization. Optimization Methods and Software 28 (1), pp. 139–158. External Links: Document Cited by: §1.1.
- [49] (1997) On the convergence of derivative-free methods for unconstrained optimization. In Approximation Theory and Optimization: Tributes to M. J. D. Powell, M. D. Buhmann and A. Iserles (Eds.), Cited by: §3.3.
- [50] (2009) Global convergence of general derivative-free trust-region algorithms to first- and second-order critical points. SIAM Journal on Optimization 20 (1), pp. 387–415. External Links: Document Cited by: §3.3.
- [51] (2009) Introduction to derivative-free optimization. MPS-SIAM Series on Optimization, Vol. 8, MPS/SIAM, Philadelphia. Cited by: §1.1, §1.3, Remark 2.5, §3.3, §4.5, Remark 4.3, §5.3, §7.2, §8.
- [52] (1996) An algorithm using quadratic interpolation for unconstrained derivative free optimization. In Nonlinear Optimization and Applications, G. Di Pillo and F. Giannessi (Eds.), pp. 27–47. External Links: Document Cited by: §4.5.
- [53] (2024) Derivative-free bound-constrained optimization for solving structured problems with surrogate models. Optimization Methods and Software 39 (4), pp. 845–873. External Links: Document Cited by: §4.4.
- [54] (2011) Direct multisearch for multiobjective optimization. SIAM Journal on Optimization 21 (3), pp. 1109–1140. External Links: Document Cited by: §8.2.
- [55] (2010) Incorporating minimum Frobenius norm models in direct search. Computational Optimization and Applications 46 (2), pp. 265–278. External Links: Document Cited by: §1.1.
- [56] (2012) Recent developments in derivative-free multiobjective optimisation. Computational Technology Reviews 5, pp. 1–30. External Links: ISSN 2044-8430, Document Cited by: §1.1.
- [57] (2025) TRFD: a derivative-free trust-region method based on finite differences for composite nonsmooth optimization. SIAM Journal on Optimization 35 (3), pp. 1792–1821. External Links: Document Cited by: §4.5.
- [58] (2024) Complexity of trust-region methods in the presence of unbounded Hessian approximations. arXiv. Note: arXiv preprint arXiv:2408.06243 External Links: 2408.06243 Cited by: footnote 21.
- [59] (2025) First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians. Journal of Scientific Computing 103 (1), pp. 32. External Links: Document Cited by: §4.2, §4.5.
- [60] (2024) Stochastic trust-region algorithm in random subspaces with convergence and expected complexity analyses. SIAM Journal on Optimization 34 (3), pp. 2671–2699. External Links: Document Cited by: §3.3.
- [61] (2025) Direct-search methods in the year 2025: theoretical guarantees and algorithmic paradigms. EURO Journal on Computational Optimization 13, pp. 100110. External Links: Document Cited by: §1.1.
- [62] (2023) Constrained stochastic blackbox optimization using a progressive barrier and probabilistic estimates. Mathematical Programming 198 (1), pp. 675–732. External Links: Document Cited by: §7.2.
- [63] (2021) Inexact derivative-free optimization for bilevel learning. Journal of Mathematical Imaging and Vision 63 (5), pp. 580–600. External Links: Document Cited by: §7.2.
- [64] (2026) A trust-region interior-point stochastic sequential quadratic programming method. arXiv preprint arXiv:2603.10230. Cited by: §6.2, §7.2.
- [65] (2026) High probability complexity bounds of trust-region stochastic sequential quadratic programming with heavy-tailed noise. Mathematical Programming. Cited by: §6.2, §7.2, §7.2.
- [66] (2024) Trust-region sequential quadratic programming for stochastic optimization with random models. arXiv. Note: arXiv preprint arXiv:2409.15734 External Links: 2409.15734 Cited by: §6.2, §7.2, §7.2.
- [67] (2016) Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. SIAM Journal on Optimization 26 (4), pp. 1987–2011. External Links: Document Cited by: §3.3.
- [68] (2013) Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. External Links: Document Cited by: item Finite differencing/implicit filtering.
- [69] (1996) Matrix computations. 3rd ed edition, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore. Cited by: footnote 16.
- [70] (2003) GALAHAD, a library of thread-safe Fortran 90 packages for large-scale nonlinear optimization. ACM Transactions on Mathematical Software 29 (4), pp. 353–372. External Links: Document Cited by: §2.2, §7.2.
- [71] (2010) On solving trust-region and other regularised subproblems in optimization. Mathematical Programming Computation 2 (1), pp. 21–57. External Links: Document Cited by: §2.2.
- [72] (2016) A derivative-free trust-region algorithm for composite nonsmooth optimization. Computational and Applied Mathematics 35 (2), pp. 475–499. External Links: Document Cited by: §4.5.
- [73] (2023) Quadratic regularization methods with finite-difference gradient approximations. Computational Optimization and Applications 85 (3), pp. 683–703. External Links: Document Cited by: §4.5.
- [74] (2020) A decoupled first/second-order steps technique for nonconvex nonlinear unconstrained optimization with improved complexity bounds. Mathematical Programming 179 (1-2), pp. 195–222. External Links: Document Cited by: §2.3.
- [75] (2005) Toeplitz and circulant matrices: a review. Foundations and Trends® in Communications and Information Theory 2 (3), pp. 155–239. External Links: Document Cited by: 1st item, footnote 35.
- [76] (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation. 2nd edition, SIAM, Philadelphia. External Links: Document Cited by: item Automatic (aka algorithmic) differentiation.
- [77] (2001) A radial basis function method for global optimization. Journal of Global Optimization 19, pp. 201–227. External Links: Document Cited by: §4.5.
- [78] (2025) Complexity of zeroth- and first-order stochastic trust-region algorithms. SIAM Journal on Optimization 35 (3), pp. 2098–2127. External Links: Document Cited by: §7.2.
- [79] (2021) Improved complexity of trust-region optimization for zeroth-order stochastic oracles with adaptive sampling. In 2021 Winter Simulation Conference (WSC), Phoenix, AZ, USA, pp. 1–12. External Links: Document Cited by: item ASTRO-DF [79], item ASTRO-DF [79].
- [80] (2024) Iteration complexity and finite-time efficiency of adaptive sampling trust-region methods for stochastic derivative-free optimization. IISE Transactions, pp. 1–15. External Links: Document Cited by: §7.2.
- [81] (2004) On the convergence of the UOBYQA method. Journal of Applied Mathematics and Computing 16 (1-2), pp. 125–142. External Links: Document Cited by: §5.3.
- [82] (2024) A modified derivative-free SQP-filter trust-region method for uncertainty handling: application in gas-lift optimization. Optimization and Engineering. External Links: Document Cited by: §6.2, §6.2.
- [83] (2023) Limiting behaviour of the generalized simplex gradient as the number of points tends to infinity on a fixed shape in . Set-Valued and Variational Analysis 31, pp. 1:1–1:31. External Links: Document Cited by: §7.2.
- [84] (2024) A matrix algebra approach to approximate Hessians. IMA Journal of Numerical Analysis 44 (4), pp. 2220–2250. External Links: Document Cited by: §4.5.
- [85] (2020) Calculus identities for generalized simplex gradients: rules and applications. SIAM Journal on Optimization 30 (1), pp. 853–884. External Links: Document Cited by: §4.5.
- [86] (2022) Model-based derivative-free methods for convex-constrained optimization. SIAM Journal on Optimization 32 (4), pp. 2552–2579. External Links: Document Cited by: §4.5, §5.3, §6.1, §6.1.1, §6.2, Lemma 6.8.
- [87] (2024) Multistart algorithm for identifying all optima of nonconvex stochastic functions. Optimization Letters 18 (6), pp. 1335–1360. External Links: ISSN 1862-4472, 1862-4480, Document Cited by: §8.2.
- [88] (2025) Sample complexity analysis for adaptive optimization algorithms with stochastic oracles. Mathematical Programming 209 (1-2), pp. 651–679. External Links: Document Cited by: §7.2, §7.2.
- [89] (1993) Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications 79 (1), pp. 157–181. External Links: Document Cited by: §1.1.
- [90] (2015) Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. Ph.D. Thesis, Universidade de Coimbra. Cited by: §3.3, §4.5.
- [91] (2011) Implicit filtering. SIAM, Philadelphia. Cited by: §1.1.
- [92] (2025) MATRS: heuristic methods for noisy derivative-free bound-constrained mixed-integer optimization. Mathematical Programming Computation. External Links: Document Cited by: §8.2.
- [93] (2003) Optimization by direct search: new perspectives on some classical and modern methods. SIAM Review 45 (3), pp. 385–482. External Links: Document Cited by: item Direct search methods, §1.1.
- [94] (2022) Use of static surrogates in hyperparameter optimization. Operations Research Forum 3 (1), pp. 11. External Links: Document Cited by: §1.1.
- [95] (2016) Stochastic derivative-free optimization using a trust region framework. Computational Optimization and Applications 64 (3), pp. 619–645. External Links: Document Cited by: §7.2.
- [96] (2019) Derivative-free optimization methods. Acta Numerica 28, pp. 287–404. Cited by: §1.1, §1.3, §6.2, §8.
- [97] (2021) Manifold sampling for optimizing nonsmooth nonconvex compositions. SIAM Journal on Optimization 31 (4), pp. 2638–2664. External Links: Document Cited by: §4.5, §4.5, item IBCDFO.
- [98] (2024) Structure-aware methods for expensive derivative-free nonsmooth composite optimization. Mathematical Programming Computation 16 (1), pp. 1–36. External Links: Document Cited by: §4.5, §4.5, item IBCDFO.
- [99] (2018) Asynchronously parallel optimization solver for finding multiple minima. Mathematical Programming Computation 10 (3), pp. 303–332. External Links: Document Cited by: §8.2.
- [100] (2024) A taxonomy of constraints in black-box simulation-based optimization. Optimization and Engineering 25 (2), pp. 1125–1143. External Links: Document Cited by: §6.2.1.
- [101] (2024) Black-box optimization algorithms for regularized least-squares problems. arXiv. Note: arXiv preprint 2407.14915 External Links: 2407.14915 Cited by: §4.5.
- [102] (2025) Worst-case complexity analysis of derivative-free methods for multi-objective optimization. arXiv. External Links: 2505.17594, Document Cited by: §8.2.
- [103] (2013) Global optimization: theory, algorithms, and applications. MOS-SIAM Series on Optimization, SIAM, Philadelphia. Cited by: §1.1.
- [104] (2007) Trailing-edge noise reduction using derivative-free optimization and large-eddy simulation. Journal of Fluid Mechanics 572, pp. 13–36. External Links: Document Cited by: §1.1.
- [105] (2009) Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization 20 (1), pp. 172–191. External Links: Document Cited by: §1.1.
- [106] (2011) Estimating computational noise. SIAM Journal on Scientific Computing 33 (3), pp. 1292–1314. External Links: Document Cited by: §7.2, §7.2.
- [107] (2012) Estimating derivatives of noisy simulations. ACM Transactions on Mathematical Software 38 (3), pp. 1–21. External Links: Document Cited by: §7.2.
- [108] (2026) Derivative-free optimization on riemannian manifolds using simplex gradient approximations. Journal of Optimization Theory and Applications 208 (1), pp. 3. External Links: Document Cited by: §8.2.
- [109] (2017) Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. External Links: Document Cited by: item Finite differencing/implicit filtering.
- [110] (2004) Introductory lectures on convex optimization. Springer US. External Links: ISBN 978-1-4020-7553-7 Cited by: footnote 5.
- [111] (2006) Numerical optimization. 2nd edition, Springer Series in Operations Research and Financial Engineering, Springer, New York. Cited by: §1.1, §1.3, item 1, §2.1, §2.1, §2.1, §2.2, §2.3, Proposition 2.3, §3, §6.2, §7.1, §7.2, footnote 26.
- [112] (2019) Derivative-free optimization solver for calibration problems. Note: https://nag.com/derivative-free-optimization-dfo/ Cited by: §1.1.
- [113] (1989) Trust region algorithms for optimization with nonlinear equality and inequality constraints. Ph.D. Thesis, University of Colorado at Boulder. Cited by: §6.2.
- [114] (2023) Constrained optimization in the presence of noise. SIAM Journal on Optimization 33 (3), pp. 2118–2136. External Links: Document Cited by: §7.2.
- [115] (1994) A direct search optimization method that models the objective and constraint functions by linear interpolation. In Advances in Optimization and Numerical Analysis, S. Gomez and J. Hennart (Eds.), pp. 51–67. External Links: Document Cited by: item Software of M. J. D. Powell.
- [116] (1998) Direct search algorithms for optimization calculations. Acta Numerica 7, pp. 287–336. External Links: Document Cited by: §1.1.
- [117] (2001) On the Lagrange functions of quadratic models that are defined by interpolation. Optimization Methods and Software 16 (1-4), pp. 289–309. External Links: Document Cited by: §5.3.
- [118] (2002) UOBYQA: unconstrained optimization by quadratic approximation. Mathematical Programming 92 (3), pp. 555–582. External Links: Document Cited by: §5.2, Remark 5.16, item Software of M. J. D. Powell, footnote 11.
- [119] (2004) Least Frobenius norm updating of quadratic models that satisfy interpolation conditions. Mathematical Programming 100 (1), pp. 183–215. External Links: Document Cited by: §4.5, Lemma 4.10, §5.3.
- [120] (2006) The NEWUOA software for unconstrained optimization without derivatives. In Large-Scale Nonlinear Optimization, P. Pardalos, G. Di Pillo, and M. Roma (Eds.), Vol. 83, pp. 255–297. External Links: Document Cited by: item Software of M. J. D. Powell.
- [121] (2009) The BOBYQA algorithm for bound constrained optimization without derivatives. Technical report Technical Report DAMTP 2009/NA06, University of Cambridge. Cited by: item Software of M. J. D. Powell.
- [122] (2015) On fast trust region methods for quadratic models with linear constraints. Mathematical Programming Computation 7 (3), pp. 237–267. External Links: Document Cited by: §6.2, item Software of M. J. D. Powell.
- [123] (2024-12) PDFO: a cross-platform package for Powell’s derivative-free optimization solvers. Mathematical Programming Computation 16 (4), pp. 535–559. External Links: Document Cited by: §5.3, §6.2.1, item Software of M. J. D. Powell.
- [124] (2022) Model-based derivative-free optimization methods and software. Ph.D. Thesis, Hong Kong Polytechnic University. Cited by: §6.2, §6.2, §6.2, item COBYQA [124], item COBYQA [124].
- [125] (2006) Gaussian processes for machine learning. Adaptive Computation and Machine Learning, MIT Press, Cambridge, Massachusetts. External Links: ISBN 978-0-262-18253-9 Cited by: §4.5.
- [126] (2007) A stochastic radial basis function method for the global optimization of expensive functions. INFORMS Journal on Computing 19 (4), pp. 497–509. External Links: Document Cited by: §1.3.
- [127] (2015) The calculus of simplex gradients. Optimization Letters 9 (5), pp. 845–865. External Links: Document Cited by: §4.5.
- [128] (2024) Stochastic trust-region and direct-search methods: a weak tail bound condition and reduced sample sizing. SIAM Journal on Optimization 34 (2), pp. 2067–2092. External Links: Document Cited by: §7.2.
- [129] (2013) Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization 56 (3), pp. 1247–1293. External Links: Document Cited by: §1.1, §8.2.
- [130] (2025) Model construction for convex-constrained derivative-free optimization. SIAM Journal on Optimization 35 (2), pp. 622–650. External Links: Document Cited by: 2nd item, §4.5, §5.2, §5.3, §6.1.1, §6.1.1, §6.2.
- [131] (2010) Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization. SIAM Journal on Optimization 20 (6), pp. 3512–3532. External Links: Document Cited by: §5.3.
- [132] (2024) On complexity constants of linear and quadratic models for derivative-free trust-region algorithms. Optimization Letters. External Links: Document Cited by: §4.5.
- [133] (2016) Taking the human out of the loop: a review of Bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. External Links: Document Cited by: §1.3, §4.5.
- [134] (2018) ASTRO-DF: a class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM Journal on Optimization 28 (4), pp. 3145–3176. External Links: Document Cited by: §7.2.
- [135] (2022) A noise-tolerant quasi-Newton algorithm for unconstrained optimization. SIAM Journal on Optimization 32 (1), pp. 29–55. External Links: Document Cited by: §7.2.
- [136] (2022) Adaptive finite-difference interval estimation for noisy derivative-free optimization. SIAM Journal on Scientific Computing 44 (4), pp. A2302–A2321. External Links: Document Cited by: §7.2.
- [137] (2014) Addressing failures in exascale computing. The International Journal of High Performance Computing Applications 28 (2), pp. 129–173. External Links: Document Cited by: §6.2.1.
- [138] (1998) Using complex variables to estimate derivatives of real functions. SIAM Review 40 (1), pp. 110–112. Cited by: footnote 3.
- [139] (1998) Global optimization requires global information. Journal of Optimization Theory and Applications 96 (3), pp. 575–588. External Links: Document Cited by: §1.1.
- [140] (2023) A trust region method for noisy unconstrained optimization. Mathematical Programming 202 (1-2), pp. 445–472. External Links: Document Cited by: §7.2.
- [141] (2022) Does model calibration reduce uncertainty in climate projections?. Journal of Climate 35 (8), pp. 2585–2602. External Links: Document Cited by: Example 1.1.
- [142] (1988) Global convergence of a a of trust-region methods for nonconvex minimization in Hilbert space. IMA Journal of Numerical Analysis 8 (2), pp. 231–252. External Links: Document Cited by: footnote 21.
- [143] (2024) A trust-region framework for derivative-free mixed-integer optimization. Mathematical Programming Computation. External Links: Document Cited by: §8.2.
- [144] (2020) Approximation theory and approximation practice. SIAM, Philadelphia. Cited by: §5.3, Proposition 5.1.
- [145] (2016) A sequential quadratic programming algorithm for equality-constrained optimization without derivatives. Optimization Letters 10 (2), pp. 383–399. External Links: Document Cited by: §6.2, §6.2.
- [146] (2008) ORBIT: optimization by radial basis function interpolation in trust-regions. SIAM Journal on Scientific Computing 30 (6), pp. 3197–3219. External Links: Document Cited by: §4.4, §4.5.
- [147] (2013) Global convergence of radial basis function trust-region algorithms for derivative-free optimization. SIAM Review 55 (2), pp. 349–371. External Links: Document Cited by: §4.4, §4.5.
- [148] (2008) MNH: a derivative-free optimization algorithm using minimal norm hessians. In Tenth Copper Mountain Conference on Iterative Methods, Cited by: 1st item.
- [149] (2017) POUNDERS in TAO: solving derivative-free nonlinear least-squares problems with POUNDERS. In Advances and Trends in Optimization with Engineering Applications, T. Terlaky, M. F. Anjos, and S. Ahmed (Eds.), MOS-SIAM Book Series on Optimization, Vol. 24, pp. 529–539. Cited by: §4.5, item IBCDFO.
- [150] (2025) ReMU: regional minimal updating for model-based derivative-free optimization. arXiv. Note: arXiv preprint arXiv:2504.03606 External Links: 2504.03606, Document Cited by: §4.5.
- [151] (2024) A new two-dimensional model-based subspace method for large-scale unconstrained derivative-free optimization: 2D-MoSub. arXiv. Note: arXiv preprint arXiv:2309.14855 External Links: 2309.14855 Cited by: §3.3.
- [152] (2025) A derivative-free method using a new underdetermined quadratic interpolation model. SIAM Journal on Optimization 35 (2), pp. 1110–1133. External Links: Document Cited by: §4.5.
- [153] (2025) Least norm updating of quadratic interpolation models for derivative-free trust-region algorithms. IMA Journal of Numerical Analysis, pp. drae106. External Links: Document Cited by: §4.5.
- [154] (2015) Recent advances in trust region algorithms. Mathematical Programming 151 (1), pp. 249–281. External Links: Document Cited by: §2.3.
- [155] (2010) A derivative-free algorithm for least-squares minimization. SIAM Journal on Optimization 20 (6), pp. 3555–3576. External Links: Document Cited by: §4.5, §4.5, footnote 24.
- [156] (2024) On the relationship between -poisedness in derivative-free optimization and outliers in local outlier factor. arXiv. External Links: 2407.17529 Cited by: §5.3.
- [157] (2023) PRIMA: reference implementation for Powell’s methods with modernization and amelioration. Note: available at http://www.libprima.net, DOI: 10.5281/zenodo.8052654 Cited by: item Software of M. J. D. Powell.
- [158] (2014) Sobolev seminorm of quadratic functions with applications to derivative-free optimization. Mathematical Programming 146 (1-2), pp. 77–96. External Links: Document Cited by: §4.5.
- [159] (2025) Scalable derivative-free optimization algorithms with low-dimensional subspace techniques. arXiv. Note: arXiv preprint arXiv:2501.04536 External Links: 2501.04536, Document Cited by: §3.3.
Appendix A Technical Results
Here, we collect some technical results used in the main text.
Lemma A.1.
Suppose we have two linear functions for , defined as , such that for all , for some and . Then
| (A.1) |
Proof.
First, we have . Next, if , choose , to ensure that and is parallel to . This gives
| (A.2) |
If instead then the bound on is trivial. ∎
Lemma A.2.
Suppose we have two quadratic functions for , defined as , such that for all , for some and . Then
| (A.3) |
Proof.
First, we have . Hence for any we have
| (A.4) |
Now, define if , or any unit vector otherwise. Similarly, define to be a unit eigenvector corresponding to the largest eigenvalue in magnitude of . Hence we have and .
Now applying (A.4) to and , we get
| (A.5) | ||||
| (A.6) |
respectively. The first inequality (A.5), gives us
| (A.7) | ||||
| (A.8) |
where the second inequality follows from the reverse triangle inequality. For the second condition (A.6), we first suppose , so . In that case,
| (A.9) |
and in the other case , for which , we reach the same conclusion via
| (A.10) |
Since by Cauchy-Schwarz and from Rayleigh quotients, we ultimately conclude
| (A.11) |
The first of these conditions implies , and so . This gives . Lastly, we apply , and we get the desired result. ∎
Appendix B Poisedness of Structured Fully Quadratic Models
Here, we explicitly estimate the poisedness constant for the structured fully quadratic interpolation set (4.48), namely
| (B.1) |
We may without loss of generality assume and , since the values of the Lagrange polynomials are invariant to shifts and scalings. That is, we take our interpolation set to be
| (B.2) |
We recall the proof of Lemma 4.7, which explicitly computes the coefficients for the associated interpolation linear system: if we wish to interpolate values for all , then (with our normalization and ) we get , and for , and for with . From these explicit formulae, the Lagrange polynomials for our interpolation set are:
-
•
For the interpolation point ,
(B.3) where we note that the Hessian has entries if and if , and is a circulant matrix [75];
-
•
For the interpolation points , ,
(B.4) -
•
For the interpolation points , ,
(B.5) -
•
For the interpolation points , with ,
(B.6)
We now maximize the magnitude of each Lagrange polynomial over individually for all .
For , we observe that and so
| (B.7) |
The Gershgorin circle theorem allows us to estimate353535The exact value (for sufficiently large) may be calculated using the explicit formula for eigenvalues of circulant matrices [75], but this is not needed for our estimate here. , and so . For , we note that for all , and for this constraint the term is maximized/minimized if all () are equal, . So, for a given value of ,
| (B.8) |
(but where the last term is zero if ). Since and for all we may estimate
| (B.9) |
For , we note that we get the same Lagrange polynomial as the structured minimum Frobenius norm quadratic interpolation set in (5.42), which gives . Lastly, for , we can observe that for , and so .
All together, we have determined that this interpolation set has poisedness constant . This bound is tight: for example, we can compute , which implies .
Appendix C Proof of Theorem 7.6
Lemma C.1.
Proof.
That holds follows by assumption on , so it remains to show .
Lemma C.2.
Proof.
By assumption on , the result holds for . By induction, suppose that for some and to find a contradiction suppose that .
Since , iteration must have unsuccessful, and so . Thus, .
The proof of Theorem 7.6 is then identical to that of Theorem 3.7, using Lemma C.2 in place of Lemma 3.6.
Remark C.3.
Taking decreases and increases (and hence decreases the worst-case complexity bound, ). That is, using allows for higher-accuracy solutions and decreases the iteration complexity bound.