Introduction to Model-Based Derivative-Free Optimization

Lindon Roberts lindon.roberts@unimelb.edu.au
School of Mathematics and Statistics & ARC Training Centre in Optimisation Technologies, Integrated Methodologies, and Applications (OPTIMA)
University of Melbourne

Abstract

The field of derivative-free optimization (DFO) studies algorithms for nonlinear optimization that do not rely on the availability of gradient or Hessian information. It is primarily designed for settings when functions are black-box, expensive to evaluate and/or noisy. A widely used and studied class of DFO methods for local optimization is model-based DFO, where the general principles from derivative-based nonlinear optimization algorithms are followed, but local Taylor-type approximations are replaced with alternative local models constructed by interpolation. This document provides an overview of the basic algorithms and analysis for model-based DFO, covering worst-case complexity, approximation theory for polynomial interpolation models, and extensions to constrained and noisy problems.

Contents

1 Introduction

This article is an introduction to the algorithmic ideas and analysis for model-based derivative-free optimization (MBDFO). As the name suggests, derivative-free optimization (DFO)—model-based or otherwise—refers to optimization in the absence of derivative information, whether that be for the objective function and/or constraint functions.¹¹1DFO is sometimes called zeroth order optimization (ZOO), particularly in machine learning. Our focus here is nonlinear optimization, where we aim to minimize a (usually assumed smooth) function of several real variables, possibly with constraints, which may be specified either explicitly (e.g. $x_{1}\geq 0$ ) or via other nonlinear function(s) (e.g. $c(\bm{x})\geq 0$ ). Most general-purpose nonlinear optimization algorithms, such as gradient descent or (quasi-)Newton methods, require some derivative information, such as the gradient or Hessian of the objective (and any nonlinear constraints). DFO methods become relevant in situations where no derivative information is available, and MBDFO methods form an important sub-class of DFO methods.

1.1 Overview of Derivative-Free Optimization

In order to understand the uses/benefits of DFO, we first have to consider how we may obtain derivative information. If we have a function $f:\mathbb{R}^{n}\to\mathbb{R}$ , such as our objective function, there are three main ways to evaluate or approximate its derivatives [111, Chapter 8]:

Explicit calculation

If the mathematical form of $f$ is known, we can directly compute the relevant derivatives by hand (or using a symbolic computation package);

Automatic (aka algorithmic) differentiation

If we have computer code to compute $\bm{x}\mapsto f(\bm{x})$ , automatic differentiation software can be used to create code that computes derivatives of $f$ by analyzing the code for $f$ and repeatedly applying the chain rule.²²2There are two ‘modes’ of automatic differentiation, forward and reverse. If $f$ has many inputs and few outputs, which is typically the case for optimization, the reverse mode is usually more efficient. In machine learning, the reverse mode is often called backpropagation. This produces exact derivatives, typically for the cost (in both time and memory) of a small—that is, $\mathcal{O}(1)$ and independent of $n$ —number of evaluations of $f$ [76, Chapter 4].

Finite differencing

We can approximate derivatives of $f$ by comparing values of $f$ at nearby points. For example, forward finite differencing estimates first derivatives via

\displaystyle\frac{\partial f}{\partial x_{i}}(\bm{x})\approx\frac{f(\bm{x}+h\bm{e}_{i})-f(\bm{x})}{h},

(1.1)

for some small $h>0$ , where $\bm{e}_{i}\in\mathbb{R}^{n}$ is the $i$ -th coordinate vector.³³3If $f$ can be extended to complex-valued inputs, the complex step approximation $\frac{\partial f}{\partial x_{i}}(\bm{x})\approx\operatorname{Im}(f(\bm{x}+h\bm{e}_{i}))/h$ can be used [138]. This is less susceptible than (1.1) to rounding errors when $h$ is very small. To evaluate a full gradient $\nabla f(\bm{x})$ , we would need to evaluate $f$ at $\bm{x}$ and $\bm{x}+h\bm{e}_{i}$ for all $i=1,\ldots,n$ (i.e. $n+1$ evaluations of $f$ ). This only requires the ability to evaluate $f$ , but is only an approximation.

In most circumstances, at least one of these three approaches can be successfully used without significant effort. DFO methods are useful in the situations where none of these approaches are possible or practical. This typically is in situations where at least two of the following apply:

Black-box: That is, the underlying mathematical structure of $f$ is not available. This could be because the details are unknown (e.g. legacy or proprietary software is used) or a situation where a clear mathematical description does not exist (e.g. results of a real-world experiment). This immediately rules out both explicit calculation and automatic differentiation.
Expensive to evaluate: If computing $\bm{x}\mapsto f(\bm{x})$ is costly to evaluate—whether that cost is time, effort or money—then any algorithms relying on many evaluations of $f$ are likely to be impractical. Finite differencing requires at least $n+1$ evaluations of $f$ at each iteration, and so may not be suitable in this case.
Noisy: In essence, this means that evaluating $f$ at nearby inputs leads to non-negligible differences in outputs. This noise may be deterministic or stochastic, depending on whether evaluating $f$ repeatedly at the same point gives the same result or not. This would usually not refer to rounding errors, but more significant differences such as the results of a Monte Carlo simulation. Here, finite differencing cannot be used without significant care, as it relies on comparing function values at nearby points.

A large survey of practical examples where these situations arise and DFO methods are useful is [2]. However, some specific examples include the following.

Example 1.1 (Calibrating Climate Models [141]).

In many areas of science, mathematical models of real-world phenomena are constructed to help predict outcomes and make decisions. Often, these models have parameters (e.g. coefficients of different terms in a differential equation) that cannot be directly measured, and instead must be calibrated using indirect measurements. This most commonly leads to least-squares minimization problems

\displaystyle\min_{\bm{x}\in\mathbb{R}^{n}}f(\bm{x}):=\sum_{i=1}^{N}(\operatorname{model}(\bm{x},\bm{y}_{i})-z_{i})^{2},

(1.2)

where $\bm{x}$ are the model parameters, $z_{1},\ldots,z_{N}\in\mathbb{R}$ are observations, and $\operatorname{model}(\bm{x},\bm{y}_{i})$ represents evaluating the model with parameters $\bm{x}$ and other inputs $\bm{y}_{i}$ to produce a predicted value for $z_{i}$ . In atmospheric physics, $\operatorname{model}(\bm{x},\bm{y})$ can involve a global climate simulation, incorporating coupled ocean, atmospheric and sea ice dynamics, and so its mathematical description is too complex to allow for a clear description (i.e. $f$ is black-box). Because of this complexity, evaluating $\operatorname{model}(\bm{x},\bm{y})$ can be computationally expensive, e.g. 8 hours on a high-performance computing system to simulate 18 months of climate forecasts. Additionally, climate dynamics can exhibit chaotic behavior, which in practice means that very similar parameters $\bm{x}$ can produce very different model results, which effectively mean that $\operatorname{model}(\bm{x},\bm{y})$ is noisy.

Example 1.2 (Quantum Optimization [1]).

One promising use of quantum computers is to solve combinatorial optimization problems such as the graph max-cut problem. In the Quantum Approximate Optimization Algorithm, a particular combinatorial problem is converted into a minimum eigenvalue problem

\displaystyle\min_{\bm{x}\in\mathbb{R}^{n}}f(\bm{x}):=\bm{\psi}(\bm{x})^{*}\bm{H}\bm{\psi}(\bm{x}),

(1.3)

for some (Hermitian) matrix $\bm{H}$ and complex vector $\bm{\psi}(\bm{x})$ . However, $\bm{\psi}$ is described implicitly, and $f(\bm{x})$ must be evaluated by performing specific calculations on a quantum computer. However, quantum computers are inherently stochastic, and so we can only ever evaluate random approximations to $f(\bm{x})$ . Given the limited availability of quantum computing hardware, the cost of evaluating $f(\bm{x})$ may also be high.

Types of DFO Methods

There are many different classes of DFO methods, both for local and global optimization. DFO methods are very common in global optimization (since converging to a stationary point is not sufficient), and the resources [8, 103, 129] provide an overview of different global optimization methods, such as Bayesian optimization, genetic algorithms, branch-and-bound, and Lipschitz optimization (such as DIRECT [89]). Our focus here is on local optimization, where we do aim to find (approximate) stationary points, just like many popular nonlinear optimization methods such as (quasi-)Newton methods. Indeed, global optimization based only on local problem information (such as objective values and/or derivatives) cannot succeed unless an algorithm densely samples the feasible region or some global information is used, such as convexity or global derivative bounds [139]. The most common DFO methods for local optimization are:

Model-based DFO (MBDFO): The focus of this introduction. Here, the goal is to mimic derivative-based optimization methods, substituting local gradient-based approximations such as Taylor series with local models constructed by interpolation.
Direct search methods: Such methods attempt to iteratively improve on a candidate solution by sampling nearby points, but without using any gradient approximations. This definition is broad—see a discussion of this in [93, Section 1.4]—but typically nearby points are selected from a small number of perturbations of the current iterate. A famous example is the Nelder–Mead method, where a simplex of points is iteratively modified to find improved solutions, but more widely studied are Generating Set Search [93] and Mesh Adaptive Direct Search [5] methods, where the perturbations are chosen in an predictable, structured way.
Finite differencing/implicit filtering: Although finite differencing-based derivative approximations are typically used within standard (derivative-based) optimization algorithms without modification, some works explicitly consider the management of the perturbation size $h$ in (1.1) the algorithm. If $h$ is large, this is sometimes called implicit filtering. Building gradient approximations by finite differencing-type approximations along randomly generated directions [109] is particularly popular in machine learning [68].

For readers interested in other classes of local DFO, we refer to the books [51, 91, 8] and survey papers [116, 93, 56, 15, 96, 61]. There are different reasons to prefer these classes—for example, direct search methods are much more developed than MBDFO at handling complex problem structures such as nonsmoothness and discrete variables. However, the similarity of MBDFO to widely recognized and successful (derivative-based) optimization algorithms is appealing, and they often perform well in practice, when measured by the total number of objective evaluations required to solve a problem (see below).

In direct search methods, the search–poll paradigm [25] allows for very flexible ‘search steps’, which allow for any procedure that produces potentially good points, to be combined with the rigorous ‘poll step’ from direct search methods. Search steps using MBDFO ideas is one popular approach in this framework which significantly improves the practical performance of direct search methods [55, 48]. Global surrogate models for the objective such as radial basis functions and Gaussian Processes can also enable a search step [25, 10, 13]. Here, a surrogate does not need to accurately approximate the objective, only accurately rank the quality of suggested points [8, 94].

Cheap vs. Expensive Evaluations

An extremely important issue in DFO is whether a given problem has functions (objective, constraints, etc.) that are ‘cheap’ or ‘expensive’ to evaluate. These terms are meant in a (somewhat) relative sense, as different people/contexts have different understandings of cost.⁴⁴4As above, ‘cost’ could be financial, computational effort or time, for example. This can vary from seconds (e.g. valve train design [43] and financial model calibration [112]) to hours (e.g. algorithm tuning [14]) to days (e.g. hydrofoil noise reduction [104]).

In the case of expensive evaluations, we typically mean that the cost of evaluating the objective (or other such function) is the dominant cost of the optimization process. If this is true, the effort required to determine the next candidate point to evaluate is essentially negligible, and so little attention need be given to topics such as efficient linear algebra or subproblem implementations, or memory management. Performing a significant amount of work within the algorithm in order to avoid even one extra evaluation is generally considered worthwhile. In this setting, some MBDFO software will store the full history of objective evaluations, which implicitly assumes the number of evaluations is not too large (e.g. the IBCDFO software tries to construct its models primarily by using already-evaluated points; see Section 4.4).

When evaluations are cheap, we need to pay more attention to these other factors impacting the performance of the optimization. However, as in many computational tasks, the level of effort that should be devoted to more efficient algorithm implementations depends on the context; whether solving an optimization problem takes 1 second or 10 seconds on a laptop is often not a critical distinction.

As described above, DFO is particularly important/useful in the expensive setting, but may also be used in the cheap setting if the situation requires it. In line with the expensive regime, numerical comparisons of DFO algorithms/implementations in the research literature are most commonly performed by comparing the number of evaluations required to solve a given problem, rather than other metrics such as number of iterations or runtime [105]. General advice on comparing optimization algorithms can be found in [19].

1.2 Model-Based DFO

In MBDFO, our broad goal is to use the algorithmic structures found in derivative-based algorithms, but replace local, Taylor series-based approximations for the objective (and/or constraints) with local models constructed by interpolation. Hence, a MBDFO algorithm maintains a small set of points in $\mathbb{R}^{n}$ (of which one is usually the current iterate), and iteratively moves this set towards a solution. Although in principle this is a generic framework, in practice the underlying derivative-based algorithms used are trust-region methods [47]. Here, we control the maximum stepsize we are willing to take in any given iteration (balancing accurate approximations with fast convergence), and minimize Taylor-like models in a ball around the current iterate. Trust-region methods are a widely used class of methods for derivative-based optimization, alongside other approaches such as linesearches and regularization methods. In MBDFO, trust-region methods are much more widely used than these alternatives, because we always know in advance a region where our next iterate will be found, and it is much more natural to construct interpolation models when we know the exact region in which they need to be a good approximation.

A simple one-dimensional illustration of MBDFO is given in Figure 1.1. At the start of iteration $k$ , we have three interpolation points— $x_{k}$ (illustrated with a large circle) and two others (small circles)—which we use to construct a quadratic approximation $m_{k}(x)$ (dashed line) to approximate the true objective $f(x)$ (solid line). In practice, of course, the full function $f(x)$ is not fully known to the algorithm and can only be sampled. We minimize $m_{k}(x)$ inside the (shaded) trust region, $B(x_{k},\Delta_{k})$ for some $\Delta_{k}>0$ , to get a tentative new iterate $x_{k}+s_{k}$ (illustrated with a star). We compute $f(x_{k}+s_{k})$ , and in this case observe that $f(x_{k}+s_{k})<f(x_{k})$ , and so we set $x_{k+1}=x_{k}+s_{k}$ as the next iterate. After accepting the step (and setting $\Delta_{k+1}=2\Delta_{k}$ to help speed up convergence), in Figure 1(b), we have added $x_{k+1}$ as an interpolation point—and removed the left-most point from the interpolation set (illustrated with a cross)—to build a new model $m_{k+1}$ , which will subsequently be minimized inside $B(x_{k+1},\Delta_{k+1})$ .

Refer to caption — (a) Initial interpolation model and step calculation

Of course, in this article we will outline exactly how we select/update interpolation points, construct interpolation models, compute steps, and choose $x_{k+1}$ and $\Delta_{k+1}$ . In particular, in higher dimensions we will need to ensure that the interpolation points are well-spaced, while remaining close to the current iterate, to ensure that a suitable model exists and is a sufficiently good approximation to the objective. Much of our theoretical discussion will be on interpolation set management.

1.3 Scope of this work

The goal of this work is to provide an accessible introduction to MBDFO. The intended audience is graduate students with an interest in nonlinear optimization, and researchers hoping to learn the fundamentals of this topic. Although the end of each section includes a discussion of important references and related works, it is not intended to be a comprehensive survey (see [96] for this), but rather to provide an overview of the fundamental algorithms, approximation techniques, and convergence theory used in the field. It aims to provide a deeper introduction than found in the books [111, 8]. The most similar existing work to this article is the book [51], but we include more recent advances (e.g. complexity analysis of algorithms, approximation theory in constrained regions, stochastic algorithms), and provide a more targeted analysis of core concepts (for example, we do not cover interpolation theory for higher order polynomial models), including some novel proofs of existing results. Some familiarity with nonlinear optimization techniques (e.g. from [111]), especially trust-region methods, would be helpful here but is not essential.

The largest part of our efforts is devoted to:

•

Introducing the basic (trust region) MBDFO algorithm and proving first- and second-order convergence and worst-case complexity bounds; and
•

Showing how linear and quadratic interpolation models can be constructed for use in MBDFO methods, and proving results quantifying their accuracy. We consider structured interpolation set choices yielding optimal algorithm bounds and their practical extension where points are selected from a database of existing evaluations (Section 4) and geometric approaches based on incremental/minimal updating of interpolation points (Section 5). These two approaches are both theoretically interesting and form the basis of practical implementations.

For both of these topics, we primarily consider the simplest case of unconstrained optimization problems with access to exact objective values. These two topics form the core from which more advanced algorithms can be built. For example, in later sections, we show how these ideas can be extended to handle constrained problems, and problems with inexact/noisy objective evaluations. At the end of each section, we provide a short overview of key references and related works, including more recent research directions.

Our focus here is on the key theoretical underpinnings of MBDFO. We largely avoid detailed discussions of practical considerations for software implementations, providing only occasional notes on this important topic (most notably on termination conditions, Section 3.3) and a list of some notable open-source MBDFO packages in Section 8. In particular, we give minimal attention to the numerical linear algebra required to solve the subproblems that arise in MBDFO (e.g. step calculation and interpolation model building), in line with the regime of expensive evaluations described in Section 1.1. All the methods describe here, except for Algorithm 7.2 in Section 7.2, are suitable for the expensive evaluation regime (see Remark 7.17 for a discussion of this point).

For brevity, we also avoid discussion of and comparison with other local DFO methods or global optimization methods, which often avoid using derivative information (see Section 1.1 and references therein for good resources on these methods). Although there are a wide range of techniques for global optimization, we note that Bayesian and surrogate optimization [133, 25, 126, 8] have similar principles to MBDFO, namely iteratively minimizing an easy-to-evaluate approximation to the objective built from evaluations at known points.

Structure

In Section 2 we give some preliminary technical results that we will use throughout, and introduces standard (i.e. derivative-based) trust-region methods for unconstrained optimization, including the general algorithmic framework, step calculation and convergence guarantees. In Section 3, we show how this framework extends to the MBDFO case, outlining what changes to the algorithmic framework are required, the requirements on the interpolation models, and the convergence guarantees. We then introduce the polynomial interpolation theory required to construct the models, and show how they satisfy the requirements from Section 3. In particular, Section 4 considers structured interpolation set construction and Section 5 extends these ideas to allow for incremental interpolation set updating. We then consider two important extensions of this theory: Section 6 covers problems with simple (e.g. bounds) and general nonlinear constraints, and Section 7 covers problems with deterministic and stochastic noise. In our summary, Section 8, we point to a selection of state-of-the-art software implementations of MBDFO.

Notation

Throughout, vectors and matrices will be written in bold (to distinguish from scalars), and functions will be written in bold if they have multiple outputs, e.g. $f:\mathbb{R}^{n}\to\mathbb{R}$ but $\bm{r}:\mathbb{R}^{n}\to\mathbb{R}^{m}$ . The vector norm $\|\bm{x}\|$ refers to the Euclidean 2-norm, and the matrix norm $\|\bm{A}\|$ refers to the operator 2-norm (i.e. largest singular value). Other norms will be indicated explicitly, such as $\|\bm{x}\|_{1}$ or $\|\bm{A}\|_{\infty}$ . We will use $B(\bm{x},r)$ to refer to the closed Euclidean ball centered at $\bm{x}\in\mathbb{R}^{n}$ with radius $r>0$ , that is $B(\bm{x},r):=\{\bm{y}\in\mathbb{R}^{n}:\|\bm{y}-\bm{x}\|\leq r\}$ . The standard coordinate vectors in $\mathbb{R}^{n}$ will be written as $\bm{e}_{1},\ldots,\bm{e}_{n}$ , where $\bm{e}_{i}$ has a 1 in the $i$ -th entry, $\bm{e}$ is the vector of all ones, $\bm{I}$ will be the identity matrix, and $\bm{0}$ will be a vector or matrix of zeros (depending on the context). For a matrix $\bm{A}\in\mathbb{R}^{m\times n}$ , we denote its Moore-Penrose pseudoinverse by $\bm{A}^{\dagger}\in\mathbb{R}^{n\times m}$ . We use $\kappa(\bm{A})$ to denote the (2-norm) condition number of a matrix, $\kappa(\bm{A}):=\|\bm{A}\|\>\|\bm{A}^{\dagger}\|$ .

2 Technical Preliminaries

2.1 Stationarity Conditions and Smoothness Assumptions

The largest focus of this work is to solve the unconstrained nonlinear optimization problem

\displaystyle\min_{\bm{x}\in\mathbb{R}^{n}}\>f(\bm{x}),

(2.1)

where $f:\mathbb{R}^{n}\to\mathbb{R}$ is a smooth (we will mostly assume continuously differentiable) but possibly nonconvex (i.e. not necessarily convex⁵⁵5The function $f$ is convex if $f(t\bm{x}+(1-t)\bm{y})\leq tf(\bm{x})+(1-t)f(\bm{y})$ for all $\bm{x},\bm{y}\in\mathbb{R}^{n}$ and $t\in[0,1]$ . An important consequence of convexity is that all local minimizers are global minimizers. If $f$ is known to be convex, (2.1) is typically much easier to solve and there are a plethora of specialized methods for this case [110].) objective function. In general, finding a global minimizer of a nonconvex function is very difficult and suffers from the curse of dimensionality⁶⁶6That is, the cost of finding a solution grows exponentially with the problem dimension $n$ . [36, Chapter 1.2.6], so we will restrict our consideration to finding local minima for (2.1).

Definition 2.1.

A point $\bm{x}^{*}\in\mathbb{R}^{n}$ is a local minimizer of $f:\mathbb{R}^{n}\to\mathbb{R}$ if there exists $\epsilon>0$ such that $f(\bm{x}^{*})\leq f(\bm{x})$ for all $\bm{x}\in B(\bm{x}^{*},\epsilon)$ . The value $f(\bm{x}^{*})$ is called the local minimum of $f$ .

For general problems such as (2.1), we need to specify what information about $f$ is known to any algorithm we develop, as this clearly constrains our available algorithmic choices. In nonlinear optimization, we typically use the oracle model for optimization algorithms: information about $f$ is available only via a given set of oracles, most commonly some/all of:

(a)

The zeroth-order oracle for $f$ is the map $\bm{x}\mapsto f(\bm{x})$ ;
(b)

The first-order oracle for $f$ is the map $\bm{x}\mapsto\nabla f(\bm{x})$ (provided $f$ is differentiable); and
(c)

The second-order oracle for $f$ is the map $\bm{x}\mapsto\nabla^{2}f(\bm{x})$ (provided $f$ is twice differentiable).

When we first explore classical (derivative-based) trust-region methods in Section 2.2, we will assume access to at least the zeroth- and first-order oracles for $f$ , with the second-order oracle often considered optional. When in Section 3 we begin to consider MBDFO algorithms, we will only assume access to a zeroth-order oracle.

Remark 2.2.

For the purposes of the theoretical analysis, we will assume $f$ is differentiable. However, all our algorithms are built assuming we cannot evaluate $\nabla f$ , for one (or more) of the reasons outlined above.

The limited problem information available in the oracle model means we need necessary or sufficient conditions for a point to be a local minimizer which can be checked using the oracles available to us.

Proposition 2.3 (Theorems 2.2–2.4, [111]).

Suppose $\bm{x}^{*}\in\mathbb{R}^{n}$ and $f:\mathbb{R}^{n}\to\mathbb{R}$ is continuously differentiable. Then,

(a)

If $\bm{x}^{*}$ is a local minimizer of $f$ , then $\nabla f(\bm{x}^{*})=\bm{0}$ . (first-order necessary condition);
(b)

If $f$ is twice continuously differentiable and $\bm{x}^{*}$ is a local minimizer of $f$ , then $\nabla^{2}f(\bm{x}^{*})$ is positive semidefinite. (second-order necessary condition); and
(c)

If $f$ is twice continuously differentiable, $\nabla f(\bm{x}^{*})=\bm{0}$ and $\nabla^{2}f(\bm{x}^{*})$ is positive definite, then $\bm{x}^{*}$ is a local minimizer of $f$ . (second-order sufficient conditions).

In light of Proposition 2.3 (a) and (b), we call points $\bm{x}$ such that $\nabla f(\bm{x})=\bm{0}$ stationary or first-order optimal, and if both $\nabla f(\bm{x})=\bm{0}$ and $\nabla^{2}f(\bm{x})$ is positive semidefinite then we say $\bm{x}$ is second-order optimal.

In practice, our algorithms will be able to find points that approximately satisfy the first- or second-order necessary conditions. In the first-order case, this means finding a point with $\|\nabla f(\bm{x})\|\leq\epsilon$ for some (small) tolerance $\epsilon>0$ . Since a second-order optimal point is defined by two necessary conditions $\|\nabla f(\bm{x}_{k})\|=0$ and $\lambda_{\min}(\nabla^{2}f(\bm{x}_{k}))\geq 0$ , our measure of approximate second-order optimality will be $\sigma(\bm{x})\leq\epsilon$ , where

\displaystyle\sigma(\bm{x}):=\max(\|\nabla f(\bm{x})\|,\tau(\bm{x})),\qquad\text{with}\qquad\tau(\bm{x}):=\max(-\lambda_{\min}(\nabla^{2}f(\bm{x})),0).

(2.2)

That is, we hope to find a point with $\sigma(\bm{x})\leq\epsilon$ , which implies both $\|\nabla f(\bm{x})\|\leq\epsilon$ and $\tau(\bm{x})\leq\epsilon$ hold, the latter condition being equivalent to $\lambda_{\min}(\nabla^{2}f(\bm{x}))\geq-\epsilon$ . For an algorithm with current iterate $\bm{x}_{k}$ at iteration $k$ , we will denote $\sigma_{k}:=\sigma(\bm{x}_{k})$ and $\tau_{k}:=\tau(\bm{x}_{k})$ .

We will consider two different assumptions on our objective $f$ in (2.1), depending on whether we wish to prove convergence to first- or second-order optimal points. Our main discussion will focus on first-order convergence, in which case we will assume the following.

Assumption 2.4.

The function $f:\mathbb{R}^{n}\to\mathbb{R}$ in (2.1) satisfies:

(a)

$f$ is continuously differentiable, and $\nabla f$ is Lipschitz continuous with constant $L_{1}$ ; and
(b)

$f$ is bounded below, there exists an $f_{\textnormal{low}}\in\mathbb{R}$ such that $f(\bm{x})\geq f_{\textnormal{low}}$ for all $\bm{x}\in\mathbb{R}^{n}$ .

Remark 2.5.

The global Lipschitz continuity of $\nabla f$ in Assumption 2.4 (a) is very strong, excluding simple functions such as $f(x)=x^{4}$ . As in [36], we make this assumption for ease of exposition, but in practice it can be weakened to $\nabla f$ being $L_{1}$ -Lipschitz continuous on any set $\mathcal{L}$ containing all trust regions $\cup_{k}B(\bm{x}_{k},\Delta_{k})\subseteq\mathcal{L}$ . For example, if $f(\bm{x}_{k})\leq f(\bm{x}_{0})$ for all iterates (i.e. our algorithm is monotone) and we have an upper bound on all trust-region radii, $\Delta_{k}\leq\Delta_{\max}$ , then we can take $\mathcal{L}=\cup_{\bm{x}:f(\bm{x})\leq f(\bm{x}_{0})}B(\bm{x},\Delta_{\max})$ . For problems with simple constraints (Section 6.1), we can restrict $\mathcal{L}$ to the feasible region. Moreover, if $\nabla f$ is continuous and $\mathcal{L}$ is bounded, then the existence of a suitable $L_{1}$ is automatic. This issue is discussed more in, for example, [47, Chapter 6.2.1] and [51, Chapter 10.2] for deterministic problems, and [41, Remark 4.2] for stochastic problems.

The most important consequence of Assumption 2.4 is a bound on the error in a first-order Taylor series for $f$ , specifically:

Lemma 2.6 (Theorem A.8.1, [36]).

Suppose Assumption 2.4 (a) holds. Then

\displaystyle\left|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right|\leq\frac{L_{1}}{2}\|\bm{y}-\bm{x}\|^{2},

(2.3)

for all $\bm{x},\bm{y}\in\mathbb{R}^{n}$ .

In the case where we are interested in convergence to second-order optimal points, we will use the following alternative smoothness assumption.

Assumption 2.7.

The function $f:\mathbb{R}^{n}\to\mathbb{R}$ in (2.1) satisfies:

(a)

$f$ is twice continuously differentiable, and $\nabla^{2}f$ is Lipschitz continuous with constant $L_{2}$ ; and
(b)

$f$ is bounded below, there exists an $f_{\textnormal{low}}\in\mathbb{R}$ such that $f(\bm{x})\geq f_{\textnormal{low}}$ for all $\bm{x}\in\mathbb{R}^{n}$ .

Where Assumption 2.4 (a) allows us to bound the first-order Taylor series error using Lemma 2.6, Assumption 2.7 (a) allows us to bound the second-order Taylor series error via the bound [36, Corollary A.8.4]

\displaystyle\left|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})-\frac{1}{2}(\bm{y}-\bm{x})^{T}\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\right|\leq\frac{L_{2}}{6}\|\bm{y}-\bm{x}\|^{3},\quad\forall\bm{x},\bm{y}\in\mathbb{R}^{n}.

(2.4)

Another consequence of Assumption 2.7 (a) is that $\nabla f$ itself has a Lipchitz continuous gradient, and so a result similar to Lemma 2.6 applies, namely [111, Appendix A]

\displaystyle\|\nabla f(\bm{y})-\nabla f(\bm{x})-\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\|\leq\frac{L_{2}}{2}\|\bm{y}-\bm{x}\|^{2},\qquad\forall\bm{x},\bm{y}\in\mathbb{R}^{n}.

(2.5)

Although the main focus of this work is the solution of unconstrained problems (2.1), we will also consider problems with nonlinear constraints, of the form


$\displaystyle\min_{\bm{x}\in\mathbb{R}^{n}}$	$\displaystyle\>f(\bm{x}),$	(2.6a)
s.t.	$\displaystyle\>c_{i}(\bm{x})=0,\qquad\forall i\in\mathcal{E},$	(2.6b)
	$\displaystyle\>c_{i}(\bm{x})\leq 0,\qquad\forall i\in\mathcal{I},$	(2.6c)

The first-order optimality conditions for (2.6) are the Karush–Kuhn–Tucker (KKT) conditions: defining the Lagrangian as

\displaystyle L(\bm{x},\bm{\lambda}):=f(\bm{x})+\sum_{i\in\mathcal{E}\cup\mathcal{I}}\lambda_{i}c_{i}(\bm{x}),

(2.7)

where $\bm{x}\in\mathbb{R}^{n}$ and $\bm{\lambda}\in\mathbb{R}^{|\mathcal{E}|+|\mathcal{I}|}$ , provided a suitable constraint qualification holds (see e.g. [111, Chapters 12.2 & 12.6]), a necessary condition for $\bm{x}^{*}$ to be a local minimizer of (2.6) is that there exists $\bm{\lambda}^{*}\in\mathbb{R}^{|\mathcal{E}|+|\mathcal{I}|}$ such that


$\displaystyle\nabla_{\bm{x}}L(\bm{x}^{},\bm{\lambda}^{})$	$\displaystyle=\bm{0},$	(2.8a)
$\displaystyle c_{i}(\bm{x}^{*})$	$\displaystyle=0,\qquad\forall i\in\mathcal{E},$	(2.8b)
$\displaystyle c_{i}(\bm{x}^{*})$	$\displaystyle\leq 0,\qquad\forall i\in\mathcal{I},$	(2.8c)
$\displaystyle\lambda^{*}_{i}$	$\displaystyle\geq 0,\qquad\forall i\in\mathcal{I},$	(2.8d)
$\displaystyle\lambda^{}_{i}c_{i}(\bm{x}^{})$	$\displaystyle=0,\qquad\forall i\in\mathcal{I}.$	(2.8e)

For an overview of introductory ideas from nonlinear optimization and more details on the above, see [111].

2.2 Trust-Region Methods

Trust-region methods are a popular class of algorithms for nonconvex optimization which have proven very successful in practice, featuring in the state-of-the-art codes GALAHAD [70] and KNitro [31], as well as in many of the methods from MATLAB’s Optimization Toolbox and SciPy’s optimization library, for example. Other common classes of algorithm are linesearch and regularization methods, which primarily differ in how global convergence⁷⁷7That is, convergence to a stationary point is guaranteed regardless of how far the algorithm starts from a solution, in contrast with local convergence, which assumes a starting point sufficiently close to a solution. is guaranteed. All of these classes are iterative, requiring the user to provide a starting point $\bm{x}_{0}\in\mathbb{R}^{n}$ and generating a sequence $\bm{x}_{1},\bm{x}_{2},\ldots\in\mathbb{R}^{n}$ which we hope will converge to a minimizer. They are also all based on iterative minimization of local approximations to the objective $f$ , usually Taylor series (or approximations thereof). In trust-region methods, the crucial ingredient that distinguishes it from the other classes is a positive scalar called the trust-region radius, updated at each iteration, that constrains the distance between consecutive iterates.

If we have zeroth- and first-order oracles for the objective $f$ , the key steps in one iteration of a trust-region method for solving (2.1) are:

1.

Build a (usually quadratic) model for $f$ that we expect to be accurate near the current iterate. This typically uses a second-order Taylor series for $f$ based at $\bm{x}_{k}$ , with the Hessian potentially replaced with a quasi-Newton approximation such as the symmetric rank-1 update (see [111, Chapter 6]);
2.

Minimize the model in a neighborhood of the current iterate (the trust region, i.e. the region in which we trust the model to be accurate);
3.

Evaluate $f$ at the minimizer found in the previous step and decide whether or not to make this the new iterate, and update the trust-region radius. We accept a new iterate if it sufficiently decreases the objective, and increase the trust-region radius if we have been making good progress (to enable larger steps, hence faster progress to the minimizer), or decrease if we are not making progress (since our model is more accurate in a smaller neighborhood of the iterate).

This is formalized in Algorithm 2.1. The two most important aspects of Algorithm 2.1 are the model construction and step calculation in (2.9) and (2.10) respectively. Typical values for the parameters might be $\gamma_{\textnormal{dec}}=0.5$ , $\gamma_{\textnormal{inc}}=2$ , $\eta_{U}=0.1$ and $\eta_{S}=0.7$ , but these should not be viewed as prescriptive.

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

and acceptance thresholds

0<\eta_{U}\leq\eta_{S}<1

2:for

k=0,1,2,\ldots

3: Build a local quadratic Taylor-like model for the objective,

\displaystyle f(\bm{y})\approx m_{k}(\bm{y}):=f(\bm{x}_{k})+\bm{g}_{k}^{T}(\bm{y}-\bm{x}_{k})+\frac{1}{2}(\bm{y}-\bm{x}_{k})^{T}\bm{H}_{k}(\bm{y}-\bm{x}_{k}),

(2.9)

for some

\bm{g}_{k}\in\mathbb{R}^{n}

and (symmetric)

\bm{H}_{k}\approx\nabla^{2}f(\bm{x}_{k})\in\mathbb{R}^{n\times n}

4: Solve the trust-region subproblem: set

\bm{s}_{k}

to be an approximate minimizer of

\displaystyle\min_{\bm{s}\in\mathbb{R}^{n}}m_{k}(\bm{x}_{k}+\bm{s}),\qquad\text{s.t.}\quad\|\bm{s}\|\leq\Delta_{k}.

(2.10)

5: Evaluate

f(\bm{x}_{k}+\bm{s}_{k})

and calculate the ratio

\displaystyle\rho_{k}=\frac{\text{actual reduction}}{\text{predicted reduction}}:=\frac{f(\bm{x}_{k})-f(\bm{x}_{k}+\bm{s}_{k})}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})}.

(2.11)

6: if

\rho_{k}\geq\eta_{S}

then

7: (Very successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}

8: else if

\eta_{U}\leq\rho_{k}<\eta_{S}

then

9: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\Delta_{k}

10: else

11: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

12: end if

13:end for

Algorithm 2.1 Derivative-based trust-region method for solving (2.1).

The core calculation in Algorithm 2.1 is the (approximate) solution of the trust-region subproblem (2.10). Ignoring the specific choice of model and centering the model at the current iterate, this subproblem has the general form

\displaystyle\min_{\bm{s}\in\mathbb{R}^{n}}m(\bm{s}):=c+\bm{g}^{T}\bm{s}+\frac{1}{2}\bm{s}^{T}\bm{H}\bm{s},\qquad\text{s.t.}\quad\|\bm{s}\|\leq\Delta.

(2.12)

At first glance, solving (2.12) appears daunting: instead of our original problem of minimizing $f$ , now at each iteration we need to minimize a different function, but with constraints!

However, the special structure of (2.12)—minimizing a (possibly nonconvex) quadratic objective subject to single a Euclidean ball constraint—allows the efficient calculation of exact (global) minimizers via a one-dimensional search over the Lagrange mulitplier for the (squared) constraint $\|\bm{s}\|^{2}\leq\Delta^{2}$ .

Implementing an efficient algorithm to do this requires some effort; see [47, Chapter 7.3] or the newer [71] for details. To find a global minimizer, the dimension $n$ must not be too large, as the subproblem solver requires computing Cholesky factorizations of $\bm{H}+\lambda\bm{I}$ for several different values of $\lambda$ , with a cost of $\mathcal{O}(n^{3})$ operations each time.

Approximate Subproblem Solutions

However, we do not necessarily need the global solution to the trust-region subproblem for Algorithm 2.1 to converge. A very simple approximate solution can be found by performing 1 iteration of gradient descent with exact linesearch. That is, our approximate solution to (2.12) is of the form $\bm{s}(t)=-t\bm{g}$ for $t\geq 0$ . Restricting the objective to this ray, we get $m(\bm{s}(t))=c-t\|\bm{g}\|^{2}+\frac{\bm{g}^{T}\bm{H}\bm{g}}{2}t^{2}$ subject to $0\leq t\leq\Delta/\|\bm{g}\|$ . This is easy to globally minimize in $t$ , yielding the so-called Cauchy point

\displaystyle\bm{s}_{C}:=-t_{C}\bm{g},\qquad\text{where}\qquad t_{C}:=\begin{cases}\min\left(\frac{\|\bm{g}\|}{\bm{g}^{T}\bm{H}\bm{g}},\frac{\Delta}{\|\bm{g}\|}\right),&\bm{g}^{T}\bm{H}\bm{g}>0,\\ \frac{\Delta}{\|\bm{g}\|},&\bm{g}^{T}\bm{H}\bm{g}\leq 0.\end{cases}

(2.13)

By direct computation, we can get a lower bound on how much the Cauchy point decreases the quadratic model.

Lemma 2.8 (Theorem 6.3.1, [47]).

The Cauchy point (2.13) solution to (2.12) satisfies

\displaystyle m(\bm{0})-m(\bm{s}_{C})\geq\frac{1}{2}\|\bm{g}\|\min\left(\Delta,\frac{\|\bm{g}\|}{\|\bm{H}\|+1}\right).

(2.14)

It turns out that (2.14) is in fact sufficient to achieve first-order convergence of Algorithm 2.1. Hence, after re-introducing $\bm{x}_{k}$ into the model as per (2.9), we make the following assumption regarding (2.10).

Assumption 2.9.

The computed step $\bm{s}_{k}$ in (2.10) satisfies $\|\bm{s}_{k}\|\leq\Delta_{k}$ and

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\|\bm{g}_{k}\|\min\left(\Delta_{k},\frac{\|\bm{g}_{k}\|}{\|\bm{H}_{k}\|+1}\right),

(2.15)

for some $\kappa_{s}\in(0,\frac{1}{2})$ .

If $n$ is large and computation of the global minimizer of (2.12) is impractical, there are still alternative methods for improving on the Cauchy point. The most common is an adaptation of the conjugate gradient (CG) method for solving symmetric positive definite linear systems [111, Chapter 5.1]. In this method, sometimes called the Steihaug-Toint method, we use CG to solve $\bm{H}\bm{s}=-\bm{g}$ starting from initial iterate $\bm{s}_{0}=\bm{0}$ . If any iteration would take us outside the feasible region, we truncate the step so we remain on the boundary of the feasible region and terminate. Similarly, if, at any iteration, we compute a search direction $\bm{d}$ such that $\bm{d}^{T}\bm{H}\bm{d}\leq 0$ , then $\bm{H}$ is not positive definite, and we move from our current CG iterate in the direction $\bm{d}$ until we reach the boundary of the feasible region. This method has the advantage of only requiring $\bm{H}$ via Hessian-vector products, and so is suitable for large-scale problems, and the first iterate is always the Cauchy point, so Assumption 2.9 is guaranteed. More details of this and other approximate subproblem solvers can be found in [47, Chapter 7.5].

To achieve second-order convergence of Algorithm 2.1, we first note that the quadratic model $m_{k}$ (2.9) has its own second-order optimality measure, namely

\displaystyle\sigma^{m}_{k}:=\max(\|\bm{g}_{k}\|,\tau^{m}_{k}),\qquad\text{with}\qquad\tau^{m}_{k}:=\max(-\lambda_{\min}(\bm{H}_{k}),0).

(2.16)

although here we will assume $\bm{g}_{k}=\nabla f(\bm{x}_{k})$ and $\bm{H}_{k}=\nabla^{2}f(\bm{x}_{k})$ , and so $\sigma^{m}_{k}=\sigma_{k}$ and $\tau^{m}_{k}=\tau_{k}$ . Here, we need a stronger assumption on our trust-region subproblem solution, to handle the case where $\bm{H}_{k}$ is not positive semidefinite.⁸⁸8For example, if $\bm{g}_{k}=\bm{0}$ and $\lambda_{\min}(\bm{H}_{k})<0$ then $\bm{s}_{k}=\bm{0}$ satisfies Assumption 2.9. Suppose that $\tau^{m}_{k}>0$ (i.e. $\lambda_{\min}(\bm{H}_{k})<0$ ), and let $\bm{u}_{k}$ be a normalized eigenvector corresponding to that eigenvalue, which is also a first-order descent direction; that is,

\displaystyle\bm{H}_{k}\bm{u}_{k}=\lambda_{\min}(\bm{H}_{k})\bm{u}_{k},\qquad\|\bm{u}_{k}\|=1,\qquad\text{and}\qquad\bm{u}_{k}^{T}\bm{g}_{k}\leq 0.

(2.17)

The last condition may be achieved by judicious sign choice in the eigenvector calculation. Similar to the definition of the Cauchy point (2.13), the eigenstep $\bm{s}_{k}^{E}$ is the point in the direction $\bm{u}_{k}$ that minimizes the model $m_{k}$ (subject to $\|\bm{s}_{E}\|\leq\Delta_{k}$ ). Since we have

\displaystyle m_{k}(\bm{x}_{k}+t\bm{u}_{k})=f(\bm{x}_{k})+t\bm{u}_{k}^{T}\bm{g}_{k}+\frac{1}{2}\bm{u}_{k}^{T}\bm{H}_{k}\bm{u}_{k}t^{2}=f(\bm{x}_{k})+t\bm{u}_{k}^{T}\bm{g}_{k}+\frac{1}{2}\lambda_{\min}(\bm{H}_{k})t^{2},

(2.18)

we conclude that $m_{k}(\bm{x}_{k}+t\bm{u}_{k})$ is decreasing as $t\geq 0$ increases, so we have $\bm{s}_{k}^{E}=\Delta_{k}\bm{u}_{k}$ since $\|\bm{u}_{k}\|=1$ . We then compute

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k}^{E})=-\Delta_{k}\bm{u}_{k}^{T}\bm{g}_{k}-\frac{1}{2}\lambda_{\min}(\bm{H}_{k})\Delta_{k}^{2}\geq-\frac{1}{2}\lambda_{\min}(\bm{H}_{k})\Delta_{k}^{2}\geq 0.

(2.19)

This motivates the following assumption on our step calculation for second-order convergence, where we now require our step to be at least as good as the Cauchy step and, if $\tau^{m}_{k}>0$ (i.e. $\lambda_{\min}(\bm{H}_{k})<0$ ), the eigenstep.

Assumption 2.10.

The computed step $\bm{s}_{k}$ in (2.10) satisfies $\|\bm{s}_{k}\|\leq\Delta_{k}$ and

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\max\left(\|\bm{g}_{k}\|\min\left(\Delta_{k},\frac{\|\bm{g}_{k}\|}{\|\bm{H}_{k}\|+1}\right),\tau^{m}_{k}\Delta_{k}^{2}\right),

(2.20)

for some $\kappa_{s}\in(0,\frac{1}{2})$ , where $\tau^{m}_{k}$ is defined in (2.16).

2.3 Convergence of Trust-Region Methods

We conclude this section by summarizing the convergence of Algorithm 2.1 to first- and second-order optimal points. Our results will also show the worst-case complexity of Algorithm 2.1, which essentially is a rate of convergence (i.e. how many iterations are needed to achieve $\|\nabla f(\bm{x}_{k})\|<\epsilon$ or $\sigma_{k}<\epsilon$ ?). We assume the following about the quadratic model (2.9).

Assumption 2.11.

At each iteration $k$ of Algorithm 2.1, the model $m_{k}$ (2.9) satisfies:

(a)

$\bm{g}_{k}=\nabla f(\bm{x}_{k})$ ;
(b)

$\|\bm{H}_{k}\|\leq\kappa_{H}-1$ for some $\kappa_{H}\geq 1$ (independent of $k$ ).⁹⁹9This will be more notationally convenient than the equivalent but more natural $\|\bm{H}_{k}\|\leq\kappa_{H}$ , for some $\kappa_{H}\geq 0$ .

Our main first-order convergence result is the following.

Theorem 2.12 (Theorem 2.3.7, [36]).

Suppose Assumptions 2.4, 2.9 and 2.11 hold and we run Algorithm 2.1. If $k_{\epsilon}$ is the first iteration of Algorithm 2.1 such that $\|\nabla f(\bm{x}_{k_{\epsilon}})\|<\epsilon$ , then $k_{\epsilon}=\mathcal{O}((L_{1}+\kappa_{H})\epsilon^{-2})$ . Hence $\liminf_{k\to\infty}\|\nabla f(\bm{x}_{k})\|=0$ .

This tells us that there is a subsequence of iterations whose gradients converge to zero, and that we get $\|\nabla f(\bm{x}_{k})\|<\epsilon$ for the first time after at most $\mathcal{O}(\epsilon^{-2})$ iterations. In practice, we typically terminate Algorithm 2.1 when $\|\bm{g}_{k}\|=\|\nabla f(\bm{x}_{k})\|$ is sufficiently small or we exceed some computational budget (e.g. maximum number of iterations).

Algorithm 2.1 to converge to second-order optimal solutions, we have to strengthen Assumption 2.11 to also require $\bm{H}_{k}=\nabla^{2}f(\bm{x}_{k})$ .

Theorem 2.13 (Theorem 3.2.6, [36]).

Suppose Assumptions 2.7, 2.10 and 2.11 hold, and $\bm{H}_{k}=\nabla^{2}f(\bm{x}_{k})$ for all $k$ , and we run Algorithm 2.1. If $k_{\epsilon}$ is the first iteration of Algorithm 2.1 such that $\sigma_{k_{\epsilon}}\leq\epsilon$ , then $k_{\epsilon}=\mathcal{O}((L_{2}+\kappa_{H})^{2}\epsilon^{-3})$ . Hence $\liminf_{k\to\infty}\sigma_{k}=0$ .

In summary, if we want to achieve optimality level $\epsilon$ , then trust-region methods require $\mathcal{O}(\epsilon^{-2})$ iterations to achieve first-order optimality¹⁰¹⁰10This follows from Theorem 2.12, but the same result holds if we assume exact second-order models, $\bm{H}_{k}=\nabla^{2}f(\bm{x}_{k})$ [36, Theorem 3.2.1]. and $\mathcal{O}(\epsilon^{-3})$ iterations to achieve second-order optimality, assuming sufficient problem smoothness, model accuracy and subproblem solution quality.

Notes and References

The material in this section is based on the widely used books [47, 111] and the more recent book on complexity theory [36]. A more recent survey of trust-region methods is [154].

Although Theorem 2.13 considers the case where we have the same desired accuracy level $\epsilon$ for both first- and second-order optimality, the proof of Theorem 2.13 gives a complexity of $\mathcal{O}(\max(\epsilon_{1}^{-2}\epsilon_{2}^{-1},\epsilon_{2}^{-3}))$ as $\epsilon_{1},\epsilon_{2}\to 0^{+}$ if we want $\|\nabla f(\bm{x}_{k})\|\leq\epsilon_{1}$ and $\tau_{k}\leq\epsilon_{2}$ . With minor algorithmic changes we can get the more natural complexity bound $\mathcal{O}(\max(\epsilon_{1}^{-2},\epsilon_{2}^{-3}))$ [74].

3 Derivative-Free Trust-Region Methods

We now consider how to adapt the generic trust-region method from Section 2.2 to the case where we only have a zeroth order oracle for the objective. That is, we are solving (2.1), where $f$ is still differentiable (e.g. Assumption 2.4), but where we do not have access to $\nabla f$ in our algorithm.

As before, at each iteration we will construct a local quadratic model around $\bm{x}_{k}$ , namely

\displaystyle f(\bm{y})\approx m_{k}(\bm{y}):=c_{k}+\bm{g}_{k}^{T}(\bm{y}-\bm{x}_{k})+\frac{1}{2}(\bm{y}-\bm{x}_{k})^{T}\bm{H}_{k}(\bm{y}-\bm{x}_{k}),

(3.1)

noting that, unlike (2.9), we do not require $m_{k}(\bm{x}_{k})=f(\bm{x}_{k})$ (i.e. $c_{k}\neq f(\bm{x}_{k})$ is allowed). For now, we will not define a specific way to construct the local quadratic model (3.1) using only function values. Instead, we will focus on simply defining some practical assumptions on the model accuracy (i.e. replacing Assumption 2.11). In Sections 4 and 5 we will outline concrete ways our model accuracy requirements can be specified.

Motivation: finite difference gradients

In the simplest case, imagine we do not have access to $\nabla f$ , and simply build $\bm{g}_{k}$ in (3.1) from (forward) finite differences:

\displaystyle[\bm{g}_{k}]_{i}:=\frac{f(\bm{x}_{k}+h\bm{e}_{i})-f(\bm{x}_{k})}{h},\qquad\forall i=1,\ldots,n,

(3.2)

for some small $h>0$ , where $\bm{e}_{i}\in\mathbb{R}^{n}$ is the $i$ -th coordinate vector. From standard analysis of finite differencing (e.g. [111, Chapter 8.1]), if $f$ is twice continuously differentiable with $\|\nabla^{2}f(\bm{x})\|\leq M$ for all $\bm{x}$ in $B(\bm{x}_{k},h)$ , then $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|_{\infty}\leq\frac{Mh}{2}$ . That is, the gradient error is of size $\mathcal{O}(h)$ as $h\to 0^{+}$ . Although derivative-based trust-region methods can handle inexact gradients [47, Chapter 8.4], they require a bound on the relative gradient error, whereas finite differencing gives an absolute error in the model gradient. We can only control the model error by choosing the value of $h$ , perhaps differently at each iteration. It is not obvious how we could pick $h$ if we instead wanted to control the relative gradient error, at least not without already having a good estimate of $\|\bm{g}_{k}\|$ and $M$ (i.e. unless we already have the derivative information we are trying to calculate!).

Now, suppose we build a quadratic model (3.1) using this $\bm{g}_{k}$ and any bounded Hessian $\|\bm{H}_{k}\|\leq\kappa_{H}-1$ (c.f. Assumption 2.11 9). Given $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|_{\infty}\leq\frac{Mh}{2}$ , and hence $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|\leq\frac{Mh\sqrt{n}}{2}$ , the error in the model (3.1) at a point $\bm{y}\in\mathbb{R}^{n}$ is

	$\displaystyle\|f(\bm{y})-f(\bm{x}_{k})-\bm{g}_{k}^{T}(\bm{y}-\bm{x}_{k})-\frac{1}{2}(\bm{y}-\bm{x}_{k})^{T}\bm{H}_{k}(\bm{y}-\bm{x}_{k})\|$
	$\displaystyle\qquad\leq\|f(\bm{y})-f(\bm{x}_{k})-\nabla f(\bm{x}_{k})^{T}(\bm{y}-\bm{x}_{k})\|+\|(\nabla f(\bm{x}_{k})-\bm{g}_{k})^{T}(\bm{y}-\bm{x}_{k})\|+\frac{\kappa_{H}}{2}\\|\bm{y}-\bm{x}_{k}\\|^{2},$		(3.3)
	$\displaystyle\qquad\leq\frac{L_{1}+\kappa_{H}}{2}\\|\bm{y}-\bm{x}_{k}\\|^{2}+\frac{Mh\sqrt{n}}{2}\\|\bm{y}-\bm{x}_{k}\\|.$		(3.4)

So, if we want to approximate the objective within the trust region, $\bm{y}\in B(\bm{x}_{k},\Delta_{k})$ , it makes sense to balance both error terms and set $h=\mathcal{O}(\Delta_{k})$ , leading to a model error of size $\mathcal{O}(\Delta_{k}^{2})$ .

Fully linear models

In our general MBDFO approach, we will use the trust-region radius $\Delta_{k}$ to control the size of the model error (e.g. at iteration $k$ , use $h=\Delta_{k}$ in (3.2), as suggested above). This suggests to us the following notion of model accuracy.

Definition 3.1.

Suppose we have $\bm{x}\in\mathbb{R}^{n}$ and $\Delta>0$ . A local model $m:\mathbb{R}^{n}\to\mathbb{R}$ approximating $f:\mathbb{R}^{n}\to\mathbb{R}$ is fully linear in $B(\bm{x},\Delta)$ if there exist constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}>0$ , independent of $m$ , $\bm{x}$ and $\Delta$ , such that


$\displaystyle\|m(\bm{y})-f(\bm{y})\|$	$\displaystyle\leq\kappa_{\textnormal{mf}}\Delta^{2},$	(3.5a)
$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\kappa_{\textnormal{mg}}\Delta,$	(3.5b)

for all $\bm{y}\in B(\bm{x},\Delta)$ . Sometimes we will use $\kappa_{\textnormal{m}}:=\max(\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}})$ for notational convenience.

Remark 3.2.

‘Fully linear’ here refers not to the model being linear, but the model being as accurate an approximation (up to constants) as a linear Taylor series. Quadratic models as in (3.1) can be fully linear. However, in Sections 4 and 5 we will sometimes consider the case where $m_{k}$ (3.1) is indeed linear (i.e. $\bm{H}_{k}=\bm{0}$ ). For simplicity, we will generically refer to $m_{k}$ as a ‘quadratic model’ even if $\bm{H}_{k}=\bm{0}$ , unless we specifically wish to draw attention to $m_{k}$ being linear.

If we do indeed have access to a first-order oracle (i.e. not in the MBDFO case), then we can satisfy Definition 3.1 using the approaches from Section 2.2. For example, if Assumption 2.4 (a) holds (and so we have Lemma 2.6):

•

The first-order Taylor series $m(\bm{y})=f(\bm{x})+\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})$ is fully linear in $B(\bm{x},\Delta)$ with $\kappa_{\textnormal{mf}}=\frac{1}{2}L_{1}$ and $\kappa_{\textnormal{mg}}=L_{1}$ .
•

More generally, any quadratic model satisfying Assumption 2.11 is fully linear in $B(\bm{x},\Delta)$ with $\kappa_{\textnormal{mf}}=\frac{1}{2}(L_{1}+\kappa_{H})$ and $\kappa_{\textnormal{mg}}=L_{1}+\kappa_{H}$ .

In Sections 4 and 5 we will show how models satisfying Definition 3.1 may be constructed using only function values.

Assumption 3.3.

At each iteration $k$ of Algorithm 3.1, the model $m_{k}$ (3.1) satisfies:

(a)

$m_{k}$ is fully linear in $B(\bm{x}_{k},\Delta_{k})$ with constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}>0$ independent of $k$ ;
(b)

$\|\bm{H}_{k}\|\leq\kappa_{H}-1$ for some fixed $\kappa_{H}\geq 1$ (independent of $k$ ).

It is clear from (3.5) that we have to contend with absolute model errors, controlled by the size of $\Delta$ (which will be set to the trust-region radius in our algorithm). This introduces several difficulties that are not present in the derivative-based case:

•

$\Delta_{k}$ now performs two roles: it controls the size of the tentative step ( $\|\bm{s}_{k}\|\leq\Delta_{k}$ in (2.10)) and the size of the model error (3.5).¹¹¹¹11In some algorithms (e.g. [118]), two trust region radii are used to partially decouple these roles.
•

If our model suggests we are close to a first-order solution (i.e. $\|\bm{g}_{k}\|\approx 0$ ), this does not mean that we are actually near a solution (i.e. $\|\nabla f(\bm{x}_{k})\|\approx 0$ ). By comparison, if we have relative errors in the model gradient, then $\|\bm{g}_{k}\|\approx 0$ if and only if $\|\nabla f(\bm{x}_{k})\|\approx 0$ [47, Lemma 8.4.1].
•

As a consequence of the above point, it is no longer clear when to terminate the algorithm in practice. If we had gradients (or relative gradient errors), then terminating when $\|\bm{g}_{k}\|$ is sufficiently small guarantees we are close to a (first-order) solution. We specifically discuss termination in Section 3.3.

Unfortunately, this additional complexity is a consequence of having a weaker notion of model accuracy, one which can be practically achieved using only function evaluations.

3.1 First-Order Convergence

We are now ready to state our MBDFO variant of Algorithm 2.1, where our Taylor-based models (i.e. (2.9) satisfying Assumption 2.11) are replaced with fully linear models (i.e. (3.1) satisfying Assumption 3.3). It is almost identical to the derivative-based version Algorithm 2.1, including allowing inexact subproblem solutions (satisfying the Cauchy decrease condition, Assumption 2.9).

Aside from our new model accuracy condition (Assumption 3.3 instead of Assumption 2.11), the only difference is that we need to explicitly check our criticality measure¹²¹²12That is, our measure of optimality, in this case $\|\bm{g}_{k}\|$ (for first-order convergence)., and require $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ for a step to be declared (very) successful. This is an important feature of MBDFO methods, and aims to address the decoupling of our measured distance to a solution $\|\bm{g}_{k}\|$ from our true distance to solution $\|\nabla f(\bm{x}_{k})\|$ . This mechanism ensures that, for successful iterations, if $\|\nabla f(\bm{x}_{k})\|$ is large (i.e. we are far from first-order optimality) then so is $\|\bm{g}_{k}\|$ (i.e. the model knows we are far from optimality); see (3.24) below.

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

, acceptance thresholds

0<\eta_{U}\leq\eta_{S}<1

, and criticality threshold

\mu_{c}>0

2:for

k=0,1,2,\ldots

3: Build a local quadratic model

m_{k}

(3.1) satisfying Assumption 3.3.

4: Solve the trust-region subproblem (2.10) to get a step

\bm{s}_{k}

satisfying Assumption 2.9.

5: Evaluate

f(\bm{x}_{k}+\bm{s}_{k})

and calculate the ratio

\rho_{k}

(2.11).

6: if

\rho_{k}\geq\eta_{S}

and

\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}

then

7: (Very successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}

8: else if

\eta_{U}\leq\rho_{k}<\eta_{S}

and

\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}

then

9: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\Delta_{k}

10: else

11: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

12: end if

13:end for

Algorithm 3.1 Simple MBDFO trust-region method for solving (2.1).

Remark 3.4.

If $\|\bm{g}_{k}\|<\mu_{c}\Delta_{k}$ at any iteration, then that iteration must be unsuccessful regardless of the value of $\rho_{k}$ . So, if this check is performed as soon as the model is constructed, we may save our efforts by not solving the trust-region subproblem or evaluating $f(\bm{x}_{k}+\bm{s}_{k})$ .

We now present our analysis of Algorithm 3.1.

Lemma 3.5.

Suppose Assumptions 2.4 (a), 2.9 and 3.3 hold. If, on iteration $k$ of Algorithm 3.1, $\bm{g}_{k}\neq\bm{0}$ and

\displaystyle\Delta_{k}\leq\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}},\frac{1}{\kappa_{H}},\frac{1}{\mu_{c}}\right)\|\bm{g}_{k}\|,

(3.6)

then $\rho_{k}\geq\eta_{S}$ and $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ (i.e. iteration $k$ is very successful).

Proof.

That $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ follows immediately by assumption on $\Delta_{k}$ . It remains to show $\rho_{k}\geq\eta_{S}$ . From Assumption 3.3 we have $\Delta_{k}\leq\frac{\|\bm{g}_{k}\|}{\kappa_{H}}\leq\frac{\|\bm{g}_{k}\|}{\|\bm{H}_{k}\|+1}$ , and so Assumption 2.9 gives

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\|\bm{g}_{k}\|\Delta_{k}.

(3.7)

Then using Assumption 3.3 we can compute

$\displaystyle\|\rho_{k}-1\|$	$\displaystyle\leq\frac{\|f(\bm{x}_{k}+\bm{s}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\|+\|f(\bm{x}_{k})-m_{k}(\bm{x}_{k})\|}{\|m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\|},$	(3.8)
	$\displaystyle\leq\frac{2\kappa_{\textnormal{mf}}\Delta_{k}^{2}}{\kappa_{s}\\|\bm{g}_{k}\\|\Delta_{k}},$	(3.9)
	$\displaystyle=\frac{2\kappa_{\textnormal{mf}}\Delta_{k}}{\kappa_{s}\\|\bm{g}_{k}\\|},$	(3.10)

and so $|\rho_{k}-1|\leq 1-\eta_{S}$ by assumption on $\Delta_{k}$ , which implies $\rho_{k}\geq\eta_{S}$ as required. ∎

The crucial consequence of Lemma 3.5 is that $\Delta_{k}$ remains large provided we have not yet converged.

Lemma 3.6.

Suppose Assumptions 2.4 (a), 2.9 and 3.3 hold and we run Algorithm 3.1. If $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ for all $k=0,\ldots,K-1$ , then

\displaystyle\Delta_{k}\geq\Delta_{\min}(\epsilon)

\displaystyle:=\min\left(\Delta_{0},\frac{\gamma_{\textnormal{dec}}\epsilon}{\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)+\kappa_{\textnormal{mg}}}\right),

(3.11)

for all $k=0,\ldots,K$ .

Proof.

We proceed by induction. The result holds trivially for $k=0$ , so suppose $\Delta_{k}\geq\Delta_{\min}(\epsilon)$ for some $k\in\{0,\ldots,K-1\}$ . To find a contradiction assume that $\Delta_{k+1}<\Delta_{\min}(\epsilon)$ . Then $\Delta_{k+1}<\Delta_{k}$ , which by the mechanism for updating the trust-region radius means iteration $k$ was unsuccessful, and $\Delta_{k}=\gamma_{\textnormal{dec}}^{-1}\Delta_{k+1}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ . From Lemma 3.5, this means we must have

\displaystyle\Delta_{k}>\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}},\frac{1}{\kappa_{H}},\frac{1}{\mu_{c}}\right)\|\bm{g}_{k}\|.

(3.12)

But since $m_{k}$ is fully linear (Assumption 3.3), we get

\displaystyle\epsilon\leq\|\nabla f(\bm{x}_{k})\|\leq\|\bm{g}_{k}\|+\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|<\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)\Delta_{k}+\kappa_{\textnormal{mg}}\Delta_{k},

(3.13)

which contradicts $\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ . ∎

Our main convergence result is the following. The arguments here can largely be used to prove the equivalent result for derivative-based methods, Theorem 2.12.

Theorem 3.7.

Suppose Assumptions 2.4, 2.9 and 3.3 hold and we run Algorithm 3.1. If $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ for all $k=0,\ldots,K-1$ , then

\displaystyle K\leq\frac{\log(\Delta_{0}/\Delta_{\min}(\epsilon))}{\log(\gamma_{\textnormal{dec}}^{-1})}+\left(1+\frac{\log(\gamma_{\textnormal{inc}})}{\log(\gamma_{\textnormal{dec}}^{-1})}\right)\frac{f(\bm{x}_{0})-f_{\textnormal{low}}}{\eta_{U}\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{2}},

(3.14)

where $\Delta_{\min}(\epsilon)$ is defined in Lemma 3.6.

Proof.

We first partition the iterations $\{0,\ldots,K-1\}=\mathcal{S}\cup\mathcal{U}$ , where $\mathcal{S}$ is the set of successful or very successful iterations and $\mathcal{U}$ is the set of unsuccessful iterations. From Lemma 3.6 we have $\Delta_{k}\geq\Delta_{\min}(\epsilon)$ for all $k\in\{0,\ldots,K-1\}$ .

Since $\bm{x}_{k+1}\neq\bm{x}_{k}$ if and only if $k\in\mathcal{S}$ , we have

	$\displaystyle f(\bm{x}_{0})-f_{\textnormal{low}}\geq f(\bm{x}_{0})-f(\bm{x}_{K})\geq\sum_{k=0}^{K-1}f(\bm{x}_{k})-f(\bm{x}_{k+1})$	$\displaystyle=\sum_{k\in\mathcal{S}}f(\bm{x}_{k})-f(\bm{x}_{k+1}),$		(3.15)
		$\displaystyle\geq\eta_{U}\sum_{k\in\mathcal{S}}m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k}),$		(3.16)

where the last inequality uses $\rho_{k}\geq\eta_{U}$ for all $k\in\mathcal{S}$ . If $k\in\mathcal{S}$ , then Assumptions 2.9 and 3.3 (b), together with the criticality requirement $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ , imply

	$\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})$	$\displaystyle\geq\kappa_{s}\\|\bm{g}_{k}\\|\min\left(\Delta_{\min}(\epsilon),\frac{\\|\bm{g}_{k}\\|}{\\|\bm{H}_{k}\\|+1}\right),$		(3.17)
		$\displaystyle\geq\kappa_{s}\mu_{c}\Delta_{\min}(\epsilon)\min\left(\Delta_{\min}(\epsilon),\frac{\mu_{c}\Delta_{\min}(\epsilon)}{\kappa_{H}}\right).$		(3.18)

We conclude that

\displaystyle f(\bm{x}_{0})-f_{\textnormal{low}}\geq\eta_{U}\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{2}\cdot|\mathcal{S}|,

(3.19)

\displaystyle|\mathcal{S}|\leq\frac{f(\bm{x}_{0})-f_{\textnormal{low}}}{\eta_{U}\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{2}}.

(3.20)

Separately, the mechanism for updating $\Delta_{k}$ ensures that $\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}$ if $k\in\mathcal{U}$ and $\Delta_{k+1}\leq\gamma_{\textnormal{inc}}\Delta_{k}$ if $k\in\mathcal{S}$ . Hence

\displaystyle\Delta_{\min}(\epsilon)\leq\Delta_{K}=\Delta_{0}\gamma_{\textnormal{dec}}^{|\mathcal{U}|}\gamma_{\textnormal{inc}}^{|\mathcal{S}|},

(3.21)

which gives

\displaystyle|\mathcal{U}|\leq\frac{\log(\Delta_{0}/\Delta_{\min}(\epsilon))}{\log(\gamma_{\textnormal{dec}}^{-1})}+\frac{\log(\gamma_{\textnormal{inc}})}{\log(\gamma_{\textnormal{dec}}^{-1})}|\mathcal{S}|.

(3.22)

The result then follows from $K=|\mathcal{S}|+|\mathcal{U}|$ . ∎

We again summarize our result in terms of the key quantities of interest. Here, we recall $\kappa_{\textnormal{m}}:=\max(\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}})$ (see Definition 3.1) and note that $\Delta_{\min}(\epsilon)=\Theta((\kappa_{\textnormal{m}}+\kappa_{H})^{-1}\epsilon)$ as $\epsilon\to 0^{+}$ and $\kappa_{\textnormal{m}},\kappa_{H}\to\infty$ , to get the following first-order convergence result.

Corollary 3.8.

Suppose the assumptions of Theorem 3.7 hold. If $k_{\epsilon}$ is the first iteration of Algorithm 3.1 such that $\|\nabla f(\bm{x}_{k_{\epsilon}})\|<\epsilon$ , then $k_{\epsilon}=\mathcal{O}(\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{2}\epsilon^{-2})$ . Hence $\liminf_{k\to\infty}\|\nabla f(\bm{x}_{k})\|=0$ .

In the derivative-based case (Theorem 2.12), using $\kappa_{\textnormal{m}}=\mathcal{O}(L_{1})$ corresponding to a Taylor model, the worst-case complexity was $\mathcal{O}((\kappa_{\textnormal{m}}+\kappa_{H})\epsilon^{-2})$ iterations. The bound in Corollary 3.8 is worse by a factor of $\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})$ , which arises from using $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{\min}(\epsilon)$ to get (3.18) (see Remark 3.9).

We also have to assess the impact of using interpolation-based models on $\kappa_{\textnormal{m}}$ . In Section 4, we show that $\kappa_{\textnormal{m}}$ is usually of size $\mathcal{O}(L_{1})$ , the same as for Taylor models, but with a constant that depends explicitly on the problem dimension $n$ , e.g. $\kappa_{\textnormal{m}}=\mathcal{O}(\sqrt{n}\>L_{1})$ in Theorem 4.2. In some special cases, though, we can recover dimension-independent $\kappa_{\textnormal{m}}$ (Corollary 4.5). However, we will always get an explicit dimension dependency of at least $\mathcal{O}(n)$ if we count the number of (possibly expensive) objective evaluations.¹³¹³13In the derivative-based setting, this does not arise because every evaluation of $\nabla f$ gives $n$ pieces of information.

Remark 3.9.

There are multiple ways to establish a model decrease of at least $\mathcal{O}(\epsilon^{-2})$ on successful iterations, as in (3.18) above. In derivative-based trust-region methods, when $\bm{g}_{k}=\nabla f(\bm{x}_{k})$ , we have

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\epsilon\min\left(\Delta_{\min}(\epsilon),\frac{\epsilon}{\kappa_{H}}\right)=\kappa_{s}\epsilon\Delta_{\min}(\epsilon),

(3.23)

immediately from the assumption $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ , and we never need to check $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ (i.e. we may set $\mu_{c}=0$ ). In the DFO setting, if we know that $m_{k}$ is fully linear then we may instead use

\displaystyle\epsilon\leq\|\nabla f(\bm{x}_{k})\|\leq\|\bm{g}_{k}\|+\|\nabla f(\bm{x}_{k})-\bm{g}_{k}\|\leq\|\bm{g}_{k}\|+\kappa_{\textnormal{mg}}\Delta_{k}\leq(1+\kappa_{\textnormal{mg}}\mu_{c}^{-1})\|\bm{g}_{k}\|,

(3.24)

to get $\|\bm{g}_{k}\|\geq\mathcal{O}(\epsilon)$ . The approach in (3.18) is the most general, as it only requires the $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ and does not require anything regarding model accuracy. The same flexibility appears in the second-order convergence theory (below).

3.2 Second-Order Convergence

Similar to Theorem 2.13, we now consider how to extend the above results to converge to second-order critical points. Just as in the derivative-based case, we need to assume $f$ is twice continuously differentiable (specifically, Assumption 2.7), and our trust-region subproblem solver needs to achieve at least the eigenstep decrease (Assumption 2.10).

However, motivated by (2.4), we need a stricter requirement on our model accuracy than fully linear. Where fully linear models match (up to constants) the error from a first-order Taylor series, we now require models which match the error from a second-order Taylor series.

Definition 3.10.

Suppose we have $\bm{x}\in\mathbb{R}^{n}$ and $\Delta>0$ . A local model $m:\mathbb{R}^{n}\to\mathbb{R}$ approximating $f:\mathbb{R}^{n}\to\mathbb{R}$ is fully quadratic in $B(\bm{x},\Delta)$ if there exist constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}},\kappa_{\textnormal{mh}}>0$ , independent of $m$ , $\bm{x}$ and $\Delta$ , such that


$\displaystyle\|m(\bm{y})-f(\bm{y})\|$	$\displaystyle\leq\kappa_{\textnormal{mf}}\Delta^{3},$	(3.25a)
$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\kappa_{\textnormal{mg}}\Delta^{2},$	(3.25b)
$\displaystyle\\|\nabla^{2}m(\bm{y})-\nabla^{2}f(\bm{y})\\|$	$\displaystyle\leq\kappa_{\textnormal{mh}}\Delta,$	(3.25c)

for all $\bm{y}\in B(\bm{x},\Delta)$ . Sometimes we will use $\kappa_{\textnormal{m}}:=\max(\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}},\kappa_{\textnormal{mh}})$ for notational convenience.

If Assumption 2.7 and hence (2.4) and (2.5) hold, it is not hard to verify that the second-order Taylor series $m(\bm{y})=f(\bm{x})+\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})$ is fully quadratic in $B(\bm{x},\Delta)$ with $\kappa_{\textnormal{mf}}=\frac{1}{6}L_{2}$ , $\kappa_{\textnormal{mg}}=\frac{1}{2}L_{2}$ and $\kappa_{\textnormal{mh}}=L_{2}$ .

Our model assumptions mimic Assumption 3.3 closely.

Assumption 3.11.

At each iteration $k$ of Algorithm 3.2, the model $m_{k}$ (3.1) satisfies:

(a)

$m_{k}$ is fully quadratic in $B(\bm{x}_{k},\Delta_{k})$ with constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}},\kappa_{\textnormal{mh}}>0$ independent of $k$ ;
(b)

$\|\bm{H}_{k}\|\leq\kappa_{H}-1$ for some fixed $\kappa_{H}\geq 1$ (independent of $k$ ).

Then, our second-order algorithm is given in Algorithm 3.2. Compared to the first-order algorithm Algorithm 3.1, we require fully quadratic models rather than fully linear models, assume the eigenstep decrease from the trust-region subproblem solver, and compare $\Delta_{k}$ with $\sigma^{m}_{k}$ (2.16) to declare a step successful, instead of comparing just with $\|\bm{g}_{k}\|$ . Additionally, we cap $\Delta_{k}$ at a maximum level $\Delta_{\max}$ , a technicality needed to bound the error between the true and estimated criticality measures (Lemma 3.12). Ultimately, this comes from the need to control the size of the gradient error (3.25b): reflecting the ‘overloaded’ role of the trust-region radius, we have to cap the radius (and hence the size of the step $\bm{s}_{k}$ ) solely because the radius also acts as the measure of model accuracy.

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: maximum trust-region radius

\Delta_{\max}\geq\Delta_{0}

, scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

, acceptance thresholds

0<\eta_{U}\leq\eta_{S}<1

, and criticality threshold

\mu_{c}>0

2:for

k=0,1,2,\ldots

3: Build a local quadratic model

m_{k}

(3.1) satisfying Assumption 3.11.

4: Solve the trust-region subproblem (2.10) to get a step

\bm{s}_{k}

satisfying Assumption 2.10.

5: Evaluate

f(\bm{x}_{k}+\bm{s}_{k})

and calculate the ratio

\rho_{k}

(2.11).

6: if

\rho_{k}\geq\eta_{S}

and

\sigma^{m}_{k}\geq\mu_{c}\Delta_{k}

then

7: (Very successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\min(\gamma_{\textnormal{inc}}\Delta_{k},\Delta_{\max})

8: else if

\eta_{U}\leq\rho_{k}<\eta_{S}

and

\sigma^{m}_{k}\geq\mu_{c}\Delta_{k}

then

9: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\Delta_{k}

10: else

11: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

12: end if

13:end for

Algorithm 3.2 Second-order MBDFO trust-region method for solving (2.1).

First, we show that fully quadratic models yield sufficiently good estimates of all the different criticality measures under consideration (see (2.2) and (2.16)).

Lemma 3.12.

Suppose Assumptions 2.7 (a) and 3.11 hold. On iteration $k$ of Algorithm 3.2, we have $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|\leq\kappa_{\sigma}\Delta_{k}$ , $|\tau^{m}_{k}-\tau_{k}|\leq\kappa_{\sigma}\Delta_{k}$ and $|\sigma_{k}-\sigma^{m}_{k}|\leq\kappa_{\sigma}\Delta_{k}$ all hold, where $\kappa_{\sigma}:=\max(\kappa_{\textnormal{mg}}\Delta_{\max},\kappa_{\textnormal{mh}})$ .

Proof.

Note that the trust-region updating mechanism in Algorithm 3.2 and $\Delta_{0}\leq\Delta_{\max}$ ensures $\Delta_{k}\leq\Delta_{\max}$ for all iterations $k$ .

First, we have $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|\leq\kappa_{\textnormal{mg}}\Delta_{k}^{2}\leq\kappa_{\textnormal{mg}}\Delta_{\max}\Delta_{k}\leq\kappa_{\sigma}\Delta_{k}$ . Secondly, if $\bm{v}$ is a normalized eigenvector corresponding to $\lambda_{\min}(\nabla^{2}f(\bm{x}_{k}))$ , then

\displaystyle\lambda_{\min}(\bm{H}_{k})-\lambda_{\min}(\nabla^{2}f(\bm{x}_{k}))

\displaystyle\leq\bm{v}^{T}[\bm{H}_{k}-\nabla^{2}f(\bm{x}_{k})]\bm{v}\leq\|\bm{H}_{k}-\nabla^{2}f(\bm{x}_{k})\|\leq\kappa_{\textnormal{mh}}\Delta_{k}\leq\kappa_{\sigma}\Delta_{k},

(3.26)

and instead taking $\bm{v}$ to be a normalized eigenvector for $\lambda_{\min}(\bm{H}_{k})$ we get the reverse result $\lambda_{\min}(\nabla^{2}f(\bm{x}_{k}))-\lambda_{\min}(\bm{H}_{k})\leq\kappa_{\sigma}\Delta_{k}$ , and so all together we get $|\lambda_{\min}(\bm{H}_{k})-\lambda_{\min}(\nabla^{2}f(\bm{x}_{k}))|\leq\kappa_{\sigma}\Delta_{k}$ . The result then follows from the identity¹⁴¹⁴14To show this, suppose $a_{i^{*}}=\max_{i}a_{i}$ . Then $\max_{i}a_{i}=a_{i^{*}}\leq b_{i^{*}}+|a_{i^{*}}-b_{i^{*}}|\leq\max_{i}b_{i}+\max_{i}|a_{i}-b_{i}|$ and a similar argument for the opposite direction. $|\max_{i}a_{i}-\max_{i}b_{i}|\leq\max_{i}|a_{i}-b_{i}|$ . ∎

We now state the second-order equivalent of Lemma 3.5, which is split into two results for the two criticality measures $\|\bm{g}_{k}\|$ and $\tau^{m}_{k}$ .

Lemma 3.13.

Suppose Assumptions 2.7 (a), 2.10 and 3.11. If, on iteration $k$ of Algorithm 3.2, $\bm{g}_{k}\neq\bm{0}$ and

\displaystyle\Delta_{k}\leq\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}\Delta_{\max}},\frac{1}{\kappa_{H}},\frac{1}{\mu_{c}}\right)\|\bm{g}_{k}\|,

(3.27)

then $\rho_{k}\geq\eta_{S}$ and $\sigma^{m}_{k}\geq\mu_{c}\Delta_{k}$ (i.e. iteration $k$ is very successful).

Proof.

First, we have $\sigma^{m}_{k}\geq\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ by assumption on $\Delta_{k}$ , so it remains to show $\rho_{k}\geq\eta_{S}$ . From Assumptions 2.10 and 3.11, we have

\displaystyle|\rho_{k}-1|\leq\frac{|f(\bm{x}_{k}+\bm{s}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})|+|f(\bm{x}_{k})-m_{k}(\bm{x}_{k})|}{|m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})|}\leq\frac{2\kappa_{\textnormal{mf}}\Delta_{k}^{3}}{\kappa_{s}\|\bm{g}_{k}\|\min\left(\Delta_{k},\frac{\|\bm{g}_{k}\|}{\kappa_{H}}\right)}=\frac{2\kappa_{\textnormal{mf}}\Delta_{k}^{3}}{\kappa_{s}\|\bm{g}_{k}\|\Delta_{k}},

(3.28)

since $\Delta_{k}\leq\frac{\|\bm{g}_{k}\|}{\kappa_{H}}$ by assumption. Therefore we have

\displaystyle|\rho_{k}-1|\leq\frac{2\kappa_{\textnormal{mf}}\Delta_{k}^{2}}{\kappa_{s}\|\bm{g}_{k}\|}\leq\frac{2\kappa_{\textnormal{mf}}\Delta_{\max}\Delta_{k}}{\kappa_{s}\|\bm{g}_{k}\|},

(3.29)

and so we conclude $|\rho_{k}-1|\geq 1-\eta_{S}$ and hence $\rho_{k}\geq\eta_{S}$ . ∎

Lemma 3.14.

Suppose Assumptions 2.7 (a), 2.10 and 3.11 hold. If, on iteration $k$ of Algorithm 3.2, $\tau^{m}_{k}>0$ and

\displaystyle\Delta_{k}\leq\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}},\frac{1}{\mu_{c}}\right)\tau^{m}_{k},

(3.30)

then $\rho_{k}\geq\eta_{S}$ and $\sigma^{m}_{k}\geq\mu_{c}\Delta_{k}$ (i.e. iteration $k$ is very successful).

Proof.

First, note that $\sigma^{m}_{k}\geq\tau^{m}_{k}\geq\mu_{c}\Delta_{k}$ . Separately, from Assumptions 2.10 and 3.11 we have

\displaystyle|\rho_{k}-1|\leq\frac{|f(\bm{x}_{k}+\bm{s}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})|+|f(\bm{x}_{k})-m_{k}(\bm{x}_{k})|}{|m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})|}\leq\frac{2\kappa_{\textnormal{mf}}\Delta_{k}^{3}}{\kappa_{s}\tau^{m}_{k}\Delta_{k}^{2}}=\frac{2\kappa_{\textnormal{mf}}\Delta_{k}}{\kappa_{s}\tau^{m}_{k}}\leq 1-\eta_{S},

(3.31)

and so $\rho_{k}\geq\eta_{S}$ . ∎

We now get our usual result, that $\Delta_{k}$ is bounded away from zero provided we are not close to optimality.

Lemma 3.15.

Suppose Assumptions 2.7 (a), 2.10 and 3.11 hold and we run Algorithm 3.2. If $\sigma_{k}\geq\epsilon$ for all $k=0,\ldots,K-1$ , then

\displaystyle\Delta_{k}\geq\Delta_{\min}(\epsilon):=\min\left(\Delta_{0},\frac{\gamma_{\textnormal{dec}}\epsilon}{\max\left(\frac{2\kappa_{\textnormal{mf}}\Delta_{\max}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)+\kappa_{\sigma}},\frac{\gamma_{\textnormal{dec}}\epsilon}{\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\mu_{c}\right)+\kappa_{\sigma}}\right),

(3.32)

for all $k=0,\ldots,K$ .

Proof.

We again proceed by induction. The result holds trivially for $k=0$ , so suppose $\Delta_{k}\geq\Delta_{\min}(\epsilon)$ for some $k\in\{0,\ldots,K-1\}$ . To find a contradiction assume that $\Delta_{k+1}<\Delta_{\min}(\epsilon)$ . Then $\Delta_{k+1}<\Delta_{k}$ , which by the mechanism for updating the trust-region radius means iteration $k$ was unsuccessful, and $\Delta_{k}=\gamma_{\textnormal{dec}}^{-1}\Delta_{k+1}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ .

At iteration $k$ , since $\sigma_{k}\geq\epsilon$ we either have $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ or $\tau_{k}\geq\epsilon$ . If $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ , since iteration $k$ is unsuccessful, from Lemma 3.13 we have

\displaystyle\Delta_{k}>\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}\Delta_{\max}},\frac{1}{\kappa_{H}},\frac{1}{\mu_{c}}\right)\|\bm{g}_{k}\|,

(3.33)

and so from Lemma 3.12 we get

\displaystyle\epsilon\leq\|\nabla f(\bm{x}_{k})\|\leq\|\bm{g}_{k}\|+\|\nabla f(\bm{x}_{k})-\bm{g}_{k}\|<\max\left(\frac{2\kappa_{\textnormal{mf}}\Delta_{\max}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)\Delta_{k}+\kappa_{\sigma}\Delta_{k},

(3.34)

which contradicts $\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ . If instead $\tau_{k}\geq\epsilon$ , from Lemma 3.14 we get

\displaystyle\Delta_{k}>\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}},\frac{1}{\mu_{c}}\right)\tau^{m}_{k},

(3.35)

which again from Lemma 3.12 implies

\displaystyle\epsilon\leq\tau_{k}\leq\tau^{m}_{k}+|\tau_{k}-\tau^{m}_{k}|<\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\mu_{c}\right)\Delta_{k}+\kappa_{\sigma}\Delta_{k},

(3.36)

and we again get a contradiction with $\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ . ∎

We now get our main second-order complexity result.

Theorem 3.16.

Suppose Assumptions 2.7, 2.10 and 3.11 hold and we run Algorithm 3.2. If $\sigma_{k}\geq\epsilon$ for all $k=0,\ldots,K-1$ , then

\displaystyle K\leq\frac{\log(\Delta_{0}/\Delta_{\min}(\epsilon))}{\log(\gamma_{\textnormal{dec}}^{-1})}+\left(1+\frac{\log(\gamma_{\textnormal{inc}})}{\log(\gamma_{\textnormal{dec}}^{-1})}\right)\frac{f(\bm{x}_{0})-f_{\textnormal{low}}}{\eta_{U}\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{3}},

(3.37)

where $\Delta_{\min}(\epsilon)$ is defined in Lemma 3.15.

Proof.

We again partition the iterations $\{0,\ldots,K-1\}=\mathcal{S}\cup\mathcal{U}$ , where $\mathcal{S}$ is the set of successful or very successful iterations and $\mathcal{U}$ is the set of unsuccessful iterations. From Lemma 3.15, we have $\Delta_{k}\geq\Delta_{\min}(\epsilon)$ for all $k\in\{0,\ldots,K\}$ .

If $k\in\mathcal{S}$ , then the criticality requirement $\sigma^{m}_{k}\geq\mu_{c}\Delta_{k}$ implies that either $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ or $\tau^{m}_{k}\geq\mu_{c}\Delta_{k}$ . First, if $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ , then Assumptions 2.10 and 3.11 (b) give

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\mu_{c}\Delta_{\min}(\epsilon)\min\left(\Delta_{\min}(\epsilon),\frac{\mu_{c}\Delta_{\min}(\epsilon)}{\kappa_{H}}\right).

(3.38)

Instead, if $\tau^{m}_{k}\geq\mu_{c}\Delta_{k}$ , then Assumption 2.10 gives

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\mu_{c}\Delta_{\min}(\epsilon)^{3}.

(3.39)

In either case, we always have

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{3}.

(3.40)

By the same reasoning as (3.16) in the proof of first-order convergence (Theorem 3.7), we have

\displaystyle f(\bm{x}_{0})-f_{\textnormal{low}}\geq\eta_{U}\sum_{k\in\mathcal{S}}m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\eta_{U}\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{3}|\mathcal{S}|,

(3.41)

\displaystyle|\mathcal{S}|\leq\frac{f(\bm{x}_{0})-f_{\textnormal{low}}}{\eta_{U}\kappa_{s}\mu_{c}\min(1,\mu_{c}/\kappa_{H})\Delta_{\min}(\epsilon)^{3}}.

(3.42)

The trust-region updating mechanism again gives us (3.22), and the result follows from $K=|\mathcal{S}|+|\mathcal{U}|$ . ∎

Recalling here the notation $\kappa_{\textnormal{m}}:=\max(\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}},\kappa_{\textnormal{mh}})$ (see Definition 3.10) and noting that $\kappa_{\sigma}=\Theta(\kappa_{\textnormal{m}})$ and $\Delta_{\min}(\epsilon)=\Theta((\kappa_{\textnormal{m}}+\kappa_{H})^{-1}\epsilon)$ as $\epsilon\to 0^{+}$ and $\kappa_{\textnormal{m}},\kappa_{H}\to\infty$ , we get the following second-order convergence result.

Corollary 3.17.

Suppose the assumptions of Theorem 3.16 hold. If $k_{\epsilon}$ is the first iteration of Algorithm 3.2 such that $\sigma_{k}<\epsilon$ , then $k_{\epsilon}=\mathcal{O}(\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{3}\epsilon^{-3})$ . Hence $\liminf_{k\to\infty}\sigma_{k}=0$ .

This bound again matches $\mathcal{O}(\epsilon^{-3})$ from the derivative-based case (Theorem 2.13), but is again larger by a factor of $\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{2}$ arising from the same issue as the first-order case (Remark 3.9). Again, although $\kappa_{\textnormal{m}}$ —and now also $\kappa_{H}$ —will usually depend on $n$ explicitly (Theorem 4.6), they can be made dimension-independent (Corollary 4.9), but we will always need at least $\mathcal{O}(n^{2})$ objective evaluations (to compensate for the loss of information from the first- and second-order oracles).

3.3 Termination

We end this section by briefly discussing how to terminate MBDFO algorithms in practice.

In all nonlinear optimization methods, in practice it is common to terminate after some fixed budget. For derivative-based methods, this may be a maximum number of iterations $k$ or (particularly for large-scale problems) a maximum runtime. In MBDFO, where the objective may be very expensive to evaluate, it is more common to have a maximum number of objective evaluations (since we may have multiple evaluations per iteration, depending on how exactly the model (3.1) is constructed).

Separately, it is important to have a termination condition based on optimality; i.e. the algorithm terminates because it has reached a sufficiently accurate solution. In the derivative-based case, we have direct access to optimality measures such as $\|\nabla f(\bm{x}_{k})\|$ , and so we can simply terminate if $\|\nabla f(\bm{x}_{k})\|\leq\epsilon$ for some user-defined tolerance $\epsilon$ . Of course, this is not possible in the MBDFO case, as we no longer have $\nabla f(\bm{x}_{k})$ .

Although we have access to $\|\bm{g}_{k}\|$ at each iteration, a small value of $\|\bm{g}_{k}\|$ does not always imply a small value of $\|\nabla f(\bm{x}_{k})\|$ (the condition we actually want to be true). If the model $m_{k}$ is fully linear and $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ , then we do have $\|\nabla f(\bm{x}_{k})=\mathcal{O}(\|\bm{g}_{k}\|)$ from (3.24), and so $\|\bm{g}_{k}\|\leq\epsilon$ (again where $\epsilon$ is a user-defined tolerance) would imply $\|\nabla f(\bm{x}_{k})\|=\mathcal{O}(\epsilon)$ . For this to be an appropriate termination condition, we require a certification that $m_{k}$ is fully linear.

In practice, however, MBDFO methods usually terminate with ‘optimality’ if $\Delta_{k}$ is sufficiently small. This is theoretically justified, as we know that $\Delta_{k}$ never gets too small if we remain far from optimality (e.g. Lemma 3.6). We also have the following result, to provide further comfort that we can terminate on small $\Delta_{k}$ (noting that all our main results above are related to convergence over a subsequence of iterations, not the full sequence of iterations).

Lemma 3.18.

Suppose Assumptions 2.4, 2.9 and 3.3 hold. Then the iterates of Algorithm 3.1 satisfy $\lim_{k\to\infty}\Delta_{k}=0$ .

Proof.

We first consider the case where there are finitely many (very) successful iterations. This means that iteration $k$ is unsuccessful for all $k$ sufficiently large, say $k\geq K$ for fixed $K$ . That is, $\Delta_{k}=\gamma_{\textnormal{dec}}^{k-K}\Delta_{K}$ for for some $\Delta_{K}\leq\gamma_{\textnormal{inc}}^{K}\Delta_{0}$ . So, $\Delta_{k}\to 0$ as $k\to\infty$ .

Instead, suppose there are infinitely many (very) successful iterations. The below reasoning applies equally to either algorithm. For any such iteration $k$ , from Assumptions 3.3 and 2.9, the same reasoning as used to get (3.19)—without replacing $\Delta_{k}$ with the lower bound $\Delta_{\min}(\epsilon)$ —gives

\displaystyle f(\bm{x}_{0})-f_{\textnormal{low}}\geq\eta_{U}\kappa_{s}\mu_{c}\min\left(1,\frac{\mu_{c}}{\kappa_{H}}\right)\sum_{k\in\mathcal{S}}\Delta_{k}^{2},

(3.43)

where here $\mathcal{S}$ is the (infinite) set of all successful or very successful iterations. That is, we have $\sum_{k\in\mathcal{S}}\Delta_{k}^{2}<\infty$ . Since $\mathcal{S}$ is infinite, it must be that $\lim_{k\in\mathcal{S}}\Delta_{k}=0$ . For any $k\notin\mathcal{S}$ sufficiently large (i.e. so that there was at least one (very) successful iteration before $k$ ), the trust-region updating mechanism guarantees $\Delta_{k}\leq\Delta_{s_{k}}\to 0$ , where $s_{k}\in\mathcal{S}$ is the last (very) successful iteration before $k$ , and so the result holds. ∎

The below result additionally tells us that terminating after the first iteration when $\Delta_{k}$ drops below some termination threshold $\Delta_{\min}$ guarantees that the current iterate $\bm{x}_{k}$ is a good solution. Specifically, we get that $\|\nabla f(\bm{x}_{k})\|=\mathcal{O}((\kappa_{\textnormal{m}}+\kappa_{H})\Delta_{\min})$ .

Lemma 3.19.

Suppose Assumptions 2.4 (a), 2.9 and 3.3 hold, and $\Delta_{\min}<\Delta_{0}$ . If $\Delta_{k+1}<\Delta_{\min}$ occurs for the first time at the end of iteration $k$ of Algorithm 3.1, then

\displaystyle\|\nabla f(\bm{x}_{k})\|\leq\gamma_{\textnormal{dec}}^{-1}\left[\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)+\kappa_{\textnormal{mg}}\right]\Delta_{\min}.

(3.44)

Proof.

Since $\Delta_{\min}<\Delta_{0}$ , we know that $k\geq 0$ . Moreover, since $\Delta_{k}$ is only decreased on unsuccessful iterations, we know that iteration $k$ is unsuccessful and $\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}$ . From Lemma 3.6, for iteration $k$ to be unsuccessful we must have

\displaystyle\Delta_{k}>\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}},\frac{1}{\kappa_{H}},\frac{1}{\mu_{c}}\right)\|\bm{g}_{k}\|.

(3.45)

Combined with $\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}$ we get

\displaystyle\|\bm{g}_{k}\|<\gamma_{\textnormal{dec}}^{-1}\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)\Delta_{\min}.

(3.46)

Therefore, since $m_{k}$ is fully linear, we get

	$\displaystyle\\|\nabla f(\bm{x}_{k})\\|\leq\\|\bm{g}_{k}\\|+\\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\\|$	$\displaystyle<\gamma_{\textnormal{dec}}^{-1}\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)\Delta_{\min}+\kappa_{\textnormal{mg}}\Delta_{k},$		(3.47)
		$\displaystyle<\gamma_{\textnormal{dec}}^{-1}\max\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)\Delta_{\min}+\kappa_{\textnormal{mg}}\gamma_{\textnormal{dec}}^{-1}\Delta_{\min},$		(3.48)

as required. ∎

Of course, Lemma 3.18 ensures that we will always terminate in finite time, for any choice of termination value $\Delta_{\min}$ .

Remark 3.20.

In Section 5.3, we will consider an algorithm where $m_{k}$ is not fully linear at every iteration. However, Lemma 3.19 still holds in that case, because $\Delta_{k}$ is only ever decreased when $m_{k}$ is fully linear.

Notes and References

The first convergence theory for an algorithm similar to Algorithm 3.1 was given in [49], but the framework studied here using fully linear/fully quadratic models was introduced later in [50] and expanded in [51]. The extension of these results to include worst-case complexity bounds was first given in the PhD thesis [90], with the first-order complexity later published in [67]. To the best of the author’s knowledge, the first use of the simplified algorithmic framework assuming fully linear models at all iterations (and checking criticality within the trust-region updating) was in [17].

Currently, one of the most active research directions in MBDFO is adapting this framework to be better suited to large-scale problems (i.e. where the dimension $n$ is large, roughly $n\gg 100$ ). The most promising direction seems to be methods that construct models in low-dimensional subspaces at each iteration, where the subspaces may be selected randomly [38, 60, 39] or using deterministic methods [151, 159].

4 Interpolation Model Construction

In Section 3, we introduced a basic MBDFO trust-region method. To make this method concrete, we need a way to construct quadratic models (3.1) at each iteration that are sufficiently accurate to guarantee convergence (i.e. satisfying Assumptions 3.3 or 3.11). In particular we focus on the fully linear/fully quadratic requirements Assumptions 3.3 (a) and 3.11 (a). Of course, we must be able to form these models using only zeroth-order information about the objective $f$ , which we will achieve by constructing models that interpolate $f$ over carefully chosen collections of points.

We first consider generating good collections of points for linear/quadratic interpolation to guarantee fully linear/fully quadratic models. We derive generic error bounds in terms of the linear algebra associated with the interpolation system, and consider special cases where points are sampled in a regular way around the current iterate (e.g. perturbations along coordinate axes). These special cases are useful to understand the theoretical capabilities of MBDFO methods, but this process of choosing a structured interpolation set is also practically useful. As outlined in Section 4.4, in the IBCDFO software collection (see Section 8), an interpolation set is built by appending suitable points one-by-one from the full history of the solver; the error bounds we derive are used to ensure ‘suitability’ and to complete the set of points if not enough historical points are suitable.

To simplify the derivation of error bounds, we will use the following result, which says that fully linear/fully quadratic models can be achieved by ensuring sufficiently accurate approximations to a Taylor series.

Lemma 4.1.

Suppose we have a quadratic model $m:\mathbb{R}^{n}\to\mathbb{R}$ approximating $f:\mathbb{R}^{n}\to\mathbb{R}$ near a point $\bm{x}\in\mathbb{R}^{n}$ ,

\displaystyle f(\bm{y})\approx m(\bm{y}):=c+\bm{g}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x}).

(4.1)

Then, for any $\Delta>0$ , the following results hold.

(a)

If Assumption 2.4 (a) holds and there exists $\kappa>0$ such that

$\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\kappa\Delta^{2},$ (4.2)

for all $\bm{y}\in B(\bm{x},\Delta)$ , then $m$ is fully linear in $B(\bm{x},\Delta)$ with constants (c.f. (3.5))

$\displaystyle\kappa_{\textnormal{mf}}=\kappa+\frac{L_{1}}{2},\qquad\text{and}\qquad\kappa_{\textnormal{mg}}=2\kappa+L_{1}+2\kappa_{H},$ (4.3)

where $\kappa_{H}$ is any upper bound on $\|\bm{H}\|$ .

(b)

If Assumption 2.7 (a) holds and there exists $\kappa>0$ such that

\displaystyle\left|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})-\frac{1}{2}(\bm{y}-\bm{x})^{T}\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\right|\leq\kappa\Delta^{3},

(4.4)

for all $\bm{y}\in B(\bm{x},\Delta)$ , then $m$ is fully quadratic in $B(\bm{x},\Delta)$ with constants (c.f. (3.25))

\displaystyle\kappa_{\textnormal{mf}}=\kappa+\frac{L_{2}}{6},\qquad\kappa_{\textnormal{mg}}=34\kappa+\frac{L_{2}}{2},\qquad\text{and}\qquad\kappa_{\textnormal{mh}}=24\kappa+L_{2}.

(4.5)

Proof.

We will use the technical results in Appendix A, which say that if two linear/quadratic functions are close in $B(\bm{x},\Delta)$ , then their coefficients are also close.

(a) Fix any $\bm{y}\in B(\bm{x},\Delta)$ , and use Lemma 2.6 to get

	$\displaystyle\|m(\bm{y})-f(\bm{y})\|$	$\displaystyle\leq\|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|+\|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|,$		(4.6)
		$\displaystyle\leq\kappa\Delta^{2}+\frac{L_{1}}{2}\\|\bm{y}-\bm{x}\\|^{2},$		(4.7)

and we get the value for $\kappa_{\textnormal{mf}}$ after $\|\bm{y}-\bm{x}\|\leq\Delta$ . Next, for any $\bm{y}\in B(\bm{x},\Delta)$ we have

\displaystyle|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\kappa\Delta^{2}+\frac{1}{2}\|\bm{H}\|\>\|\bm{y}-\bm{x}\|^{2}\leq\left(\kappa+\frac{1}{2}\kappa_{H}\right)\Delta^{2},

(4.8)

and so Lemma A.1 gives $\|\bm{g}-\nabla f(\bm{x})\|\leq(2\kappa+\kappa_{H})\Delta$ . So, for an arbitrary $\bm{y}\in B(\bm{x},\Delta)$ , we use $\nabla m(\bm{y})=\bm{g}+\bm{H}(\bm{y}-\bm{x})$ to get

	$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\\|\bm{H}(\bm{y}-\bm{x})\\|+\\|\bm{g}-\nabla f(\bm{x})\\|+\\|\nabla f(\bm{y})-\nabla f(\bm{x})\\|,$		(4.9)
		$\displaystyle\leq\kappa_{H}\\|\bm{y}-\bm{x}\\|+(2\kappa+\kappa_{H})\Delta+L_{1}\\|\bm{y}-\bm{x}\\|,$		(4.10)

and we get the value of $\kappa_{\textnormal{mg}}$ .

(b) First, to get $\kappa_{\textnormal{mf}}$ , we follow the same reasoning as for (4.7) but replacing Lemma 2.6 with (2.4) to get

\displaystyle|m(\bm{y})-f(\bm{y})|\leq\kappa\Delta^{3}+\frac{L_{2}}{6}\|\bm{y}-\bm{x}\|^{3},

(4.11)

for any $\bm{y}\in B(\bm{x},\Delta)$ . Next, we use Lemma A.2 on (4.4) to get

\displaystyle\|\bm{g}-\nabla f(\bm{x})\|\leq 10\kappa\Delta^{2},\qquad\text{and}\qquad\|\bm{H}-\nabla^{2}f(\bm{x})\|\leq 24\kappa\Delta.

(4.12)

So, for any $\bm{y}\in B(\bm{x},\Delta)$ we have

$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\\|\bm{g}-\nabla f(\bm{x})\\|+\\|\bm{H}(\bm{y}-\bm{x})-\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\\|$
	$\displaystyle\qquad\qquad\qquad\qquad+\\|\nabla f(\bm{y})-\nabla f(\bm{x})-\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\\|,$	(4.13)
	$\displaystyle\leq 10\kappa\Delta^{2}+24\kappa\Delta\\|\bm{y}-\bm{x}\\|+\frac{L_{2}}{2}\\|\bm{y}-\bm{x}\\|^{2},$	(4.14)

where the last inequality uses (2.5), and we get the value for $\kappa_{\textnormal{mg}}$ . Lastly, we have

\displaystyle\|\nabla^{2}m(\bm{y})-\nabla^{2}f(\bm{y})\|\leq\|\bm{H}-\nabla^{2}f(\bm{x})\|+\|\nabla^{2}f(\bm{y})-\nabla^{2}f(\bm{x})\|\leq 24\kappa\Delta+L_{2}\|\bm{y}-\bm{x}\|,

(4.15)

and we get $\kappa_{\textnormal{mh}}$ . ∎

Our goal is now to show that linear/quadratic interpolation models for $f$ can provide sufficiently good approximations to a Taylor series for $f$ , from which Lemma 4.1 tells us that the model is fully linear/fully quadratic, as needed for our algorithms.

4.1 Linear Interpolation

In the first instance, we will try to construct fully linear models (Definition 3.1) using linear interpolation. Here, for our objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ , we will build a local linear model around a base point $\bm{x}\in\mathbb{R}^{n}$ (which will be $\bm{x}=\bm{x}_{k}$ at the $k$ -th iteration of our algorithm),

\displaystyle f(\bm{y})\approx m(\bm{y}):=c+\bm{g}^{T}(\bm{y}-\bm{x}),

(4.16)

for some $c\in\mathbb{R}$ and $\bm{g}\in\mathbb{R}^{n}$ ; i.e. taking $\bm{H}_{k}=\bm{0}$ in (3.1).¹⁵¹⁵15This then satisfies Assumption 3.3 (b) with $\kappa_{H}=1$ . We choose the $p:=n+1$ unknowns ( $c$ and $\bm{g}$ ) to interpolate known values of $f$ at $p$ points, $\bm{y}_{1},\ldots,\bm{y}_{p}$ . That is, we pick $c$ and $\bm{g}$ such that $f(\bm{y}_{i})=m(\bm{y}_{i})$ for all $i=1,\ldots,p$ . This reduces to solving the following $p\times p$ linear system:

\displaystyle\underbrace{\begin{bmatrix}1&(\bm{y}_{1}-\bm{x})^{T}\\ \vdots&\vdots\\ 1&(\bm{y}_{p}-\bm{x})^{T}\end{bmatrix}}_{=:\bm{M}}\begin{bmatrix}c\\ \bm{g}\end{bmatrix}=\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}.

(4.17)

We shall see that, for the resulting $c$ and $\bm{g}$ to give a fully linear model (Definition 3.1) in a ball around the base point, $B(\bm{x},\Delta)$ —where at iteration $k$ we will usually have $\bm{x}=\bm{x}_{k}$ , our current iterate and $\Delta=\Delta_{k}$ , our current trust-region radius—the interpolation points $\bm{y}_{1},\ldots,\bm{y}_{p}$ will have to be close to the base point, $\|\bm{y}_{i}-\bm{x}\|\lesssim\Delta$ (i.e. $\|\bm{y}_{i}-\bm{x}\|\leq c\Delta$ for a constant $c$ not much larger than 1). However, as we saw in Lemma 3.18, the trust-region radius $\Delta_{k}\to 0^{+}$ as $k\to\infty$ , and so the matrix $\bm{M}$ in (4.17) becomes increasingly ill-conditioned as the algorithm progresses (as each column of $\bm{M}$ except the first approaches zero).

To avoid this ill-conditioning, we can rescale the second through last columns¹⁶¹⁶16This is a form of equilibration [69, Chapter 3.5.2]. of the linear system (4.17) by $\Delta^{-1}$ to replace $\bm{y}_{i}-\bm{x}$ with $\hat{\bm{s}}_{i}:=(\bm{y}_{i}-\bm{x})/\Delta$ , to get

\displaystyle\underbrace{\begin{bmatrix}1&\hat{\bm{s}}_{1}^{T}\\ \vdots&\vdots\\ 1&\hat{\bm{s}}_{p}^{T}\end{bmatrix}}_{=:\hat{\bm{M}}}\begin{bmatrix}c\\ \hat{\bm{g}}\end{bmatrix}=\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix},\qquad\text{where $\hat{\bm{g}}:=\Delta\>\bm{g}$.}

(4.18)

Now, we expect that all $\|\hat{\bm{s}}_{i}\|\lesssim 1$ , and we avoid ill-conditioning resulting from $\Delta_{k}\to 0^{+}$ . This rescaling does not guarantee that the new linear system $\hat{\bm{M}}$ (4.18) is well-conditioned—we will ensure this by judicious choice of the $\hat{\bm{s}}_{i}$ —but it does ensure that the magnitude of $\Delta$ is not a source of ill-conditioning.

In this setting, since all $\|\hat{\bm{s}}_{i}\|\lesssim 1$ , we know that $\|\hat{\bm{M}}\|$ is not too large, since $\|\hat{\bm{M}}\|\leq\|\hat{\bm{M}}\|_{F}\lesssim\sqrt{2p}$ . Hence if $\hat{\bm{M}}$ is invertible we have $\kappa(\hat{\bm{M}})\lesssim\sqrt{2p}\|\hat{\bm{M}}^{-1}\|$ , where $\kappa(\cdot)$ denotes the (2-norm) matrix condition number, so any ill-conditioning in $\hat{\bm{M}}$ is entirely driven by the size of $\|\hat{\bm{M}}^{-1}\|$ . This is reflected in the following interpolation error bound, where we assume $\|\hat{\bm{s}}_{i}\|\leq\beta$ for some $\beta>0$ , and both $\beta$ and $\|\hat{\bm{M}}^{-1}\|$ appear explicitly.

Theorem 4.2.

Suppose $f$ satisfies Assumption 2.4 (a) and we construct a linear model (4.16) for $f$ by solving (4.18), where we assume $\hat{\bm{M}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=n+1$ is the number of interpolation points), then the model is fully linear in $B(\bm{x},\Delta)$ with constants

\displaystyle\kappa_{\textnormal{mf}}=\frac{L_{1}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{-1}\|_{\infty}+\frac{L_{1}}{2},\quad\text{and}\quad\kappa_{\textnormal{mg}}=2\kappa_{\textnormal{mf}}.

(4.19)

Proof.

For $i\in\{1,\ldots,p\}$ , the $i$ th row of (4.18) gives $m(\bm{y}_{i})=f(\bm{y}_{i})$ , and so

\displaystyle|m(\bm{y}_{i})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{i}-\bm{x})|\leq\frac{L_{1}}{2}\|\bm{y}_{i}-\bm{x}\|^{2}\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2},

(4.20)

using Lemma 2.6. Now, consider some arbitrary point $\bm{y}\in B(\bm{x},\Delta)$ and define $\hat{\bm{s}}:=(\bm{y}-\bm{x})/\Delta$ , giving $\|\hat{\bm{s}}\|\leq 1$ . Since $\hat{\bm{M}}$ is invertible, so too is $\hat{\bm{M}}^{T}$ , and there exists a unique solution $\bm{v}\in\mathbb{R}^{p}$ to $\hat{\bm{M}}^{T}\bm{v}=\begin{bmatrix}1\\ \hat{\bm{s}}\end{bmatrix}$ . Multiplying the last $n$ rows of this equation by $\Delta$ , we get

\displaystyle\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\sum_{i=1}^{p}v_{i}\begin{bmatrix}1\\ \bm{y}_{i}-\bm{x}\end{bmatrix},\qquad\text{where}\qquad\bm{v}=\hat{\bm{M}}^{-T}\begin{bmatrix}1\\ \hat{\bm{s}}\end{bmatrix}.

(4.21)

Considering the value of the model at this point $\bm{y}$ , we get

$\displaystyle\|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|$	$\displaystyle=\left\|\begin{bmatrix}c\\ \bm{g}\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}-\begin{bmatrix}f(\bm{x})\\ \nabla f(\bm{x})\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}\right\|,$	(4.22)
	$\displaystyle=\left\|\sum_{i=1}^{p}v_{i}\left(\begin{bmatrix}c\\ \bm{g}\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}_{i}-\bm{x}\end{bmatrix}-\begin{bmatrix}f(\bm{x})\\ \nabla f(\bm{x})\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}_{i}-\bm{x}\end{bmatrix}\right)\right\|,$	(4.23)
	$\displaystyle=\left\|\sum_{i=1}^{p}v_{i}\left(m(\bm{y}_{i})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{i}-\bm{x})\right)\right\|,$	(4.24)
	$\displaystyle\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}\\|\bm{v}\\|_{1},$	(4.25)

where the last line comes from the triangle inequality and (4.20). From $\|\bm{v}\|_{1}\leq\|\hat{\bm{M}}^{-T}\|_{1}(1+\|\hat{\bm{s}}\|_{1})\leq\|\hat{\bm{M}}^{-1}\|_{\infty}(1+\sqrt{n})$ , a consequence of the identity $\|\bm{A}^{T}\|_{1}=\|\bm{A}\|_{\infty}$ and $\|\hat{\bm{s}}\|_{1}\leq\sqrt{n}\|\hat{\bm{s}}\|\leq\sqrt{n}$ , we get

\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{-1}\|_{\infty}\Delta^{2}.

(4.26)

The result then follows from Lemma 4.1(a) with $\kappa_{H}=0$ . ∎

Remark 4.3.

Theorem 4.2 gives the error bounds in terms of $\|\hat{\bm{M}}^{-1}\|_{\infty}$ , as opposed to the 2-norm which occurs in the corresponding bounds in earlier works, e.g. [51]. This does not significantly change our results, but the proof approach where we control the model error relative to a Taylor series via (4.25) generalizes to interpolation theory in constrained regions (Section 6.1) and naturally yields bounds in terms of $\|\hat{\bm{M}}^{-1}\|_{\infty}$ .

All that remains is to find a set of interpolation points so that $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ and $\|\hat{\bm{M}}^{-1}\|_{\infty}$ is not too large. The simplest way to do this is to use $\bm{x}$ and $n$ perturbations around $\bm{x}$ of size $\Delta$ . The most natural way to do this is via coordinate perturbations around $\bm{x}$ , i.e. taking our interpolation set to be

\displaystyle\{\bm{x}_{k},\bm{x}_{k}+\Delta_{k}\bm{e}_{1},\ldots,\bm{x}_{k}+\Delta_{k}\bm{e}_{n}\}.

(4.27)

This gives us the following worst-case complexity bound for Algorithm 3.1, building on the general result Corollary 3.8.

Corollary 4.4.

Under the assumptions of Theorem 3.7, if the model $m_{k}$ (3.1) at each iteration of Algorithm 3.1 is generated by linear interpolation to the points in (4.27), then the iterates of Algorithm 3.1 achieve $\|\nabla f(\bm{x}_{k})\|<\epsilon$ for the first time after at most $\mathcal{O}(n\epsilon^{-2})$ iterations and $\mathcal{O}(n^{2}\epsilon^{-2})$ objective evaluations.

Proof.

Given the interpolation points (4.27), we have

\displaystyle\hat{\bm{M}}=\begin{bmatrix}1&\bm{0}^{T}\\ \bm{e}&\bm{I}\end{bmatrix},\qquad\text{and so}\qquad\hat{\bm{M}}^{-1}=\begin{bmatrix}1&\bm{0}^{T}\\ -\bm{e}&\bm{I}\end{bmatrix},

(4.28)

so $\|\hat{\bm{M}}^{-1}\|_{\infty}=2$ and hence the fully linear constants from Theorem 4.2 satisfy $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(\sqrt{n}L_{1})$ . We then apply Corollary 3.8 with $\kappa_{H}=1$ in Assumption 3.3 (b) to get the iteration bound. The objective evaluation bound comes from observing that our interpolation mechanism requires evaluating $f$ at no more than $n+1$ points per iteration. ∎

Using the interpolation set (4.27) is equivalent to building a linear Taylor model using finite differencing (3.2) with the dynamically adjusted stepsize $\Delta_{k}$ to estimate $\nabla f(\bm{x}_{k})$ . However, it is this dynamic adjustment that is at the core of MBDFO, reflecting that we need to adjust our interpolation set depending on the required accuracy in the model (controlled by $\Delta_{k}$ ).

The dimension dependency in Corollary 4.4 can be improved by considering coordinate perturbations of size $\Delta_{k}/\sqrt{n}$ .

Corollary 4.5.

Under the assumptions of Theorem 3.7, if the model $m_{k}$ (3.1) at each iteration of Algorithm 3.1 is generated by linear interpolation to the points $\{\bm{x}_{k},\bm{x}_{k}+\Delta_{k}\bm{e}_{1}/\sqrt{n},\ldots,\bm{x}_{k}+\Delta_{k}\bm{e}_{n}/\sqrt{n}\}$ , then the iterates of Algorithm 3.1 achieve $\|\nabla f(\bm{x}_{k})\|<\epsilon$ for the first time after at most $\mathcal{O}(\epsilon^{-2})$ iterations and $\mathcal{O}(n\epsilon^{-2})$ objective evaluations.

Proof.

For this interpolation set, we have $\|\hat{\bm{M}}^{-1}\|_{\infty}=\mathcal{O}(\sqrt{n})$ and $\beta=1/\sqrt{n}$ , so the model is fully linear with $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(L_{1})$ (i.e. no explicit dependence on $n$ ) by Theorem 4.2. The remainder of the proof is identical to Corollary 4.4. ∎

So, at least in terms of the explicit dependence on $n$ , Corollary 4.5 gives an iteration bound that matches the derivative-based case, with a $\mathcal{O}(n)$ evaluation bound that comes from the fact that we need $n$ objective evaluations to match the same amount of problem information as one gradient evaluation (i.e. $n$ real numbers).

However, taking an even smaller perturbation size cannot reduce $\kappa_{\textnormal{mf}}$ to a smaller value (e.g. $\kappa_{\textnormal{mf}}=\mathcal{O}(1/n)$ ) to further improve the worst-case complexity, since $\kappa_{\textnormal{mf}}$ always has a component of size $\mathcal{O}(L_{1})$ from Assumption 2.4 that is independent of the model construction.

4.2 Quadratic Interpolation

Now that we have a method for constructing fully linear models, which can be used to achieve convergence to first-order stationary points, we now consider the construction of fully quadratic models (Definition 3.10) using quadratic interpolation, which will allow us to find second-order stationary points using Algorithm 3.2.

In this case, we have an objective $f:\mathbb{R}^{n}\to\mathbb{R}$ and we now build a local quadratic model around a base point $\bm{x}\in\mathbb{R}^{n}$ (which again will usually be $\bm{x}=\bm{x}_{k}$ in the $k$ -th iteration of our algorithm),

\displaystyle f(\bm{y})\approx m(\bm{y}):=c+\bm{g}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x}),

(4.29)

for some $c\in\mathbb{R}$ , $\bm{g}\in\mathbb{R}^{n}$ and $\bm{H}\in\mathbb{R}^{n\times n}$ symmetric. This model has $p:=1+n+n(n+1)/2=(n+1)(n+2)/2$ degrees of freedom¹⁷¹⁷17Note we always take $p$ to be the number of interpolation points, so the value of $p$ here is different to the linear case in Section 4.1., and so we again construct $m$ to interpolate $f$ at $p$ points $\bm{y}_{1},\ldots,\bm{y}_{p}$ .

To write this interpolation problem efficiently, we introduce the natural basis for quadratic functions $\mathbb{R}^{n}\to\mathbb{R}$ , which we denote $\bm{\phi}:\mathbb{R}^{n}\to\mathbb{R}^{p}$ , defined as

\displaystyle\bm{\phi}(\bm{x}):=\left[1,x_{1},\ldots,x_{n},\tfrac{1}{2}\,x_{1}^{2},x_{1}x_{2},\ldots,x_{1}x_{n},\tfrac{1}{2}\,x_{2}^{2},x_{2}x_{3},\ldots,x_{n-1}x_{n},\tfrac{1}{2}\,x_{n}^{2}\right]^{T}.

(4.30)

With this basis, and noting $\bm{H}=\bm{H}^{T}$ , the model (4.29) can be written as

\displaystyle m(\bm{y})=\bm{\phi}(\bm{y}-\bm{x})^{T}\begin{bmatrix}c\\ \bm{g}\\ \operatorname{upper}(\bm{H})\end{bmatrix},

(4.31)

where $\operatorname{upper}(\bm{H})\in\mathbb{R}^{n(n+1)/2}$ contains the upper triangular elements of $\bm{H}$ , arranged as

\displaystyle\operatorname{upper}(\bm{H})=\left[H_{1,1},H_{1,2},\ldots,H_{1,n},H_{2,2},H_{2,3},\ldots,H_{n-1,n},H_{n,n}\right]^{T}.

(4.32)

With these definitions, the interpolation problem is given by the $p\times p$ linear system

\displaystyle\underbrace{\begin{bmatrix}\bm{\phi}(\bm{y}_{1}-\bm{x})^{T}\\ \vdots\\ \bm{\phi}(\bm{y}_{p}-\bm{x})^{T}\end{bmatrix}}_{=:\bm{Q}}\begin{bmatrix}c\\ \bm{g}\\ \operatorname{upper}(\bm{H})\end{bmatrix}=\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}.

(4.33)

Just like the linear case, the matrix $\bm{Q}$ can become very ill-conditioned if $\Delta$ , a typical value for $\|\bm{y}_{i}-\bm{x}\|$ , gets very small. To address this, we define $\hat{\bm{s}}_{i}:=(\bm{y}_{i}-\bm{x})/\Delta$ and instead solve

\displaystyle\underbrace{\begin{bmatrix}\bm{\phi}(\hat{\bm{s}}_{1})^{T}\\ \vdots\\ \bm{\phi}(\hat{\bm{s}}_{p})^{T}\end{bmatrix}}_{=:\hat{\bm{Q}}}\begin{bmatrix}c\\ \hat{\bm{g}}\\ \operatorname{upper}(\hat{\bm{H}})\end{bmatrix}=\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix},

(4.34)

where $\hat{\bm{g}}:=\Delta\>\bm{g}$ and $\hat{\bm{H}}:=\Delta^{2}\>\bm{H}$ .

As before, since $\|\hat{\bm{s}}_{i}\|\lesssim 1$ , we know that $\|\hat{\bm{Q}}\|$ is not too large, and so—provided $\hat{\bm{Q}}$ is invertible—the only potential source of ill-conditioning is if $\hat{\bm{Q}}^{-1}$ is large. As such, the size of $\hat{\bm{Q}}^{-1}$ determines the fully quadratic error bounds.

Theorem 4.6.

Suppose $f$ satisfies Assumption 2.7 (a) and we construct a quadratic model $m$ (4.29) for $f$ by solving (4.34), where we assume $\hat{\bm{Q}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=(n+1)(n+2)/2$ is the number of interpolation points), then the model is fully quadratic in $B(\bm{x},\Delta)$ with constants


$\displaystyle\kappa_{\textnormal{mf}}$	$\displaystyle=\frac{1}{6}L_{2}\left(n+\sqrt{n}+\frac{3}{2}\right)\beta^{3}\\|\hat{\bm{Q}}^{-1}\\|_{\infty}+\frac{L_{2}}{6},$	(4.35a)
$\displaystyle\kappa_{\textnormal{mg}}$	$\displaystyle=\frac{17}{3}L_{2}\left(n+\sqrt{n}+\frac{3}{2}\right)\beta^{3}\\|\hat{\bm{Q}}^{-1}\\|_{\infty}+\frac{L_{2}}{2},\qquad\text{and}$	(4.35b)
$\displaystyle\kappa_{\textnormal{mh}}$	$\displaystyle=4L_{2}\left(n+\sqrt{n}+\frac{3}{2}\right)\beta^{3}\\|\hat{\bm{Q}}^{-1}\\|_{\infty}+L_{2}.$	(4.35c)

Proof.

For ease of notation, define $T_{f,\bm{x},2}(\bm{y}):=f(\bm{x})+\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})$ as the second-order Taylor series for $f$ at $\bm{x}$ . For $i\in\{1,\ldots,p\}$ , the $i$ th row of (4.34) gives $f(\bm{y}_{i})=m(\bm{y}_{i})$ and so

\displaystyle\left|m(\bm{y}_{i})-T_{f,\bm{x},2}(\bm{y}_{i})\right|\leq\frac{L_{2}}{6}\|\bm{y}_{i}-\bm{x}\|^{3}\leq\frac{L_{2}}{6}\beta^{3}\Delta^{3},

(4.36)

using (2.4). Now we choose $\bm{y}\in B(\bm{x},\Delta)$ and define $\hat{\bm{s}}:=(\bm{y}-\bm{x})/\Delta$ , giving $\|\hat{\bm{s}}\|\leq 1$ . Since $\hat{\bm{Q}}$ is invertible, so is $\hat{\bm{Q}}^{T}$ , and so we let $\bm{v}\in\mathbb{R}^{p}$ be the unique solution to $\hat{\bm{Q}}^{T}\bm{v}=\bm{\phi}(\hat{\bm{s}})$ . Scaling rows 2 through $n+1$ of this equation by $\Delta$ and rows $n+2$ through $p$ by $\Delta^{2}$ , we get

\displaystyle\bm{\phi}(\bm{y}-\bm{x})=\sum_{i=1}^{p}v_{i}\bm{\phi}(\bm{y}_{i}-\bm{x}),\qquad\text{where}\qquad\bm{v}=\hat{\bm{Q}}^{-T}\bm{\phi}(\hat{\bm{s}}).

(4.37)

We then get

$\displaystyle\left\|m(\bm{y})-T_{f,\bm{x},2}(\bm{y})\right\|$	$\displaystyle=\left\|\bm{\phi}(\bm{y}-\bm{x})^{T}\begin{bmatrix}c\\ \bm{g}\\ \operatorname{upper}(\bm{H})\end{bmatrix}-\bm{\phi}(\bm{y}-\bm{x})^{T}\begin{bmatrix}f(\bm{x})\\ \nabla f(\bm{x})\\ \operatorname{upper}(\nabla^{2}f(\bm{x}))\end{bmatrix}\right\|,$	(4.38)
	$\displaystyle=\left\|\sum_{i=1}^{p}v_{i}\left(\bm{\phi}(\bm{y}_{i}-\bm{x})^{T}\begin{bmatrix}c\\ \bm{g}\\ \operatorname{upper}(\bm{H})\end{bmatrix}-\bm{\phi}(\bm{y}_{i}-\bm{x})^{T}\begin{bmatrix}f(\bm{x})\\ \nabla f(\bm{x})\\ \operatorname{upper}(\nabla^{2}f(\bm{x}))\end{bmatrix}\right)\right\|,$	(4.39)
	$\displaystyle=\left\|\sum_{i=1}^{p}v_{i}(m(\bm{y}_{i})-T_{f,\bm{x},2}(\bm{y}_{i}))\right\|,$	(4.40)
	$\displaystyle\leq\frac{L_{2}}{6}\beta^{3}\Delta^{3}\\|\bm{v}\\|_{1},$	(4.41)

where the inequality comes from (4.36). We then use $\|\bm{v}\|_{1}\leq\|\hat{\bm{Q}}^{-T}\|_{1}\|\bm{\phi}(\hat{\bm{s}})\|_{1}=\|\hat{\bm{Q}}^{-1}\|_{\infty}\|\bm{\phi}(\hat{\bm{s}})\|_{1}$ , and it remains to bound $\|\bm{\phi}(\hat{\bm{s}})\|_{1}$ . To do this, we use the definition of $\bm{\phi}$ (4.30) to write

\displaystyle\|\bm{\phi}(\hat{\bm{s}})\|_{1}=1+\|\hat{\bm{s}}\|_{1}+\frac{1}{2}\hat{s}_{1}^{2}+\cdots+\frac{1}{2}\hat{s}_{n}^{2}+\sum_{i=1}^{n}\sum_{j=i+1}^{n}|\hat{s}_{i}\hat{s}_{j}|.

(4.42)

Using Young’s inequality, $|ab|\leq\frac{|a|^{2}+|b|^{2}}{2}=\frac{a^{2}+b^{2}}{2}$ for $a,b\in\mathbb{R}$ , we get

$\displaystyle\\|\bm{\phi}(\hat{\bm{s}})\\|_{1}$	$\displaystyle\leq 1+\\|\hat{\bm{s}}\\|_{1}+\frac{1}{2}\\|\hat{\bm{s}}\\|^{2}+\frac{1}{2}\sum_{i=1}^{n}\sum_{j=i+1}^{n}(\hat{s}_{i}^{2}+\hat{s}_{j}^{2}),$	(4.43)
	$\displaystyle\leq 1+\\|\hat{\bm{s}}\\|_{1}+\frac{1}{2}\\|\hat{\bm{s}}\\|^{2}+\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}(\hat{s}_{i}^{2}+\hat{s}_{j}^{2}),$	(4.44)
	$\displaystyle=1+\\|\hat{\bm{s}}\\|_{1}+\frac{1}{2}\\|\hat{\bm{s}}\\|^{2}+\frac{1}{2}\sum_{i=1}^{n}(n\hat{s}_{i}^{2}+\\|\hat{\bm{s}}\\|^{2}),$	(4.45)
	$\displaystyle=1+\\|\hat{\bm{s}}\\|_{1}+\frac{1}{2}\\|\hat{\bm{s}}\\|^{2}+n\\|\hat{\bm{s}}\\|^{2},$	(4.46)

and so combining with $\|\hat{\bm{s}}\|\leq 1$ and $\|\hat{\bm{s}}\|_{1}\leq\sqrt{n}$ , we get $\|\bm{\phi}(\hat{\bm{s}})\|_{1}\leq n+\sqrt{n}+\frac{3}{2}$ , which gives

\displaystyle|m(\bm{y})-T_{f,\bm{x},2}(\bm{y})|\leq\frac{L_{2}}{6}\left(n+\sqrt{n}+\frac{3}{2}\right)\beta^{3}\|\hat{\bm{Q}}^{-1}\|_{\infty}\Delta^{3}.

(4.47)

The result then follows from Lemma 4.1(b). ∎

One way to choose the $p$ sample points to ensure that $\|\hat{\bm{Q}}^{-1}\|$ is not too large is to use $\bm{x}$ along with the positive and negative perturbations along the coordinate axes $\bm{x}\pm\Delta\bm{e}_{i}$ for all $i\in\{1,\ldots,n\}$ , and $\bm{x}+\Delta(\bm{e}_{i}+\bm{e}_{j})$ for all combinations $i\neq j\in\{1,\ldots,n\}$ . That is,

	$\displaystyle\{\bm{y}_{1},\ldots,\bm{y}_{p}\}$	$\displaystyle=\{\bm{x},\bm{x}+\Delta\bm{e}_{1},\ldots,\bm{x}+\Delta\bm{e}_{n},\bm{x}-\Delta\bm{e}_{1},\ldots,\bm{x}-\Delta\bm{e}_{n},$
		$\displaystyle\bm{x}+\Delta(\bm{e}_{1}+\bm{e}_{2}),\ldots,\bm{x}+\Delta(\bm{e}_{1}+\bm{e}_{n}),\bm{x}+\Delta(\bm{e}_{2}+\bm{e}_{3}),\ldots,\bm{x}+\Delta(\bm{e}_{n-1}+\bm{e}_{n})\}.$		(4.48)

Lemma 4.7.

For the interpolation set given by (4.48), we have $\|\hat{\bm{Q}}^{-1}\|_{\infty}\leq 8$ .

Proof.

Suppose we solve $\hat{\bm{Q}}\bm{u}=\bm{v}$ for some right-hand side $\bm{v}$ with $v_{i}=f(\bm{y}_{i})$ and $\|\bm{v}\|_{\infty}=1$ , and where $\bm{u}:=\begin{bmatrix}c\\ \hat{\bm{g}}\\ \operatorname{upper}(\hat{\bm{H}})\end{bmatrix}$ . Then $\max_{\bm{v}}\|\bm{u}\|_{\infty}=\|\hat{\bm{Q}}^{-1}\|_{\infty}$ and so it suffices to find an upper bound on $\|\bm{u}\|_{\infty}$ .

The choice of interpolation points (4.48) gives

\displaystyle\{\hat{\bm{s}}_{1},\ldots,\hat{\bm{s}}_{p}\}

\displaystyle=\{\bm{0},\bm{e}_{1},\ldots,\bm{e}_{n},-\bm{e}_{1},\ldots,-\bm{e}_{n},\bm{e}_{1}+\bm{e}_{2},\ldots,\bm{e}_{1}+\bm{e}_{n},\bm{e}_{2}+\bm{e}_{3},\ldots,\bm{e}_{n-1}+\bm{e}_{n}\}.

(4.49)

in the linear system (4.34). The first row of (4.34) immediately gives $c=v_{1}$ and so $|c|\leq 1$ since $\|\bm{v}\|_{\infty}\leq 1$ .

Next, for $i\in\{1,\ldots,n\}$ , the values $v_{i+1}$ and $v_{n+i+1}$ (corresponding to $\hat{\bm{s}}=\pm\bm{e}_{i}$ ) completely determine $\hat{g}_{i}$ and $\hat{H}_{i,i}$ via rows $i+1$ and $n+i+1$ of (4.34), namely

\displaystyle c+\hat{g}_{i}+\frac{1}{2}\hat{H}_{i,i}=v_{i+1},\qquad\text{and}\qquad c-\hat{g}_{i}+\frac{1}{2}\hat{H}_{i,i}=v_{n+i+1},

(4.50)

which gives $\hat{g}_{i}=\frac{1}{2}(v_{i+1}-v_{n+i+1})$ and $\hat{H}_{i,i}=v_{i+1}+v_{n+i+1}-2c$ , and so $|\hat{g}_{i}|\leq 1$ and $|\hat{H}_{i,i}|\leq 4$ from $|c|,|v_{i}|,|v_{n+i+1}|\leq 1$ .

Lastly, in $\hat{\bm{Q}}$ , the column corresponding to $\hat{H}_{i,j}$ for $i<j$ only has one nonzero entry, a 1 in the $t$ -th row corresponding to $\hat{\bm{s}}_{t}=\bm{e}_{i}+\bm{e}_{j}$ . Hence we get $\hat{H}_{i,j}=v_{t}-c-\hat{g}_{i}-\hat{g}_{j}-\frac{1}{2}\hat{H}_{i,i}-\frac{1}{2}\hat{H}_{j,j}$ and so $|\hat{H}_{i,j}|\leq 8$ . ∎

All together, if we run our MBDFO algorithm Algorithm 3.2 with quadratic models given by solving (4.33) with points (4.48) at each iteration, we get the below second-order complexity result.

Corollary 4.8.

Under the assumptions of Theorem 3.16 and the sequence $\{\|\nabla^{2}f(\bm{x}_{k})\|\}_{k=0}^{\infty}$ is uniformly bounded¹⁸¹⁸18This is required for the derivative-based theory, via Assumption 2.11 9 and $\bm{H}_{k}=\nabla^{2}f(\bm{x}_{k})$ in Theorem 2.13., if the local quadratic model at each iteration of Algorithm 3.2 is generated by quadratic interpolation to the points (4.48) (with $\bm{x}=\bm{x}_{k}$ and $\Delta=\Delta_{k}$ ), then the iterates of Algorithm 3.2 achieve second-order optimality $\sigma_{k}<\epsilon$ for the first time after at most $\mathcal{O}(n^{4}\epsilon^{-3})$ iterations and $\mathcal{O}(n^{6}\epsilon^{-3})$ objective evaluations.

Proof.

We note that all the interpolation points (4.48) satisfy $\|\bm{y}_{i}-\bm{x}\|\leq\sqrt{2}\Delta$ (i.e. $\beta=\sqrt{2}$ in the assumptions of Theorem 4.6). From Theorem 4.6 and Lemma 4.7, the model at each iteration is fully quadratic with $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}},\kappa_{\textnormal{mh}}=\mathcal{O}(nL_{2})$ . We also get $\|\bm{H}_{k}\|\leq\|\nabla^{2}f(\bm{x}_{k})\|+\kappa_{\textnormal{mh}}\Delta_{\max}$ from (3.25c) and so $\kappa_{H}=\mathcal{O}(\kappa_{\textnormal{mh}})$ . We then apply Corollary 3.17 to get the iteration bound. The objective evaluation bound comes from observing that our interpolation mechanism requires evaluating $f$ at no more than $p=(n+1)(n+2)/2$ points per iteration. ∎

Just as in the linear interpolation case (Corollary 4.5), we can improve the explicit dependence on dimension in the worst-case complexity bound by making the coordinate perturbations depend on the dimension. However, this construction does not align with a single interpolation set, rather it uses different objective evaluations to construct $\bm{g}_{k}$ as $\bm{H}_{k}$ in iteration $k$ . Specifically, we use the central difference approximation to the gradient with stepsize $\Delta_{k}/n^{1/4}$ , and forward difference approximation to the Hessian with stepsize $\Delta_{k}/n$ :


$\displaystyle[\bm{g}_{k}]_{i}$	$\displaystyle=\frac{f(\bm{x}_{k}+\Delta_{k}\bm{e}_{i}/n^{1/4})-f(\bm{x}_{k}-\Delta_{k}\bm{e}_{i}/n^{1/4})}{2(\Delta_{k}/n^{1/4})},$	(4.51a)
$\displaystyle[\bm{H}_{k}]_{i,j}$	$\displaystyle=\frac{f(\bm{x}_{k}+\Delta_{k}(\bm{e}_{i}+\bm{e}_{j})/n)-f(\bm{x}_{k}+\Delta_{k}\bm{e}_{i}/n)-f(\bm{x}_{k}+\Delta_{k}\bm{e}_{j}/n)+f(\bm{x}_{k})}{(\Delta_{k}/n)^{2}},$	(4.51b)

for all $i,j=1,\ldots,n$ , which requires $(n+1)(n+2)/2+2n$ objective evaluations because the calculation of $\bm{g}_{k}$ cannot re-use any objective evaluations used to construct $\bm{H}_{k}$ .

Corollary 4.9.

Under the assumptions of Theorem 3.16 and the sequence $\{\|\nabla^{2}f(\bm{x}_{k})\|\}_{k=0}^{\infty}$ is uniformly bounded, if the local quadratic model at each iteration of Algorithm 3.2 is generated as in (4.51), then the iterates of Algorithm 3.2 achieve second-order optimality $\sigma_{k}<\epsilon$ for the first time after at most $\mathcal{O}(\epsilon^{-3})$ iterations and $\mathcal{O}(n^{2}\epsilon^{-3})$ objective evaluations.

Proof.

From [21, Theorem 2.2] or [59, Lemma 5] we have $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|\leq\frac{L_{2}}{6}\Delta_{k}^{2}$ and from [32, Proposition 2.7] or [59, Lemma 6] we have $\|\bm{H}_{k}-\nabla^{2}f(\bm{x}_{k})\|\leq\frac{(\sqrt{2}+1)L_{2}}{3}\Delta_{k}$ . By arguments similar to the proof of Lemma 4.1(b), the model is fully quadratic with $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}},\kappa_{\textnormal{mh}}=\mathcal{O}(L_{2})$ (i.e. no explicit dependence on $n$ ). The remainder of the proof follows that of Corollary 4.8, but with the slightly larger $(n+1)(n+2)/2+2n$ objective evaluations per iteration. ∎

In terms of the dependency on $n$ , this objective evaluation bound again matches what we may expect from a derivative-based method using (genuine small- $h$ ) finite differencing to approximate gradients and Hessians.

4.3 Underdetermined Quadratic Interpolation

So far, we have the option of constructing either fully linear models with linear interpolation (using $n+1$ points), or constructing fully quadratic models with quadratic interpolation (using $(n+1)(n+2)/2$ points). In most cases, quadratic models give much better practical performance because they can adapt to the local curvature of $f$ , and so should be preferred over linear models, but here they come with the downside of requiring significantly more interpolation points—and hence objective evaluations—at each iteration. In this section, we outline an approach for constructing quadratic models (4.29) with $p\in\{n+2,\ldots,(n+1)(n+2)/2-1\}$ interpolation points. In this case we have more degrees of freedom in the model than available evaluations, and so the interpolation problem is underdetermined (i.e. there are infinitely many quadratic functions satisfying the interpolation conditions). We describe one particular approach, minimum Frobenius norm underdetermined quadratic interpolation.

To that end, suppose we again have $p$ interpolation points $\bm{y}_{1},\ldots,\bm{y}_{p}$ , but now with $p\in\{n+2,\ldots,(n+1)(n+2)/2-1\}$ . In general, there are infinitely many quadratic models $m$ (4.29) satisfying $m(\bm{y}_{i})=f(\bm{y}_{i})$ for all $i=1,\ldots,p$ . Of these, we pick the model for which the Hessian $\bm{H}$ has minimal Frobenius norm:

\displaystyle\min_{c,\bm{g},\bm{H}}\frac{1}{4}\|\bm{H}\|_{F}^{2},\qquad\text{s.t.}\quad\text{$m(\bm{y}_{i})=f(\bm{y}_{i})$ for all $i=1,\ldots,p$},\quad\text{and}\quad\bm{H}=\bm{H}^{T}.

(4.52)

Since $m$ is linear in the coefficients of $c$ , $\bm{g}$ and $\bm{H}$ , this is a convex quadratic program with equality constraints, and so can be solved via a linear system.

Lemma 4.10 (Section 2, [119]).

The solution to (4.52) can be obtained by solving the $(p+n+1)\times(p+n+1)$ linear system¹⁹¹⁹19We use $\bm{F}$ to denote the matrix in (4.65) as we are doing Frobenius-norm interpolation.

\displaystyle\underbrace{\left[\begin{array}[]{c|c}\bm{P}&\bm{M}\\ \hline\cr\bm{M}^{T}&\bm{0}\end{array}\right]}_{=:\bm{F}}\left[\begin{array}[]{c}\lambda_{1}\\ \vdots\\ \lambda_{p}\\ \hline\cr c\\ \bm{g}\end{array}\right]=\left[\begin{array}[]{c}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\\ \hline\cr 0\\ \bm{0}\end{array}\right],

(4.65)

where $\bm{P}\in\mathbb{R}^{p\times p}$ has entries $P_{i,j}=\frac{1}{2}[(\bm{y}_{i}-\bm{x})^{T}(\bm{y}_{j}-\bm{x})]^{2}$ and

\displaystyle\bm{M}=\begin{bmatrix}1&(\bm{y}_{1}-\bm{x})^{T}\\ \vdots&\vdots\\ 1&(\bm{y}_{p}-\bm{x})^{T}\end{bmatrix}\in\mathbb{R}^{p\times(n+1)},

(4.66)

c.f. (4.17). The model Hessian is given by $\bm{H}=\sum_{i=1}^{p}\lambda_{i}(\bm{y}_{i}-\bm{x})(\bm{y}_{i}-\bm{x})^{T}$ .

Remark 4.11.

If $p=(n+1)(n+2)/2$ , then we can still get a model by solving (4.52) via (4.65). However, since we have sufficient degrees of freedom that the interpolation conditions uniquely define the model, there is only one valid $\bm{H}$ (corresponding to fully quadratic interpolation), and so we are better off solving the smaller system (4.33). However, this does mean that all the below results also apply to fully quadratic models.

Once again, if $\Delta$ is a typical size of $\|\bm{y}_{i}-\bm{x}\|$ and we define $\hat{\bm{s}}_{i}:=(\bm{y}_{i}-\bm{x})/\Delta$ for all $i\in\{1,\ldots,p\}$ , we get the scaled interpolation system

\displaystyle\underbrace{\left[\begin{array}[]{c|c}\hat{\bm{P}}&\hat{\bm{M}}\\ \hline\cr\hat{\bm{M}}^{T}&\bm{0}\end{array}\right]}_{=:\hat{\bm{F}}}\left[\begin{array}[]{c}\hat{\lambda}_{1}\\ \vdots\\ \hat{\lambda}_{p}\\ \hline\cr c\\ \hat{\bm{g}}\end{array}\right]=\left[\begin{array}[]{c}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\\ \hline\cr 0\\ \bm{0}\end{array}\right],

(4.79)

where $\hat{P}_{i,j}=\frac{1}{2}[\hat{\bm{s}}_{i}^{T}\hat{\bm{s}}_{j}]^{2}$ and $\hat{\bm{M}}$ has $i$ -th row $[1\>\hat{\bm{s}}_{i}^{T}]$ (c.f. (4.18)) We get the coefficients $\hat{\bm{g}}=\Delta\>\bm{g}$ and $\hat{\lambda}_{i}=\Delta^{4}\lambda_{i}$ , and so $\hat{\bm{H}}=\sum_{i=1}^{p}\hat{\lambda}_{i}\hat{\bm{s}}_{i}\hat{\bm{s}}_{i}^{T}$ gives $\hat{\bm{H}}=\Delta^{2}\bm{H}$ .

Remark 4.12.

We note that $\bm{F}$ and $\hat{\bm{F}}$ are symmetric, and both $\bm{P}$ and $\hat{\bm{P}}$ are positive semidefinite. This can be seen, e.g. for $\hat{\bm{P}}$ , by observing that for any $\bm{v}\in\mathbb{R}^{p}$ we have

\displaystyle\bm{v}^{T}\hat{\bm{P}}\bm{v}=\frac{1}{2}\sum_{i,j=1}^{p}v_{i}v_{j}\left(\sum_{k=1}^{n}\hat{s}_{i,k}\hat{s}_{j,k}\right)^{2}=\frac{1}{2}\sum_{i,j=1}^{p}\sum_{k,\ell=1}^{n}v_{i}v_{j}\hat{s}_{i,k}\hat{s}_{j,k}\hat{s}_{i,\ell}\hat{s}_{j,\ell}=\frac{1}{2}\sum_{k,\ell=1}^{n}\left(\sum_{i=1}^{p}v_{i}\hat{s}_{i,k}\hat{s}_{i,\ell}\right)^{2}\geq 0.

(4.80)

Hence $\bm{F}$ and $\hat{\bm{F}}$ yield saddle point linear systems, which have been widely studied [20].

Since we choose to ensure $\|\bm{H}\|_{F}$ is small, we can provide an upper bound on the size of the model Hessian. This will be necessary for us to apply Lemma 4.1(a) to show that the model is fully linear.

Lemma 4.13.

Suppose $f$ satisfies Assumption 2.4 (a) and we construct a quadratic model $m$ (4.29) for $f$ by solving (4.79), where we assume $\hat{\bm{F}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ , then the model Hessian satisfies

\displaystyle\|\bm{H}\|\leq\kappa_{H}:=\frac{L_{1}}{2}p\beta^{4}\|\hat{\bm{F}}^{-1}\|_{\infty}.

(4.81)

Proof.

By suitably modifying $c$ and $\bm{g}$ , the model Hessian does not change if the objective function is changed by adding a linear function to it.²⁰²⁰20From changing $c$ and $\bm{g}$ suitably, any Hessian that interpolates $f$ also interpolates $f$ plus a linear function, so the minimizer $\bm{H}$ must be the same. Hence, we consider interpolating to the function $\tilde{f}(\bm{y}):=f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})$ , and so by Lemma 2.6 we have $|\tilde{f}(\bm{y}_{i})|\leq\frac{L_{1}}{2}\|\bm{y}_{i}-\bm{x}\|^{2}$ , and so all terms in the right-hand side of (4.79) are bounded by $\frac{L_{1}}{2}\beta^{2}\Delta^{2}$ . This means that for all $i=1,\ldots,p$ we have

\displaystyle|\lambda_{i}|=\Delta^{-4}|\hat{\lambda}_{i}|\leq\frac{L_{1}}{2}\beta^{2}\Delta^{-2}\|\hat{\bm{F}}^{-1}\|_{\infty}.

(4.82)

Finally, we have

\displaystyle\|\bm{H}\|\leq\sum_{i=1}^{p}|\lambda_{i}|\>\|\bm{y}_{i}-\bm{x}\|^{2}\leq p\left(\frac{L_{1}}{2}\beta^{2}\Delta^{-2}\|\hat{\bm{F}}^{-1}\|_{\infty}\right)\beta^{2}\Delta^{2},

(4.83)

and the result follows. ∎

We can now establish that our minimum Frobenius norm quadratic interpolation model is indeed fully linear.

Theorem 4.14.

\displaystyle\kappa_{\textnormal{mf}}=\frac{L_{1}+\kappa_{H}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{\dagger}\|_{\infty}+\frac{L_{1}+\kappa_{H}}{2},\qquad\text{and}\qquad\kappa_{\textnormal{mg}}=2\kappa_{\textnormal{mf}}+2\kappa_{H},

(4.84)

using the value of $\kappa_{H}$ defined in Lemma 4.13.

Proof.

Fix $\bm{y}\in B(\bm{x},\Delta)$ and define $\hat{\bm{s}}:=(\bm{y}-\bm{x})/\Delta$ . Since $\hat{\bm{F}}$ is invertible, $\hat{\bm{M}}$ has full column rank [20, Theorem 3.3], so the rows of $\hat{\bm{M}}$ span $\mathbb{R}^{n+1}$ . Hence, the system $\hat{\bm{M}}^{T}\bm{v}=\begin{bmatrix}1\\ \hat{\bm{s}}\end{bmatrix}$ is consistent. So, we take $\bm{v}=(\hat{\bm{M}}^{T})^{\dagger}\begin{bmatrix}1\\ \hat{\bm{s}}\end{bmatrix}$ to be the minimal norm solution, with $\|\bm{v}\|_{1}\leq\|\hat{\bm{M}}^{\dagger}\|_{\infty}(1+\sqrt{n})$ by similar reasoning as in the proof of Theorem 4.2. Multiplying the last $n$ rows of this (consistent) equation by $\Delta$ , we again get (4.21) for our new $\bm{v}$ . By following the same argument as for (4.24) and using the interpolation condition $m(\bm{y}_{i})=f(\bm{y}_{i})$ we get

$\displaystyle\|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|$	$\displaystyle=\left\|\sum_{i=1}^{p}v_{i}\left(c+\bm{g}^{T}(\bm{y}_{i}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{i}-\bm{x})\right)\right\|,$	(4.85)
	$\displaystyle\leq\left\|\sum_{i=1}^{p}v_{i}(m(\bm{y}_{i})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{i}-\bm{x}))\right\|$
	$\displaystyle\qquad\qquad\qquad+\left\|\sum_{i=1}^{p}\frac{1}{2}v_{i}(\bm{y}_{i}-\bm{x})^{T}\bm{H}(\bm{y}_{i}-\bm{x})\right\|,$	(4.86)
	$\displaystyle\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}\\|\bm{v}\\|_{1}+\frac{1}{2}\\|\bm{H}\\|\beta^{2}\Delta^{2}\\|\bm{v}\\|_{1},$	(4.87)

and so

\displaystyle|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}+\kappa_{H}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{\dagger}\|_{\infty}\Delta^{2},

(4.88)

which finally gives

	$\displaystyle\|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|$	$\displaystyle\leq\frac{L_{1}+\kappa_{H}}{2}(1+\sqrt{n})\beta^{2}\\|\hat{\bm{M}}^{\dagger}\\|_{\infty}\Delta^{2}+\frac{1}{2}\|(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x})\|,$		(4.89)
		$\displaystyle\leq\left(\frac{L_{1}+\kappa_{H}}{2}(1+\sqrt{n})\beta^{2}\\|\hat{\bm{M}}^{\dagger}\\|_{\infty}+\frac{\kappa_{H}}{2}\right)\Delta^{2}.$		(4.90)

The result then follows from Lemma 4.1(a). ∎

A popular structured interpolation set used for minimum Frobenius norm interpolation is $\bm{y}_{1}=\bm{x}$ , $\bm{y}_{i+1}=\bm{x}+\Delta\bm{e}_{i}$ and $\bm{y}_{n+i+1}=\bm{x}-\Delta\bm{e}_{i}$ for $i=1,\ldots,n$ , with $p=2n+1$ . This gives the associated interpolation matrix

\displaystyle\hat{\bm{F}}=\left[\begin{array}[]{ccc|cc}0&\bm{0}^{T}&\bm{0}^{T}&1&\bm{0}^{T}\\ \bm{0}&\frac{1}{2}\bm{I}&\frac{1}{2}\bm{I}&\bm{e}&\bm{I}\\ \bm{0}&\frac{1}{2}\bm{I}&\frac{1}{2}\bm{I}&\bm{e}&-\bm{I}\\ \hline\cr 1&\bm{e}^{T}&\bm{e}^{T}&0&\bm{0}^{T}\\ \bm{0}&\bm{I}&-\bm{I}&\bm{0}&\bm{0}\end{array}\right],\qquad\text{with}\qquad\hat{\bm{F}}^{-1}=\left[\begin{array}[]{ccc|cc}2n&-\bm{e}^{T}&-\bm{e}^{T}&1&\bm{0}^{T}\\ -\bm{e}&\frac{1}{2}\bm{I}&\frac{1}{2}\bm{I}&\bm{0}&\frac{1}{2}\bm{I}\\ -\bm{e}&\frac{1}{2}\bm{I}&\frac{1}{2}\bm{I}&\bm{0}&-\frac{1}{2}\bm{I}\\ \hline\cr 1&\bm{0}^{T}&\bm{0}^{T}&0&\bm{0}^{T}\\ \bm{0}&\frac{1}{2}\bm{I}&-\frac{1}{2}\bm{I}&\bm{0}&\bm{0}\end{array}\right],

(4.101)

and so $\|\hat{\bm{F}}^{-1}\|_{\infty}=4n+1$ . Applying Lemma 4.13, we get the bound $\kappa_{H}=\mathcal{O}(pn)=\mathcal{O}(n^{2})$ . Combining with $\|\hat{\bm{M}}^{\dagger}\|_{\infty}=1$ (which can be verified by direct computation), Theorem 4.14 tells us that the model is fully linear with $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(\sqrt{n}\>\kappa_{H})=\mathcal{O}(n^{5/2})$ . Applying these results to Corollary 3.8, we get a worst-case complexity for Algorithm 3.1 of $\mathcal{O}(n^{7}\epsilon^{-2})$ iterations and $\mathcal{O}(n^{8}\epsilon^{-2})$ objective evaluations. This is significantly worse than linear and even (fully) quadratic interpolation!

However, luckily for us it turns out that the bound $\kappa_{H}=\mathcal{O}(n^{2})$ from Lemma 4.13 is overly conservative in this case, because only the first row of $\hat{\bm{F}}^{-1}$ has an absolute row sum of size $\mathcal{O}(n)$ , and this row is only used to calculate the model coefficient $\lambda_{1}$ . Moreover, $\lambda_{1}$ only contributes to the model Hessian via $\lambda_{1}(\bm{y}_{1}-\bm{x})(\bm{y}_{1}-\bm{x})^{T}$ , so since $\bm{y}_{1}=\bm{x}$ , it does not affect the Hessian. Indeed, using the explicit form of $\hat{\bm{F}}^{-1}$ (4.101), we compute that $\lambda_{i+1}=\lambda_{n+i+1}=\Delta^{-4}[-f(\bm{x})+\frac{1}{2}f(\bm{x}+\Delta\bm{e}_{i})+\frac{1}{2}f(\bm{x}-\Delta\bm{e}_{i})]$ for $i=1,\ldots,n$ , and so the model Hessian is

\displaystyle\bm{H}=\sum_{i=1}^{n}(\lambda_{i+1}+\lambda_{n+i+1})\Delta^{2}\bm{e}_{i}\bm{e}_{i}^{T}=\operatorname{diag}\left(\left\{\frac{f(\bm{x}+\Delta\bm{e}_{i})-2f(\bm{x})+f(\bm{x}-\Delta\bm{e}_{i})}{\Delta^{2}}:i=1,\ldots,n\right\}\right).

(4.102)

As per the proof of Lemma 4.13, we may without loss of generality assume $|f(\bm{y}_{i})|\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}$ , and so we get the improved bound $\|\bm{H}\|=\max_{i=1,\ldots,n}|H_{i,i}|\leq 2L_{1}\beta^{2}$ . Hence we may take $\kappa_{H}=\mathcal{O}(1)$ , a significant improvement over Lemma 4.13.

Taking $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(\sqrt{n}\>\kappa_{H})$ from Theorem 4.14 with our tighter bound $\kappa_{H}=\mathcal{O}(1)$ in Corollary 3.8, we arrive at the same worst-case complexity as linear interpolation.

Corollary 4.15.

Under the assumptions of Theorem 3.7, if the local quadratic model at each iteration of in Algorithm 3.1 is generated by minimum Frobenius norm quadratic interpolation to the points $\{\bm{x}_{k},\bm{x}_{k}+\Delta_{k}\bm{e}_{1},\ldots,\bm{x}_{k}+\Delta_{k}\bm{e}_{n},\bm{x}_{k}-\Delta_{k}\bm{e}_{1},\ldots,\bm{x}_{k}-\Delta_{k}\bm{e}_{n}\}$ , then the iterates of Algorithm 3.1 achieve $\|\nabla f(\bm{x}_{k})\|<\epsilon$ for the first time after at most $\mathcal{O}(n\epsilon^{-2})$ iterations and $\mathcal{O}(n^{2}\epsilon^{-2})$ objective evaluations.

Again, by scaling the perturbations in terms of the problem dimension, we can get improved dependence on dimension in the worst-case complexity bounds.

Corollary 4.16.

Under the assumptions of Theorem 3.7, and if $\Delta_{k}\leq\Delta_{\max}$ for all $k$ , if the local quadratic model at each iteration of in Algorithm 3.1 is generated by minimum Frobenius norm quadratic interpolation to the points $\{\bm{x}_{k},\bm{x}_{k}+\Delta_{k}\bm{e}_{1}/n^{1/4},\ldots,\bm{x}_{k}+\Delta_{k}\bm{e}_{n}/n^{1/4},\bm{x}_{k}-\Delta_{k}\bm{e}_{1}/n^{1/4},\ldots,\bm{x}_{k}-\Delta_{k}\bm{e}_{n}/n^{1/4}\}$ , then the iterates of Algorithm 3.1 achieve $\|\nabla f(\bm{x}_{k})\|<\epsilon$ for the first time after at most $\mathcal{O}(\epsilon^{-2})$ iterations and $\mathcal{O}(n\epsilon^{-2})$ objective evaluations.

Proof.

This construction gives the same $\bm{g}_{k}$ as in (4.51), and so $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|\leq\frac{L_{1}}{2}\Delta_{k}^{2}\leq\frac{L_{1}\Delta_{\max}}{2}\Delta_{k}$ . By the same reasoning as above, we have $\kappa_{H}=\mathcal{O}(1)$ , and so $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(1)$ . The remainder of the proof is identical to the linear interpolation case (Corollary 4.5). ∎

Remark 4.17.

A simple modification of this approach, which can be beneficial in practice, is to look at the minimum change in the Hessian between successive iterations of our main algorithm. That is, given some old Hessian approximation $\bm{H}_{\textnormal{prev}}$ , we choose $\bm{H}=\bm{H}_{\textnormal{prev}}+\Delta\bm{H}$ where $\Delta\bm{H}$ solves

\displaystyle\min_{c,\bm{g},\Delta\bm{H}}\frac{1}{4}\|\Delta\bm{H}\|_{F}^{2},\qquad\text{s.t.}\quad\text{$m(\bm{y}_{i}-\bm{x})=f(\bm{y}_{i})$ for all $i=1,\ldots,p$}\quad\text{and}\quad\Delta\bm{H}=\Delta\bm{H}^{T}.

(4.103)

The corresponding linear system is the same as (4.65) but where the entries in the right-hand side are replaced with $f(\bm{y}_{i})-\frac{1}{2}(\bm{y}_{i}-\bm{x})^{T}\bm{H}_{\textnormal{prev}}(\bm{y}_{i}-\bm{x})$ , and where $\bm{H}=\bm{H}_{\textnormal{prev}}+\sum_{i=1}^{p}\lambda_{i}(\bm{y}_{i}-\bm{x})(\bm{y}_{i}-\bm{x})^{T}$ . Following the proof of Lemma 4.13, we get $\|\Delta\bm{H}\|\leq\frac{L_{1}+\|\bm{H}_{\textnormal{prev}}\|}{2}p\beta^{4}\|\hat{\bm{F}}^{-1}\|_{\infty}$ and so our model Hessian can potentially grow rapidly,

\displaystyle\|\bm{H}\|\leq\|\bm{H}_{\textnormal{prev}}\|+\|\Delta\bm{H}\|\leq\left(1+\frac{1}{2}p\beta^{4}\|\hat{\bm{F}}^{-1}\|_{\infty}\right)\|\bm{H}_{\textnormal{prev}}\|+\frac{L_{1}}{2}p\beta^{4}\|\hat{\bm{F}}^{-1}\|_{\infty}.

(4.104)

Unfortunately this rate of growth in the model Hessian is sufficient that global convergence results do not apply.²¹²¹21Trust-region methods can converge in the case of unbounded model Hessians, but the growth cannot exceed $\|\bm{H}_{k}\|=\mathcal{O}(k)$ ; see [142, 58] for more details.

4.4 Practical Model Construction

We have seen that choosing interpolation sets based on structured perturbations around $\bm{x}_{k}$ can yield fully linear/fully quadratic models. Although these can yield very good worst-case complexity bounds (e.g. Corollary 4.5), they do not allow us to reuse any existing objective evaluations during model construction. Given we often turn to MBDFO methods when objective evaluations are expensive, a pragmatic approach would allow us to construct models based on existing objective evaluations, while controlling the size of the relevant matrix norms.

To do this, given $\bm{x}_{k}$ , $\Delta_{k}$ and a collection of existing evaluations $\mathcal{Y}\subset\mathbb{R}^{n}$ that are sufficiently close to $\bm{x}_{k}$ , we first try to find a subset of $\mathcal{Y}$ that (together with $\bm{x}_{k}$ ) is suitable for linear interpolation. If we cannot find $n$ points in $\mathcal{Y}$ that yield a sufficiently good linear interpolation set, we augment this set based on the null space of selected perturbations around $\bm{x}_{k}$ (i.e. $\bm{y}-\bm{x}_{k}$ for selected points $\bm{y}\in\mathcal{Y}$ ). By maintaining a QR factorization of the selected perturbations, we can control $\|\hat{\bm{M}}^{-1}\|$ (4.18) directly based on the diagonal entries of the $R$ matrix [147, Lemma 2.4]. This procedure is formalized in [146, Figure 4.2], and extended to bound-constrained problems in [53].

We can then optionally choose to extend the selected interpolation set by adding more points (that are sufficiently close to $\bm{x}_{k}$ ) while ensuring that $\|\hat{\bm{F}}^{-1}\|$ (4.79) is not too large. Two approaches for doing this are:

•

Ensuring the new point added does not significantly increase $\|\hat{\bm{P}}^{-1}\|$ by using a QR factorization of $\hat{\bm{M}}^{T}$ and a Cholesky-like factorization of $\hat{\bm{P}}$ , as described in [148, Algorithm 4.2];
•

Ensuring the new point added does not significantly increase $\|\hat{\bm{F}}^{-1}\|$ by controlling the Schur complement of the resulting $\hat{\bm{F}}$ after the new point is added [130, Algorithm 5.2].

The first of these approaches is used in the IBCDFO software collection (see Section 8).

In both these approaches, points are appended to the interpolation set from the set of existing evaluations, with the model quality controlled via the norm of the (inverse) interpolation matrix as per the error bounds derived in this section. In Section 5, we will instead consider geometric approaches for incrementally improving an interpolation set, where we take an existing interpolation set and determine how to change a small number of points to improve its quality.

4.5 Composite Models

Another important construction is for models where the objective function $f$ has some known structure. For example, this may be where $f$ results from composing a known function with a function for which derivatives are unavailable, namely

\displaystyle f(\bm{x})=H(\bm{r}(\bm{x})),

(4.105)

where $\bm{r}:\mathbb{R}^{n}\to\mathbb{R}^{q}$ (for some $q$ ) is smooth but has only a zeroth order oracle, and the smooth function $H:\mathbb{R}^{q}\to\mathbb{R}$ and its derivatives are completely known. Perhaps the most common example of (4.105) is nonlinear least-squares minimization, where $H(\bm{r})=\frac{1}{2}\|\bm{r}\|^{2}$ .

In such settings, because we receive more information with every zeroth order oracle query (the full vector $\bm{r}(\bm{x})$ instead of the scalar $f(\bm{x})$ ), it can often be sufficient to build linear interpolation models for $\bm{r}(\bm{x})$ , and extend these to a quadratic model for $f$ using (4.105). For example, in the nonlinear least-squares case, if we sample $\bm{r}(\bm{y}_{i})$ for $i=1,\ldots,n+1$ and perform linear interpolation on each component of $\bm{r}(\bm{x})$ , we get a model

\displaystyle\bm{r}(\bm{y})\approx\bm{m}(\bm{y}):=\bm{c}+\bm{J}(\bm{y}-\bm{x}),

(4.106)

for some $\bm{c}\in\mathbb{R}^{q}$ and $\bm{J}\in\mathbb{R}^{q\times n}$ . We then naturally get a quadratic model for $f$ via

\displaystyle f(\bm{y})\approx m(\bm{y}):=\frac{1}{2}\|\bm{m}(\bm{y})\|^{2}=\frac{1}{2}\|\bm{c}\|^{2}+(\bm{J}^{T}\bm{c})^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}(\bm{J}^{T}\bm{J})(\bm{y}-\bm{x}).

(4.107)

In effect, by using the structure of the problem (4.105) we get an approximate quadratic model for the cost of a linear interpolation model. If each component of $\bm{m}(\bm{y})$ is a fully linear model for the corresponding component of $\bm{r}(\bm{y})$ , then it can be shown that the quadratic model (4.107) is a fully linear model for $f$ [37, Lemma 3.3]. Of course, if a quadratic model for $\bm{r}$ is available, then a more complex quadratic (or potentially up to quartic) model for $f$ can be derived in a similar way [155, 149].

This composite approach, where we receive more oracle information from $\bm{r}(\bm{x})$ and directly exploit some known structure, can significantly improve the practical performance of MBDFO methods. These ideas also extend to the case where $H$ is nonsmooth (e.g. $H(\bm{r})=\|\bm{r}\|_{1}$ or $H(\bm{r})=\max(\bm{r})$ ), although solving the resulting trust-region step calculation and convergence theory is more complex [97, 98].

Notes and References

The first fully linear/fully quadratic error bounds for linear and quadratic interpolation were given in [46], and for linear regression and underdetermined quadratic interpolation in [45], although minimum Frobenius norm interpolation had been proposed earlier in [119]. These results are collected in the reference book [51] and summarized in [132]. The proofs presented here generalize to constrained interpolation (Section 6.1), and are based on the theory for linear interpolation and minimum Frobenius norm interpolation given in [86, 130].

The interpolation constructions here can also be thought of as a way to approximate the gradient or Hessian of a function just with function evaluations. Such approximations, known as simplex gradients/Hessians, have been studied in [127, 85, 84]. Using perturbations of size $\mathcal{O}(\Delta_{k}/\sqrt{n})$ to improve the dimension dependence of the worst-case complexity bound for linear models (as in Corollary 4.5) was originally proposed for quadratic regularization methods in [73], and was adapted to MBDFO in [40, 57]. The extension to fully quadratic models is based on the constructions in [32, 59]. In particular, [59] achieves a better bound of $\mathcal{O}(n^{3/2})$ objective evaluations by only re-sampling the Hessian every $n$ iterations. Although [59] uses a cubic regularization method, it is likely a similar improvement would be possible for a trust-region method.

How to best construct models in the underdetermined quadratic case is still actively being studied. Early alternatives to (4.52) or (4.103) were proposed in [52, 158], but recent years have seen several alternatives proposed [152, 153, 150]. The method in [150], for example, attempts to optimize the parameters of a specific model construction problem over the course of the algorithm.

Composite models for nonlinear least-squares problems were first introduced in [155] with the resulting worst-case complexity studied in [37]. Similar ideas can also be used when $h$ is nonsmooth [72, 97, 98, 101], although the resulting trust-region subproblem becomes significantly more difficult to solve. However, MBDFO for general nonsmooth problems is a relatively under-studied topic (see e.g. [90, 9]), in contrast to the relatively established approaches for direct search DFO methods [5, 11].

We conclude by noting that other forms of interpolation model have been used in MBDFO. The most notable example of non-polynomial models is the use of radial basis function (RBF) models, which take the form

\displaystyle f(\bm{y})\approx m(\bm{y}):=c+\bm{g}^{T}(\bm{y}-\bm{x})+\sum_{i=1}^{p}\lambda_{i}\psi(\|\bm{y}-\bm{y}_{i}\|),

(4.108)

where $\psi:[0,\infty)\to\mathbb{R}$ is some predetermined scalar function, such as the Gaussian function $\psi(r)=e^{-cr^{2}}$ for some $c>0$ [30]. Because of the inclusion of a linear term in the model, such models can also be made fully linear through proper selection of the interpolation points [146, 147]. RBF models are primarily designed to globally approximate a function (rather than locally as here), and so are more commonly used in global optimization algorithms [23, 77]. Such models have many similarities with statistically motivated global function approximations such as Gaussian Processes [125], which are used in Bayesian (global) optimization algorithms [133] but again have also been used in MBDFO [16].

5 Incremental Geometry Improvement

In Section 4, we described how to generate a set of interpolation points in order to be fully linear/fully quadratic in $B(\bm{x},\Delta)$ . There, we had full control over where to place the interpolation points (e.g. using coordinate perturbations around $\bm{x}$ ) to ensure the model was sufficiently accurate. This approach is of theoretical interest, and does motivate procedures to find a collection of suitable interpolation points from a database (and determine any extra evaluations required to get a good model). However, in many practically successful MBDFO algorithms, an incremental approach is used, where interpolation points for the next iteration $k+1$ are taken to be the points from the current iteration $k$ , with minimal changes to the set, such as adding the new iterate $\bm{x}_{k+1}$ . This ensures we only ask for a very small number of new (possibly expensive) objective evaluations at each iteration. Here, we describe how to assess the quality of an interpolation set using geometric conditions, which determine how to minimally alter a given interpolation set to ensure its quality.

The primary tool we will use to assess the quality of an interpolation set, and determine which points to move (and where to move them to), are Lagrange polynomials. For a set of points $\bm{y}_{1},\ldots,\bm{y}_{p}\in\mathbb{R}^{n}$ , the associated Lagrange polynomials are a collection of $p$ polynomials $\ell_{1},\ldots,\ell_{p}:\mathbb{R}^{n}\to\mathbb{R}$ satisfying the conditions $\ell_{i}(\bm{y}_{j})=\delta_{i,j}$ , where $\delta_{i,j}$ is the Kronecker delta (i.e. $\delta_{i,j}=1$ if $i=j$ and 0 otherwise). The degree of the Lagrange polynomials will match the degree of the interpolation model we are trying to assess.

Lagrange polynomials are an important concept from polynomial approximation theory. For example, suppose we have a continuous function $f:[-1,1]\to\mathbb{R}$ which we will approximate by a degree- $d$ polynomial $p_{d}$ via interpolation to points $x_{1},\ldots,x_{d+1}\in[-1,1]$ . The associated Lagrange polynomials determine the Lebesgue constant for the interpolation set, $\Lambda_{1}:=\max_{x\in[-1,1]}\sum_{i=1}^{d+1}|\ell_{i}(x)|$ . We have the following result, which says that the interpolant $p_{d}$ has error within $\mathcal{O}(\Lambda_{1})$ of the best possible error.

Proposition 5.1 (Theorem 15.1, [144]).

The error between the true (continuous) function $f:[-1,1]\to\mathbb{R}$ and degree- $d$ polynomial interpolant $p_{d}$ satisfies

\displaystyle\|f-p_{d}\|_{\infty}\leq(\Lambda_{1}+1)\|f-p_{d}^{*}\|_{\infty},

(5.1)

where $\|f\|_{\infty}:=\max_{x\in[-1,1]}|f(x)|$ is the supremum norm and $p_{d}^{*}\in\operatorname*{arg\,min}_{p_{d}}\|f-p_{d}\|_{\infty}$ is an optimal degree- $d$ polynomial interpolant to $f$ .

That is, choosing interpolation sets which make the associated Lagrange polynomials small in magnitude are associated with better polynomial approximations. We will see that a similar property holds in the multi-dimensional case.

We begin by re-deriving the fully linear/fully quadratic constants from Section 4, replacing all matrix norms with quantities related to the magnitude of the relevant Lagrange polynomials. We will then use this to build a MBDFO algorithm with self-correcting interpolation sets.

5.1 Interpolation Error Bounds

To begin, consider the case of linear interpolation to build a model (4.16) by solving the system (4.17) or (4.18). In this case, given the same base point $\bm{x}\in\mathbb{R}^{n}$ as the model (4.16), the Lagrange polynomials are the linear functions

\displaystyle\ell_{i}(\bm{y}):=c_{i}+\bm{g}_{i}^{T}(\bm{y}-\bm{x}),\qquad\forall i=1,\ldots,p,

(5.2)

where $p:=n+1$ , defined by the interpolation conditions $\ell_{i}(\bm{y}_{j})=\delta_{i,j}$ for $i,j=1,\ldots,p$ . That is, from (4.17) or (4.18) we can construct $\ell_{i}$ by solving

\displaystyle\bm{M}\begin{bmatrix}c_{i}\\ \bm{g}_{i}\end{bmatrix}=\bm{e}_{i},\qquad\text{or}\qquad\hat{\bm{M}}\begin{bmatrix}c_{i}\\ \Delta\>\bm{g}_{i}\end{bmatrix}=\bm{e}_{i},

(5.3)

provided that $\bm{M}$ and $\hat{\bm{M}}$ are invertible. Comparing to (4.17) or (4.18), one important consequence of this is that the interpolation model can be written as a linear combination of Lagrange polynomials,

\displaystyle\begin{bmatrix}c\\ \bm{g}\end{bmatrix}=\sum_{i=1}^{p}f(\bm{y}_{i})\begin{bmatrix}c_{i}\\ \bm{g}_{i}\end{bmatrix},\qquad\text{giving}\qquad m(\bm{y})=\sum_{i=1}^{p}f(\bm{y}_{i})\ell_{i}(\bm{y}).

(5.4)

If $f$ is constant, $f(\bm{y})=1$ for all $\bm{y}\in\mathbb{R}^{n}$ , then since the linear model $m$ is unique, the model must perfectly match the original function $m(\bm{y})=f(\bm{y})$ . Applying (5.4), we conclude that Lagrange polynomials form a partition of unity,

\displaystyle\sum_{i=1}^{p}\ell_{i}(\bm{y})=1,\qquad\forall\bm{y}\in\mathbb{R}^{n}.

(5.5)

Separately, from (5.2) and (5.3) we observe that

\displaystyle\ell_{i}(\bm{y})=\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}^{T}\begin{bmatrix}c_{i}\\ \bm{g}_{i}\end{bmatrix}=\bm{e}_{i}^{T}\bm{M}^{-T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\bm{e}_{i}^{T}\hat{\bm{M}}^{-T}\begin{bmatrix}1\\ (\bm{y}-\bm{x})/\Delta\end{bmatrix}.

(5.6)

This means that we can evaluate all Lagrange polynomials at a single point by solving a single linear system, without calculating any $c_{i}$ or $\bm{g}_{i}$ :

\displaystyle\bm{\lambda}(\bm{y}):=\begin{bmatrix}\ell_{1}(\bm{y})\\ \vdots\\ \ell_{p}(\bm{y})\end{bmatrix}=\bm{M}^{-T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\hat{\bm{M}}^{-T}\begin{bmatrix}1\\ (\bm{y}-\bm{x})/\Delta\end{bmatrix}.

(5.7)

We have seen a similar relationship before, in (4.21) in the proof of Theorem 4.2. Specifically, the vector $\bm{v}$ in (4.21) is equal to $\bm{\lambda}(\bm{y})$ , and so (4.25) can be written as

\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}\|\bm{\lambda}(\bm{y})\|_{1},

(5.8)

for all $\bm{y}\in B(\bm{x},\Delta)$ .

The bound (5.8) tells us that, just as for higher-order polynomial interpolation in the scalar case (Proposition 5.1), the Lebesgue constant $\max_{\bm{y}\in B(\bm{x},\Delta)}\|\bm{\lambda}(\bm{y})\|_{1}$ is a useful way to measure the quality of an interpolation model. However, we will not be able to use the Lebesgue constant to determine how to incrementally improve an interpolation set. Instead, we consider $\|\bm{\lambda}(\bm{y})\|_{\infty}$ in the following definition, which will apply equally to quadratic interpolation as well as linear.

Definition 5.2.

Suppose we have an interpolation set $\mathcal{Y}:=\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset\mathbb{R}^{n}$ such that its Lagrange polynomials exist. Given $\bm{x}\in\mathbb{R}^{n}$ and $\Delta>0$ , we say that $\mathcal{Y}$ is $\Lambda$ -poised in $B(\bm{x},\Delta)$ for some $\Lambda>0$ if

\displaystyle\max_{\bm{y}\in B(\bm{x},\Delta)}\|\bm{\lambda}(\bm{y})\|_{\infty}\leq\Lambda.

(5.9)

There are two noteworthy observations that can be made at this point:

•

We can construct the Lagrange polynomials (i.e. $\Lambda$ exists) if and only if the relevant interpolation linear system is invertible;
•

Since the Lagrange polynomials always form a partition of unity (5.5) we have $\|\bm{\lambda}(\bm{y})\|_{1}\geq\sum_{i=1}^{p}\ell_{i}(\bm{y})=1$ , and so $\Lambda\geq 1/p$ . Moreover, if any $\bm{y}_{i}\in B(\bm{x},\Delta)$ , which is usually the case, the condition $\ell_{i}(\bm{y}_{i})=1$ implies $\Lambda\geq 1$ .

For example, the set $\mathcal{Y}=\{\bm{x},\bm{x}+\Delta\bm{e}_{1},\ldots,\bm{x}+\Delta\bm{e}_{n}\}$ from Corollary 4.4 has Lagrange polynomials

\displaystyle\ell_{1}(\bm{y})=1-\bm{e}^{T}(\bm{y}-\bm{x})/\Delta,\qquad\text{and}\qquad\ell_{i+1}(\bm{y})=\bm{e}_{i}^{T}(\bm{y}-\bm{x})/\Delta,\quad\forall i=1,\ldots,n.

(5.10)

The maximizers of $|\ell_{i}(\bm{y})|$ over $\bm{y}\in B(\bm{x},\Delta)$ are $\bm{y}=\bm{x}-\Delta\frac{\bm{e}}{\|\bm{e}\|}$ for $i=1$ and $\bm{y}=\bm{x}\pm\Delta\bm{e}_{i-1}$ for $i=2,\ldots,n+1$ , yielding $\Lambda=\sqrt{n}+1$ .

As a direct consequence of (5.8), we get the following alternative to Theorem 4.2.

Theorem 5.3.

Suppose $f$ satisfies Assumption 2.4 (a) and we construct a linear model (4.16) for $f$ by solving (4.18), where we assume $\hat{\bm{M}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=n+1$ is the number of interpolation points) and the interpolation set is $\Lambda$ -poised in $B(\bm{x},\Delta)$ , then the model is fully linear in $B(\bm{x},\Delta)$ with constants

\displaystyle\kappa_{\textnormal{mf}}=\frac{L_{1}}{2}\beta^{2}p\Lambda+\frac{L_{1}}{2},\quad\text{and}\quad\kappa_{\textnormal{mg}}=2\kappa_{\textnormal{mf}}.

(5.11)

Proof.

Follows immediately from (5.8), $\|\bm{\lambda}(\bm{y})\|_{1}\leq p\Lambda$ and Lemma 4.1(a) with $\kappa_{H}=0$ . ∎

We get similar results in the case of (fully) quadratic interpolation (4.29). If we have interpolation points $\bm{y}_{1},\ldots,\bm{y}_{p}$ with $p=(n+1)(n+2)/2$ , then the associated Lagrange polynomials are

\displaystyle\ell_{i}(\bm{y}):=c_{i}+\bm{g}_{i}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}\bm{H}_{i}(\bm{y}-\bm{x}),\qquad\forall i=1,\ldots,p,

(5.12)

which we can construct by solving the same systems as (4.33) and (4.34), namely

\displaystyle\bm{Q}\begin{bmatrix}c_{i}\\ \bm{g}_{i}\\ \operatorname{upper}(\bm{H}_{i})\end{bmatrix}=\bm{e}_{i},\qquad\text{or}\qquad\hat{\bm{Q}}\begin{bmatrix}c_{i}\\ \Delta\>\bm{g}_{i}\\ \Delta^{2}\operatorname{upper}(\bm{H}_{i})\end{bmatrix}=\bm{e}_{i},\qquad\forall i=1,\ldots,p.

(5.13)

This again gives us the relationship (5.4) and similar reasoning also gives (5.5). Recalling the natural quadratic basis (4.30) and (4.31), we also have

\displaystyle\ell_{i}(\bm{y})=\bm{\phi}(\bm{y}-\bm{x})^{T}\begin{bmatrix}c_{i}\\ \bm{g}_{i}\\ \operatorname{upper}(\bm{H}_{i})\end{bmatrix}=\bm{\phi}(\hat{\bm{s}})^{T}\begin{bmatrix}c_{i}\\ \Delta\>\bm{g}_{i}\\ \Delta^{2}\operatorname{upper}(\bm{H}_{i})\end{bmatrix},

(5.14)

where $\hat{\bm{s}}:=(\bm{y}-\bm{x})/\Delta$ , and combining with (5.13) we get the analog of (5.7), namely

\displaystyle\bm{\lambda}(\bm{y})=\bm{Q}^{-T}\bm{\phi}(\bm{y}-\bm{x})=\hat{\bm{Q}}^{-T}\bm{\phi}(\hat{\bm{s}}).

(5.15)

Comparing with the proof of Theorem 4.6, we see that $\bm{v}$ in (4.37) is equal to $\bm{\lambda}(\bm{y})$ , and so (4.41) can be replaced with

\displaystyle\left|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})-\frac{1}{2}(\bm{y}-\bm{x})^{T}\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\right|\leq\frac{L_{2}}{6}\beta^{3}\|\bm{\lambda}(\bm{y})\|_{1}\Delta^{3},

(5.16)

for all $\bm{y}\in B(\bm{x},\Delta)$ .

Hence we get the following version of Theorem 4.6.

Theorem 5.4.

Suppose $f$ satisfies Assumption 2.7 (a) and we construct a quadratic model $m$ (4.29) for $f$ by solving (4.34), where we assume $\hat{\bm{Q}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=(n+1)(n+2)/2$ is the number of interpolation points) and the interpolation set is $\Lambda$ -poised in $B(\bm{x},\Delta)$ , then the model is fully quadratic in $B(\bm{x},\Delta)$ with constants

\displaystyle\kappa_{\textnormal{mf}}=\frac{1}{6}L_{2}\beta^{3}p\Lambda+\frac{L_{2}}{6},\quad\kappa_{\textnormal{mg}}=\frac{17}{3}L_{2}\beta^{3}p\Lambda+\frac{L_{2}}{2},\quad\text{and}\quad\kappa_{\textnormal{mh}}=4L_{2}\beta^{3}p\Lambda+L_{2}.

(5.17)

Proof.

Combine (5.16) with $\|\bm{\lambda}(\bm{y})\|_{1}\leq p\Lambda$ and Lemma 4.1(b). ∎

Considering again the structured interpolation points (4.48) for fully quadratic interpolation, in Appendix B we show that, for this set, $\Lambda=\mathcal{O}(n)$ .

Lastly, consider the case of minimum Frobenius norm quadratic interpolation, for an interpolation set $\bm{y}_{1},\ldots,\bm{y}_{p}$ with $p\in\{n+2,\ldots,(n+1)(n+2)/2-1\}$ . Here, the associated Lagrange polynomials are again quadratic, (5.12), but are constructed by solving (c.f. (4.52))

\displaystyle\min_{c_{i},\bm{g}_{i},\bm{H}_{i}}\frac{1}{4}\|\bm{H}_{i}\|_{F}^{2},\qquad\text{s.t.}\quad\text{$\ell_{i}(\bm{y}_{j})=\delta_{i,j}$ for all $j=1,\ldots,p$}\quad\text{and}\quad\bm{H}_{i}=\bm{H}_{i}^{T},

(5.18)

which gives us (c.f. (4.65) and (4.79))

\displaystyle\bm{F}\left[\begin{array}[]{c}\lambda_{i,1}\\ \vdots\\ \lambda_{i,p}\\ \hline\cr c_{i}\\ \bm{g}_{i}\end{array}\right]=\bm{e}_{i},\qquad\text{or}\qquad\hat{\bm{F}}\left[\begin{array}[]{c}\Delta^{4}\lambda_{i,1}\\ \vdots\\ \Delta^{4}\lambda_{i,p}\\ \hline\cr c_{i}\\ \Delta\>\bm{g}_{i}\end{array}\right]=\bm{e}_{i},\qquad\forall i=1,\ldots,p,

(5.29)

with associated Hessians $\bm{H}_{i}=\sum_{j=1}^{p}\lambda_{i,j}(\bm{y}_{j}-\bm{x})(\bm{y}_{j}-\bm{x})^{T}$ . Once again, we get (5.4) and (5.5).

We now define $\bm{\varphi}(\bm{y})\in\mathbb{R}^{p+n+1}$ to be the polynomial vector

\displaystyle\bm{\varphi}(\bm{y})=\begin{bmatrix}\frac{1}{2}[(\bm{y}_{1}-\bm{x})^{T}(\bm{y}-\bm{x})]^{2}&\cdots&\frac{1}{2}[(\bm{y}_{p}-\bm{x})^{T}(\bm{y}-\bm{x})]^{2}&1&(\bm{y}-\bm{x})^{T}\end{bmatrix}^{T},

(5.30)

defined so that the $i$ -th row/column of $\bm{F}$ is $\bm{\varphi}(\bm{y}_{i})$ , i.e. $\bm{\varphi}(\bm{y}_{i})=\bm{F}\bm{e}_{i}$ for $i=1,\ldots,p$ . We also define $\hat{\bm{\varphi}}(\hat{\bm{s}})$ analogously, replacing $\bm{y}_{i}-\bm{x}$ with $\hat{\bm{s}}_{i}=(\bm{y}_{i}-\bm{x})/\Delta$ and $\bm{y}-\bm{x}$ with $\hat{\bm{s}}=(\bm{y}-\bm{x})/\Delta$ , so $\hat{\bm{\varphi}}(\hat{\bm{s}}_{i})=\hat{\bm{F}}\bm{e}_{i}$ . We can then observe that

\displaystyle\ell_{i}(\bm{y})=c_{i}+\bm{g}_{i}^{T}(\bm{y}-\bm{x})+\frac{1}{2}\sum_{j=1}^{p}\lambda_{i,j}[(\bm{y}_{j}-\bm{x})^{T}(\bm{y}-\bm{x})]^{2}=\bm{\varphi}(\bm{y})^{T}\begin{bmatrix}\begin{array}[]{c}\lambda_{i,1}\\ \vdots\\ \lambda_{i,p}\\ \hline\cr c_{i}\\ \bm{g}_{i}\end{array}\end{bmatrix}=\bm{\varphi}(\bm{y})^{T}\bm{F}^{-1}\bm{e}_{i},

(5.31)

or $\ell_{i}(\bm{y})=\hat{\bm{\varphi}}(\hat{\bm{s}})^{T}\hat{\bm{F}}^{-1}\bm{e}_{i}$ , and, recalling that $\bm{F}$ and $\hat{\bm{F}}$ are symmetric, once again we can evaluate all Lagrange polynomials at a single point without constructing any $\ell_{i}$ explicitly, via

\displaystyle\bm{\lambda}(\bm{y})=[\bm{F}^{-1}\bm{\varphi}(\bm{y})]_{1,\ldots,p}=[\hat{\bm{F}}^{-1}\hat{\bm{\varphi}}(\hat{\bm{s}})]_{1,\ldots,p},

(5.32)

where $[\cdot]_{1,\ldots,p}$ refers to the first $p$ entries of the vector.

We now get our new version of Lemma 4.13.

Lemma 5.5.

\displaystyle\|\bm{H}\|\leq\kappa_{H}:=12L_{1}p\beta^{2}\Lambda.

(5.33)

Proof.

Just as in the proof of Lemma 4.13, we note that we get the same Hessian if we interpolate to $\tilde{f}(\bm{y}):=f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})$ , with $|\tilde{f}(\bm{y}_{i})|\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}$ . Hence from (5.4) we have

\displaystyle\|\bm{H}\|\leq\sum_{i=1}^{p}|\tilde{f}(\bm{y}_{i})|\cdot\|\bm{H}_{i}\|\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}\sum_{i=1}^{p}\|\bm{H}_{i}\|,

(5.34)

where $\bm{H}_{i}$ is the Hessian of $\ell_{i}(\bm{y})$ . Since $|\ell_{i}(\bm{y})|\leq\Lambda$ for all $\bm{y}\in B(\bm{x},\Delta)$ , we can apply Lemma A.2 to compare $\ell_{i}$ with the zero function and conclude $\|\bm{H}_{i}\|\leq\frac{24\Lambda}{\Delta^{2}}$ , from which the result follows. ∎

Our new version of Theorem 4.14 is the following.

Theorem 5.6.

\displaystyle\kappa_{\textnormal{mf}}=\frac{L_{1}+\kappa_{H}}{2}\beta^{2}p\Lambda+\frac{L_{1}+\kappa_{H}}{2},\qquad\text{and}\qquad\kappa_{\textnormal{mg}}=2\kappa_{\textnormal{mf}}+2\kappa_{H},

(5.35)

using the value of $\kappa_{H}$ defined in Lemma 5.5.

Proof.

By following the same argument as used to derive (4.87) in the proof of Theorem 4.14, we get

\displaystyle|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}+\kappa_{H}}{2}\beta^{2}\|\bm{v}\|_{1}\Delta^{2},

(5.36)

for any $\bm{y}\in B(\bm{x},\Delta)$ , where $\bm{v}\in\mathbb{R}^{p}$ is any vector satisfying

\displaystyle\bm{M}^{T}\bm{v}=\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}.

(5.37)

We now show that $\bm{\lambda}(\bm{y})$ satisfies (5.37). Since $\bm{H}=\bm{0}$ is a global minimizer of (4.52), if $f$ is linear then we have exact interpolation, $m(\bm{y})=f(\bm{y})$ for all $\bm{y}$ . Applying this to $f(\bm{y})=(\bm{y}-\bm{x})^{T}\bm{e}_{i}$ for $i=1,\ldots,n$ and writing the resulting models in the form (5.4), we see

\displaystyle(\bm{y}-\bm{x})^{T}\bm{e}_{i}=\sum_{i=1}^{p}(\bm{y}_{i}-\bm{x})^{T}\bm{e}_{i}\>\ell_{i}(\bm{y}),

(5.38)

which, together with (5.5), is equivalent to $\bm{\lambda}(\bm{y})$ satisfying (5.37). Hence, (5.36) becomes

\displaystyle|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}+\kappa_{H}}{2}\beta^{2}\|\bm{\lambda}(\bm{y})\|_{1}\Delta^{2},

(5.39)

and so

	$\displaystyle\|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|$	$\displaystyle\leq\frac{L_{1}+\kappa_{H}}{2}\beta^{2}\\|\bm{\lambda}(\bm{y})\\|_{1}\Delta^{2}+\frac{1}{2}\|(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x})\|,$		(5.40)
		$\displaystyle\leq\left(\frac{L_{1}+\kappa_{H}}{2}\beta^{2}\\|\bm{\lambda}(\bm{y})\\|_{1}+\frac{\kappa_{H}}{2}\right)\Delta^{2}.$		(5.41)

The result then follows from Lemma 4.1(a) and $\|\bm{\lambda}(\bm{y})\|_{1}\leq p\Lambda$ . ∎

Now consider the set $\mathcal{Y}=\{\bm{x},\bm{x}+\Delta\bm{e}_{1},\ldots,\bm{x}+\Delta\bm{e}_{n},\bm{x}-\Delta\bm{e}_{1},\ldots,\bm{x}-\Delta\bm{e}_{n}\}$ , as in Corollary 4.15. The associated Lagrange polynomials are:


$\displaystyle\ell_{1}(\bm{y})$	$\displaystyle=1-\frac{1}{\Delta^{2}}\\|\bm{y}-\bm{x}\\|^{2},$	(5.42a)
$\displaystyle\ell_{i+1}(\bm{y})$	$\displaystyle=\frac{(y_{i}-x_{i})^{2}}{2\Delta^{2}}+\frac{y_{i}-x_{i}}{2\Delta},$	(5.42b)
$\displaystyle\ell_{n+i+i}(\bm{y})$	$\displaystyle=\frac{(y_{i}-x_{i})^{2}}{2\Delta^{2}}-\frac{y_{i}-x_{i}}{2\Delta},$	(5.42c)

for $i=1,\ldots,n$ . Each of these takes values in $[-\frac{1}{8},1]$ for all $\bm{y}\in B(\bm{x},\Delta)$ and so $\Lambda=1$ .

Remark 5.7.

For the example interpolation sets considered here, applying the value of $\Lambda$ in the fully linear/fully quadratic constants gives a worse bound (in terms of $n$ ) than Section 4. These can be improved by calculating the constants in terms of the Lebesgue constant, but this does not allow for the procedures in Section 5.2 to be used.

5.2 Model Improvement

As mentioned previously, the main reason for using Lagrange polynomials in our model error bounds is that it allows us both check if a model is fully linear, and to decide how to make small changes to the interpolation set to make it fully linear. Essentially, we wish to find an interpolation set for which $\Lambda$ is sufficiently small, below some given value.

Firstly, we can determine $\Lambda$ by maximizing each Lagrange polynomial and its negative within the trust-region. Since the Lagrange polynomials are themselves linear or quadratic, this reduces to solving $2p$ trust-region subproblems (2.12). In theory, this requires finding the global trust-region solution for each, but in practice solving these trust-region subproblems approximately can give a sufficiently good estimate [118].

Now, suppose we have an interpolation set which is $\Lambda$ -poised, but we want to change some points to reduce its value of $\Lambda$ below some threshold $\Lambda^{*}>1$ (recalling that $\Lambda\geq 1$ if any $\bm{y}_{i}\in B(\bm{x},\Delta)$ ). The procedure for changing the interpolation set to improve its poisedness constant is quite simple, and given in Algorithm 5.1. At each iteration of Algorithm 5.1 we find $\bm{y}\in B(\bm{x},\Delta)$ and $i$ such that $|\lambda_{i}(\bm{y})|=\Lambda$ and replace $\bm{y}_{i}$ with $\bm{y}$ . We do not specify a particular interpolation model type; Algorithm 5.1 applies equally to linear, quadratic or minimum Frobenius norm quadratic interpolation without modification.

1:Interpolation set

\mathcal{Y}

which is

\Lambda

-poised in

B(\bm{x},\Delta)

for some

\Lambda>0

and all

\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta

for some

\beta\geq 1

, desired

\Lambda

-poisedness constant

\Lambda^{*}>1

2:while

\Lambda>\Lambda^{*}

3: Find

\bm{y}\in B(\bm{x},\Delta)

and

i\in\{1,\ldots,p\}

such that

|\ell_{i}(\bm{y})|>\Lambda^{*}

4: Update

\mathcal{Y}

by replacing

\bm{y}_{i}

with

\bm{y}

5: Recompute the poisedness constant

\Lambda

of the new

\mathcal{Y}

6:end while

7:return

\mathcal{Y}

Algorithm 5.1 Interpolation set improvement using Lagrange polynomials.

To analyze Algorithm 5.1, we consider the impact on the determinant of the relevant interpolation matrix (i.e. $\bm{M}$ (4.17), $\bm{Q}$ (4.33) or $\bm{F}$ (4.65) depending on the type of interpolation model) from changing a single point in the interpolation set.

The case for linear or quadratic interpolation is quite straightforward.

Lemma 5.8.

Suppose the set $\mathcal{Y}$ is used for either linear or quadratic interpolation, with interpolation matrix $\bm{A}\in\mathbb{R}^{p\times p}$ (i.e. $\bm{M}$ or $\bm{Q}$ ), where $\mathcal{Y}=\{\bm{y}_{1},\ldots,\bm{y}_{p}\}$ . If we replace $\bm{y}_{i}$ in $\mathcal{Y}$ with another point $\bm{y}$ , then the resulting interpolation matrix $\bm{A}_{\textnormal{new}}$ satisfies

\displaystyle|\det(\bm{A}_{\textnormal{new}})|\geq|\ell_{i}(\bm{y})|\>|\det(\bm{A})|,

(5.43)

where the Lagrange polynomial $\ell_{i}$ is given by the original interpolation set $\mathcal{Y}$ (before the replacement is made).

Proof.

If $\bm{A}$ is singular, the result holds trivially, so assume that $\bm{A}$ is invertible. Replacing $\bm{y}_{i}$ with $\bm{y}$ changes the $i$ th row of $\bm{A}$ from $\bm{\phi}(\bm{y}_{i})^{T}$ to $\bm{\phi}(\bm{y})^{T}$ in $\bm{A}_{\textnormal{new}}$ , where $\bm{\phi}(\bm{y})=\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}$ in the linear interpolation case or the natural quadratic basis (4.30) for quadratic interpolation. Equivalently, $\bm{A}_{\textnormal{new}}^{T}$ is the same as $\bm{A}^{T}$ , but with its $i$ th column changed from $\bm{\phi}(\bm{y}_{i})$ to $\bm{\phi}(\bm{y})$ . However, from either (5.7) or (5.15), we have that $\ell_{i}(\bm{y})$ is the $i$ th entry of $\bm{A}^{-T}\bm{\phi}(\bm{y})$ . By Cramer’s rule for linear systems, this means $\ell_{i}(\bm{y})=\det(\bm{A}_{\textnormal{new}}^{T})/\det(\bm{A}^{T})=\det(\bm{A}_{\textnormal{new}})/\det(\bm{A})$ , giving the result. ∎

For minimum Frobenius norm updating, a similar result holds, but it is more complicated to prove because of the more complicated structure of the interpolation matrix $\bm{F}$ (4.65). In particular, replacing a single interpolation point changes both a row and column of $\bm{F}$ , so Cramer’s rule cannot be used as in the proof of Lemma 5.8.

Lemma 5.9.

Suppose the set $\mathcal{Y}$ is used for minimum Frobenius norm interpolation, with interpolation matrix $\bm{F}\in\mathbb{R}^{(p+n+1)\times(p+n+1)}$ , where $\mathcal{Y}=\{\bm{y}_{1},\ldots,\bm{y}_{p}\}$ . If we replace $\bm{y}_{i}$ in $\mathcal{Y}$ with another point $\bm{y}$ , then the resulting interpolation matrix $\bm{F}_{\textnormal{new}}$ satisfies

\displaystyle|\det(\bm{F}_{\textnormal{new}})|\geq\ell_{i}(\bm{y})^{2}\>|\det(\bm{F})|,

(5.44)

where the Lagrange polynomial $\ell_{i}$ is given by the original interpolation set $\mathcal{Y}$ (before the replacement is made).

Proof.

See [130, Theorem 5.2]. ∎

We are now ready to show that Algorithm 5.1 achieves the desired outcome.

Theorem 5.10.

Algorithm 5.1 terminates in finite time, and the resulting interpolation set $\mathcal{Y}$ is $\Lambda^{*}$ -poised in $B(\bm{x},\Delta)$ , and we have $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for all $\bm{y}_{i}\in\mathcal{Y}$ .

Proof.

The termination condition for Algorithm 5.1 implies that $\mathcal{Y}$ is $\Lambda^{*}$ -poised in $B(\bm{x},\Delta)$ if it terminates. All points in the resulting $\mathcal{Y}$ are either in $B(\bm{x},\beta\Delta)$ (if they were in $\mathcal{Y}$ initially) or are in $B(\bm{x},\Delta)$ (if they come from an update in line 4 of Algorithm 5.1). Hence all points satisfy $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ since we assume $\beta\geq 1$ . It remains to show that Algorithm 5.1 terminates in finite time.

Since the initial set $\mathcal{Y}$ is $\Lambda$ -poised, the corresponding interpolation matrix, say $\bm{A}$ (either $\bm{M}$ , $\bm{Q}$ or $\bm{F}$ depending on the type of interpolation) is invertible, so $|\det(\bm{A})|>0$ initially. Call this value $d_{0}$ . However, since all points from the initial $\mathcal{Y}$ and any new points generated are all in the compact set $B(\bm{x},\beta\Delta)$ , there is a finite maximum value, say $d_{\max}$ , of $|\det(\bm{A})|$ over all interpolation sets contained in $B(\bm{x},\beta\Delta)$ , From either Lemma 5.8 or Lemma 5.9, whenever we make the update in line 4 of Algorithm 5.1, we increase $|\det(\bm{A})|$ by a factor of at least $\min(|\ell_{i}(\bm{y})|,\ell_{i}(\bm{y})^{2})>\Lambda^{*}$ (since $|\ell_{i}(\bm{y})|>\Lambda^{*}>1$ ). Hence Algorithm 5.1 must terminate after at most $\lceil\log_{\Lambda^{*}}(d_{\max}/d_{0})\rceil$ iterations. ∎

Three iterations of Algorithm 5.1 for linear interpolation in two dimensions are illustrated in Figure 5.1. Initially, two interpolation points $\bm{x}_{2}$ and $\bm{x}_{3}$ are very close, yielding a large value of $\Lambda$ .²²²²22The value of $\Lambda$ is still not unreasonably large here, simply so the locations of $\bm{x}_{2}$ and $\bm{x}_{3}$ can be illustrated clearly. After two iterations the value of $\Lambda$ has been decreased by more than an order of magnitude, and is very close to 1. Note that the center of the region $\bm{x}_{1}$ is no longer chosen as an interpolation point, so $m(\bm{x})\neq f(\bm{x})$ is possible for this final interpolation set. However, after one iteration of Algorithm 5.1 we already have a small value of $\Lambda$ while retaining $m(\bm{x})=f(\bm{x})$ , which may be desirable in practice.

5.3 Incremental Geometry Improvement Algorithm

We conclude this section by describing a variant of Algorithm 3.1 which avoids the requirement that every model $m_{k}$ is fully linear, and instead handles non-fully linear models by incrementally improving the geometry of the current interpolation set. This core principle—only worry about the geometry of the interpolation set if the algorithm is not progressing, and only make minimal changes to the set if so—is used in many state-of-the-art MBDFO codes, such as Powell’s methods (see Section 8).

At each iteration $k$ of this algorithm, we have an interpolation set $\mathcal{Y}_{k}=\{\bm{y}_{k,1},\ldots,\bm{y}_{k,p}\}$ with $\bm{x}_{k}\in\mathcal{Y}_{k}$ , which can be used for linear, quadratic or minimum Frobenius norm quadratic interpolation. The algorithm ensures that the interpolation linear system is invertible at every iteration. After the model is constructed and a tentative step determined, a slightly different the trust-region mechanism to Algorithm 3.1 is used: if a step is not accepted and the interpolation set is not guaranteed to be fully linear, we perform a single iteration of the geometry-improving Algorithm 5.1 (and do not reduce $\Delta_{k}$ ) instead of having an unsuccessful step (where $\Delta_{k}$ is reduced).

Formally, the ‘replace distant point’ and ‘replace bad point’ steps in Algorithm 5.2 have the minimal requirements that the new geometry-improving point $\bm{y}_{\textnormal{new}}$ must satisfy, in practice a good choice—motivated by Lemma 5.8 or 5.9—is to find this point by maximizing the (magnitude of the) relevant Lagrange polynomial inside the current trust region $B(\bm{x}_{k},\Delta_{k})$ . More details are given in Remark 5.16 below.

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

, acceptance thresholds

0<\eta_{U}\leq\eta_{S}<1

, criticality threshold

\mu_{c}>0

, number of interpolation points

p

, distance threshold

\beta>1

and poisedness threshold

\Lambda>1

2:Select an initial interpolation set

\mathcal{Y}_{0}\subset\mathbb{R}^{n}

of size

p

with

\bm{x}_{0}\in\mathcal{Y}_{0}

such that the resulting interpolation linear system is invertible.

3:for

k=0,1,2,\ldots

4: Solve the relevant interpolation problem to obtain the model

m_{k}

(2.9) and Lagrange polynomials

\ell_{k,1},\ldots,\ell_{k,p}

5: Solve the trust-region subproblem (2.10) to get a step

\bm{s}_{k}

satisfying Assumption 2.9.

6: Evaluate

f(\bm{x}_{k}+\bm{s}_{k})

and calculate the ratio

\rho_{k}

(2.11).

7: if

\rho_{k}\geq\eta_{U}

and

\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}

then

8: ((Very) successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and either

\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}

\rho_{k}\geq\eta_{S}

\Delta_{k+1}=\Delta_{k}

\eta_{U}\leq\rho_{k}<\eta_{S}

9: Set

\mathcal{Y}_{k+1}

to be any interpolation set of size

p

with

\bm{x}_{k+1}\in\mathcal{Y}_{k+1}

so that the resulting interpolation linear system is invertible.

10: else if

\max_{i=1,\ldots,p}\|\bm{y}_{k,i}-\bm{x}_{k}\|>\beta\Delta_{k}

then

11: (Replace distant point) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\Delta_{k}

12: Set

\mathcal{Y}_{k+1}=\mathcal{Y}_{k}\setminus\{\bm{y}_{k,i_{k}}\}\cup\{\bm{y}_{\textnormal{new}}\}

, where

i_{k}\in\{1,\ldots,p\}

satisfies

\|\bm{y}_{k,i_{k}}-\bm{x}_{k}\|>\beta\Delta_{k}

and

\bm{y}_{\textnormal{new}}\in B(\bm{x}_{k},\Delta_{k})

satisfies

\ell_{k,i_{k}}(\bm{y}_{\textnormal{new}})\neq 0

13: else if

\max_{\bm{y}\in B(\bm{x}_{k},\Delta_{k})}\max_{i=1,\ldots,p\>\text{s.t.}\bm{y}_{i}\neq\bm{x}_{k}}|\ell_{k,i}(\bm{y})|>\Lambda

then

14: (Replace bad point) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\Delta_{k}

15: Set

\mathcal{Y}_{k+1}=\mathcal{Y}_{k}\setminus\{\bm{y}_{k,i_{k}}\}\cup\{\bm{y}_{\textnormal{new}}\}

, where

i_{k}\in\{1,\ldots,p\}

and

\bm{y}_{\textnormal{new}}\in B(\bm{x}_{k},\Delta_{k})

satisfy

|\ell_{k,i_{k}}(\bm{y}_{\textnormal{new}})|>\Lambda

and

\bm{y}_{k,i_{k}}\neq\bm{x}_{k}

16: else

17: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

and

\mathcal{Y}_{k+1}=\mathcal{Y}_{k}

18: end if

19:end for

Algorithm 5.2 Incremental geometry improving MBDFO algorithm for (2.1).

The full algorithm is given in Algorithm 5.2. It is written in a generic way, without specifying a specific interpolation type, which can be linear, fully quadratic²³²³23Note that a fully quadratic model with $\Delta\leq\Delta_{\max}$ is fully linear with constants $\kappa_{\textnormal{mf}}\Delta_{\max}$ and $\kappa_{\textnormal{mg}}\Delta_{\max}$ , or use minimum Frobenius interpolation with $p=(n+1)(n+2)/2$ ; see Remark 4.11. or minimum Frobenius quadratic.

We begin with some basic results.

Lemma 5.11.

Regarding Algorithm 5.2, we have:

(a)

The interpolation linear system is invertible at every iteration (and hence Lagrange polynomials exist at every iteration);
(b)

If $\max_{i=1,\ldots,p}\|\bm{y}_{k,i}-\bm{x}_{k}\|\leq\beta\Delta_{k}$ and $\max_{\bm{y}\in B(\bm{x},\Delta)}|\ell_{k,i}(\bm{y})|\leq\Lambda$ for all $i=1,\ldots,p$ except if $\bm{y}_{i}=\bm{x}_{k}$ , then the model $m_{k}$ is fully linear in $B(\bm{x}_{k}.\Delta_{k})$ with constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}>0$ possibly depending on $p$ , $\beta$ and $\Lambda$ ;
(c)

We can only have finitely many consecutive iterations of types ‘replace distant point’ or ‘replace bad point’ before a (very) successful or unsuccessful iteration must occur; and

Proof.

First, we show (a) by induction. The case $k=0$ holds by construction of $\mathcal{Y}_{0}$ , so suppose it is invertible for some iteration $k$ . If iteration $k$ is (very) successful, then the next linear system is invertible by definition of $\mathcal{Y}_{k+1}$ . If iteration $k$ replaces a distant or bad interpolation point, then $\mathcal{Y}_{k+1}=\mathcal{Y}_{k}\setminus\{\bm{y}_{k,i_{k}}\}\cup\{\bm{y}_{\textnormal{new}}\}$ for $\ell_{k,i_{k}}(\bm{y}_{\textnormal{new}})\neq 0$ , so the system is invertible by either Lemma 5.8 or 5.9. Lastly, if iteration $k$ is unsuccessful, then the linear system is unchanged.

To show (b), denote $i^{*}$ to be the index such that $\bm{y}_{k,i^{*}}=\bm{x}_{k}$ . Since the Lagrange polynomials always form a partition of unity, we have, for any $\bm{y}\in B(\bm{x},\Delta)$ ,

\displaystyle|\ell_{k,i^{*}}(\bm{y})|\leq 1+\sum_{\begin{subarray}{c}i=1\\ i\neq i^{*}\end{subarray}}^{p}|\ell_{k,i}(\bm{y})|\leq 1+(p-1)\Lambda.

(5.45)

Hence $\max_{\bm{y}\in B(\bm{x},\Delta)}\|\bm{\lambda}(\bm{y})\|_{\infty}\leq 1+(p-1)\Lambda$ , and the model $m_{k}$ is fully linear in $B(\bm{x}_{k},\Delta_{k})$ with suitable constants by Theorem 5.3 or 5.6 (which also covers fully quadratic interpolation, per Remark 4.11). Finally, (c) is an immediate consequence of Theorem 5.10. ∎

The properties in Lemma 5.11 are sufficient to prove global convergence for Algorithm 5.2 (i.e. $\liminf_{k\to\infty}\|\nabla f(\bm{x}_{k})\|=0$ ), but to get a worst-case complexity bound, we need to slightly strengthen Lemma 5.11 (c) to a uniform bound.

Assumption 5.12.

The number of consecutive iterations of types ‘replace distant point’ or ‘replace bad point’ cannot exceed some $G_{\max}\in\mathbb{N}$ , independent of the starting interpolation set.

It was recently shown that Assumption 5.12 holds with $G_{\max}=\mathcal{O}(p\log n)$ for linear and fully quadratic interpolation [40, Theorem 5.1]. No such bound is known for minimum Frobenius quadratic interpolation, but it is likely that one exists.

Lemma 5.13.

Suppose Assumptions 2.4 (a), 2.9, and 3.3 (b) hold. If, on iteration $k$ of Algorithm 5.2, $\max_{i=1,\ldots,p}\|\bm{y}_{k,i}-\bm{x}_{k}\|\leq\beta\Delta_{k}$ and $\max_{\bm{y}\in B(\bm{x},\Delta)}|\ell_{k,i}(\bm{y})|\leq\Lambda$ for all $i=1,\ldots,p$ except if $\bm{y}_{k,i}=\bm{x}_{k}$ , and $\bm{g}_{k}\neq\bm{0}$ and

\displaystyle\Delta_{k}\leq\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2\kappa_{\textnormal{mf}}},\frac{1}{\kappa_{H}},\frac{1}{\mu_{c}}\right)\|\bm{g}_{k}\|,

(5.46)

where $\kappa_{\textnormal{mf}}$ is from Lemma 5.11 (b), then $\rho_{k}\geq\eta_{S}$ and $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ (i.e. iteration $k$ is very successful).

Proof.

By Lemma 5.11 (b), the model $m_{k}$ is fully linear in $B(\bm{x}_{k},\Delta_{k})$ and so the proof of Lemma 3.5 holds. ∎

Lemma 5.14.

Suppose Assumptions 2.4 (a), 2.9, and 3.3 (b) hold and we run Algorithm 5.2. If $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ for all $k=0,\ldots,K-1$ , then $\Delta_{k}\geq\Delta_{\min}(\epsilon)$ for all $k=0,\ldots,K$ , where $\Delta_{\min}(\epsilon)$ is defined in Lemma 3.6.

Proof.

The proof is identical to that of Lemma 3.6, except noting that if $\Delta_{k+1}<\Delta_{k}$ then we must have an unsuccessful iteration, and so $\max_{i=1,\ldots,p}\|\bm{y}_{k,i}-\bm{x}_{k}\|\leq\beta\Delta_{k}$ and $\max_{\bm{y}\in B(\bm{x},\Delta)}|\ell_{k,i}(\bm{y})|\leq\Lambda$ for all $i=1,\ldots,p$ except if $\bm{y}_{k,i}=\bm{x}_{k}$ . This allows us to use Lemma 5.13 to achieve the contradiction. ∎

Our final worst-case complexity result is the following.

Theorem 5.15.

Suppose Assumptions 2.4, 2.9, 3.3 (b) and 5.12 hold. If $k_{\epsilon}$ is the first iteration of Algorithm 5.2 such that $\|\nabla f(\bm{x}_{k_{\epsilon}})\|<\epsilon$ , then $k_{\epsilon}=\mathcal{O}(G_{\max}\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{2}\epsilon^{-2})$ , also requiring a total of $\mathcal{O}(G_{\max}\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{2}\epsilon^{-2})$ objective evaluations. Hence $\liminf_{k\to\infty}\|\nabla f(\bm{x}_{k})\|=0$ .

Proof.

The bounds on the number of (very) successful (3.20) and unsuccessful (3.22) iterations from the proof of Theorem 3.7 hold here, where we only need to note that $\Delta_{k}$ is unchanged for ‘replace distant point’ and ‘replace bad point’ iterations to recover (3.22). Lastly, the total number of ‘replace distant point’ and ‘replace bad point’ iterations is at most $G_{\max}(|\mathcal{S}|+|\mathcal{U}|)$ by Assumption 5.12, so the total number of iterations of all types is at most $(G_{\max}+1)(|\mathcal{S}|+|\mathcal{U}|)$ , i.e. a factor of $\mathcal{O}(G_{\max})$ worse than Theorem 3.7. The objective evaluation bound comes from noting that Algorithm 5.2 requires at most 2 evaluations per iteration (once the first interpolation set is chosen). ∎

If we could construct linear interpolation models to give $\kappa_{\textnormal{m}}$ independent of $n$ as in Corollary 4.5, Theorem 5.15 with $G_{\max}=\mathcal{O}(n\log n)$ gives an evaluation complexity bound that is slightly worse than Corollary 4.5 by a factor of $\mathcal{O}(\log n)$ . We can both make $\kappa_{\textnormal{m}}$ independent of $n$ and remove the $\mathcal{O}(\log n)$ factor from the complexity bound by excluding the Lagrange polynomial associated with the current iterate $\bm{x}_{k}$ in the $\Lambda$ -poisedness requirement (i.e. by modifying Definition 5.2) [40]. In general, if the interpolation set is updated well on (very) successful iterations—see Remark 5.16—then $\kappa_{\textnormal{m}}$ and $G_{\max}$ are usually small and the approach of Algorithm 5.2 is efficient in practice.

Remark 5.16.

On (very) successful iterations, we have a flexible choice in how to update the interpolation set (line 9 of Algorithm 5.2). A simple way to do this would be $\mathcal{Y}_{k+1}=\mathcal{Y}_{k}\setminus\{\bm{y}_{k,i_{k}}\}\cup\{\bm{x}_{k+1}\}$ , where $i_{k}$ is any index such that $\ell_{k,i_{k}}(\bm{x}_{k+1})\neq 0$ , which must exist by (5.5). Usually, all Lagrange polynomials will be nonzero at $\bm{x}_{k+1}$ , and so, motivated by the geometry improving procedure, a good choice might be to replace a point that is far from $\bm{x}_{k+1}$ and/or improves the poisedness of the interpolation set (i.e. $|\ell_{k,i_{k}}(\bm{x}_{k+1})|$ is large, as in Algorithm 5.1). For example, in [118], $i_{k}$ is selected as

\displaystyle i_{k}\in\operatorname*{arg\,max}_{i=1,\ldots,p}\left\{\max\left(\frac{\|\bm{y}_{k,i}-\bm{x}_{k+1}\|^{3}}{\Delta_{k}^{3}},1\right)\>|\ell_{k,i}(\bm{x}_{k+1})|\right\}.

(5.47)

For ‘replace distant point’ and ‘replace bad point’ iterations, typically the point removed from the interpolation set is the maximizer of the relevant quantity (distance from $\bm{x}_{k}$ or poisedness), and is replaced with a point that maximizes the magnitude of the Lagrange polynomial of the point being removed, as in Algorithm 5.1.

Powell’s Methods

The framework of Algorithm 5.2 broadly aligns with the approach in Powell’s highly regarded MBDFO software (see Section 8). The most notable distinction is that two trust-region radii are used in a complex way that partially decouples the step size constraint $\|\bm{s}_{k}\|\leq\Delta_{k}$ from the radius used to assess $\Lambda$ -poisedness. It also uses a simplified interpolation set management procedure:

•

On (very) successful steps, update the interpolation set as per Remark 5.16;
•

Otherwise, check that all interpolation points are sufficiently close to $\bm{x}_{k}$ . If yes, decrease both trust-region radii. If no, replace the furthest interpolation point with the approximate maximizer of (the magnitude of) the associated Lagrange polynomial, and possibly decrease one trust-region radius.

Additionally, the implementation pays careful attention to efficient linear algebra and subproblem solvers. This approach has proven extremely successful in practice, despite the simplified interpolation set management having limited convergence guarantees (e.g. [81]).²⁴²⁴24The use of two trust-region radii on its own still allows for convergence theory, e.g. [155]. An accessible overview of these algorithms can be found in [123].

Notes and References

More information on Lebesgue constants in approximation theory may be found in resources such as [144]. The theory of Lagrange polynomials to construct fully linear/fully quadratic interpolation models was originally developed in [46] for linear/quadratic interpolation and [45] for minimum Frobenius norm interpolation, with these results collected in [51]. We again note that minimum Frobenius norm quadratic models were originally studied in [119].

A more concise alternative to (5.8) and (5.16) are the bounds [117, 33]

\displaystyle|m(\bm{y})-f(\bm{y})|\leq\frac{L_{1}}{2}\sum_{i=1}^{p}|\ell_{i}(\bm{y})|\>\|\bm{y}-\bm{y}_{i}\|^{2},\qquad\text{and}\qquad|m(\bm{y})-f(\bm{y})|\leq\frac{L_{2}}{6}\sum_{i=1}^{p}|\ell_{i}(\bm{y})|\>\|\bm{y}-\bm{y}_{i}\|^{3},

(5.48)

for all $\bm{y}\in\mathbb{R}^{n}$ , for linear and (fully) quadratic interpolation under Assumptions 2.4 (a) and 2.7 (a) respectively.

The incremental geometry improving algorithm (Algorithm 5.2) is based on the approach from [131], which also includes an example where simply updating $\mathcal{Y}_{k}$ by removing the furthest point from $\bm{x}_{k}$ and replacing it with $\bm{x}_{k}+\bm{s}_{k}$ does not converge, demonstrating that consideration of interpolation set geometry is necessary for a convergent MBDFO method. The presentation in Section 5.3 draws on the complexity bounds from the more recent analysis [40], which also proves that Assumption 5.12 can be satisfied for polynomial interpolation over an arbitrary basis.

Recent work on Lagrange polynomials in MBDFO have included refining the error bounds from Lagrange polynomials [33], extending to the constrained optimization case [86, 130] (see Section 6.1), and drawing connections with methods for outlier detection [156].

6 Constrained Optimization

In this section we consider how the above algorithmic ideas and approximation theory can be adapted to constrained optimization problems. We first consider simple convex constraints (not involving any derivative-free functions), such as bounds, and then consider the case of general, potentially derivative-free, constraints.

6.1 Simple Convex Constraints

We first consider the case of MBDFO in the presence of simple constraints on the decision variables. We will focus on the case where the feasible set is an easy-to-describe, convex set. Specifically, we will extend the above ideas to solve problems of the form

\displaystyle\min_{\bm{x}\in\mathcal{C}}f(\bm{x}),

(6.1)

where $f:\mathbb{R}^{n}\to\mathbb{R}$ is, as usual, smooth and nonconvex but with only a zeroth order oracle available, and the feasible set $\mathcal{C}$ satisfies the following.

Assumption 6.1.

The feasible set $\mathcal{C}\subseteq\mathbb{R}^{n}$ in (6.1) is closed, convex and has nonempty interior.

This covers some of the most important constraint types, such as lower/upper bounds and linear inequality constraints. To allow us to apply our method to many choices of feasible sets, the only information about $\mathcal{C}$ available to our algorithm will be its Euclidean projection operator,

\displaystyle\operatorname{proj}_{\mathcal{C}}(\bm{x}):=\operatorname*{arg\,min}_{\bm{y}\in\mathcal{C}}\|\bm{y}-\bm{x}\|,

(6.2)

which returns the closest point in $\mathcal{C}$ to $\bm{x}$ . This function is well-defined (i.e. there is a unique minimizer) whenever $\mathcal{C}$ satisfies Assumption 6.1 [18, Theorem 6.25]. Since $\operatorname{proj}_{\mathcal{C}}(\bm{x})=\bm{x}$ if and only if $\bm{x}\in\mathcal{C}$ , the projection operator also gives us a test for feasibility. There are many simple sets $\mathcal{C}$ for which $\operatorname{proj}_{\mathcal{C}}$ is easy to compute (see [18, Table 6.1] for others), for example:

•

For bound constraints $\mathcal{C}=\{\bm{x}:\bm{a}\leq\bm{x}\leq\bm{b}\}$ (including unbounded constraints, $a_{i}=-\infty$ and/or $b_{i}=\infty$ ), we have $[\operatorname{proj}_{\mathcal{C}}(\bm{x})]_{i}=\min(\max(a_{i},x_{i}),b_{i})$ for $i=1,\ldots,n$ ; and
•

If $\mathcal{C}=\{\bm{x}:\bm{a}^{T}\bm{x}\leq b\}$ for $\bm{a}\neq\bm{0}$ , then $\operatorname{proj}_{\mathcal{C}}(\bm{x})=\bm{x}-\frac{\max(\bm{a}^{T}\bm{x}-b,0)}{\|\bm{a}\|^{2}}\bm{a}$ .

If we have many sets, each of which has their own projection operator, the projection onto the intersection of these sets can be computed using Dykstra’s algorithm [28].

A good criticality measure for (6.1), to assess how close we are to a solution, is the following²⁵²⁵25Other criticality measures can also be used, with another common choice being $\pi(\bm{x})=\|\operatorname{proj}_{\mathcal{C}}(\bm{x}-\nabla f(\bm{x}))-\bm{x}\|$ .:


$\displaystyle\pi(\bm{x}):=-\min_{\bm{d}\in\mathbb{R}^{n}}$	$\displaystyle\>\nabla f(\bm{x})^{T}\bm{d},\hskip 18.49988pt\forall\bm{x}\in\mathcal{C}.$	(6.3a)
s.t.	$\displaystyle\>\bm{x}+\bm{d}\in\mathcal{C},$	(6.3b)
	$\displaystyle\>\\|\bm{d}\\|\leq 1,$	(6.3c)

In the unconstrained case, $\mathcal{C}=\mathbb{R}^{n}$ , this reduces to $\pi(\bm{x})=\|\nabla f(\bm{x})\|$ , as we might hope. The suitability of $\pi(\bm{x})$ as a criticality measure comes from the following.

Proposition 6.2 (Theorem 12.1.6, [47]).

If $f$ is twice continuously differentiable, then $\pi$ is continuous in $\bm{x}$ , $\pi(\bm{x})\geq 0$ for all $\bm{x}\in\mathcal{C}$ , and $\pi(\bm{x})=0$ if and only if $\bm{x}$ is a first-order critical point of (6.1).²⁶²⁶26A point $\bm{x}$ is first-order critical for (6.1) if $-\nabla f(\bm{x})\in\mathcal{N}(\bm{x})$ , where $\mathcal{N}(\bm{x})$ is the normal cone for $\mathcal{C}$ at $\bm{x}$ . This is a first-order necessary condition for a constrained local minimizer [111, Theorem 12.8].

Our goal in this section is to develop a strictly feasible trust-region method for solving (6.1); that is, where we can only evaluate $f$ at feasible points. This may be necessary, for example, if we have a $\sqrt{x}$ term in $f$ with bounds $x\geq 0$ .

If at iteration $k$ we have a quadratic model $m_{k}$ (3.1) for the objective, we naturally get an estimate of the criticality measure $\pi(\bm{x}_{k})$ (6.3c), namely


$\displaystyle\pi^{m}_{k}:=-\min_{\bm{d}\in\mathbb{R}^{n}}$	$\displaystyle\>\bm{g}_{k}^{T}\bm{d},.$	(6.4a)
s.t.	$\displaystyle\>\bm{x}_{k}+\bm{d}\in\mathcal{C},$	(6.4b)
	$\displaystyle\>\\|\bm{d}\\|\leq 1,$	(6.4c)

Again, if $\mathcal{C}=\mathbb{R}^{n}$ then $\pi^{m}_{k}=\|\bm{g}_{k}\|$ . Assuming strict feasibility, our trust-region subproblem now looks like: set $\bm{s}_{k}$ to be an approximate minimizer of

\displaystyle\min_{\bm{s}\in\mathbb{R}^{n}}m_{k}(\bm{x}_{k}+\bm{s})\hskip 18.49988pt\text{s.t.}\hskip 18.49988pt\bm{x}_{k}+\bm{s}\in\mathcal{C},\qquad\text{and}\qquad\|\bm{s}\|\leq\Delta_{k}.

(6.5)

The presence of the feasibility condition $\bm{x}_{k}+\bm{s}\in\mathcal{C}$ means that the theory and algorithms in Section 2.2 for solving the trust-region subproblem no longer apply. In this setting, the analog of the Cauchy point is to consider the projected gradient path, $\bm{s}_{k}(t):=\operatorname{proj}_{\mathcal{C}}(\bm{x}_{k}-t\bm{g}_{k})-\bm{x}_{k}$ . A suitable linesearch applied to $\bm{s}_{k}(t)$ can yield [47, Theorem 12.2.2] a step $\bm{s}_{k}=\bm{s}_{k}(t^{*})$ satisfying the following assumption, the constrained version of Assumption 2.9. Importantly, instead of the sufficient decrease condition depending on $\|\bm{g}_{k}\|$ , it now depends on $\pi^{m}_{k}$ .

Assumption 6.3.

For all $k$ , the computed step $\bm{s}_{k}$ (6.5) satisfies $\bm{x}_{k}+\bm{s}_{k}\in\mathcal{C}$ , $\|\bm{s}_{k}\|\leq\Delta_{k}$ and

\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\pi^{m}_{k}\min\left(\Delta_{k},\frac{\pi^{m}_{k}}{\|\bm{H}_{k}\|+1},1\right),

(6.6)

for some $\kappa_{s}\in(0,\frac{1}{2})$ .

Assumption 6.3 is almost identical to Assumption 2.9, with the unconstrained criticality measure $\|\bm{g}_{k}\|$ replaced with $\pi^{m}_{k}$ . Hence, as we shall see, this new assumption will not have a significant impact on the convergence of our algorithm.

The biggest change comes from the model accuracy requirements. Our fully linear definition (Definition 3.1) is based on comparing the model with the objective at all points inside the trust region. However, our strict feasibility requirement means that our interpolation model can only be constructed using points in $\mathcal{C}$ . This limits our ability to have accurate models outside $\mathcal{C}$ , which is required for Definition 3.1 to be satisfied.

Example 6.4.

Suppose we have problem dimension $n=2$ and our feasible region is defined by the simple bound constraints $\mathcal{C}=\{\bm{x}\in\mathbb{R}^{2}:|x_{2}|\leq\delta\}$ for some small $\delta\in(0,1)$ . Now suppose we have the base point $\bm{x}=\bm{0}$ and we wish to perform linear interpolation to build a fully linear model in $B(\bm{0},1)$ (i.e. $\Delta=1$ ). Following the approach in Section 4.1, a natural choice of feasible interpolation points is $\{\bm{0},\bm{e}_{1},\delta\,\bm{e}_{2}\}$ . The resulting interpolation model can be shown to be fully linear with $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(1/\delta)$ , and so our fully linear constants can be made arbitrarily large just by changing the feasible region. So, if we continued to use Definition 3.1 as our measure of model accuracy, the worst-case complexity of our algorithm could grow like $\mathcal{O}(\delta^{-2}\epsilon^{-2})$ for accuracy level $\epsilon$ .

However this definition is stronger than we need. Requiring strictly feasible interpolation points only limits our ability to approximate $f$ outside the feasible region, but that is exactly the region which is not relevant to our optimization. So, we now introduce a generalized definition of fully linear interpolation models, capturing the notion that we only care about model accuracy within the feasible region.

Definition 6.5.

Suppose we have $\bm{x}\in\mathcal{C}$ and $\Delta>0$ . A local model $m:\mathbb{R}^{n}\to\mathbb{R}$ approximating $f:\mathbb{R}^{n}\to\mathbb{R}$ is $\mathcal{C}$ -feasible fully linear in $B(\bm{x},\Delta)$ if there exist constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}>0$ , independent of $m$ , $\bm{x}$ and $\Delta$ , such that


$\displaystyle\max_{\begin{subarray}{c}\bm{y}\in\mathcal{C}\\ \bm{y}\in B(\bm{x},\Delta)\end{subarray}}\|m(\bm{y})-f(\bm{y})\|$	$\displaystyle\leq\kappa_{\textnormal{mf}}\Delta^{2},$	(6.7a)
$\displaystyle\max_{\begin{subarray}{c}\bm{x}+\bm{d}\in\mathcal{C}\\ \\|\bm{d}\\|\leq 1\end{subarray}}\|(\nabla m(\bm{x})-\nabla f(\bm{x}))^{T}\bm{d}\|$	$\displaystyle\leq\kappa_{\textnormal{mg}}\Delta.$	(6.7b)

Again, we will sometimes use $\kappa_{\textnormal{m}}:=\max(\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}})$ for notational convenience.

In particular, we note the distinction $\|\bm{y}-\bm{x}\|\leq\Delta$ in (6.7a) but $\|\bm{d}\|\leq 1$ in (6.7b) (as in (6.4c)). Also, compared to Definition 3.1, the gradient accuracy condition (6.7b) only considers the accuracy of $\nabla m(\bm{x})\approx\nabla f(\bm{x})$ , not at any other points in the trust region.²⁷²⁷27Indeed, our convergence analysis in Section 3 only requires (3.5b) to hold at $\bm{y}=\bm{x}$ .

We are now ready to state our main MBDFO algorithm for solving (6.1), given in Algorithm 6.1. It is essentially the same as Algorithm 3.1, but replacing $\|\bm{g}_{k}\|$ with $\pi^{m}_{k}$ , and using our new sufficient decrease and fully linear conditions. Specifically, our new model assumptions are given below.

Assumption 6.6.

At each iteration $k$ of Algorithm 6.1, the model $m_{k}$ (3.1) satisfies:

(a)

$m_{k}$ is $\mathcal{C}$ -feasible fully linear in $B(\bm{x}_{k},\Delta_{k})$ with constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}>0$ independent of $k$ ;
(b)

$\|\bm{H}_{k}\|\leq\kappa_{H}-1$ for some $\kappa_{H}\geq 1$ (independent of $k$ ).

1:Starting point

\bm{x}_{0}\in\mathcal{C}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

, acceptance thresholds

0<\eta_{U}\leq\eta_{S}<1

, and criticality threshold

\mu_{c}>0

2:for

k=0,1,2,\ldots

3: Build a local quadratic model

m_{k}

(3.1) satisfying Assumption 6.6.

4: Solve the trust-region subproblem (6.5) to get a step

\bm{s}_{k}

satisfying Assumption 6.3.

5: Evaluate

f(\bm{x}_{k}+\bm{s}_{k})

and calculate the ratio

\rho_{k}

(2.11).

6: if

\rho_{k}\geq\eta_{S}

and

\pi^{m}_{k}\geq\mu_{c}\Delta_{k}

then

7: (Very successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}

8: else if

\eta_{U}\leq\rho_{k}<\eta_{S}

and

\pi^{m}_{k}\geq\mu_{c}\Delta_{k}

then

9: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\Delta_{k}

10: else

11: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

12: end if

13:end for

Algorithm 6.1 Constrained MBDFO trust-region method for solving (6.1).

We begin by stating the obvious result that all iterates of Algorithm 6.1 are feasible.

Lemma 6.7.

Suppose Assumption 6.3 holds and we run Algorithm 6.1. Then $\bm{x}_{k}\in\mathcal{C}$ for all $k$ .

Proof.

This holds by induction on $k$ because $\bm{x}_{0}\in\mathcal{C}$ by definition, and $\bm{x}_{k+1}\in\{\bm{x}_{k},\bm{x}_{k}+\bm{s}_{k}\}$ , with $\bm{x}_{k}+\bm{s}_{k}\in\mathcal{C}$ whenever $\bm{x}_{k}\in\mathcal{C}$ from Assumption 6.3. ∎

The $\mathcal{C}$ -feasible fully linear requirement (6.7b) is required principally to ensure the following, which essentially replaces (3.5b) in measuring the error in the criticality measure.

Lemma 6.8 (Lemma 2.4, [86]).

Suppose Assumptions 2.4 (a) and 6.6 hold. Then $|\pi(\bm{x}_{k})-\pi^{m}_{k}|\leq\kappa_{\textnormal{mg}}\Delta_{k}$ for all iterations $k$ of Algorithm 6.1.

Now, the worst-case complexity of Algorithm 6.1 can be proven using essentially identical arguments to the complexity of Algorithm 3.1, replacing $\|\nabla f(\bm{x}_{k})\|$ and $\|\bm{g}_{k}\|$ with $\pi(\bm{x}_{k})$ and $\pi^{m}_{k}$ respectively.

Theorem 6.9.

Suppose Assumptions 2.4, 6.3 and 6.6 hold and we run Algorithm 6.1. If $k_{\epsilon}$ is the first iteration of Algorithm 6.1 such that $\pi(\bm{x}_{k})<\epsilon$ , then $k_{\epsilon}=\mathcal{O}(\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{2}\epsilon^{-2})$ . Hence $\liminf_{k\to\infty}\pi(\bm{x}_{k})=0$ .

Proof.

See [86, Theorem 3.14] for details. ∎

We now have a worst-case complexity bound for Algorithm 6.1 which matches the unconstrained version, Corollary 3.8 for Algorithm 3.1. What remains is to specify how to construct $\mathcal{C}$ -feasible fully linear models using only feasible points.

6.1.1 Constructing Feasible Interpolation Models

We wish to construct interpolation models that are $\mathcal{C}$ -feasible fully linear (Definition 6.5) using only feasible interpolation points. To do this, we will use the notions of Lebesgue measure and $\Lambda$ -poisedness (Definition 5.2), since these more naturally extend to the constrained case than the matrix norms in Section 4.

Since we care about the accuracy of our model only at feasible points, the natural generalization of Definition 5.2 to the constrained case is to measure the size of Lagrange polynomials only within the feasible region.

Definition 6.10.

Suppose we an interpolation set $\mathcal{Y}:=\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset\mathcal{C}$ such that its Lagrange polynomials exist. Given $\bm{x}\in\mathcal{C}$ and $\Delta>0$ , we say that $\mathcal{Y}$ is $\Lambda$ -poised in $B(\bm{x},\Delta)\cap\mathcal{C}$ for some $\Lambda>0$ if

\displaystyle\max_{\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}}\|\bm{\lambda}(\bm{y})\|_{\infty}\leq\Lambda.

(6.8)

Aside from only considering $\bm{y}\in\mathcal{C}$ , the main difference in Definition 6.10 compared to Definition 5.2 is that we also only consider $\bm{y}\in B(\bm{x},\min(\Delta,1))$ rather than $B(\bm{x},\Delta)$ . This is required to handle the differing constraints $\|\bm{y}-\bm{x}\|\leq\Delta$ and $\|\bm{d}\|\leq 1$ in (6.7).

Linear Interpolation Models

First, suppose we construct linear interpolation models using the interpolation set $\mathcal{Y}=\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\in\mathcal{C}$ , for $p=n+1$ by solving (4.17) or (4.18), as usual. We first note that we have an alternative version of (5.8), namely

\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}}{2}\beta^{2}\|\bm{\lambda}(\bm{y})\|_{1}\min(\Delta,1)^{2},

(6.9)

for all $\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}$ , where the proof is identical to the unconstrained case, just replacing $B(\bm{x},\Delta)$ with $B(\bm{x},\min(\Delta,1))$ . If $\mathcal{Y}$ is $\Lambda$ -poised, then we have $\|\bm{\lambda}(\bm{y})\|_{1}\leq p\Lambda$ . By treating the cases $\Delta\leq 1$ and $\Delta>1$ separately, we get a constrained version of Theorem 5.3.

Theorem 6.11.

Suppose $f$ satisfies Assumption 2.4 (a) and we construct a linear model (4.16) for $f$ by solving (4.18), where we assume $\hat{\bm{M}}$ is invertible. If $\bm{x}\in\mathcal{C}$ , with $\bm{y}_{i}\in\mathcal{C}$ and $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=n+1$ is the number of interpolation points), and the interpolation set is $\Lambda$ -poised in $B(\bm{x},\Delta)\cap\mathcal{C}$ , then the model is $\mathcal{C}$ -feasible fully linear in $B(\bm{x},\Delta)$ with constants

\displaystyle\kappa_{\textnormal{mf}}=L_{1}\beta^{2}p\Lambda+\frac{L_{1}}{2},\qquad\text{and}\qquad\kappa_{\textnormal{mg}}=L_{1}\beta^{2}p\Lambda.

(6.10)

Proof.

See [86, Theorem 4.4]. ∎

The fully linear constants in Theorem 6.11 are of size $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(L_{1}p\Lambda)$ , which matches the values in the unconstrained case $\mathcal{C}=\mathbb{R}^{n}$ from Theorem 5.3.

Example 6.12 (Example 6.4 revisited).

Consider again the case $n=2$ with $\mathcal{C}=\{\bm{x}\in\mathbb{R}^{2}:|x_{2}|\leq\delta\}$ , with linear interpolation in $B(\bm{0},1)$ using $\{\bm{0},\bm{e}_{1},\delta\>\bm{e}_{2}\}$ . By considering the enlarged trust region $[-1,1]^{2}$ instead of $B(\bm{0},1)$ , we see that the interpolation set is $\Lambda$ -poised in $B(\bm{0},1)\cap\mathcal{C}$ with $\Lambda\leq 3$ and hence $\kappa_{\textnormal{mf}}$ and $\kappa_{\textnormal{mg}}$ are independent of $\delta$ from Theorem 6.11, a significant improvement over $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}=\mathcal{O}(1/\delta)$ by applying the unconstrained approach from Definition 3.1 and Theorem 4.2.

Minimum Frobenius Norm Models

A similar approach works if we wish to use minimum Frobenius norm quadratic interpolation (4.52), with the same definition of Lebesgue measure and $\Lambda$ -poisedness Definition 6.10.

To begin, we state a bound on the size of the model Hessian, similar to (5.33) in the unconstrained case. Unfortunately we cannot bound $\|\bm{H}\|$ entirely; we instead bound certain Rayleigh quotient-type expressions.

Lemma 6.13.

Suppose $f$ satisfies Assumption 2.4 (a) and we construct a quadratic model $m$ (4.29) for $f$ by solving (4.79), where we assume $\hat{\bm{F}}$ is invertible. If $\bm{x}\in\mathcal{C}$ , with $\bm{y}_{i}\in\mathcal{C}$ and $\|\bm{y}_{i}-\bm{x}\|\leq\beta\min(\Delta,1)$ for some $\beta>0$ and all $i=1,\ldots,p$ , and $\{\bm{y}_{1},\ldots,\bm{y}_{p}\}$ is $\Lambda$ -poised in $B(\bm{x},\Delta)\cap\mathcal{C}$ then the model Hessian $\bm{H}$ satisfies

\displaystyle\max_{i,j=1,\ldots,p}\frac{|(\bm{y}_{i}-\bm{x})^{T}\bm{H}(\bm{y}_{j}-\bm{x})|}{\beta^{2}\min(\Delta,1)^{2}}\leq\kappa^{\mathcal{C}}_{H}:=L_{1}p(8\Lambda\beta^{2}+36\Lambda\beta+58\Lambda+6),

(6.11)

and

\displaystyle|(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x})|\leq\kappa^{\mathcal{C}}_{H}\beta^{2}p\Lambda^{2}\min(\Delta,1)^{2},

(6.12)

for all $\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}$ .

Proof.

The proof of (6.11) uses similar ideas to the proof of (5.33) but requires a more lengthy calculation which we omit. See [130, Lemma 4.4] for details. To show (6.12), fix $\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}$ . We recall from (5.38) that we may write $\bm{y}-\bm{x}=\sum_{i=1}^{p}\ell_{i}(\bm{y})(\bm{y}_{i}-\bm{x})$ . So, from (6.11) we get

	$\displaystyle\|(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x})\|$	$\displaystyle\leq\sum_{i,j=1}^{p}\|\ell_{i}(\bm{y})\|\>\|\ell_{j}(\bm{y})\|\>\|(\bm{y}_{i}-\bm{x})^{T}\bm{H}(\bm{y}_{j}-\bm{x})\|,$		(6.13)
		$\displaystyle\leq\kappa^{\mathcal{C}}_{H}\beta^{2}\min(\Delta,1)^{2}\\|\bm{\lambda}(\bm{y})\\|_{1}\max_{i=1,\ldots,p}\|\ell_{i}(\bm{y})\|,$		(6.14)

and are done, after noting $\|\bm{\lambda}(\bm{y})\|_{1}\leq p\Lambda$ and $\max_{i=1,\ldots,p}|\ell_{i}(\bm{y})|\leq\Lambda$ . ∎

Remark 6.14.

If we have $\|\bm{H}\|\leq\kappa_{H}-1$ (as in Assumption 6.6 (b)), then we can replace (6.12) with $|(\bm{y}-\bm{x})^{T}\bm{H}(\bm{y}-\bm{x})|\leq\kappa_{H}\min(\Delta,1)^{2}$ .

The constrained version of Theorem 5.6 is the following.

Theorem 6.15.

Suppose $f$ satisfies Assumption 2.4 (a) and we construct a quadratic model $m$ (4.29) for $f$ by solving (4.79), where we assume $\hat{\bm{F}}$ is invertible. If $\bm{x}\in\mathcal{C}$ , with $\bm{y}_{i}\in\mathcal{C}$ and $\|\bm{y}_{i}-\bm{x}\|\leq\beta\min(\Delta,1)$ for some $\beta>0$ , and the interpolation set is $\Lambda$ -poised in $B(\bm{x},\Delta)\cap\mathcal{C}$ , then the model is $\mathcal{C}$ -feasible fully linear in $B(\bm{x},\Delta)$ with constants

\displaystyle\kappa_{\textnormal{mf}}=\frac{3(L_{1}+\kappa^{\mathcal{C}}_{H})}{2}\beta^{2}p\Lambda+\frac{1}{2}\kappa^{\mathcal{C}}_{H}\beta^{2}p\Lambda^{2}+\frac{L_{1}}{2},\hskip 18.49988pt\text{and}\hskip 18.49988pt\kappa_{\textnormal{mg}}=(L_{1}+\kappa^{\mathcal{C}}_{H})\beta^{2}p\Lambda,

(6.15)

with $\kappa^{\mathcal{C}}_{H}$ defined in Lemma 6.13.

Proof.

First, we follow the proof of Theorem 5.6, except using $|(\bm{y}_{i}-\bm{x})^{T}\bm{H}(\bm{y}_{j}-\bm{x})|\leq\kappa^{\mathcal{C}}_{H}\beta^{2}\min(\Delta,1)^{2}$ from (6.11) in the derivation of (5.39) to get

\displaystyle|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}+\kappa^{\mathcal{C}}_{H}}{2}\beta^{2}\|\bm{\lambda}(\bm{y})\|_{1}\min(\Delta,1)^{2},

(6.16)

for all $\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}$ . Otherwise, the proof is largely similar to that of Theorem 6.11; see [130, Theorem 4.7] for more details. ∎

Model Improvement

As in Section 5.2, we can use our new notion of $\Lambda$ -poisedness (Definition 6.10) to build algorithms to improve the geometry of an interpolation set. Just like the unconstrained case, this works equally well in the linear and minimum Frobenius norm quadratic interpolation cases. Indeed, our definitions ensure that Algorithm 5.1 extends trivially to the constrained case, as shown in Algorithm 6.2. The only changes are that the initial interpolation set must all be feasible, and our new point $\bm{y}$ must also be feasible.

1:Interpolation set

\mathcal{Y}

which is

\Lambda

-poised in

B(\bm{x},\Delta)\cap\mathcal{C}

\bm{x}\in\mathcal{C}

, with

\bm{y}_{i}\in\mathcal{C}

and

\|\bm{y}_{i}-\bm{x}\|\leq\beta\min(\Delta,1)

for some

\beta\geq 1

, desired poisedness constant

\Lambda^{*}>1

2:while

\Lambda>\Lambda^{*}

3: Find

\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}

and

i\in\{1,\ldots,p\}

such that

|\ell_{i}(\bm{y})|>\Lambda^{*}

4: Update

\mathcal{Y}

by replacing

\bm{y}_{i}

with

\bm{y}

5: Recompute the poisedness constant

\Lambda

of the new

\mathcal{Y}

6:end while

7:return

\mathcal{Y}

Algorithm 6.2 Interpolation set improvement (convex constrained case).

Theorem 6.16.

Algorithm 6.2 terminates in finite time, and the resulting interpolation set $\mathcal{Y}$ is $\Lambda^{*}$ -poised in $B(\bm{x},\Delta)$ , and we have $\bm{y}_{i}\in\mathcal{C}$ and $\|\bm{y}_{i}-\bm{x}\|\leq\beta\min(\Delta,1)$ for all $\bm{y}_{i}\in\mathcal{Y}$ .

Proof.

The proof is identical to that of Theorem 5.10, since Lemmas 5.8 and 5.9 still hold, and replacing $\Delta$ with $\min(\Delta,1)$ . ∎

However, the assumption in Algorithm 6.2 that the initial interpolation set is $\Lambda$ -poised in $B(\bm{x},\Delta)\cap\mathcal{C}$ for some $\Lambda$ (with only feasible points) is not so easy to ensure. In the unconstrained case, we saw example sets in Section 4 that ensure the relevant interpolation linear system is invertible, but these do not necessarily work here as we cannot guarantee they are feasible. Instead, we can use Algorithm 6.3 to construct a strictly feasible set with invertible interpolation linear system.

1:Interpolation region

B(\bm{x},\min(\Delta,1))

2:Construct an initial interpolation set

\mathcal{Y}\subset B(\bm{x},\min(\Delta,1))

which has an invertible interpolation linear system (e.g. as in Section 4).

3:while there exists

\bm{y}_{i}\in\mathcal{Y}

with

\bm{y}_{i}\notin\mathcal{C}

4: Find

\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}

such that

\ell_{i}(\bm{y})\neq 0

5: Update

\mathcal{Y}

by replacing

\bm{y}_{i}

with

\bm{y}

6:end while

7:return

\mathcal{Y}

Algorithm 6.3 Interpolation set initialization (convex constrained case).

Theorem 6.17.

Algorithm 6.3 terminates in finite time, and the resulting interpolation set $\mathcal{Y}$ is contained in $B(\bm{x},\min(\Delta,1))\cap\mathcal{C}$ and has an invertible interpolation linear system.

Proof.

All initial points are in $B(\bm{x},\min(\Delta,1))$ , and any initial point $\bm{y}_{i}\notin\mathcal{C}$ is replaced by a new point $\bm{y}\in B(\bm{x},\min(\Delta,1))\cap\mathcal{C}$ , and so Algorithm 6.3 terminates after at most $p$ iterations. The interpolation linear system remains invertible after each iteration because it starts invertible, and we apply Lemma 5.8 or 5.9 together with $\ell_{i}(\bm{y})\neq 0$ . ∎

6.2 General Constraints

Since MBDFO differs most from derivative-based trust-region methods in its model construction, our focus in this section has been on building accurate models using strictly feasible points, and hence we have restricted ourselves to problems with simple constraints (6.1), rather than problems with general nonlinear constraints (for which we also may only have zeroth order oracles). To conclude this section, we briefly outline one approach for solving general constrained problems of the form (2.6). This form is often augmented with explicit bound constraints, although we do not do this here for simplicity. We assume that the objective $f$ and all constraint functions $c_{i}$ for $i\in\mathcal{E}\cup\mathcal{I}$ only have zeroth order oracles, and so their derivatives must be approximated.

The algorithm we outline will be a sequential quadratic programming (SQP) MBDFO method, based on the COBYQA (Constrained Optimization BY Quadratic Approximation) algorithm available in recent versions of Python’s widely used SciPy optimization library. At each iteration of an SQP method, we construct a local quadratic model for the objective $f$ ,

\displaystyle f(\bm{y})\approx m_{k}(\bm{y}):=c_{k}+\bm{g}_{k}^{T}(\bm{y}-\bm{x}_{k})+\frac{1}{2}(\bm{y}-\bm{x}_{k})^{T}\bm{H}_{k}(\bm{y}-\bm{x}_{k}),

(6.17)

and local linear models for all constraints $c_{i}$ ,

\displaystyle c_{i}(\bm{y})\approx m_{k,i}(\bm{y}):=c_{k,i}+\bm{g}_{k,i}^{T}(\bm{y}-\bm{x}_{k}),\hskip 18.49988pt\forall i\in\mathcal{E}\cup\mathcal{I}.

(6.18)

The main difference in these approximations is that we want to choose the model Hessian in $m_{k}$ to achieve $\bm{H}_{k}\approx\nabla^{2}_{\bm{x}}L(\bm{x}_{k},\bm{\lambda}_{k})$ (recalling the Lagrangian (2.7)) rather than $\bm{H}_{k}\approx\nabla^{2}f(\bm{x}_{k})$ , and so we actually need to approximate Hessians for each $c_{i}$ and have suitable Lagrange multiplier estimates. With these approximations, the trust-region subproblem becomes


$\displaystyle\min_{\bm{s}\in\mathbb{R}^{n}}$	$\displaystyle\>m_{k}(\bm{s}),$	(6.19a)
s.t.	$\displaystyle\>m_{k,i}(\bm{x}_{k}+\bm{s})=0,\hskip 18.49988pt\forall i\in\mathcal{E},$	(6.19b)
	$\displaystyle\>m_{k,i}(\bm{x}_{k}+\bm{s})\leq 0,\hskip 18.49988pt\forall i\in\mathcal{I},$	(6.19c)
	$\displaystyle\>\\|\bm{s}\\|\leq\Delta_{k}.$	(6.19d)

In the unconstrained case, we decided if a step was good by measuring the decrease in the objective (2.11). Here, we measure the quality of a step $\bm{s}_{k}$ by using a merit function, which combines the objective value and size of any constraint violations into a single scalar. For example, the $\ell_{2}$ merit function used in COBYQA is defined as

\displaystyle\phi(\bm{x},\gamma):=f(\bm{x})+\gamma\Phi(\bm{x}),\hskip 18.49988pt\text{where}\qquad\Phi(\bm{x}):=\sqrt{\sum_{i\in\mathcal{E}}c_{i}(\bm{x})^{2}+\sum_{i\in\mathcal{I}}\max(c_{i}(\bm{x}),0)^{2}},

(6.20)

where the penalty parameter $\gamma>0$ controls the relative weight applied to reductions in $f$ compared to improvements in feasibility.²⁸²⁸28An alternative but common choice of merit function uses the $\ell_{1}$ norm of the constraint violation. Note that $\phi(\bm{x},\gamma)=f(\bm{x})$ if $\bm{x}$ is feasible. Given our models (6.17) and (6.18), we can derive our approximate merit function

\displaystyle\phi_{k}(\bm{x},\gamma):=m_{k}(\bm{x})+\gamma\Phi_{k}(\bm{x}),\hskip 18.49988pt\text{where}\qquad\Phi_{k}(\bm{x}):=\sqrt{\sum_{i\in\mathcal{E}}m_{k,i}(\bm{x})^{2}+\sum_{i\in\mathcal{I}}\max(m_{k,i}(\bm{x}),0)^{2}}.

(6.21)

The choice of accepting/rejecting a step and updating the trust-region radius is similar to the unconstrained case.

A prototypical algorithm is given in Algorithm 6.4. We note that increase the merit penalty parameter $\gamma_{k}$ as $k\to\infty$ ensures we encourage $\bm{x}_{k}$ to gradually become feasible. The condition $\gamma_{k}\geq\|\bm{\lambda}_{k}\|$ , where $\bm{\lambda}_{k}$ is a Lagrange multiplier estimate is to ensure that the merit function is exact; that is, the true solution $\bm{x}^{*}$ is also a minimizer of $\phi(\cdot,\gamma)$ for all $\gamma$ sufficiently large. In the case of the $\ell_{2}$ merit function, ‘sufficiently large’ is $\gamma\geq\|\bm{\lambda}^{*}\|$ [124, Theorem 4.2.1]. Motivated by (2.8), the Lagrange multiplier estimates $\bm{\lambda}_{k}$ are computed by solving the (bound-constrained) linear least-squares problem


$\displaystyle\bm{\lambda}_{k}\in\operatorname*{arg\,min}_{\bm{\lambda}\in\mathbb{R}^{\|\mathcal{E}\|+\|\mathcal{I}\|}}$	$\displaystyle\>\left\\|\bm{g}_{k}+\sum_{i\in\mathcal{E}\cup\mathcal{I}}\lambda_{i}\bm{g}_{k,i}\right\\|^{2},$	(6.22a)
s.t.	$\displaystyle\>\lambda_{i}=0,\hskip 18.49988pt\forall i\in\{j\in\mathcal{I}:c_{k,i}<0\},$	(6.22b)
	$\displaystyle\>\lambda_{i}\geq 0,\hskip 18.49988pt\forall i\in\{j\in\mathcal{I}:c_{k,i}\geq 0\}.$	(6.22c)

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

, initial Lagrange multiplier estimates

\bm{\lambda}_{0}\in\mathbb{R}^{|\mathcal{E}|+|\mathcal{I}|}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

, and acceptance thresholds

0<\eta_{U}\leq\eta_{S}<1

2:for

k=0,1,2,\ldots

3: Build quadratic models

m_{k}

(6.17) for

f

and

m_{k,i}

(6.18) for each

c_{i}

4: Approximately solve the subproblems (6.24) and (6.25) to get a step

\bm{s}_{k}=\bm{n}_{k}+\bm{t}_{k}

5: Evaluate

f

and all constraints

c_{i}

\bm{x}_{k}+\bm{s}_{k}

and calculate the ratio

\displaystyle\rho_{k}=\frac{\text{actual reduction in merit function}}{\text{predicted reduction in merit function}}=\frac{\phi(\bm{x}_{k},\gamma_{k})-\phi(\bm{x}_{k}+\bm{s}_{k},\gamma_{k})}{\phi_{k}(\bm{x}_{k},\gamma_{k})-\phi_{k}(\bm{x}_{k}+\bm{s}_{k},\gamma_{k})},

(6.23)

for some value

\gamma_{k}\geq\max(\gamma_{k-1},\|\bm{\lambda}_{k}\|)

, chosen such that

\phi_{k}(\bm{x}_{k}+\bm{s}_{k},\gamma_{k})<\phi_{k}(\bm{x}_{k},\gamma_{k})

6: if

\rho_{k}\geq\eta_{S}

then

7: (Very successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}

8: else if

\eta_{U}\leq\rho_{k}<\eta_{S}

then

9: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\Delta_{k}

10: else

11: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

12: end if

13: Estimate the Lagrange multipliers

\bm{\lambda}_{k+1}

associated with

\bm{x}_{k+1}

by solving (6.22).

14:end for

Algorithm 6.4 Example MBDFO trust-region SQP method for solving (2.6).

Aside from the management of an interpolation set, which can use the techniques described in previous sections (but the interpolation problem is solved for $f$ and all constraints $c_{i}$ ), the main practical consideration is the calculation of the step (6.19). The largest difficulty is that the linearized constraints may be infeasible. A common solution to this issue is to decompose the computed step into a normal and tangential component, $\bm{s}_{k}=\bm{n}_{k}+\bm{t}_{k}$ . The normal step $\bm{n}_{k}$ aims to reduce constraint violation and the tangential step aims to decrease the objective without worsening the constraint violation. There are several ways to do this, but COBYQA uses the following approach. First, calculate the normal step $\bm{n}_{k}$ by (approximately) solving


$\displaystyle\min_{\bm{n}\in\mathbb{R}^{n}}$	$\displaystyle\>\Phi_{k}(\bm{x}_{k}+\bm{n})^{2},$	(6.24a)
s.t.	$\displaystyle\>\\|\bm{n}\\|\leq\theta\Delta_{k},$	(6.24b)

for some scalar $\theta\in(0,1)$ (e.g. $\theta=0.8$ ), and then calculate the tangential step $\bm{t}_{k}$ by approximately solving


$\displaystyle\min_{\bm{t}\in\mathbb{R}^{n}}$	$\displaystyle\>m_{k}(\bm{x}_{k}+\bm{n}_{k}+\bm{t}),$	(6.25a)
s.t.	$\displaystyle\>\bm{g}_{k,i}^{T}\bm{t}=0,\hskip 18.49988pt\forall i\in\mathcal{E},$	(6.25b)
	$\displaystyle\>\bm{g}_{k,i}^{T}\bm{t}\leq\max(-m_{k,i}(\bm{x}_{k}+\bm{n}_{k}),0),\hskip 18.49988pt\forall i\in\mathcal{I},$	(6.25c)
	$\displaystyle\>\\|\bm{n}_{k}+\bm{t}\\|\leq\Delta_{k}.$	(6.25d)

Both of these subproblems (6.24) and (6.25) can be reformulated to be trust-region-like subproblems with linear constraints which can be solved using algorithms similar to the conjugate gradient-based method developed in [122]; see [124, Chapter 6] for details.

We also note that the requirement that $\gamma_{k}$ is chosen to ensure $\phi_{k}(\bm{x}_{k}+\bm{s}_{k},\gamma_{k})<\phi_{k}(\bm{x}_{k},\gamma_{k})$ can be satisfied relatively easily, under mild assumptions on the step components $\bm{n}_{k}$ and $\bm{t}_{k}$ .

Lemma 6.18.

If, on iteration $k$ of Algorithm 6.4, $\bm{s}_{k}=\bm{n}_{k}+\bm{t}_{k}$ where $\bm{n}_{k}$ (6.24) either satisfies $\Phi_{k}(\bm{x}_{k}+\bm{n}_{k})<\Phi(\bm{x}_{k})$ or $\bm{n}_{k}=\bm{0}$ , and $\bm{t}_{k}$ (6.25) satisfies $m_{k}(\bm{x}_{k}+\bm{t}_{k})<m_{k}(\bm{x}_{k})$ , then $\phi_{k}(\bm{x}_{k}+\bm{s}_{k},\gamma)<\phi_{k}(\bm{x}_{k},\gamma)$ for all $\gamma>0$ sufficiently large.

Proof.

The constraints (6.25b) and (6.25c) ensure that $\Phi_{k}(\bm{x}_{k}+\bm{s}_{k})=\Phi_{k}(\bm{x}_{k}+\bm{n}_{k}+\bm{t}_{k})\leq\Phi_{k}(\bm{x}_{k}+\bm{n}_{k})$ .

First, if $\Phi_{k}(\bm{x}_{k}+\bm{n}_{k})<\Phi_{k}(\bm{x}_{k})$ , then we have $\Phi_{k}(\bm{x}_{k}+\bm{s}_{k})<\Phi_{k}(\bm{x}_{k})$ , and the result follows by noting that $\phi_{k}(\bm{x}_{k},\gamma)-\phi_{k}(\bm{x}_{k}+\bm{s},\gamma)=m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s})+\gamma(\Phi_{k}(\bm{x}_{k})-\Phi_{k}(\bm{x}_{k}+\bm{s}))$ is linear and increasing in $\gamma>0$ . Instead, if $\bm{n}_{k}=\bm{0}$ then $\bm{s}_{k}=\bm{t}_{k}$ so $\Phi_{k}(\bm{x}_{k}+\bm{s}_{k})\leq\Phi_{k}(\bm{x}_{k})$ and $m_{k}(\bm{x}_{k}+\bm{s}_{k})=m_{k}(\bm{x}_{k}+\bm{t}_{k})<m_{k}(\bm{x}_{k})$ , so $\phi_{k}(\bm{x}_{k}+\bm{s}_{k},\gamma)<\phi_{k}(\bm{x}_{k},\gamma)$ for all $\gamma>0$ . ∎

Algorithm 6.4 does not have any convergence guarantees as written, and was developed purely as a practical method. Other MBDFO SQP methods such as [145, 82] also do not have convergence guarantees. However, stochastic SQP methods (extending the ideas from Section 7.2 to the constrained setting) with convergence guarantees [66, 65, 64] can be used in the deterministic setting, and there does not appear to be any fundamental issue preventing the convergence theory for derivative-based SQP trust-region methods [47, Chapter 15] being adapted to the MBDFO setting.

6.2.1 Evaluation Failures and Hidden Constraints

In some situations, we may also have to concern ourselves with objective evaluations failing (i.e. our procedure for evaluating $f(\bm{x})$ fails or crashes unexpectedly). For example, the objective evaluation for a helicopter rotor blade design problem in [25] failed to compute over 60% of the time. More generally, large-scale computations (such as many objective evaluations in DFO contexts) can be exposed to hardware failure; for one high-performance computing system, hardware failures occurred on average once every 7.5 days [137].

This can be thought of as an example of hidden constraints [42, 100], where our problem takes the form

\displaystyle\min_{\bm{x}\in\mathcal{D}}\>f(\bm{x}),

(6.26)

for some feasible set $\mathcal{D}\subseteq\mathbb{R}^{n}$ which is unknown to the algorithm (and where $\bm{x}\in\mathcal{D}$ represents ‘evaluating $f(\bm{x})$ succeeds’). This can be rigorously handled with unconstrained algorithms if we let $f:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ and define $f(\bm{x}):=+\infty$ whenever $\bm{x}\notin\mathcal{D}$ . Here, we must prove convergence to Clarke stationary points, although to the author’s knowledge this has not been applied to MBDFO, only direct search [42]. In practice, this can instead be handled by setting $f(\bm{x})$ to some large (finite) value if $\bm{x}\notin\mathcal{D}$ [123].

Notes and References

Section 6.1 is based primarily on [86, 130]. These works draw on the derivative-based trust-region theory outlined in [47, Chapter 12], the complexity analysis for convex-constrained cubic regularization [34] and convergence theory for convex-constrained MBDFO using the original notion (Definition 3.1) of fully linear models [44]. A good resource for the theory of convex sets and convex optimization is [18].

The description of COBYQA (Algorithm 6.4) for general constraints is taken from [124]. More detail about derivative-based SQP methods, including convergence theory, alternative approaches for calculating $\bm{s}_{k}$ and alternative globalization mechanisms such as filters (instead of a merit function), see [47, Chapter 15]. The specific step calculation given by (6.24) and (6.25) is a variant of the Byrd–Omojokun method [47, Chapters 15.4.2 & 15.4.4], originally from [113] (a thesis by Omojokun, supervised by Byrd). There are many other works which describe MBDFO methods for handling general constraints (2.6) described in the survey [96, Section 7]. However, we note in particular [145, 82], which propose SQP methods very similar to COBYQA. See [111] for an overview of theory and algorithms for general constrained optimization, including quadratic programming.

MBDFO for nonlinearly constrained problems is not as well-developed as for unconstrained problems. There are few implementations outside of COBYQA, and to the author’s knowledge none with both good practical performance and theoretical guarantees. By contrast, constrained optimization within the Mesh Adaptive Direct Search framework is well-studied in the sense of proving convergence to Clarke stationary points (e.g. [4, 5, 6, 3, 12]).

7 Optimization for Noisy Problems

A key use case of MBDFO methods is optimization for problems with some degree of noise present. In general, this refers to situations where an exact zeroth order oracle is not available (i.e. we can only evaluate $f$ , but not exactly). This comprises situations, for example, where the calculation of the objective $f$ requires performing a Monte Carlo simulation, or involves a physical/real-world experiment subject to some inherent noise or uncertainty.

In this section we discuss the cases of both deterministic and stochastic noise. By ‘deterministic noise’, we mean the following: if we evaluate the noisy function repeatedly at the same point, we get the same answer (e.g. roundoff errors, finite termination of an iterative process). Stochastic noise refers to the case where repeated evaluations at the same point can produce different value, and so the noisy value of $f(\bm{x})$ can be treated as a random variable (e.g. Monte Carlo simulation, experimental errors).

In both cases, we outline realistic model accuracy assumptions (in place of ‘fully linear at every iteration’), and analyze the resulting variants of Algorithm 3.1. In practice, a simple heuristic (with theoretical justification for deterministic noise, see Remark C.3) for handling noisy problems is to use a generic algorithm such as Algorithm 3.1, but set $\gamma_{\textnormal{dec}}$ very close to 1, for example $\gamma_{\textnormal{dec}}=0.98$ , rather than more typical values such as $\gamma_{\textnormal{dec}}=0.5$ [35].

7.1 Deterministic Noise

In the case of deterministic noise, we essentially have a zeroth order oracle that reliably returns an incorrect value. We will assume that the noise is uniformly bounded.

Assumption 7.1.

Given objective $f:\mathbb{R}^{n}\to\mathbb{R}$ (2.1), we only have access to the noisy zeroth order oracle $\tilde{f}:\mathbb{R}^{n}\to\mathbb{R}$ , satisfying $|\tilde{f}(\bm{x})-f(\bm{x})|\leq\epsilon_{f}$ for all $\bm{x}\in\mathbb{R}^{n}$ , for some $\epsilon_{f}>0$ .

We illustrate the difficulty that the presence of noise introduces to optimization algorithms by first considering the case of finite differencing. The below result says that, without due care, finite differencing a noisy zeroth order oracle can give very inaccurate derivative estimates.

Lemma 7.2.

Suppose $f:\mathbb{R}^{n}\to\mathbb{R}$ is twice continuously differentiable and $\|\nabla^{2}f(\bm{y})\|\leq M$ for all $\bm{y}\in B(\bm{x}_{k},h)$ , for some $h>0$ . If we have a noisy oracle $\tilde{f}$ satisfying Assumption 7.1 and we compute (c.f. (3.2))

\displaystyle[\bm{g}_{k}]_{i}:=\frac{\tilde{f}(\bm{x}_{k}+h\bm{e}_{i})-\tilde{f}(\bm{x}_{k})}{h},\hskip 18.49988pt\forall i=1,\ldots,n,

(7.1)

then $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|_{\infty}\leq\frac{Mh}{2}+\frac{2\epsilon_{f}}{h}$ .

Proof.

Let $\hat{\bm{g}}_{k}$ be the noise-free finite difference estimate from (3.2). From standard finite differencing theory (e.g. [111, Chapter 8.1]) we have $\|\hat{\bm{g}}_{k}-\nabla f(\bm{x}_{k})\|_{\infty}\leq\frac{Mh}{2}$ . We also have

\displaystyle|[\bm{g}_{k}-\hat{\bm{g}}_{k}]_{i}|\leq\frac{|\tilde{f}(\bm{x}_{k}+h\bm{e}_{i})-f(\bm{x}_{k}+h\bm{e}_{i})|+|\tilde{f}(\bm{x}_{k})-f(\bm{x}_{k})|}{h}\leq\frac{2\epsilon_{f}}{h},\hskip 18.49988pt\forall i=1,\ldots,n,

(7.2)

as required. ∎

This demonstrates that, in the noisy case, the error in finite differencing can increase significantly as $h\to 0^{+}$ . In fact, to minimize the error bound in Lemma 7.2 we should take $h=2\sqrt{\epsilon_{f}/M}$ to get gradient error $\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|_{\infty}\leq 2\sqrt{M\epsilon_{f}}$ .²⁹²⁹29For this reason, when estimating the gradient of an objective using (7.1), both Python’s scipy.optimize.minimize and MATLAB’s fminunc use the default value $h=\sqrt{\epsilon_{\text{machine}}}$ , where $\epsilon_{\text{machine}}$ is the machine epsilon. The advantage of MBDFO is that interpolation models are constructed from well-spaced points, typically of distance $\mathcal{O}(\Delta_{k})$ from each other, and so we avoid the $h\to 0^{+}$ case until $\Delta_{k}$ is very small.

For example, suppose we construct our model (3.1) using linear interpolation to points $\bm{y}_{1},\ldots,\bm{y}_{p}$ for $p=n+1$ (4.17). With a noisy zeroth order oracle, we now solve

\displaystyle\bm{M}\begin{bmatrix}c\\ \bm{g}\end{bmatrix}=\begin{bmatrix}\tilde{f}(\bm{y}_{1})\\ \vdots\\ \tilde{f}(\bm{y}_{p})\end{bmatrix}=\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}+\begin{bmatrix}\tilde{f}(\bm{y}_{1})-f(\bm{y}_{1})\\ \vdots\\ \tilde{f}(\bm{y}_{p})-f(\bm{y}_{p})\end{bmatrix}.

(7.3)

This gives us the following version of Theorem 4.2.

Theorem 7.3.

Suppose $f$ satisfies Assumption 2.4 (a) and $\tilde{f}$ satisfies Assumption 7.1 and we construct a linear model (4.16) for $f$ by solving (7.3), where we assume $\hat{\bm{M}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=n+1$ is the number of interpolation points), then the model satisfies


$\displaystyle\|m(\bm{y})-f(\bm{y})\|$	$\displaystyle\leq\kappa_{\textnormal{mf}}\Delta^{2}+\tilde{\kappa}_{\textnormal{mf}}\>\epsilon_{f},$	(7.4a)
$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\kappa_{\textnormal{mg}}\Delta+\tilde{\kappa}_{\textnormal{mg}}\frac{\epsilon_{f}}{\Delta},$	(7.4b)

for all $\bm{y}\in B(\bm{x},\Delta)$ , where $\kappa_{\textnormal{mf}}$ and $\kappa_{\textnormal{mg}}$ are the same as in Theorem 4.2 (i.e. (4.19)), and

\displaystyle\tilde{\kappa}_{\textnormal{mf}}=(1+\sqrt{n})\|\hat{\bm{M}}^{-1}\|_{\infty},\hskip 18.49988pt\text{and}\hskip 18.49988pt\tilde{\kappa}_{\textnormal{mg}}=2\tilde{\kappa}_{\textnormal{mf}}.

(7.5)

Proof.

Noting $m(\bm{y}_{i})=\tilde{f}(\bm{y}_{i})$ , the noisy version of (4.20) is

\displaystyle|m(\bm{y}_{i})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{i}-\bm{x})|\leq\frac{L_{1}}{2}\beta^{2}\Delta^{2}+\epsilon_{f},

(7.6)

and so the analog of (4.26) is

\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{L_{1}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{-1}\|_{\infty}\Delta^{2}+(1+\sqrt{n})\|\hat{\bm{M}}^{-1}\|_{\infty}\epsilon_{f},

(7.7)

for all $\bm{y}\in B(\bm{x},\Delta)$ . The result then follows from Lemma 4.1(a) with $\kappa_{H}=0$ . ∎

A similar result holds for minimum Frobenius norm quadratic models.

Theorem 7.4.

Suppose $f$ satisfies Assumption 2.4 (a) and $\tilde{f}$ satisfies Assumption 7.1, and we construct a quadratic model $m$ (4.29) for $f$ by solving (4.79) with function estimates $\tilde{f}(\bm{y}_{i})$ , where we assume $\hat{\bm{F}}$ is invertible. If $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ , then the model satisfies (7.4) for all $\bm{y}\in B(\bm{x},\Delta)$ with constants

\displaystyle\kappa_{\textnormal{mf}}

\displaystyle=\frac{L_{1}+\kappa_{H}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{\dagger}\|_{\infty}+\frac{L_{1}+\kappa_{H}}{2},\hskip 18.49988pt\text{and}\hskip 18.49988pt\tilde{\kappa}_{\textnormal{mf}}=(1+\sqrt{n})\|\hat{\bm{M}}^{\dagger}\|_{\infty},

(7.8)

with $\kappa_{\textnormal{mg}}=2\kappa_{\textnormal{mf}}+2\kappa_{H}$ and $\tilde{\kappa}_{\textnormal{mg}}=2\tilde{\kappa}_{\textnormal{mg}}$ , and where

\displaystyle\|\bm{H}\|\leq\kappa_{H}:=\frac{L_{1}}{2}p\beta^{4}\|\hat{\bm{F}}^{-1}\|_{\infty}+\frac{\epsilon_{f}}{\Delta^{2}}p\beta^{2}\|\hat{\bm{F}}^{-1}\|_{\infty}.

(7.9)

Proof.

Noting $m(\bm{y}_{i})=\tilde{f}(\bm{y}_{i})$ , the noisy equivalent of (4.88) in the proof of Theorem 4.14 is

\displaystyle|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|

\displaystyle\leq\frac{L_{1}+\|\bm{H}\|}{2}\beta^{2}(1+\sqrt{n})\|\hat{\bm{M}}^{\dagger}\|\Delta^{2}+\epsilon_{f}(1+\sqrt{n})\|\hat{\bm{M}}^{\dagger}\|,

(7.10)

The argument from Theorem 4.14 then gives the values of the constants in (7.4). To get the bound on $\|\bm{H}\|$ , following the proof of Lemma 4.13 we get

\displaystyle|\hat{\lambda}_{i}|\leq\left(\frac{L_{1}}{2}\beta^{2}\Delta^{2}+\epsilon_{f}\right)\|\hat{\bm{F}}^{-1}\|_{\infty},

(7.11)

from which the bound follows. ∎

Theorems 7.3 and 7.4 motivate the following assumption on our interpolation model, replacing Assumption 3.3.

Assumption 7.5.

At each iteration $k$ of Algorithm 7.1, the model $m_{k}$ (3.1) satisfies:

(a)

There exist constants $\kappa_{\textnormal{mf}},\tilde{\kappa}_{\textnormal{mf}},\kappa_{\textnormal{mg}},\tilde{\kappa}_{\textnormal{mg}}>0$ (independent of $k$ ) such that $m_{k}$ satisfies (7.4) for all $\bm{y}\in B(\bm{x}_{k},\Delta_{k})$ ;
(b)

$\|\bm{H}_{k}\|\leq\kappa_{H}-1$ for some $\kappa_{H}\geq 1$ (independent of $k$ ).

We again denote $\kappa_{\textnormal{m}}:=\max(\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}})$ and $\tilde{\kappa}_{\textnormal{m}}:=\max(\tilde{\kappa}_{\textnormal{mf}},\tilde{\kappa}_{\textnormal{mg}})$ for notational convenience.

We now consider the adaptation of Algorithm 3.1 to the deterministic noise setting. The only other change to the stated algorithm is that the numerator in the ratio test (7.12) is now the estimated predicted decrease, based on querying $\widetilde{f}$ (satisfying Assumption 7.1) instead of $f$ . Our new algorithm is given in Algorithm 7.1.

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factors

0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}

, acceptance threshold

0<\eta_{S}<1

, and criticality threshold

\mu_{c}>0

2:for

k=0,1,2,\ldots

3: Build a local quadratic model

m_{k}

(3.1) satisfying Assumption 7.5.

4: Solve the trust-region subproblem (2.10) to get a step

\bm{s}_{k}

satisfying Assumption 2.9.

5: Calculate the ratio

\displaystyle\rho_{k}=\frac{\text{estimated actual decrease}}{\text{predicted decrease}}:=\frac{\widetilde{f}(\bm{x}_{k})-\widetilde{f}(\bm{x}_{k}+\bm{s}_{k})}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})}.

(7.12)

6: if

\rho_{k}\geq\eta_{S}

and

\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}

then

7: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}

8: else

9: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}

10: end if

11:end for

Algorithm 7.1 MBDFO trust-region method for problems with deterministic noise.

The worst-case complexity analysis of Algorithm 7.1 is broadly similar to the noise-free case, but more complicated. The iteration complexity bound matches the noise-free case (Corollary 3.8) but only when the desired first-order accuracy is sufficiently large, $\mathcal{O}(\sqrt{\epsilon_{f}})$ .

Theorem 7.6.

Suppose Assumptions 2.4, 2.9, 7.1 and 7.5 hold. For any

\displaystyle\epsilon>\epsilon_{\min}=\mathcal{O}\left(\sqrt{(\kappa_{\textnormal{m}}+\kappa_{H})\tilde{\kappa}_{\textnormal{m}}\epsilon_{f}}\right),

(7.13)

provided $\Delta_{0}$ is sufficiently large, Algorithm 7.1 achieves $\|\nabla f(\bm{x}_{k})\|\leq\epsilon$ for the first time after at most $\mathcal{O}(\kappa_{H}(\kappa_{\textnormal{m}}+\kappa_{H})^{2}\epsilon^{-2})$ iterations.

Proof.

See Appendix C. ∎

7.2 Stochastic Noise

If the noise in our objective is stochastic, then, unlike the deterministic noise case, we can decrease the noise level by averaging many noisy values of $f(\bm{x})$ to achieve, in principle, arbitrarily good estimates of the true objective. This means that, with care, we can achieve any desired level of optimality, $\|\nabla f(\bm{x}_{k})\|<\epsilon$ , rather than being limited to optimality levels $\epsilon=\mathcal{O}(\sqrt{\epsilon_{f}})$ as in the deterministic case.

Formally, we will assume that our stochastic noise is unbiased, and so optimizing a function which has stochastic noise in its evaluations can be viewed as the stochastic optimization problem

\displaystyle\min_{\bm{x}\in\mathbb{R}^{n}}\>f(\bm{x}):=\mathbb{E}_{\omega}[f(\bm{x},\omega)],

(7.14)

where $\omega\in\Omega$ is a source of random noise. Here, we assume that we only have zeroth order stochastic oracles $\bm{x}\mapsto f(\bm{x},\omega)$ , for realizations of the noise $\omega$ . To make the problem tractable, we will assume that the randomness has bounded variance.

Assumption 7.7.

There exists $\sigma>0$ such that $\operatorname{Var}_{\omega}(f(\bm{x},\omega))\leq\sigma^{2}$ for all $\bm{x}\in\mathbb{R}^{n}$ .

The most important consequence of Assumption 7.7 is that we can easily construct estimates of $f(\bm{x})$ in (7.14) which are arbitrarily good, with high probability.

Proposition 7.8 (Chebyshev’s inequality, e.g. Chapter 2.1 of [27]).

Suppose Assumption 7.7 holds. For any $\bm{x}\in\mathbb{R}^{n}$ , suppose we generate i.i.d. realizations $\omega_{1},\ldots,\omega_{N}$ from $\Omega$ and calculate the sample average, $\overline{f}_{N}(\bm{x},\omega)=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x},\omega_{i})$ . Then $\mathbb{P}\left[|\overline{f}_{N}(\bm{x},\omega)-f(\bm{x})|>t\right]\leq\frac{\sigma^{2}}{Nt^{2}}$ for any $t>0$ .

Roughly speaking, the Central Limit Theorem says that the typical error between the sample mean from $N$ samples and the true mean is of size $\mathcal{O}(\frac{\sigma}{\sqrt{N}})$ as $N\to\infty$ . Chebyshev’s inequality gives a ‘finite $N$ ’ version of this idea. Specifically, motivated by the fully linear assumption, the below result confirms that to get a sample error of the desirable size $\mathcal{O}(\Delta^{2})$ requires averaging $N=\mathcal{O}(\Delta^{-4})$ samples.

Corollary 7.9.

Suppose Assumption 7.7 holds. For any $\bm{x}\in\mathbb{R}^{n}$ , and $\epsilon_{f},\Delta>0$ and $\alpha_{f}\in(0,1)$ , if $N\geq\frac{\sigma^{2}}{\epsilon_{f}^{2}(1-\alpha_{f})\Delta^{4}}$ then $\mathbb{P}\left[|\overline{f}_{N}(\bm{x},\omega)-f(\bm{x})|\leq\epsilon_{f}\Delta^{2}\right]\geq\alpha_{f}$ .

Another consequence of Chebyshev’s inequality is that, again by choosing the number of samples sufficiently large, we can guarantee that an interpolation model is fully linear with high probability.

Theorem 7.10.

Suppose $f$ satisfies Assumption 2.4 (a), Assumption 7.7 holds, and we construct a linear model (4.16) for $f$ by solving (7.3) with noisy estimates $\tilde{f}(\bm{y}_{i})=\overline{f}_{N}(\bm{y}_{i},\omega)$ (i.i.d. for each $i$ ), where we assume $\hat{\bm{M}}$ is invertible. Assume also that $\|\bm{y}_{i}-\bm{x}\|\leq\beta\Delta$ for some $\beta>0$ and all $i=1,\ldots,p$ (where $p=n+1$ is the number of interpolation points). If $N\geq\frac{\sigma^{2}}{\epsilon_{f}^{2}(1-\alpha_{m}^{1/p})\Delta^{4}}$ for some $\epsilon_{f}>0$ and $\alpha_{m}\in(0,1)$ , then with probability at least $\alpha_{m}$ the model is fully linear with constants

\displaystyle\kappa_{\textnormal{mf}}=\frac{L_{1}}{2}(1+\sqrt{n})\beta^{2}\|\hat{\bm{M}}^{-1}\|_{\infty}+\frac{L_{1}}{2}+(1+\sqrt{n})\|\hat{\bm{M}}^{-1}\|_{\infty}\epsilon_{f},\qquad\text{and}\qquad\kappa_{\textnormal{mg}}=2\kappa_{\textnormal{mf}}.

(7.15)

Proof.

By assumption on $N$ , Corollary 7.9 gives

\displaystyle\mathbb{P}\left[|\overline{f}_{N}(\bm{y}_{i},\omega)-f(\bm{y}_{i})|\leq\epsilon_{f}\Delta^{2}\right]\geq\alpha_{m}^{1/p},

(7.16)

for each $i=1,\ldots,p$ , and since these events are all independent we have

\displaystyle\mathbb{P}\left[\max_{i=1,\ldots,p}|\overline{f}_{N}(\bm{y}_{i},\omega)-f(\bm{y}_{i})|\leq\epsilon_{f}\Delta^{2}\right]\geq\alpha_{m}.

(7.17)

That is, with probability at least $\alpha_{m}$ we have $|\overline{f}_{N}(\bm{y}_{i},\omega)-f(\bm{y}_{i})|\leq\epsilon_{f}\Delta^{2}$ for all $i=1,\ldots,p$ . The result then follows from Theorem 7.3. ∎

An analogous result can be derived for minimum Frobenius norm models, by the same probabilistic argument as above and the use of Theorem 7.4, where we again need $N\geq\frac{\sigma^{2}}{\epsilon_{f}(1-\alpha_{m}^{1/p})\Delta^{4}}$ samples for each objective estimate.

The above results motivate Algorithm 7.2, a stochastic version of Algorithm 3.1. Algorithm 7.2 has very similar algorithmic structure to Algorithm 3.1, but with some simplifications to aid the analysis. Specifically,

•

We always increase $\Delta_{k}$ on successful iterations, i.e. setting $\eta_{U}=\eta_{S}$ in Algorithm 3.1;
•

We maintain the same ratio for increasing and decreasing $\Delta_{k}$ , i.e. setting $\gamma_{\textnormal{dec}}=\gamma_{\textnormal{inc}}^{-1}$ in Algorithm 3.1;
•

We impose a maximum trust-region radius, $\Delta_{k}\leq\Delta_{\max}$ for all $k$ , similar to Algorithm 3.2. We assume $\Delta_{\max}=\gamma_{\textnormal{inc}}^{j_{\max}}\Delta_{0}$ for some $j_{\max}\in\mathbb{N}$ again for simplicity.

More importantly, in Algorithm 7.2 we now assume that the model $m_{k}$ is fully linear with high probability, and the approximate evaluations of $f(\bm{x}_{k})$ and $f(\bm{x}_{k}+\bm{s}_{k})$ in the calculation of $\rho_{k}$ are accurate with high probability. In particular, we care about the likelihood of the events


$\displaystyle I_{k}$	$\displaystyle:=\mathbbm{1}\{\text{$m_{k}$ is fully linear in $B(\bm{x}_{k},\Delta_{k})$ with constants $\kappa_{\textnormal{mf}},\kappa_{\textnormal{mg}}>0$}\},\hskip 18.49988pt\text{and}$	(7.18a)
$\displaystyle J_{k}$	$\displaystyle:=\mathbbm{1}\{\text{$\|f_{k}^{0}-f(\bm{x}_{k})\|\leq\epsilon_{f}\Delta_{k}^{2}$ and $\|f_{k}^{s}-f(\bm{x}_{k}+\bm{s}_{k})\|\leq\epsilon_{f}\Delta_{k}^{2}$ for some $\epsilon_{f}>0$}\},$	(7.18b)

where $\mathbbm{1}$ is the indicator function of an event (i.e. taking values 1 if the event occurs and 0 otherwise). Formally, all the randomness in Algorithm 7.2 comes from the models $m_{k}$ and the estimates $f_{k}^{0},f_{k}^{s}$ . So, we define the filtration $\mathcal{F}_{k-1}$ to be the $\sigma$ -algebra generated by $\{m_{0},f_{0}^{0},f_{0}^{s},\ldots,m_{k-1},f_{k-1}^{0},f_{k-1}^{s}\}$ , representing all randomness up to the start of iteration $k$ . Similarly, we define $\mathcal{F}_{k-1/2}$ to be generated by $\{m_{0},f_{0}^{0},f_{0}^{s},\ldots,m_{k-1},f_{k-1}^{0},f_{k-1}^{s},m_{k}\}$ , representing all randomness up to the calculation of $\bm{s}_{k}$ (i.e. from $\mathcal{F}_{k-1}$ and $m_{k}$ ).

Given this filtration structure, we make the following assumptions about the model.

Assumption 7.11.

At each iteration $k$ of Algorithm 7.2, the model $m_{k}$ (2.9) satisfies:

(a)

$\mathbb{P}[I_{k}=1|\mathcal{F}_{k-1}]\geq\alpha_{m}$ for some $\alpha_{m}\in(\frac{1}{2},1]$ ;
(b)

$\|\bm{H}_{k}\|\leq\kappa_{H}-1$ for some $\kappa_{H}\geq 1$ (independent of $k$ ).

We also require the following assumption about the function value estimates used to calculate $\rho_{k}$ .

Assumption 7.12.

At each iteration $k$ of Algorithm 7.2, the estimates $f_{k}^{0}$ and $f_{k}^{s}$ satisfy $\mathbb{P}[J_{k}=1|\mathcal{F}_{k-1/2}]\geq\alpha_{f}$ for some $\alpha_{f}\in(\frac{1}{2},1]$ and some $\epsilon_{f}>0$ .

Under Assumption 7.7, Corollary 7.9 and Theorem 7.10 show that Assumptions 7.11 and 7.12 can be satisfied by using sample averages $\overline{f}_{N}(\bm{x},\omega)$ for $N\geq\frac{\sigma^{2}}{\epsilon_{f}(1-\alpha_{f}^{1/2})\Delta^{4}}$ samples.³⁰³⁰30We need $\alpha_{f}^{1/2}$ rather than $\alpha_{f}$ as in Corollary 7.9 because the condition $J_{k}=1$ requires we have accurate estimates for both $f(\bm{x}_{k})$ and $f(\bm{x}_{k}+\bm{s}_{k})$ . However, Assumptions 7.11 and 7.12 are more general and may be satisfied under other assumptions on the randomness in the objective.

1:Starting point

\bm{x}_{0}\in\mathbb{R}^{n}

and trust-region radius

\Delta_{0}>0

. Algorithm parameters: scaling factor

\gamma_{\textnormal{inc}}>1

, acceptance threshold

0<\eta_{S}<1

, criticality threshold

\mu_{c}>0

, and maximum trust-region radius

\Delta_{\max}=\gamma_{\textnormal{inc}}^{j_{\max}}\Delta_{0}

for some

j_{\max}\in\mathbb{N}

2:for

k=0,1,2,\ldots

3: Build a local quadratic model

m_{k}

(3.1) satisfying Assumption 7.11.

4: Solve the trust-region subproblem (2.10) to get a step

\bm{s}_{k}

satisfying Assumption 2.9.

5: Calculate estimates

f_{k}^{0}\approx f(\bm{x}_{k})

and

f_{k}^{s}\approx f(\bm{x}_{k}+\bm{s}_{k})

satisfying Assumption 7.12, and calculate the ratio

\displaystyle\rho_{k}=\frac{\text{estimated actual decrease}}{\text{predicted decrease}}:=\frac{f_{k}^{0}-f_{k}^{s}}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})}.

(7.19)

6: if

\rho_{k}\geq\eta_{S}

and

\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}

then

7: (Successful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k}

and

\Delta_{k+1}=\min(\gamma_{\textnormal{inc}}\Delta_{k},\Delta_{\max})

8: else

9: (Unsuccessful iteration) Set

\bm{x}_{k+1}=\bm{x}_{k}

and

\Delta_{k+1}=\gamma_{\textnormal{inc}}^{-1}\Delta_{k}

10: end if

11:end for

Algorithm 7.2 Stochastic MBDFO trust-region method for solving (7.14).

The worst-case complexity of Algorithm 7.2 is derived from the following general probabilistic result. This considers a non-negative random process $\phi_{k}$ which (before stopping) typically decreases in proportion to another non-negative random process $\Delta_{k}$ . It says that, provided $\Delta_{k}$ is unlikely to get small before stopping (specifically, increases in $\Delta_{k}$ occur when a biased event $\mathbb{P}[W_{k}=1]>1/2$ occurs), we can bound how long it takes for stopping to occur.

Proposition 7.13 (Theorem 2, [24]).

Suppose we have random processes $\{(\phi_{k},\Delta_{k},W_{k})\}_{k=0}^{\infty}$ with $\phi_{k},\Delta_{k}\geq 0$ , and

\displaystyle\mathbb{P}[W_{k}=1|\mathcal{F}_{k-1}]=q,\hskip 18.49988pt\text{and}\hskip 18.49988pt\mathbb{P}[W_{k}=-1|\mathcal{F}_{k-1}]=1-q,

(7.20)

for some $q>\frac{1}{2}$ , where $\mathcal{F}_{k-1}$ is the $\sigma$ -algebra generated by $\{(\phi_{j},\Delta_{j})\}_{j=0}^{k}$ and $\{W_{j}\}_{j=0}^{k-1}$ . For any $\epsilon>0$ , let $K_{\epsilon}$ be a stopping time with respect to the filtrations $\{\mathcal{F}_{k}\}_{k=0}^{\infty}$ . If, for all $k$ ,

(a)

There exists $\lambda>0$ and $j_{\max}\in\mathbb{Z}$ such that $\Delta_{k}\leq\Delta_{\max}:=\Delta_{0}e^{\lambda j_{\max}}$ ;
(b)

There exists $j_{\epsilon}\in\mathbb{Z}$ with $j_{\epsilon}\leq 0$ such that $K_{\epsilon}>k$ implies $\Delta_{k+1}\geq\min(\Delta_{k}e^{\lambda W_{k}},\Delta_{\epsilon})$ , where $\Delta_{\epsilon}:=\Delta_{0}e^{\lambda j_{\epsilon}}$ and $\lambda>0$ is from (a); and
(c)

There exists $\theta>0$ and non-increasing function $h:[0,\infty)\to[0,\infty)$ such that $K_{\epsilon}>k$ implies $\phi_{k}-\mathbb{E}[\phi_{k+1}|\mathcal{F}_{k-1}]\geq\theta h(\Delta_{k})$ ;

then

\displaystyle\mathbb{E}[K_{\epsilon}]\leq\frac{q\phi_{0}}{2(q-\frac{1}{2})\theta h(\Delta_{\epsilon})}+1.

(7.21)

Proposition 7.13 can be thought of as a stochastic version of Theorem 3.7, where condition (b) is a requirement similar to the conclusion of Lemma 3.6.³¹³¹31We will use $h(\Delta)=\Delta^{2}$ here, but $h(\Delta)=\Delta^{3}$ can be used to get second-order complexity results.

To apply Proposition 7.13 to analyze Algorithm 7.2, we will take $\phi_{k}$ to be a measure of algorithm progress based on $f(\bm{x}_{k})-f_{\textnormal{low}}$ (see Lemma 7.15 below) and the stopping time to be

\displaystyle K_{\epsilon}:=\inf\{k\geq 0:\|\nabla f(\bm{x}_{k})\|<\epsilon\}

(7.22)

and define the event $W_{k}$ to be when all probabilistic estimates are accurate

\displaystyle W_{k}:=2I_{k}J_{k}-1=\begin{cases}1,&\text{if $I_{k}=J_{k}=1$},\\ -1,&\text{otherwise},\end{cases}

(7.23)

and so from Assumptions 7.11 (a) and 7.12 we get (7.20) with $q=\alpha_{m}\alpha_{f}$ (hence we will need to assume that $\alpha_{m}\alpha_{f}>\frac{1}{2}$ ). Condition (a) in Proposition 7.13 is also automatically satisfied for Algorithm 7.2 by setting $\lambda=\log(\gamma_{\textnormal{inc}})$ . We now need to establish that conditions (b) and (c) hold, for suitable choices of $j_{\epsilon}$ , $\phi_{k}$ and $\theta$ .

Lemma 7.14.

Suppose Assumptions 2.4 (a), 2.9, 7.11 and 7.12 hold and we run Algorithm 7.2. For any $\epsilon>0$ , if $j_{\epsilon}$ is the largest non-positive integer (i.e. smallest in magnitude) such that

\displaystyle\Delta_{\epsilon}:=\Delta_{0}e^{\lambda j_{\epsilon}}\leq\frac{\epsilon}{c_{0}},\hskip 18.49988pt\text{where}\hskip 18.49988ptc_{0}:=\max\left(\frac{4\max(\kappa_{\textnormal{mf}},\epsilon_{f})}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)+\kappa_{\textnormal{mg}},

(7.24)

and where $\lambda:=\log(\gamma_{\textnormal{inc}})$ , then provided $K_{\epsilon}>k$ we have $\Delta_{k+1}\geq\min(\Delta_{k}e^{\lambda W_{k}},\Delta_{\epsilon})$ .

Proof.

Fix $k$ such that $K_{\epsilon}>k$ , and so $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ . If $W_{k}=-1$ (i.e. $I_{k}=0$ or $J_{k}=0$ ), then $\Delta_{k+1}=\gamma_{\textnormal{inc}}^{-1}\Delta_{k}=\Delta_{k}e^{-\lambda}=\Delta_{k}e^{\lambda W_{k}}$ and the result holds. So now suppose that $W_{k}=1$ (i.e. $I_{k}=J_{k}=1$ ).

Since $\Delta_{k}=\gamma_{\textnormal{inc}}^{i_{k}}\Delta_{0}$ for some $i_{k}\in\mathbb{Z}$ with $i_{k+1}\in\{i_{k}-1,i_{k}+1\}$ (by the trust-region updating mechanism), if $\Delta_{k}>\Delta_{\epsilon}$ then $i_{k}>j_{\epsilon}$ and so $i_{k+1}\geq j_{\epsilon}$ , which means $\Delta_{k+1}\geq\Delta_{\epsilon}$ and the result holds.

The only remaining case to consider is where $I_{k}=J_{k}=1$ and $\Delta_{k}\leq\Delta_{\epsilon}$ . Since $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ and $\Delta_{k}\leq\Delta_{\epsilon}\leq c_{0}^{-1}\epsilon$ , we have $\|\nabla f(\bm{x}_{k})\|\geq c_{0}\Delta_{k}$ , and so since $m_{k}$ is fully linear ( $I_{k}=1$ ),

\displaystyle\|\bm{g}_{k}\|\geq\|\nabla f(\bm{x}_{k})\|-\kappa_{\textnormal{mg}}\Delta_{k}\geq\max\left(\frac{4\max(\kappa_{\textnormal{mf}},\epsilon_{f})}{\kappa_{s}(1-\eta_{S})},\kappa_{H},\mu_{c}\right)\Delta_{k},

(7.25)

and so $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ and

\displaystyle\Delta_{k}\leq\min\left(\frac{\kappa_{s}(1-\eta_{S})}{2(\kappa_{\textnormal{mf}}+\epsilon_{f})},\frac{1}{\kappa_{H}}\right)\|\bm{g}_{k}\|.

(7.26)

From Assumption 2.9 and $\Delta_{k}\leq\|\bm{g}_{k}\|/\kappa_{H}$ we have $m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\|\bm{g}_{k}\|\Delta_{k}$ . Separately, since $I_{k}=1$ the model $m_{k}$ is fully linear, and since $J_{k}=1$ we have $\max(|f_{k}^{0}-f(\bm{x}_{k})|,|f_{k}^{s}-f(\bm{x}_{k}+\bm{s}_{k})|)\leq\epsilon_{f}\Delta_{k}^{2}$ . Thus

$\displaystyle\|\rho_{k}-1\|$	$\displaystyle=\frac{\|(f_{k}^{0}-m_{k}(\bm{x}_{k}))-(f_{k}^{s}-m_{k}(\bm{x}_{k}+\bm{s}_{k}))\|}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})},$	(7.27)
	$\displaystyle\leq\frac{\|f_{k}^{0}-f(\bm{x}_{k})\|+\|f(\bm{x}_{k})-m_{k}(\bm{x}_{k})\|+\|f_{k}^{s}-f(\bm{x}_{k}+\bm{s}_{k})\|+\|f(\bm{x}_{k}+\bm{s}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\|}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})},$	(7.28)
	$\displaystyle\leq\frac{2(\kappa_{\textnormal{mf}}+\epsilon_{f})\Delta_{k}^{2}}{\kappa_{s}\\|\bm{g}_{k}\\|\Delta_{k}},$	(7.29)

and so $|\rho_{k}-1|\leq 1-\eta_{S}$ , or $\rho_{k}\geq\eta_{S}$ , and iteration $k$ is successful. Hence $\Delta_{k+1}=\gamma_{\textnormal{inc}}\Delta_{k}=\Delta_{k}e^{\lambda W_{k}}$ from $W_{k}=1$ , using $\Delta_{k}\leq\Delta_{\epsilon}\leq\gamma_{\textnormal{inc}}\Delta_{\max}$ since $j_{\epsilon}\leq 0<j_{\max}$ . ∎

We now establish condition (c) in Proposition 7.13, for a suitable choice of $\phi_{k}$ and $h(\Delta)$ .

Lemma 7.15.

Suppose Assumptions 2.4, 2.9, 7.11 and 7.12 hold, we run Algorithm 7.2 and

\displaystyle\epsilon_{f}<\frac{1}{2}\eta_{S}\kappa_{s}\mu_{c}\min\left(1,\frac{\mu_{c}}{\kappa_{H}}\right),

(7.30)

holds. Also assume that $\alpha_{m}$ and $\alpha_{f}$ in Assumptions 7.11 and 7.12 are sufficiently close to 1. Then there exists $\theta>0$ and $\nu\in(0,1)$ such that, for any $\epsilon>0$ , provided $K_{\epsilon}>k$ we have

\displaystyle\phi_{k}-\mathbb{E}[\phi_{k+1}|\mathcal{F}_{k-1}]\geq\theta\Delta_{k}^{2},\hskip 18.49988pt\text{where}\hskip 18.49988pt\phi_{k}:=\nu[f(\bm{x}_{k})-f_{\textnormal{low}}]+(1-\nu)\Delta_{k}^{2}\geq 0.

(7.31)

Proof.

The proof is omitted for brevity; see [24, Theorem 3] with further details given in [41, Theorem 4.11]. ∎

We now get our main complexity bound as an immediate consequence of Proposition 7.13.

Theorem 7.16.

Suppose Assumptions 2.4, 2.9, 7.11 and 7.12 hold, we run Algorithm 7.2 and (7.30) holds. Also assume that $\alpha_{m}$ and $\alpha_{f}$ in Assumptions 7.11 and 7.12 are sufficiently close to 1 (the same as required by Lemma 7.15). Then there exists a constant $\tilde{\theta}>0$ such that

\displaystyle\mathbb{E}[K_{\epsilon}]\leq\frac{\alpha_{m}\alpha_{f}[f(\bm{x}_{0})-f_{\textnormal{low}}+\Delta_{0}^{2}]}{2(\alpha_{m}\alpha_{f}-\frac{1}{2})\tilde{\theta}\epsilon^{2}}+1.

(7.32)

Proof.

Apply Lemmas 7.14 and 7.15 to show that Proposition 7.13 applies with $h(\Delta)=\Delta^{2}$ . Then use $\phi_{0}=\nu[f(\bm{x}_{0})-f_{\textnormal{low}}]+(1-\nu)\Delta_{0}^{2}\leq f(\bm{x}_{0})-f_{\textnormal{low}}+\Delta_{0}^{2}$ from $\nu\in(0,1)$ . Since $j_{\epsilon}$ in the definition of $\Delta_{\epsilon}$ in Lemma 7.14 is chosen to be smallest in magnitude, we have $\Delta_{\epsilon}\geq\frac{\epsilon}{\gamma_{\textnormal{inc}}c_{0}}$ , and so $\theta\Delta_{\epsilon}^{2}\leq\tilde{\theta}\epsilon^{2}$ for $\tilde{\theta}=\theta/(\gamma_{\textnormal{inc}}^{2}c_{0}^{2})$ . ∎

In general, we know that at each iteration $k$ we need to sample the stochastic objective $\mathcal{O}(\Delta_{k}^{-4})$ times to satisfy our probabilistic Assumptions 7.11 and 7.12. From Lemma 7.14 we know that $\Delta_{k}$ can typically get as small as $\Delta_{\epsilon}=\mathcal{O}(\epsilon)$ . So, broadly speaking, we may require up to $\mathcal{O}(\epsilon^{-4})$ stochastic objective evaluations per iteration, or $\mathcal{O}(\epsilon^{-6}/(\alpha_{m}\alpha_{f}-\frac{1}{2}))$ evaluations in total. A rigorous $\mathcal{O}(\epsilon^{-6})$ evaluation complexity bound for a variant of Algorithm 7.2 is proven in [88, Theorem 5].

Remark 7.17.

Using Corollary 7.9 and Theorem 7.10 to satisfy Assumptions 7.11 and 7.12 relies on repeatedly sampling the objective at every iterate and interpolation point, potentially a large number of times. As such, Algorithm 7.2 is the only method in this work that is not well-suited to the regime where objective evaluations are expensive. However, this is the price of seeking convergence to a stationary point (which is not generically possible in the case of deterministic noise, as the true objective value can never be known for any point). Convergence to a neighborhood, as in the case of deterministic noise, can be established for Algorithm 7.1 [32], which is more realistic for the expensive evaluation regime.

Notes and References

Lemma 7.2 is based on [111, Lemma 9.1]. Algorithm 7.1 and the associated worst-case complexity analysis (Theorem 7.6) is new, but broadly follows the approach from [32, 40]. In particular, [32] considers very flexible model and evaluation accuracy assumptions, including allowing for probabilistically accurate models and sufficient decrease estimates, similar to Section 7.2. It also includes second-order complexity theory. This requires modifying the ratio test (7.12) based on knowledge of $\epsilon_{f}$ . If the noise were stochastic, we could estimate $\epsilon_{f}$ based on the standard deviation of $\tilde{f}(\bm{x})$ for different values of $\bm{x}$ . For deterministic noise, $\epsilon_{f}$ must be estimated in a different way (e.g. [106]).

A procedure to estimate the size of (especially deterministic) noise in a function was developed in [106]. Accurate finite differencing schemes (i.e. picking the perturbation size $h$ based on the estimated noise level) are developed in [107, 136]. Several recent works have shown how to adapt existing (derivative-based) algorithms to appropriately handle deterministic noise in objective and gradient evaluations, such as [135, 140, 114]. They typically also rely on a modified ratio test, similar to the use of (7.12) in Algorithm 7.1. A modified ratio test is also used in the (derivative-based) trust-region solver TRU in the GALAHAD package [70] to handle roundoff errors. If the level of deterministic noise can be controlled (i.e. $\widetilde{f}(\bm{x})$ can be computed for any $\epsilon_{f}>0$ , but not $\epsilon_{f}=0$ ), then Algorithm 3.1 with minor adjustments can converge to any accuracy level [63].

A more widely studied setting for stochastic optimization is where stochastic gradient estimates, $\bm{x}\mapsto\nabla_{\bm{x}}f(\bm{x},\omega)$ are available, which is the core of many algorithms for training machine learning models [26]. We also note that stricter assumptions about the distribution of $f(\bm{x},\omega)$ , e.g. bounds on higher-order moments, can yield tighter bounds on $|\overline{f}_{N}(\bm{x},\omega)-f(\bm{x})|$ than Chebyshev’s inequality (e.g. Hoeffding’s or Bernstein’s inequality), see [27].

Algorithm 7.2 was introduced in [41], where almost-sure convergence of $\|\nabla f(\bm{x}_{k})\|$ to zero was shown, and the worst-case complexity analysis shown here was originally given in [24]. The work [24] also includes an extension of Algorithm 7.2 that converges to second-order critical points, with suitable worst-case complexity bounds. This approach has been extended to constrained problems via stochastic SQP methods (primarily aimed at the stochastic gradient setting but also applicable to MBDFO) in [66, 65, 64]. Adaptive sample averaging can improve performance, where the number of samples $N$ is chosen dynamically at every evaluation point, based on the sample standard deviation and $\Delta_{k}$ , to ensure that the current estimate $\overline{f}_{N}(\bm{x},\omega)$ is sufficiently accurate [134, 80]. The sample complexity (i.e. total number of stochastic oracle calls) of stochastic MBDFO algorithms can be reduced under stronger assumptions on the stochastic oracle, such as bounds on tail probabilities [128] or the availability of common random numbers [78].

Instead of constructing interpolation models using averaged estimates $\overline{f}_{N}(\bm{x},\omega)$ for large $N$ , an alternative is to use regression models (i.e. $N=1$ but with very large $p$ ), based on the interpolation theory from [45, 51, 22] and discussed briefly in Section 4.3. In this case, the almost-sure convergence of $\|\nabla f(\bm{x}_{k})\|$ to zero is shown in [95]. Regression models are useful for both deterministic and stochastic noise (unlike sample averaging), but sample averaging allows for more accurate estimates $f_{k}^{0},f_{k}^{s}$ for deciding whether to accept a step or not. Radial basis function models can also cope with large numbers of interpolation points $p$ , and hence construct regression-type models suitable for noisy problems [16]. Error bounds for linear regression models as the number of points $p\to\infty$ , relevant for problems with stochastic noise, are derived in [83].

A growing body of recent work extends the results from [41, 24] to more general (probabilistic) conditions on function value and model accuracy [32, 88, 66, 65], and to direct search methods [7, 62]. This is improving our understanding of how much progress can be made in MBDFO and similar methods for stochastic problems with given levels of accuracy. A greater understanding is useful here, since MBDFO algorithms for deterministic problems with minimal/no sample averaging can already perform quite well in some cases [35, Section 5].

8 Summary and Software

The creation and analysis of MBDFO algorithms requires an interesting combination of optimization and approximation theory, which has been extensively developed over the last 30 years in particular. We have introduced the most important tools used in these algorithms, in particular trust-region methods and the construction of fully linear/fully quadratic interpolation models, and have demonstrated how to extend these ideas to important settings such as constrained problems and noisy objective evaluations. As mentioned in the introduction, the intention was never to provide a comprehensive overview of developments in the field; for this, we direct the reader to [96], and also note the more detailed discussions of algorithm and interpolation theory in [51].

8.1 Software

We finish by providing a list of high-quality open-source software implementations of MBDFO methods, suitable for use in practical applications as well as a starting point for further research.³²³²32Disclosure: The author is the primary developer of two packages listed here, namely Py-BOBYQA and DFO-LS.

Software of M. J. D. Powell: A collection of several Fortran packages for MBDFO. This includes COBYLA [115] (general constrained problems, linear interpolation), UOBYQA [118] & NEWUOA [120] (unconstrained problems, fully and minimum Frobenius norm quadratic models respectively), BOBYQA [121] (bound-constrained problems, minimum Frobenius norm quadratic models) and LINCOA [122] (linearly constrained problems, minimum Frobenius norm quadratic models). They are most easily accessed through the PRIMA package [157], which also includes C, MATLAB, Python and Julia interfaces. Py-BOBYQA [35] is a separate, pure Python re-implementation of BOBYQA with additional heuristics to improve performance for noisy problems. An accessible overview of these algorithms can be found in [123].³³³³33Mike Powell (1936–2015) was a pioneer of both (derivative-based) trust-region and MBDFO methods, among many other significant contributions to optimization and approximation theory [29].
COBYQA [124]: A general-purpose Python package which can handle bound and nonlinear constraints. Nonlinear constraints are also assumed to be derivative-free. It is based on minimum Frobenius norm quadratic models. It is most readily available through SciPy’s optimization module.³⁴³⁴34See https://scipy.org/.
DFO-LS [35]: A Python package for nonlinear least-squares problems. It can handle simple bound and convex constraints, and nonsmooth regularizers, and it includes heuristics to improve performance for noisy problems. It is based on linear interpolation models for each term in the least-squares sum (see Section 4.5).
IBCDFO: A collection of MATLAB and Python packages for MBDFO with composite models (see Section 4.5). Includes POUNDerS [149] (nonlinear least-squares with bound constraints using minimum Frobenius norm quadratic models), manifold sampling [97] & GOOMBAH [98] (both for unconstrained problems with general composite objectives (4.105) with $h$ possibly nonsmooth, using minimum Frobenius norm quadratic models)
ASTRO-DF [79]: A Python package aimed at stochastic MBDFO problems. It can handle constraints and is well-suited to simulation-based optimization, where common random numbers can be used to evaluate the objective at different points in a consistent way. It is based on underdetermined quadratic interpolation models (using diagonal Hessians as per Corollary 4.15).
NOMAD [11]: A widely used direct search code, written in C++ with MATLAB and Python interfaces. It constructs quadratic models with minimum Frobenius norm interpolation (or an alternative surrogate) to accelerate the direct search process via a model-based search step, and/or by ordering the points to be checked. It can handle nonsmooth problems, constraints, discrete variables, multi-objective problems and more.

Table 8.1 provides a list of URLs for the above packages, accurate at the time of writing. This list is not intended to be exhaustive, but hopefully can provide a starting point for the interested reader.

Package	Location
PRIMA (Powell’s software)	https://github.com/libprima/prima
Py-BOBYQA	https://github.com/numericalalgorithmsgroup/pybobyqa
COBYQA	https://github.com/cobyqa/cobyqa
DFO-LS	https://github.com/numericalalgorithmsgroup/dfols
IBCDFO	https://github.com/POptUS/IBCDFO
ASTRO-DF	https://github.com/simopt-admin/simopt
NOMAD	https://github.com/bbopt/nomad

Table 8.1: URLs for selected MBDFO software packages.

8.2 Outlook

MBDFO theory and algorithms have matured greatly over the last 30 years, but there is still much ongoing work to be done. Compared to many other areas of optimization, the gap between theoretically studied algorithms and efficient, practical software is relatively large, and it remains to be seen how closely aligned these can be made. As local MBDFO methods mature, there is more scope to incorporate them into global optimization techniques such as multistart methods [99, 87], since MBDFO methods tend to outperform global optimization methods when good estimates of minimizers are available [129]. Another consequence of the increasing maturity is that MBDFO can now start to be applied to optimization problems with more complex structure, such as multi-objective optimization [54, 102], optimization on manifolds [108] and mixed integer problems [143, 92].

Acknowledgments

This work was supported by the Australian Research Council Discovery Early Career Award DE240100006. Thanks to Warren Hare for the encouragement to pursue this project. Thanks to Nicole Felice, Warren Hare, Jeffrey Larson, Fangyu Liu, Matt Menickelly, Clément Royer and Stefan Wild for spotting errors and providing helpful feedback. Thanks to the editor and anonymous referees for their suggestions on the manuscript.

References

[1] A. Abbas, A. Ambainis, B. Augustino, A. Bärtschi, H. Buhrman, C. Coffrin, G. Cortiana, V. Dunjko, D. J. Egger, B. G. Elmegreen, N. Franco, F. Fratini, B. Fuller, J. Gacon, C. Gonciulea, S. Gribling, S. Gupta, S. Hadfield, R. Heese, G. Kircher, T. Kleinert, T. Koch, G. Korpas, S. Lenk, J. Marecek, V. Markov, G. Mazzola, S. Mensa, N. Mohseni, G. Nannicini, C. O’Meara, E. P. Tapia, S. Pokutta, M. Proissl, P. Rebentrost, E. Sahin, B. C. B. Symons, S. Tornow, V. Valls, S. Woerner, M. L. Wolf-Bauwens, J. Yard, S. Yarkoni, D. Zechiel, S. Zhuk, and C. Zoufa (2024) Challenges and opportunities in quantum optimization. Nature Reviews Physics 6, pp. 718–735. External Links: Document Cited by: Example 1.2.
[2] S. Alarie, C. Audet, A. E. Gheribi, M. Kokkolaras, and S. Le Digabel (2021) Two decades of blackbox optimization applications. EURO Journal on Computational Optimization 9, pp. 100011. External Links: Document Cited by: §1.1.
[3] C. Audet, J. E. Dennis, and S. Le Digabel (2010) Globalization strategies for mesh adaptive direct search. Computational Optimization and Applications 46 (2), pp. 193–215. External Links: Document Cited by: §6.2.
[4] C. Audet and J. E. Dennis (2004) A pattern search filter method for nonlinear programming without derivatives. SIAM Journal on Optimization 14 (4), pp. 980–1010. External Links: Document Cited by: §6.2.
[5] C. Audet and J. E. Dennis (2006) Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on Optimization 17 (1), pp. 188–217. External Links: Document Cited by: item Direct search methods, §4.5, §6.2.
[6] C. Audet and J. E. Dennis (2009) A progressive barrier for derivative-free nonlinear programming. SIAM Journal on Optimization 20 (1), pp. 445–472. External Links: Document Cited by: §6.2.
[7] C. Audet, K. J. Dzahini, M. Kokkolaras, and S. Le Digabel (2021) Stochastic mesh adaptive direct search for blackbox optimization using probabilistic estimates. Computational Optimization and Applications 79 (1), pp. 1–34. External Links: Document Cited by: §7.2.
[8] C. Audet and W. Hare (2017) Derivative-free and blackbox optimization. Springer Series in Operations Research and Financial Engineering, Springer, Cham, Switzerland. Cited by: §1.1, §1.1, §1.1, §1.3, §1.3.
[9] C. Audet and W. Hare (2020) Model-based methods in derivative-free nonsmooth optimization. In Numerical Nonsmooth Optimization: State of the Art Algorithms, A. M. Bagirov, M. Gaudioso, N. Karmitsa, M. M. Mäkelä, and S. Taheri (Eds.), External Links: Document Cited by: §4.5.
[10] C. Audet, M. Kokkolaras, S. Le Digabel, and B. Talgorn (2018) Order-based error for managing ensembles of surrogates in mesh adaptive direct search. Journal of Global Optimization 70 (3), pp. 645–675. External Links: Document Cited by: §1.1.
[11] C. Audet, S. Le Digabel, V. R. Montplaisir, and C. Tribes (2022) Algorithm 1027: NOMAD version 4: nonlinear optimization with the MADS algorithm. ACM Transactions on Mathematical Software 48 (3), pp. 1–22. External Links: Document Cited by: §4.5, item NOMAD [11], item NOMAD [11].
[12] C. Audet, S. Le Digabel, and M. Peyrega (2015) Linear equalities in blackbox optimization. Computational Optimization and Applications 61 (1), pp. 1–23. External Links: Document Cited by: §6.2.
[13] C. Audet, S. Le Digabel, and R. Saltet (2022) Quantifying uncertainty with ensembles of surrogates for blackbox optimization. Computational Optimization and Applications 83 (1), pp. 29–66. External Links: Document Cited by: §1.1.
[14] C. Audet and D. Orban (2006) Finding optimal algorithmic parameters using derivative-free optimization. SIAM Journal on Optimization 17 (3), pp. 642–664. External Links: Document Cited by: §1.1.
[15] C. Audet (2014) A survey on direct search methods for blackbox optimization and their applications. In Mathematics Without Boundaries, P. M. Pardalos and T. M. Rassias (Eds.), pp. 31–56. External Links: Document Cited by: §1.1.
[16] F. Augustin and Y. M. Marzouk (2017) A trust-region method for derivative-free nonlinear constrained stochastic optimization. Note: arXiv preprint arXiv:1703.04156 External Links: 1703.04156 Cited by: §4.5, §7.2.
[17] A. S. Bandeira, K. Scheinberg, and L. N. Vicente (2014) Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24 (3), pp. 1238–1264. External Links: Document Cited by: §3.3.
[18] A. Beck (2017) First-order methods in optimization. SIAM. Cited by: §6.1, §6.2.
[19] V. Beiranvand, W. Hare, and Y. Lucet (2017) Best practices for comparing optimization algorithms. Optimization and Engineering 18 (4), pp. 815–848. External Links: Document Cited by: §1.1.
[20] M. Benzi, G. H. Golub, and J. Liesen (2005) Numerical solution of saddle point problems. Acta Numerica 14, pp. 1–137. External Links: Document Cited by: §4.3, Remark 4.12.
[21] A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg (2022) A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Foundations of Computational Mathematics 22, pp. 507–560. External Links: 1910.04055, Document Cited by: §4.2.
[22] S. C. Billups, J. Larson, and P. Graf (2013) Derivative-free optimization of expensive functions with computational error using weighted regression. SIAM Journal on Optimization 23 (1), pp. 27–53. External Links: Document Cited by: §7.2.
[23] M. Björkmann and K. Holmström (2000) Global optimization of costly nonconvex functions using radial basis functions. Optimization and Engineering 1, pp. 373–397. External Links: Document Cited by: §4.5.
[24] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg (2019) Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS Journal on Optimization 1 (2), pp. 92–119. External Links: Document Cited by: §7.2, §7.2, §7.2, Proposition 7.13.
[25] A. J. Booker, J. E. Dennis, P. D. Frank, D. B. Serafini, V. Torczon, and M. W. Trosset (1999) A rigorous framework for optimization of expensive functions by surrogates. Structural Optimization 13, pp. 1–13. External Links: Document Cited by: §1.1, §1.3, §6.2.1.
[26] L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document Cited by: §7.2.
[27] S. Boucheron, G. Lugosi, and P. Massart (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Oxford. Cited by: §7.2, Proposition 7.8.
[28] J. P. Boyle and R. L. Dykstra (1986) A method for finding projections onto the intersection of convex sets in Hilbert spaces. In Advances in Order Restricted Statistical Inference, R. Dykstra, T. Robertson, and F. T. Wright (Eds.), New York, NY, pp. 28–47. External Links: ISBN 978-1-4613-9940-7 Cited by: §6.1.
[29] M. D. Buhmann, R. Fletcher, A. Iserles, and P. Toint (2017) Michael James David Powell. Note: https://www.damtp.cam.ac.uk/user/na/NA_papers/NA2017_04.pdf Cited by: footnote 33.
[30] M. D. Buhmann (2000) Radial basis functions. Acta Numerica 9, pp. 1–38. External Links: Document Cited by: §4.5.
[31] R. H. Byrd, J. Nocedal, and R. A. Waltz (2006) Knitro: an integrated package for nonlinear optimization. In Large-Scale Nonlinear Optimization, P. Pardalos, G. Di Pillo, and M. Roma (Eds.), Vol. 83, pp. 35–59. External Links: Document Cited by: §2.2.
[32] L. Cao, A. S. Berahas, and K. Scheinberg (2024) First- and second-order high probability complexity bounds for trust-region methods with noisy oracles. Mathematical Programming 207, pp. 55–106. External Links: Document Cited by: §4.2, §4.5, §7.2, §7.2, Remark 7.17.
[33] L. Cao, Z. Wen, and Y. Yuan (2023) The error in multivariate linear extrapolation with applications to derivative-free optimization. Note: arXiv preprint arXiv:2307.00358 Cited by: §5.3, §5.3.
[34] C. Cartis, N. I. M. Gould, and P. L. Toint (2012) An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis 32 (4), pp. 1662–1695. External Links: ISSN 0272-4979, 1464-3642, Document Cited by: §6.2.
[35] C. Cartis, J. Fiala, B. Marteau, and L. Roberts (2019) Improving the flexibility and robustness of model-based derivative-free optimization solvers. ACM Transactions on Mathematical Software 45 (3), pp. 32:1–32:41. External Links: Document Cited by: §7.2, §7, item Software of M. J. D. Powell, item DFO-LS [35], item DFO-LS [35].
[36] C. Cartis, N. I. M. Gould, and P. L. Toint (2022) Evaluation complexity of algorithms for nonconvex optimization: theory, computation and perspectives. MOS-SIAM Series on Optimization, MOS/SIAM, Philadelphia. Cited by: §2.1, §2.1, §2.3, Theorem 2.12, Theorem 2.13, Remark 2.5, Lemma 2.6, footnote 10.
[37] C. Cartis and L. Roberts (2019) A derivative-free Gauss–Newton method. Mathematical Programming Computation 11 (4), pp. 631–674. External Links: Document Cited by: §4.5, §4.5.
[38] C. Cartis and L. Roberts (2023) Scalable subspace methods for derivative-free nonlinear least-squares optimization. Mathematical Programming 199 (1-2), pp. 461–524. External Links: Document Cited by: §3.3.
[39] C. Cartis and L. Roberts (2024) Randomized subspace derivative-free optimization with quadratic models and second-order convergence. arXiv. Note: arXiv preprint arXiv:2412.14431 External Links: 2412.14431, Document Cited by: §3.3.
[40] A. Chaudhry and K. Scheinberg (2025) On complexity of model-based derivative-free methods. arXiv. Note: arXiv preprint arXiv:2510.14935 External Links: 2510.14935 Cited by: §4.5, §5.3, §5.3, §5.3, §7.2.
[41] R. Chen, M. Menickelly, and K. Scheinberg (2018) Stochastic optimization using a trust-region method and random models. Mathematical Programming 169 (2), pp. 447–487. External Links: Document Cited by: Remark 2.5, §7.2, §7.2, §7.2.
[42] X. Chen and C. T. Kelley (2016) Optimization with hidden constraints and embedded Monte Carlo computations. Optimization and Engineering 17 (1), pp. 157–175. External Links: Document Cited by: §6.2.1, §6.2.1.
[43] T. D. Choi, O. J. Eslinger, C. T. Kelley, J. W. David, and M. Etheridge (2000) Optimization of automotive valve train components with implicit filtering. Optimization and Engineering 1, pp. 9–27. External Links: Document Cited by: §1.1.
[44] P.D. Conejo, E.W. Karas, L.G. Pedroso, A.A. Ribeiro, and M. Sachine (2013) Global convergence of trust-region algorithms for convex constrained minimization without derivatives. Applied Mathematics and Computation 220, pp. 324–330. External Links: Document Cited by: §6.2.
[45] A. R. Conn, K. Scheinberg, and L. N. Vicente (2008) Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA Journal of Numerical Analysis 28 (4), pp. 721–748. External Links: Document Cited by: §4.5, §5.3, §7.2.
[46] A. R. Conn, K. Scheinberg, and L. N. Vicente (2007) Geometry of interpolation sets in derivative free optimization. Mathematical Programming 111 (1-2), pp. 141–172. External Links: Document Cited by: §4.5, §5.3.
[47] A. R. Conn, N. I. M. Gould, and P. L. Toint (2000) Trust-region methods. MPS-SIAM Series on Optimization, Vol. 1, MPS/SIAM, Philadelphia. Cited by: §1.2, §2.2, §2.2, §2.3, Remark 2.5, Lemma 2.8, 2nd item, §3, §6.1, §6.2, §6.2, §6.2, Proposition 6.2.
[48] A. R. Conn and S. Le Digabel (2013) Use of quadratic models with mesh-adaptive direct search for constrained black box optimization. Optimization Methods and Software 28 (1), pp. 139–158. External Links: Document Cited by: §1.1.
[49] A. R. Conn, K. Scheinberg, and P. L. Toint (1997) On the convergence of derivative-free methods for unconstrained optimization. In Approximation Theory and Optimization: Tributes to M. J. D. Powell, M. D. Buhmann and A. Iserles (Eds.), Cited by: §3.3.
[50] A. R. Conn, K. Scheinberg, and L. N. Vicente (2009) Global convergence of general derivative-free trust-region algorithms to first- and second-order critical points. SIAM Journal on Optimization 20 (1), pp. 387–415. External Links: Document Cited by: §3.3.
[51] A. R. Conn, K. Scheinberg, and L. N. Vicente (2009) Introduction to derivative-free optimization. MPS-SIAM Series on Optimization, Vol. 8, MPS/SIAM, Philadelphia. Cited by: §1.1, §1.3, Remark 2.5, §3.3, §4.5, Remark 4.3, §5.3, §7.2, §8.
[52] A. R. Conn and P. L. Toint (1996) An algorithm using quadratic interpolation for unconstrained derivative free optimization. In Nonlinear Optimization and Applications, G. Di Pillo and F. Giannessi (Eds.), pp. 27–47. External Links: Document Cited by: §4.5.
[53] F. E. Curtis, S. Dezfulian, and A. Wächter (2024) Derivative-free bound-constrained optimization for solving structured problems with surrogate models. Optimization Methods and Software 39 (4), pp. 845–873. External Links: Document Cited by: §4.4.
[54] A. L. Custódio, J. F. A. Madeira, A. I. F. Vaz, and L. N. Vicente (2011) Direct multisearch for multiobjective optimization. SIAM Journal on Optimization 21 (3), pp. 1109–1140. External Links: Document Cited by: §8.2.
[55] A. L. Custódio, H. Rocha, and L. N. Vicente (2010) Incorporating minimum Frobenius norm models in direct search. Computational Optimization and Applications 46 (2), pp. 265–278. External Links: Document Cited by: §1.1.
[56] A.L. Custódio, M. Emmerich, and J.F.A. Madeira (2012) Recent developments in derivative-free multiobjective optimisation. Computational Technology Reviews 5, pp. 1–30. External Links: ISSN 2044-8430, Document Cited by: §1.1.
[57] D. Davar and G. N. Grapiglia (2025) TRFD: a derivative-free trust-region method based on finite differences for composite nonsmooth optimization. SIAM Journal on Optimization 35 (3), pp. 1792–1821. External Links: Document Cited by: §4.5.
[58] Y. Diouane, M. L. Habiboullah, and D. Orban (2024) Complexity of trust-region methods in the presence of unbounded Hessian approximations. arXiv. Note: arXiv preprint arXiv:2408.06243 External Links: 2408.06243 Cited by: footnote 21.
[59] N. Doikov and G. N. Grapiglia (2025) First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians. Journal of Scientific Computing 103 (1), pp. 32. External Links: Document Cited by: §4.2, §4.5.
[60] K. J. Dzahini and S. M. Wild (2024) Stochastic trust-region algorithm in random subspaces with convergence and expected complexity analyses. SIAM Journal on Optimization 34 (3), pp. 2671–2699. External Links: Document Cited by: §3.3.
[61] K.J. Dzahini, F. Rinaldi, C.W. Royer, and D. Zeffiro (2025) Direct-search methods in the year 2025: theoretical guarantees and algorithmic paradigms. EURO Journal on Computational Optimization 13, pp. 100110. External Links: Document Cited by: §1.1.
[62] K. J. Dzahini, M. Kokkolaras, and S. Le Digabel (2023) Constrained stochastic blackbox optimization using a progressive barrier and probabilistic estimates. Mathematical Programming 198 (1), pp. 675–732. External Links: Document Cited by: §7.2.
[63] M. J. Ehrhardt and L. Roberts (2021) Inexact derivative-free optimization for bilevel learning. Journal of Mathematical Imaging and Vision 63 (5), pp. 580–600. External Links: Document Cited by: §7.2.
[64] Y. Fang, J. Kim, S. Na, J. Demmel, and J. Lavaei (2026) A trust-region interior-point stochastic sequential quadratic programming method. arXiv preprint arXiv:2603.10230. Cited by: §6.2, §7.2.
[65] Y. Fang, J. Lavaei, and S. Na (2026) High probability complexity bounds of trust-region stochastic sequential quadratic programming with heavy-tailed noise. Mathematical Programming. Cited by: §6.2, §7.2, §7.2.
[66] Y. Fang, S. Na, M. W. Mahoney, and M. Kolar (2024) Trust-region sequential quadratic programming for stochastic optimization with random models. arXiv. Note: arXiv preprint arXiv:2409.15734 External Links: 2409.15734 Cited by: §6.2, §7.2, §7.2.
[67] R. Garmanjani, D. Júdice, and L. N. Vicente (2016) Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. SIAM Journal on Optimization 26 (4), pp. 1987–2011. External Links: Document Cited by: §3.3.
[68] S. Ghadimi and G. Lan (2013) Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. External Links: Document Cited by: item Finite differencing/implicit filtering.
[69] G. H. Golub and C. F. Van Loan (1996) Matrix computations. 3rd ed edition, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore. Cited by: footnote 16.
[70] N. I. M. Gould, D. Orban, and P. L. Toint (2003) GALAHAD, a library of thread-safe Fortran 90 packages for large-scale nonlinear optimization. ACM Transactions on Mathematical Software 29 (4), pp. 353–372. External Links: Document Cited by: §2.2, §7.2.
[71] N. I. M. Gould, D. P. Robinson, and H. S. Thorne (2010) On solving trust-region and other regularised subproblems in optimization. Mathematical Programming Computation 2 (1), pp. 21–57. External Links: Document Cited by: §2.2.
[72] G. N. Grapiglia, J. Yuan, and Y. Yuan (2016) A derivative-free trust-region algorithm for composite nonsmooth optimization. Computational and Applied Mathematics 35 (2), pp. 475–499. External Links: Document Cited by: §4.5.
[73] G. N. Grapiglia (2023) Quadratic regularization methods with finite-difference gradient approximations. Computational Optimization and Applications 85 (3), pp. 683–703. External Links: Document Cited by: §4.5.
[74] S. Gratton, C. W. Royer, and L. N. Vicente (2020) A decoupled first/second-order steps technique for nonconvex nonlinear unconstrained optimization with improved complexity bounds. Mathematical Programming 179 (1-2), pp. 195–222. External Links: Document Cited by: §2.3.
[75] R. M. Gray (2005) Toeplitz and circulant matrices: a review. Foundations and Trends® in Communications and Information Theory 2 (3), pp. 155–239. External Links: Document Cited by: 1st item, footnote 35.
[76] A. Griewank and A. Walther (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation. 2nd edition, SIAM, Philadelphia. External Links: Document Cited by: item Automatic (aka algorithmic) differentiation.
[77] H. M. Gutmann (2001) A radial basis function method for global optimization. Journal of Global Optimization 19, pp. 201–227. External Links: Document Cited by: §4.5.
[78] Y. Ha, S. Shashaani, and R. Pasupathy (2025) Complexity of zeroth- and first-order stochastic trust-region algorithms. SIAM Journal on Optimization 35 (3), pp. 2098–2127. External Links: Document Cited by: §7.2.
[79] Y. Ha, S. Shashaani, and Q. Tran-Dinh (2021) Improved complexity of trust-region optimization for zeroth-order stochastic oracles with adaptive sampling. In 2021 Winter Simulation Conference (WSC), Phoenix, AZ, USA, pp. 1–12. External Links: Document Cited by: item ASTRO-DF [79], item ASTRO-DF [79].
[80] Y. Ha and S. Shashaani (2024) Iteration complexity and finite-time efficiency of adaptive sampling trust-region methods for stochastic derivative-free optimization. IISE Transactions, pp. 1–15. External Links: Document Cited by: §7.2.
[81] L. Han and G. Liu (2004) On the convergence of the UOBYQA method. Journal of Applied Mathematics and Computing 16 (1-2), pp. 125–142. External Links: Document Cited by: §5.3.
[82] M. I. Hannanu, E. Camponogara, T. L. Silva, and M. Hovd (2024) A modified derivative-free SQP-filter trust-region method for uncertainty handling: application in gas-lift optimization. Optimization and Engineering. External Links: Document Cited by: §6.2, §6.2.
[83] W. Hare, G. Jarry-Bolduc, and C. Planiden (2023) Limiting behaviour of the generalized simplex gradient as the number of points tends to infinity on a fixed shape in $\mathbb{R}^{n}$ . Set-Valued and Variational Analysis 31, pp. 1:1–1:31. External Links: Document Cited by: §7.2.
[84] W. Hare, G. Jarry-Bolduc, and C. Planiden (2024) A matrix algebra approach to approximate Hessians. IMA Journal of Numerical Analysis 44 (4), pp. 2220–2250. External Links: Document Cited by: §4.5.
[85] W. Hare and G. Jarry-Bolduc (2020) Calculus identities for generalized simplex gradients: rules and applications. SIAM Journal on Optimization 30 (1), pp. 853–884. External Links: Document Cited by: §4.5.
[86] M. Hough and L. Roberts (2022) Model-based derivative-free methods for convex-constrained optimization. SIAM Journal on Optimization 32 (4), pp. 2552–2579. External Links: Document Cited by: §4.5, §5.3, §6.1, §6.1.1, §6.2, Lemma 6.8.
[87] P. Jaiswal and J. Larson (2024) Multistart algorithm for identifying all optima of nonconvex stochastic functions. Optimization Letters 18 (6), pp. 1335–1360. External Links: ISSN 1862-4472, 1862-4480, Document Cited by: §8.2.
[88] B. Jin, K. Scheinberg, and M. Xie (2025) Sample complexity analysis for adaptive optimization algorithms with stochastic oracles. Mathematical Programming 209 (1-2), pp. 651–679. External Links: Document Cited by: §7.2, §7.2.
[89] D. R. Jones, C. D. Perttunen, and B. E. Stuckman (1993) Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications 79 (1), pp. 157–181. External Links: Document Cited by: §1.1.
[90] D. Júdice (2015) Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. Ph.D. Thesis, Universidade de Coimbra. Cited by: §3.3, §4.5.
[91] C. T. Kelley (2011) Implicit filtering. SIAM, Philadelphia. Cited by: §1.1.
[92] M. Kimiaei and A. Neumaier (2025) MATRS: heuristic methods for noisy derivative-free bound-constrained mixed-integer optimization. Mathematical Programming Computation. External Links: Document Cited by: §8.2.
[93] T. G. Kolda, R. M. Lewis, and V. Torczon (2003) Optimization by direct search: new perspectives on some classical and modern methods. SIAM Review 45 (3), pp. 385–482. External Links: Document Cited by: item Direct search methods, §1.1.
[94] D. Lakhmiri and S. Le Digabel (2022) Use of static surrogates in hyperparameter optimization. Operations Research Forum 3 (1), pp. 11. External Links: Document Cited by: §1.1.
[95] J. Larson and S. C. Billups (2016) Stochastic derivative-free optimization using a trust region framework. Computational Optimization and Applications 64 (3), pp. 619–645. External Links: Document Cited by: §7.2.
[96] J. Larson, M. Menickelly, and S. M. Wild (2019) Derivative-free optimization methods. Acta Numerica 28, pp. 287–404. Cited by: §1.1, §1.3, §6.2, §8.
[97] J. Larson, M. Menickelly, and B. Zhou (2021) Manifold sampling for optimizing nonsmooth nonconvex compositions. SIAM Journal on Optimization 31 (4), pp. 2638–2664. External Links: Document Cited by: §4.5, §4.5, item IBCDFO.
[98] J. Larson and M. Menickelly (2024) Structure-aware methods for expensive derivative-free nonsmooth composite optimization. Mathematical Programming Computation 16 (1), pp. 1–36. External Links: Document Cited by: §4.5, §4.5, item IBCDFO.
[99] J. Larson and S. M. Wild (2018) Asynchronously parallel optimization solver for finding multiple minima. Mathematical Programming Computation 10 (3), pp. 303–332. External Links: Document Cited by: §8.2.
[100] S. Le Digabel and S. M. Wild (2024) A taxonomy of constraints in black-box simulation-based optimization. Optimization and Engineering 25 (2), pp. 1125–1143. External Links: Document Cited by: §6.2.1.
[101] Y. Liu, K. H. Lam, and L. Roberts (2024) Black-box optimization algorithms for regularized least-squares problems. arXiv. Note: arXiv preprint 2407.14915 External Links: 2407.14915 Cited by: §4.5.
[102] G. Liuzzi and S. Lucidi (2025) Worst-case complexity analysis of derivative-free methods for multi-objective optimization. arXiv. External Links: 2505.17594, Document Cited by: §8.2.
[103] M. Locatelli and F. Schoen (2013) Global optimization: theory, algorithms, and applications. MOS-SIAM Series on Optimization, SIAM, Philadelphia. Cited by: §1.1.
[104] A. L. Marsden, M. Wang, J. E. Dennis, and P. Moin (2007) Trailing-edge noise reduction using derivative-free optimization and large-eddy simulation. Journal of Fluid Mechanics 572, pp. 13–36. External Links: Document Cited by: §1.1.
[105] J. J. Moré and S. M. Wild (2009) Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization 20 (1), pp. 172–191. External Links: Document Cited by: §1.1.
[106] J. J. Moré and S. M. Wild (2011) Estimating computational noise. SIAM Journal on Scientific Computing 33 (3), pp. 1292–1314. External Links: Document Cited by: §7.2, §7.2.
[107] J. J. Moré and S. M. Wild (2012) Estimating derivatives of noisy simulations. ACM Transactions on Mathematical Software 38 (3), pp. 1–21. External Links: Document Cited by: §7.2.
[108] S. Najafi and M. Hajarian (2026) Derivative-free optimization on riemannian manifolds using simplex gradient approximations. Journal of Optimization Theory and Applications 208 (1), pp. 3. External Links: Document Cited by: §8.2.
[109] Y. Nesterov and V. Spokoiny (2017) Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. External Links: Document Cited by: item Finite differencing/implicit filtering.
[110] Y. Nesterov (2004) Introductory lectures on convex optimization. Springer US. External Links: ISBN 978-1-4020-7553-7 Cited by: footnote 5.
[111] J. Nocedal and S. J. Wright (2006) Numerical optimization. 2nd edition, Springer Series in Operations Research and Financial Engineering, Springer, New York. Cited by: §1.1, §1.3, item 1, §2.1, §2.1, §2.1, §2.2, §2.3, Proposition 2.3, §3, §6.2, §7.1, §7.2, footnote 26.
[112] Numerical Algorithms Group (2019) Derivative-free optimization solver for calibration problems. Note: https://nag.com/derivative-free-optimization-dfo/ Cited by: §1.1.
[113] E. O. Omojokun (1989) Trust region algorithms for optimization with nonlinear equality and inequality constraints. Ph.D. Thesis, University of Colorado at Boulder. Cited by: §6.2.
[114] F. Oztoprak, R. Byrd, and J. Nocedal (2023) Constrained optimization in the presence of noise. SIAM Journal on Optimization 33 (3), pp. 2118–2136. External Links: Document Cited by: §7.2.
[115] M. J. D. Powell (1994) A direct search optimization method that models the objective and constraint functions by linear interpolation. In Advances in Optimization and Numerical Analysis, S. Gomez and J. Hennart (Eds.), pp. 51–67. External Links: Document Cited by: item Software of M. J. D. Powell.
[116] M. J. D. Powell (1998) Direct search algorithms for optimization calculations. Acta Numerica 7, pp. 287–336. External Links: Document Cited by: §1.1.
[117] M. J. D. Powell (2001) On the Lagrange functions of quadratic models that are defined by interpolation. Optimization Methods and Software 16 (1-4), pp. 289–309. External Links: Document Cited by: §5.3.
[118] M. J. D. Powell (2002) UOBYQA: unconstrained optimization by quadratic approximation. Mathematical Programming 92 (3), pp. 555–582. External Links: Document Cited by: §5.2, Remark 5.16, item Software of M. J. D. Powell, footnote 11.
[119] M. J. D. Powell (2004) Least Frobenius norm updating of quadratic models that satisfy interpolation conditions. Mathematical Programming 100 (1), pp. 183–215. External Links: Document Cited by: §4.5, Lemma 4.10, §5.3.
[120] M. J. D. Powell (2006) The NEWUOA software for unconstrained optimization without derivatives. In Large-Scale Nonlinear Optimization, P. Pardalos, G. Di Pillo, and M. Roma (Eds.), Vol. 83, pp. 255–297. External Links: Document Cited by: item Software of M. J. D. Powell.
[121] M. J. D. Powell (2009) The BOBYQA algorithm for bound constrained optimization without derivatives. Technical report Technical Report DAMTP 2009/NA06, University of Cambridge. Cited by: item Software of M. J. D. Powell.
[122] M. J. D. Powell (2015) On fast trust region methods for quadratic models with linear constraints. Mathematical Programming Computation 7 (3), pp. 237–267. External Links: Document Cited by: §6.2, item Software of M. J. D. Powell.
[123] T. M. Ragonneau and Z. Zhang (2024-12) PDFO: a cross-platform package for Powell’s derivative-free optimization solvers. Mathematical Programming Computation 16 (4), pp. 535–559. External Links: Document Cited by: §5.3, §6.2.1, item Software of M. J. D. Powell.
[124] T. M. Ragonneau (2022) Model-based derivative-free optimization methods and software. Ph.D. Thesis, Hong Kong Polytechnic University. Cited by: §6.2, §6.2, §6.2, item COBYQA [124], item COBYQA [124].
[125] C. E. Rasmussen and C. K. I. Williams (2006) Gaussian processes for machine learning. Adaptive Computation and Machine Learning, MIT Press, Cambridge, Massachusetts. External Links: ISBN 978-0-262-18253-9 Cited by: §4.5.
[126] R. G. Regis and C. A. Shoemaker (2007) A stochastic radial basis function method for the global optimization of expensive functions. INFORMS Journal on Computing 19 (4), pp. 497–509. External Links: Document Cited by: §1.3.
[127] R. G. Regis (2015) The calculus of simplex gradients. Optimization Letters 9 (5), pp. 845–865. External Links: Document Cited by: §4.5.
[128] F. Rinaldi, L. N. Vicente, and D. Zeffiro (2024) Stochastic trust-region and direct-search methods: a weak tail bound condition and reduced sample sizing. SIAM Journal on Optimization 34 (2), pp. 2067–2092. External Links: Document Cited by: §7.2.
[129] L. M. Rios and N. V. Sahinidis (2013) Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization 56 (3), pp. 1247–1293. External Links: Document Cited by: §1.1, §8.2.
[130] L. Roberts (2025) Model construction for convex-constrained derivative-free optimization. SIAM Journal on Optimization 35 (2), pp. 622–650. External Links: Document Cited by: 2nd item, §4.5, §5.2, §5.3, §6.1.1, §6.1.1, §6.2.
[131] K. Scheinberg and Ph. L. Toint (2010) Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization. SIAM Journal on Optimization 20 (6), pp. 3512–3532. External Links: Document Cited by: §5.3.
[132] A. E. Schwertner and F. N. C. Sobral (2024) On complexity constants of linear and quadratic models for derivative-free trust-region algorithms. Optimization Letters. External Links: Document Cited by: §4.5.
[133] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas (2016) Taking the human out of the loop: a review of Bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. External Links: Document Cited by: §1.3, §4.5.
[134] S. Shashaani, F. S. Hashemi, and R. Pasupathy (2018) ASTRO-DF: a class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM Journal on Optimization 28 (4), pp. 3145–3176. External Links: Document Cited by: §7.2.
[135] H. M. Shi, Y. Xie, R. Byrd, and J. Nocedal (2022) A noise-tolerant quasi-Newton algorithm for unconstrained optimization. SIAM Journal on Optimization 32 (1), pp. 29–55. External Links: Document Cited by: §7.2.
[136] H. M. Shi, Y. Xie, M. Q. Xuan, and J. Nocedal (2022) Adaptive finite-difference interval estimation for noisy derivative-free optimization. SIAM Journal on Scientific Computing 44 (4), pp. A2302–A2321. External Links: Document Cited by: §7.2.
[137] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. DeBardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen (2014) Addressing failures in exascale computing. The International Journal of High Performance Computing Applications 28 (2), pp. 129–173. External Links: Document Cited by: §6.2.1.
[138] W. Squire and G. Trapp (1998) Using complex variables to estimate derivatives of real functions. SIAM Review 40 (1), pp. 110–112. Cited by: footnote 3.
[139] C. P. Stephens and W. Baritompa (1998) Global optimization requires global information. Journal of Optimization Theory and Applications 96 (3), pp. 575–588. External Links: Document Cited by: §1.1.
[140] S. Sun and J. Nocedal (2023) A trust region method for noisy unconstrained optimization. Mathematical Programming 202 (1-2), pp. 445–472. External Links: Document Cited by: §7.2.
[141] S. F. B. Tett, J. M. Gregory, N. Freychet, C. Cartis, M. J. Mineter, and L. Roberts (2022) Does model calibration reduce uncertainty in climate projections?. Journal of Climate 35 (8), pp. 2585–2602. External Links: Document Cited by: Example 1.1.
[142] P. L. Toint (1988) Global convergence of a a of trust-region methods for nonconvex minimization in Hilbert space. IMA Journal of Numerical Analysis 8 (2), pp. 231–252. External Links: Document Cited by: footnote 21.
[143] J. J. Torres, G. Nannicini, E. Traversi, and R. Wolfler Calvo (2024) A trust-region framework for derivative-free mixed-integer optimization. Mathematical Programming Computation. External Links: Document Cited by: §8.2.
[144] L. N. Trefethen (2020) Approximation theory and approximation practice. SIAM, Philadelphia. Cited by: §5.3, Proposition 5.1.
[145] A. Tröltzsch (2016) A sequential quadratic programming algorithm for equality-constrained optimization without derivatives. Optimization Letters 10 (2), pp. 383–399. External Links: Document Cited by: §6.2, §6.2.
[146] S. M. Wild, R. G. Regis, and C. A. Shoemaker (2008) ORBIT: optimization by radial basis function interpolation in trust-regions. SIAM Journal on Scientific Computing 30 (6), pp. 3197–3219. External Links: Document Cited by: §4.4, §4.5.
[147] S. M. Wild and C. Shoemaker (2013) Global convergence of radial basis function trust-region algorithms for derivative-free optimization. SIAM Review 55 (2), pp. 349–371. External Links: Document Cited by: §4.4, §4.5.
[148] S. M. Wild (2008) MNH: a derivative-free optimization algorithm using minimal norm hessians. In Tenth Copper Mountain Conference on Iterative Methods, Cited by: 1st item.
[149] S. M. Wild (2017) POUNDERS in TAO: solving derivative-free nonlinear least-squares problems with POUNDERS. In Advances and Trends in Optimization with Engineering Applications, T. Terlaky, M. F. Anjos, and S. Ahmed (Eds.), MOS-SIAM Book Series on Optimization, Vol. 24, pp. 529–539. Cited by: §4.5, item IBCDFO.
[150] P. Xie and S. M. Wild (2025) ReMU: regional minimal updating for model-based derivative-free optimization. arXiv. Note: arXiv preprint arXiv:2504.03606 External Links: 2504.03606, Document Cited by: §4.5.
[151] P. Xie and Y. Yuan (2024) A new two-dimensional model-based subspace method for large-scale unconstrained derivative-free optimization: 2D-MoSub. arXiv. Note: arXiv preprint arXiv:2309.14855 External Links: 2309.14855 Cited by: §3.3.
[152] P. Xie and Y. Yuan (2025) A derivative-free method using a new underdetermined quadratic interpolation model. SIAM Journal on Optimization 35 (2), pp. 1110–1133. External Links: Document Cited by: §4.5.
[153] P. Xie and Y. Yuan (2025) Least $H^{2}$ norm updating of quadratic interpolation models for derivative-free trust-region algorithms. IMA Journal of Numerical Analysis, pp. drae106. External Links: Document Cited by: §4.5.
[154] Y. Yuan (2015) Recent advances in trust region algorithms. Mathematical Programming 151 (1), pp. 249–281. External Links: Document Cited by: §2.3.
[155] H. Zhang, A. R. Conn, and K. Scheinberg (2010) A derivative-free algorithm for least-squares minimization. SIAM Journal on Optimization 20 (6), pp. 3555–3576. External Links: Document Cited by: §4.5, §4.5, footnote 24.
[156] Q. Zhang and P. Xie (2024) On the relationship between $\Lambda$ -poisedness in derivative-free optimization and outliers in local outlier factor. arXiv. External Links: 2407.17529 Cited by: §5.3.
[157] Z. Zhang (2023) PRIMA: reference implementation for Powell’s methods with modernization and amelioration. Note: available at http://www.libprima.net, DOI: 10.5281/zenodo.8052654 Cited by: item Software of M. J. D. Powell.
[158] Z. Zhang (2014) Sobolev seminorm of quadratic functions with applications to derivative-free optimization. Mathematical Programming 146 (1-2), pp. 77–96. External Links: Document Cited by: §4.5.
[159] Z. Zhang (2025) Scalable derivative-free optimization algorithms with low-dimensional subspace techniques. arXiv. Note: arXiv preprint arXiv:2501.04536 External Links: 2501.04536, Document Cited by: §3.3.

Appendix A Technical Results

Here, we collect some technical results used in the main text.

Lemma A.1.

Suppose we have two linear functions $m_{i}:\mathbb{R}^{n}\to\mathbb{R}$ for $i\in\{1,2\}$ , defined as $m_{i}(\bm{y}):=c_{i}+\bm{g}_{i}^{T}(\bm{y}-\bm{x})$ , such that $|m_{1}(\bm{y})-m_{2}(\bm{y})|\leq\epsilon$ for all $\bm{y}\in B(\bm{x},\Delta)$ , for some $\bm{x}\in\mathbb{R}^{n}$ and $\Delta>0$ . Then

\displaystyle|c_{1}-c_{2}|\leq\epsilon,\hskip 18.49988pt\text{and}\hskip 18.49988pt\|\bm{g}_{1}-\bm{g}_{2}\|\leq\frac{2\epsilon}{\Delta}.

(A.1)

Proof.

First, we have $|c_{1}-c_{2}|=|m_{1}(\bm{x})-m_{2}(\bm{x})|\leq\epsilon$ . Next, if $\bm{g}_{1}\neq\bm{g}_{2}$ , choose $\bm{y}=\bm{x}+\Delta\frac{\bm{g}_{1}-\bm{g}_{2}}{\|\bm{g}_{1}-\bm{g}_{2}\|}$ , to ensure that $\|\bm{y}-\bm{x}\|=\Delta$ and $\bm{y}-\bm{x}$ is parallel to $\bm{g}_{1}-\bm{g}_{2}$ . This gives

\displaystyle\Delta\|\bm{g}_{1}-\bm{g}_{2}\|=|(\bm{g}_{1}-\bm{g}_{2})^{T}(\bm{y}-\bm{x})|\leq|m_{1}(\bm{y})-m_{2}(\bm{y})|+|c_{1}-c_{2}|\leq 2\epsilon.

(A.2)

If instead $\bm{g}_{1}=\bm{g}_{2}$ then the bound on $\|\bm{g}_{1}-\bm{g}_{2}\|$ is trivial. ∎

Lemma A.2.

Suppose we have two quadratic functions $m_{i}:\mathbb{R}^{n}\to\mathbb{R}$ for $i\in\{1,2\}$ , defined as $m_{i}(\bm{y}):=c_{i}+\bm{g}_{i}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}\bm{H}_{i}(\bm{y}-\bm{x})$ , such that $|m_{1}(\bm{y})-m_{2}(\bm{y})|\leq\epsilon$ for all $\bm{y}\in B(\bm{x},\Delta)$ , for some $\bm{x}\in\mathbb{R}^{n}$ and $\Delta>0$ . Then

\displaystyle|c_{1}-c_{2}|\leq\epsilon,\hskip 18.49988pt\|\bm{g}_{1}-\bm{g}_{2}\|\leq\frac{10\epsilon}{\Delta},\hskip 18.49988pt\text{and}\hskip 18.49988pt\|\bm{H}_{1}-\bm{H}_{2}\|\leq\frac{24\epsilon}{\Delta^{2}}.

(A.3)

Proof.

First, we have $|c_{1}-c_{2}|=|m_{1}(\bm{x})-m_{2}(\bm{x})|\leq\epsilon$ . Hence for any $\bm{y}\in B(\bm{x},\Delta)$ we have

\displaystyle\left|(\bm{y}-\bm{x})^{T}\left[\bm{g}_{1}+\frac{1}{2}\bm{H}_{1}(\bm{y}-\bm{x})-\bm{g}_{2}-\frac{1}{2}\bm{H}_{2}(\bm{y}-\bm{x})\right]\right|\leq|m_{1}(\bm{y})-m_{2}(\bm{y})|+|c_{1}-c_{2}|\leq 2\epsilon.

(A.4)

Now, define $\hat{\bm{u}}:=\frac{\bm{g}_{1}-\bm{g}_{2}}{\|\bm{g}_{1}-\bm{g}_{2}\|}$ if $\bm{g}_{1}\neq\bm{g}_{2}$ , or any unit vector otherwise. Similarly, define $\hat{\bm{v}}$ to be a unit eigenvector corresponding to the largest eigenvalue in magnitude of $\bm{H}_{1}-\bm{H}_{2}$ . Hence we have $\hat{\bm{u}}^{T}(\bm{g}_{1}-\bm{g}_{2})=\|\bm{g}_{1}-\bm{g}_{2}\|$ and $|\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}|=\|\bm{H}_{1}-\bm{H}_{2}\|$ .

Now applying (A.4) to $\bm{y}=\bm{x}+\frac{\Delta}{2}\hat{\bm{u}}$ and $\bm{y}=\bm{x}+\Delta\hat{\bm{v}}$ , we get

	$\displaystyle\left\|\frac{\Delta}{2}\underbrace{\hat{\bm{u}}^{T}(\bm{g}_{1}-\bm{g}_{2})}_{=\\|\bm{g}_{1}-\bm{g}_{2}\\|}+\frac{\Delta^{2}}{8}\hat{\bm{u}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{u}}\right\|$	$\displaystyle\leq 2\epsilon,\hskip 18.49988pt\text{and}$		(A.5)
	$\displaystyle\left\|\Delta\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})+\frac{\Delta^{2}}{2}\underbrace{\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}}_{=\pm\\|\bm{H}_{1}-\bm{H}_{2}\\|}\right\|$	$\displaystyle\leq 2\epsilon,$		(A.6)

respectively. The first inequality (A.5), gives us

	$\displaystyle\frac{\Delta}{2}\\|\bm{g}_{1}-\bm{g}_{2}\\|-\frac{\Delta^{2}}{8}\|\hat{\bm{u}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{u}}\|$	$\displaystyle\leq\left\|\frac{\Delta}{2}\\|\bm{g}_{1}-\bm{g}_{2}\\|-\frac{\Delta^{2}}{8}\left\|-\hat{\bm{u}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{u}}\right\|\right\|,$		(A.7)
		$\displaystyle\leq\left\|\frac{\Delta}{2}\\|\bm{g}_{1}-\bm{g}_{2}\\|+\frac{\Delta^{2}}{8}\hat{\bm{u}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{u}}\right\|\leq 2\epsilon,$		(A.8)

where the second inequality follows from the reverse triangle inequality. For the second condition (A.6), we first suppose $\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}\geq 0$ , so $\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}=\|\bm{H}_{1}-\bm{H}_{2}\|$ . In that case,

\displaystyle\frac{\Delta^{2}}{2}\|\bm{H}_{1}-\bm{H}_{2}\|

\displaystyle=\left(\Delta\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})+\frac{\Delta^{2}}{2}\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}\right)-\Delta\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})\leq 2\epsilon+\Delta|\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})|,

(A.9)

and in the other case $\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}<0$ , for which $\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}=-\|\bm{H}_{1}-\bm{H}_{2}\|$ , we reach the same conclusion via

\displaystyle\frac{\Delta^{2}}{2}\|\bm{H}_{1}-\bm{H}_{2}\|

\displaystyle=-\left(\Delta\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})+\frac{\Delta^{2}}{2}\hat{\bm{v}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{v}}\right)+\Delta\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})\leq 2\epsilon+\Delta|\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})|.

(A.10)

Since $|\hat{\bm{v}}^{T}(\bm{g}_{1}-\bm{g}_{2})|\leq\|\bm{g}_{1}-\bm{g}_{2}\|$ by Cauchy-Schwarz and $|\hat{\bm{u}}^{T}(\bm{H}_{1}-\bm{H}_{2})\hat{\bm{u}}|\leq\|\bm{H}_{1}-\bm{H}_{2}\|$ from Rayleigh quotients, we ultimately conclude

\displaystyle\frac{\Delta}{2}\|\bm{g}_{1}-\bm{g}_{2}\|\leq 2\epsilon+\frac{\Delta^{2}}{8}\|\bm{H}_{1}-\bm{H}_{2}\|,\hskip 18.49988pt\text{and}\hskip 18.49988pt\frac{\Delta^{2}}{2}\|\bm{H}_{1}-\bm{H}_{2}\|\leq 2\epsilon+\Delta\|\bm{g}_{1}-\bm{g}_{2}\|.

(A.11)

The first of these conditions implies $\frac{\Delta^{2}}{2}\|\bm{H}_{1}-\bm{H}_{2}\|\geq 2\Delta\|\bm{g}_{1}-\bm{g}_{2}\|-8\epsilon$ , and so $2\Delta\|\bm{g}_{1}-\bm{g}_{2}\|-8\epsilon\leq\frac{\Delta^{2}}{2}\|\bm{H}_{1}-\bm{H}_{2}\|\leq 2\epsilon+\Delta\|\bm{g}_{1}-\bm{g}_{2}\|$ . This gives $\Delta\|\bm{g}_{1}-\bm{g}_{2}\|\leq 10\epsilon$ . Lastly, we apply $\frac{\Delta^{2}}{2}\|\bm{H}_{1}-\bm{H}_{2}\|\leq 2\epsilon+\Delta\|\bm{g}_{1}-\bm{g}_{2}\|\leq 12\epsilon$ , and we get the desired result. ∎

Appendix B Poisedness of Structured Fully Quadratic Models

Here, we explicitly estimate the poisedness constant $\Lambda$ for the structured fully quadratic interpolation set (4.48), namely

	$\displaystyle\mathcal{Y}$	$\displaystyle=\{\bm{x},\bm{x}+\Delta\bm{e}_{1},\ldots,\bm{x}+\Delta\bm{e}_{n},\bm{x}-\Delta\bm{e}_{1},\ldots,\bm{x}-\Delta\bm{e}_{n},$
		$\displaystyle\bm{x}+\Delta(\bm{e}_{1}+\bm{e}_{2}),\ldots,\bm{x}+\Delta(\bm{e}_{1}+\bm{e}_{n}),\bm{x}+\Delta(\bm{e}_{2}+\bm{e}_{3}),\ldots,\bm{x}+\Delta(\bm{e}_{n-1}+\bm{e}_{n})\}.$		(B.1)

We may without loss of generality assume $\bm{x}=\bm{0}$ and $\Delta=1$ , since the values of the Lagrange polynomials are invariant to shifts and scalings. That is, we take our interpolation set to be

\displaystyle\mathcal{Y}

\displaystyle=\{\bm{0}\}\cup\{\pm\bm{e}_{i}:i=1,\ldots,n\}\cup\{\bm{e}_{i}+\bm{e}_{j}:i,j=1,\ldots,n,\>j>i\}.

(B.2)

We recall the proof of Lemma 4.7, which explicitly computes the coefficients for the associated interpolation linear system: if we wish to interpolate values $m(\bm{y})=f(\bm{y})$ for all $\bm{y}\in\mathcal{Y}$ , then (with our normalization $\bm{x}=\bm{0}$ and $\Delta=1$ ) we get $c=f(\bm{0})$ , $g_{i}=\frac{1}{2}(f(\bm{e}_{i})-f(-\bm{e}_{i}))$ and $H_{i,i}=f(\bm{e}_{i})+f(-\bm{e}_{i})-2f(\bm{0})$ for $i=1,\ldots,n$ , and $H_{i,j}=f(\bm{e}_{i}+\bm{e}_{j})-c-g_{i}-g_{j}-\frac{1}{2}H_{i,i}-\frac{1}{2}H_{j,j}$ for $i,j=1,\ldots,n$ with $i\neq j$ . From these explicit formulae, the Lagrange polynomials for our interpolation set are:

•

For the interpolation point $\bm{0}$ ,

\displaystyle\ell_{\bm{0}}(\bm{x})=1-\sum_{i=1}^{n}x_{i}^{2}+\frac{1}{2}\sum_{\begin{subarray}{c}i,j=1\\ j\neq i\end{subarray}}^{n}x_{i}x_{j},

(B.3)

where we note that the Hessian $\nabla^{2}\ell_{\bm{0}}$ has entries $[\nabla^{2}\ell_{\bm{0}}]_{i,j}=1$ if $i\neq j$ and $-2$ if $i=j$ , and is a circulant matrix [75];

•

For the interpolation points $\bm{e}_{i}$ , $i=1,\ldots,n$ ,

\displaystyle\ell_{\bm{e}_{i}}(\bm{x})=\frac{1}{2}x_{i}+\frac{1}{2}x_{i}^{2}-\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}x_{i}x_{j};

(B.4)

•

For the interpolation points $-\bm{e}_{i}$ , $i=1,\ldots,n$ ,

$\displaystyle\ell_{-\bm{e}_{i}}(\bm{x})=-\frac{1}{2}x_{i}+\frac{1}{2}x_{i}^{2};$ (B.5)
•

For the interpolation points $\bm{e}_{i}+\bm{e}_{j}$ , $i,j=1\ldots,n$ with $j>i$ ,

$\displaystyle\ell_{\bm{e}_{i}+\bm{e}_{j}}(\bm{x})=x_{i}x_{j}.$ (B.6)

We now maximize the magnitude of each Lagrange polynomial $\ell_{\bm{y}}(\bm{x})$ over $\bm{x}\in B(\bm{0},1)$ individually for all $\bm{y}\in\mathcal{Y}$ .

For $\ell_{\bm{0}}$ , we observe that $\nabla\ell_{\bm{0}}(\bm{0})=\bm{0}$ and so

\displaystyle|\ell_{\bm{0}}(\bm{x})|\leq 1+\frac{1}{2}\|\bm{x}\|^{2}\|\nabla^{2}\ell_{\bm{0}}\|\leq 1+\frac{1}{2}\|\nabla^{2}\ell_{\bm{0}}\|.

(B.7)

The Gershgorin circle theorem allows us to estimate³⁵³⁵35The exact value $\|\nabla^{2}\ell_{\bm{0}}\|=n-3$ (for $n$ sufficiently large) may be calculated using the explicit formula for eigenvalues of circulant matrices [75], but this is not needed for our estimate here. $\|\nabla^{2}\ell_{\bm{0}}\|\leq n+1$ , and so $\max_{\bm{x}\in B(\bm{0},1)}|\ell_{\bm{0}}(\bm{x})|\leq 1+\frac{n+1}{2}$ . For $\ell_{\bm{e}_{i}}$ , we note that $\sum_{j\neq i}x_{j}^{2}\leq 1-x_{i}^{2}$ for all $\bm{x}\in B(\bm{0},1)$ , and for this constraint the term $\sum_{j\neq i}x_{i}x_{j}=x_{i}\sum_{j\neq i}x_{j}$ is maximized/minimized if all $x_{j}$ ( $j\neq i$ ) are equal, $x_{j}=\pm\frac{1}{n-1}\sqrt{1-x_{i}^{2}}$ . So, for a given value of $x_{i}$ ,

\displaystyle|\ell_{\bm{e}_{i}}(\bm{x})|\leq\frac{1}{2}x_{i}+\frac{1}{2}x_{i}^{2}\pm\frac{n}{n-1}x_{i}\sqrt{1-x_{i}^{2}},

(B.8)

(but where the last term is zero if $n=1$ ). Since $x_{i}\in[-1,1]$ and $\frac{n}{n-1}\leq 2$ for all $n\geq 2$ we may estimate

\displaystyle\max_{\bm{x}\in B(\bm{0},1)}|\ell_{\bm{e}_{i}}(\bm{x})|\leq\frac{1}{2}+\frac{1}{2}+2\cdot 1\cdot 1=3.

(B.9)

For $\ell_{-\bm{e}_{i}}$ , we note that we get the same Lagrange polynomial as the structured minimum Frobenius norm quadratic interpolation set in (5.42), which gives $\max_{\bm{x}\in B(\bm{0},1)}|\ell_{-\bm{e}_{i}}(\bm{x})|\leq 1$ . Lastly, for $\ell_{\bm{e}_{i}+\bm{e}_{j}}$ , we can observe that $|x_{i}|,|x_{j}|\leq 1$ for $\bm{x}\in B(\bm{0},1)$ , and so $\max_{\bm{x}\in B(\bm{0},1)}|\ell_{-\bm{e}_{i}}(\bm{x})|\leq 1$ .

All together, we have determined that this interpolation set has poisedness constant $\Lambda\leq\max(3,1+\frac{n+1}{2})=\mathcal{O}(n)$ . This $\mathcal{O}(n)$ bound is tight: for example, we can compute $\ell_{\bm{0}}(\bm{e}/\sqrt{n})=(5-n)/2$ , which implies $\Lambda\geq|n-5|/2$ .

Appendix C Proof of Theorem 7.6

Lemma C.1.

Suppose Assumptions 2.4, 2.9, 7.1 and 7.5 hold. If $\Delta_{k}\leq\min(\frac{1}{\mu_{c}},\frac{1}{\kappa_{H}})\|\bm{g}_{k}\|$ and

\displaystyle\|\bm{g}_{k}\|\geq\frac{2\kappa_{\textnormal{mf}}\Delta_{k}}{\kappa_{s}(1-\eta_{S})}+\frac{2(\tilde{\kappa}_{\textnormal{mf}}+1)\epsilon_{f}}{\kappa_{s}(1-\eta_{S})\Delta_{k}},

(C.1)

then iteration $k$ is successful (i.e. $\rho_{k}\geq\eta_{S}$ and $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ ).

Proof.

That $\|\bm{g}_{k}\|\geq\mu_{c}\Delta_{k}$ holds follows by assumption on $\Delta_{k}$ , so it remains to show $\rho_{k}\geq\eta_{S}$ .

Since $\Delta_{k}\leq\|\bm{g}_{k}\|/\kappa_{H}$ we have $m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq\kappa_{s}\|\bm{g}_{k}\|\Delta_{k}$ from Assumption 2.9. We then compute

$\displaystyle\|\rho_{k}-1\|$	$\displaystyle\leq\frac{\|\tilde{f}(\bm{x}_{k})-f(\bm{x}_{k})\|+\|f(\bm{x}_{k})-m_{k}(\bm{x}_{k})\|+\|\tilde{f}(\bm{x}_{k}+\bm{s}_{k})-f(\bm{x}_{k}+\bm{s}_{k})\|+\|f(\bm{x}_{k}+\bm{s}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\|}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})},$	(C.2)
	$\displaystyle\leq\frac{2\epsilon_{f}+2\kappa_{\textnormal{mf}}\Delta_{k}^{2}+2\tilde{\kappa}_{\textnormal{mf}}\epsilon_{f}}{\kappa_{s}\\|\bm{g}_{k}\\|\Delta_{k}},$	(C.3)
	$\displaystyle\leq 1-\eta_{S},$	(C.4)

where the last inequality follows from (C.1), and so $\rho_{k}\geq\eta_{S}$ . ∎

Lemma C.2.

Suppose Assumptions 2.4, 2.9, 7.1 and 7.5 hold. If $\|\nabla f(\bm{x}_{k})\|\geq\epsilon$ for all $k=0,\ldots,K-1$ with

\displaystyle\epsilon>\epsilon_{\min}:=\frac{2\sqrt{C_{0}C_{1}}}{\sqrt{1-\left(\frac{1-\gamma_{\textnormal{dec}}}{1+\gamma_{\textnormal{dec}}}\right)^{2}}},

(C.5)

where

\displaystyle C_{0}:=\left(\max\left(\mu_{c},\kappa_{H},\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})}\right)+\kappa_{\textnormal{mg}}\right),\qquad\text{and}\qquad C_{1}:=\left(\frac{2(\tilde{\kappa}_{\textnormal{mf}}+1)\epsilon_{f}}{\kappa_{s}(1-\eta_{S})}+\tilde{\kappa}_{\textnormal{mg}}\epsilon_{f}\right),

(C.6)

and $\Delta_{0}$ is sufficiently large (specifically, $\Delta_{0}\geq\Delta_{\min}(\epsilon)$ ), then

\displaystyle\Delta_{k}\geq\Delta_{\min}(\epsilon):=\frac{1}{2}(1+\gamma_{\textnormal{dec}})\frac{\epsilon}{2C_{0}}-\frac{1}{2}(1-\gamma_{\textnormal{dec}})\frac{\sqrt{\epsilon^{2}-4C_{0}C_{1}}}{2C_{0}}>0,

(C.7)

for all $k=0,\ldots,K$ .

Proof.

By assumption on $\Delta_{0}$ , the result holds for $k=0$ . By induction, suppose that $\Delta_{k}\geq\Delta_{\min}(\epsilon)$ for some $k\in\{0,\ldots,K-1\}$ and to find a contradiction suppose that $\Delta_{k+1}<\Delta_{\min}(\epsilon)$ .

Since $\Delta_{k}\geq\Delta_{\min}(\epsilon)>\Delta_{k+1}$ , iteration $k$ must have unsuccessful, and so $\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}$ . Thus, $\Delta_{\min}(\epsilon)\leq\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ .

From Assumption 7.5 we have

\displaystyle\epsilon\leq\|\nabla f(\bm{x}_{k})\|\leq\|\bm{g}_{k}\|+\|\bm{g}_{k}-\nabla f(\bm{x}_{k})\|\leq\|\bm{g}_{k}\|+\kappa_{\textnormal{mg}}\Delta_{k}+\frac{\tilde{\kappa}_{\textnormal{mg}}\epsilon_{f}}{\Delta_{k}}.

(C.8)

Since iteration $k$ was unsuccessful, from Lemma C.1 we have $\|\bm{g}_{k}\|<\max(\mu_{c},\kappa_{H})\Delta_{k}$ or

\displaystyle\|\bm{g}_{k}\|<\frac{2\kappa_{\textnormal{mf}}\Delta_{k}}{\kappa_{s}(1-\eta_{S})}+\frac{2(\tilde{\kappa}_{\textnormal{mf}}+1)\epsilon_{f}}{\kappa_{s}(1-\eta_{S})\Delta_{k}}.

(C.9)

In the first case, we get

\displaystyle\epsilon<(\max(\mu_{c},\kappa_{H})+\kappa_{\textnormal{mg}})\Delta_{k}+\frac{\tilde{\kappa}_{\textnormal{mg}}\epsilon_{f}}{\Delta_{k}},

(C.10)

and in the second case we get

\displaystyle\epsilon<\left(\frac{2\kappa_{\textnormal{mf}}}{\kappa_{s}(1-\eta_{S})}+\kappa_{\textnormal{mg}}\right)\Delta_{k}+\left(\frac{2(\tilde{\kappa}_{\textnormal{mf}}+1)\epsilon_{f}}{\kappa_{s}(1-\eta_{S})}+\tilde{\kappa}_{\textnormal{mg}}\epsilon_{f}\right)\frac{1}{\Delta_{k}}.

(C.11)

So, regardless of which case we are in, it must hold that

\displaystyle\epsilon<C_{0}\Delta_{k}+\frac{C_{1}}{\Delta_{k}},\hskip 18.49988pt\text{or}\hskip 18.49988ptC_{0}\Delta_{k}^{2}-\epsilon\Delta_{k}+C_{1}>0.

(C.12)

We claim that this contradicts $\Delta_{\min}(\epsilon)\leq\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ .

Since $C_{0},C_{1},\epsilon>0$ and $\epsilon\geq\epsilon_{\min}>\sqrt{4C_{0}C_{1}}$ , (C.12) is a convex quadratic with two positive roots, say $\Delta^{-}<\Delta^{+}$ . Our assumption $\epsilon>\epsilon_{\min}$ ensures that $\Delta^{-}<\gamma_{\textnormal{dec}}\Delta^{+}$ , and our choice of $\Delta_{\min}(\epsilon)$ gives $\Delta_{\min}(\epsilon)=\frac{1}{2}(\Delta^{-}+\gamma_{\textnormal{dec}}\Delta^{+})\in(\Delta^{-},\gamma_{\textnormal{dec}}\Delta^{+})$ . So, (C.12) gives either $\Delta_{k}<\Delta^{-}<\Delta_{\min}(\epsilon)$ or $\Delta_{k}>\Delta^{+}>\gamma_{\textnormal{dec}}^{-1}\Delta_{\min}(\epsilon)$ , which gives the desired contradiction. ∎

The proof of Theorem 7.6 is then identical to that of Theorem 3.7, using Lemma C.2 in place of Lemma 3.6.

Remark C.3.

Taking $\gamma_{\textnormal{dec}}\to 1^{-}$ decreases $\epsilon_{\min}$ and increases $\Delta_{\min}(\epsilon)$ (and hence decreases the worst-case complexity bound, $K=\mathcal{O}(\Delta_{\min}(\epsilon)^{-2})$ ). That is, using $\gamma_{\textnormal{dec}}\approx 1$ allows for higher-accuracy solutions and decreases the iteration complexity bound.

	$\displaystyle\|f(\bm{y})-f(\bm{x}_{k})-\bm{g}_{k}^{T}(\bm{y}-\bm{x}_{k})-\frac{1}{2}(\bm{y}-\bm{x}_{k})^{T}\bm{H}_{k}(\bm{y}-\bm{x}_{k})\|$
	$\displaystyle\qquad\leq\|f(\bm{y})-f(\bm{x}_{k})-\nabla f(\bm{x}_{k})^{T}(\bm{y}-\bm{x}_{k})\|+\|(\nabla f(\bm{x}_{k})-\bm{g}_{k})^{T}(\bm{y}-\bm{x}_{k})\|+\frac{\kappa_{H}}{2}\\|\bm{y}-\bm{x}_{k}\\|^{2},$		(3.3)
	$\displaystyle\qquad\leq\frac{L_{1}+\kappa_{H}}{2}\\|\bm{y}-\bm{x}_{k}\\|^{2}+\frac{Mh\sqrt{n}}{2}\\|\bm{y}-\bm{x}_{k}\\|.$		(3.4)

$\displaystyle\|\rho_{k}-1\|$	$\displaystyle\leq\frac{\|f(\bm{x}_{k}+\bm{s}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\|+\|f(\bm{x}_{k})-m_{k}(\bm{x}_{k})\|}{\|m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\|},$	(3.8)
	$\displaystyle\leq\frac{2\kappa_{\textnormal{mf}}\Delta_{k}^{2}}{\kappa_{s}\\|\bm{g}_{k}\\|\Delta_{k}},$	(3.9)
	$\displaystyle=\frac{2\kappa_{\textnormal{mf}}\Delta_{k}}{\kappa_{s}\\|\bm{g}_{k}\\|},$	(3.10)

	$\displaystyle\|m(\bm{y})-f(\bm{y})\|$	$\displaystyle\leq\|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|+\|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\|,$		(4.6)
		$\displaystyle\leq\kappa\Delta^{2}+\frac{L_{1}}{2}\\|\bm{y}-\bm{x}\\|^{2},$		(4.7)

	$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\\|\bm{H}(\bm{y}-\bm{x})\\|+\\|\bm{g}-\nabla f(\bm{x})\\|+\\|\nabla f(\bm{y})-\nabla f(\bm{x})\\|,$		(4.9)
		$\displaystyle\leq\kappa_{H}\\|\bm{y}-\bm{x}\\|+(2\kappa+\kappa_{H})\Delta+L_{1}\\|\bm{y}-\bm{x}\\|,$		(4.10)

$\displaystyle\\|\nabla m(\bm{y})-\nabla f(\bm{y})\\|$	$\displaystyle\leq\\|\bm{g}-\nabla f(\bm{x})\\|+\\|\bm{H}(\bm{y}-\bm{x})-\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\\|$
	$\displaystyle\qquad\qquad\qquad\qquad+\\|\nabla f(\bm{y})-\nabla f(\bm{x})-\nabla^{2}f(\bm{x})(\bm{y}-\bm{x})\\|,$	(4.13)
	$\displaystyle\leq 10\kappa\Delta^{2}+24\kappa\Delta\\|\bm{y}-\bm{x}\\|+\frac{L_{2}}{2}\\|\bm{y}-\bm{x}\\|^{2},$	(4.14)