A brief survey of methods for solving nonlinear least-squares problems

In this paper, we present a brief survey of methods for solving nonlinear least-squares problems. We pay specific attention to methods that take into account the special structure of the problems. Most of the methods discussed belong to the quasi-Newton family (i.e. the structured quasi-Newton methods (SQN)). Our survey comprises some of the traditional and modern developed methods for nonlinear least-squares problems. At the end, we suggest a few topics for further research.


1.
Introduction. This paper provides a survey of methods to find a minimizer for the nonlinear least-squares problem (NLS). Specifically, we consider the problem of determining a minimizer x * ∈ R n to the problem of the form Here F : R n → R m (m ≥ n), F (x) = (F 1 (x), F 2 (x), ..., F m (x)) T , F i : R n → R, and F is a twice continuously differentiable mapping. This type of problem arises in many applications such as data fitting, optimal control, parameter estimation, experimental design, data assimilation, and imaging problems (see, e.g. [12,17,22,25,28,33,34,56] and references therein). These problems are solved either by general unconstrained minimization algorithms, or by specialized algorithms that take into account the particular structure of the objective function f [55,68]. The gradient and Hessian of the objective function (1) have a special form and are given by where F i is the i-th component of F and ∇ 2 F i (x) is its Hessian, J(x) = ∇F (x) T is the Jacobian of the so-called residual function F at x and B(x) is a square matrix representing the second-order term of (3). Basically, the iterative scheme for solving (1) has the general form: given an initial approximation x 0 , a sequence of iterates {x k } is obtained via where s k = α k d k , α k is the step length obtained by a suitable line search procedure, and d k is the search direction. Along the text, we adopt the notation F k = F (x k ), J k = J(x k ) and g k = ∇f (x k ) = J(x k ) T F (x k ).
It is vital to notice that the Hessian matrix (3) consists of two parts; the firstorder term contains only the first derivatives of the residual, and the second-order term contains the second derivatives of the residuals.
There are many methods for solving (1), either first-or second-order based. Most classical and modified algorithms require the computation of the gradient, and also demand at least the first order-term of the Hessian matrix of the objective function (1), which may be costly as dimension increases.
However, if the residue, i.e. the objective function value, is large at the minimizer, then the computation of the exact second-order derivatives of F , although expensive, might become relevant. Nonzero residues are present in many practical applications (see [14,55]). Unfortunately, Newton-type methods that require the exact Hessian matrix are not suitable for large-scale problems. As a consequence, alternative approaches, which make use of the first derivative information, or just rest upon function evaluations have been developed (see [42] for details).
The Gauss-Newton method (GN) [24] is a well-known method for solving (1), and it requires the computation of the Jacobian J k of the function F k , and the solution of a linear system at each iteration. Provided J T k J k is nonsingular, the Gauss-Newton direction can be written as where g k = J T k F k . Another method for solving (1) is the Levenberg-Marquardt method (LM) [31,36,40]. This method adds some positive scalar correction to the term J T k J k in the GN method. The correction is a regularization parameter added specifically to remedy the possibility of the Jacobian matrix to become rank-deficient. The LM direction is given by where µ k is a positive regularization or damping parameter. Both GN and LM methods neglect the second order part of the Hessian matrix of the objective function (1). As a result, they are expected to perform well when solving small residual problems. Nevertheless, when solving large residual problems, these methods may perform poorly [14,55]. The need for storing the Jacobian matrix of the GN and LM methods also contributes to their inefficiency in the case of solving large-scale problems. Thus, derivative-free approaches are needed for solving such cases (see, for example, [7, 29, 47, 49-51, 59-61, 69, 70]).
Our survey focuses mainly on two categories of methods for solving nonlinear least-squares problems. The first category involves the use of structured quasi-Newton methods (SQN), whereas the second one concerns derivative-free methods for minimizing nonlinear least squares based on the Gauss-Newton method and hybrid methods. These categories are detailed in Sections 2 and 3, respectively. In Section 4 we describe additional works that do not fit into these two categories, mainly concerning miscellaneous methods as well as convergence analysis results.
Final remarks and open research questions are raised in Section 5.
2. Structured Quasi-Newton Methods for Nonlinear Least Squares. Structured quasi-Newton methods (SQN) were devised to overcome the difficulties that arise when solving large residual problems using the Gauss-Newton method [44,55]. These methods combine Gauss-Newton and quasi-Newton ideas in order to make good use of the structure of the Hessian matrix of the objective function. The search direction is obtained from where A k = J T k J k , and B k is an approximation to the second-order part of the Hessian matrix. The matrix B k is updated using a suitable quasi-Newton update such that where s k = x k+1 − x k and y k = g k+1 − g k . In [8], Brown and Dennis developed an algorithm for nonlinear least-squares curve fitting. In [13] Dennis presented the structured quasi-Newton method as a technique for solving large residual NLS problems. Although SQN methods are shown to be very efficient (especially for large residual problems), there are some aspects associated with such approaches that require special attention. Firstly, a suitable quasi-Newton update has to be found, together with secant equations, such as (9) or (10), listed above. Secondly, there is a demand for algorithms that automatically switch to the Gauss-Newton method when solving small residual problems, or that adopt a quasi-Newton update when solving large residual problems.

2.1.
Sizing Techniques on SQN. Sizing techniques were introduced as a remedy for the difficulties involved when solving small (or in particular zero) residual problems using SQN methods. This is because, in general, quasi-Newton updates do not approach zero matrices; therefore, SQN methods do not perform as well as the GN and LM methods do for zero residual problems at the solution (i.e. when F (x * ) = 0).
In [3], Bartholomew-Biggs considered some SQN methods based on the observation of McKeown [37] about the incapability of GN-like algorithms on minimizing some class of functions that are sums of squares. Bartholomew-Biggs was aware of the shortcomings of the SQN methods when solving small residual problems, so he proposed a sizing parameter on the matrix B k , and updated it using the symmetric rank one (SR1) formula where Dennis et al. [15] proposed another successful SQN algorithm called NL2SOL. The NL2SOL maintains the secant approximation of the second-order part of the Hessian and automatically switches to such an approximation when solving large residual problems. The sizing parameter used by the NL2SOL algorithm is given byζ where the scalar s T k y * k s T k B k s k corresponds to the one used by Oren and Luenberger [46] and Oren [45] in their popular self-sizing quasi-Newton algorithms. The approximated second-order part of the Hessian matrix in the NL2SOL algorithm was updated using The SQN method with sizing parameter (14) yield an adjustable algorithm because, if the residual turns to F k+1 = 0, then y * k = 0 andζ k = 0. As a result, the updated matrix B k+1 will be the zero matrix.
In [26], Huschens argued that there are no theoretical results backing the commonly used sizing techniques discussed above. He proposed a self-sizing strategy on the SQN approach (and named it a totally structured secant method ) that converges locally and superlinearly for non-zero residual problems and converges locally and quadratically for zero residual problems. Specifically, he dealt with the convex class of the Broyden's family.
Instead of obtaining the SQN direction using equation (8), Huschens proposed an alternative way to compute the direction using where is the k-th approximation of the secondorder part of the Hessian. The matrix B k is updated so that the secant equation is satisfied, whereŷ k = A k+1 s k + F k+1 F k y * k and y * k is given in Equation (13). Another version of the convergence proof by Huschens was given by Yabe and Ogasawara in [63], where they extend the result from a restricted convex class to a wider class of the structured Broyden's family.
In [73], Zhang et al. take advantage of the good convergence properties of the totally structured secant method of Huschens and proposed a modified quasi-Newton equation for nonlinear least-squares problems. Based on the numerical experiments presented in the paper, the performance of the Zhang et al. modified secant equation was shown to be better than the well-known structured secant equations (9) and (10).
In summary, the introduction of SQN methods reduced some of the drawbacks concerning the minimization of large residual problems. These approaches contributed to choosing the best secant equation from (9) and (10) for the quasi-Newton update (see [15]). Sizing techniques on SQN provide an environment for developing switching algorithms that make use of the good performance of the Gauss-Newton method when dealing with zero residual problems.

2.2.
Hybridization Methods for Solving Nonlinear Least Squares. Hybrid methods for minimizing nonlinear least-squares problems take advantage of two approaches in just one algorithm. In [37], McKeown illustrates several examples of the quasi-Newton methods that do not take into consideration the special form of the Hessian matrix (3), but even though, perform much better than GN and LM methods.
Betts in [5] developed a hybrid algorithm for solving nonlinear least-squares problems. Betts algorithm considered the estimate for the Hessian matrix as the sum of the first and second order term, where the first order term is the usual Gauss-Newton method and the second-order term is obtained using a rank one quasi-Newton updating formula.
Based on the arguments of McKeown, Nazareth [42] reviewed some of the methods that implicitly or explicitly considered approximating the second order term of the Hessian matrix. In [41], Nazareth presented his contribution for minimizing a sum of squares of nonlinear functions. He considered a convex combination of the first order term and a matrix obtained by adopting a quasi-Newton approximation on the second-order term of the Hessian. The Hessian approximation proposed by Nazareth is given by where α k ∈ [0, 1], λ k > 0 is a regularization parameter, and I is the identity matrix in R n×n . The search direction is obtained by solving It can be observed that if α k = 1, the resulting direction corresponds to the Levenberg-Marquardt direction, while if α k = 0, we have the regularized quasi-Newton direction. Thus, the direction obtained by solving (18) contains the classical Levenberg-Marquardt, and the regularized quasi-Newton directions as special cases.
Al-Baali and Fletcher [2] considered various possibilities for hybridizing the Gauss-Newton method with quasi-Newton updates for nonlinear least squares. They used the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update in approximating the second-order term of the Hessian matrix. The authors have also used a switching criterion that measures the difference between the current Hessian and any positive definite matrix. Depending on the size of this measure, their algorithms switch to either the GN or the BFGS direction. The matrix updating is defined by where ∆(.) is the Hessian error measure.
In [16], Fletcher and Xu pointed out that the rate of convergence of the algorithm proposed in [2] is not superlinear as claimed. They argued that the measure ∆(.) does not, in general, distinguish between zero and non-zero residual problems, they also constructed a counter-example to support their observations (details can be found in [16]). In view of this, Fletcher and Xu proposed a general and most efficient algorithm with a switching strategy that uses the condition where 0 < < 1. If (19) is satisfied then the algorithm switches to SQN direction, otherwise the GN direction is used by the algorithm. Therefore, the algorithm automatically adopts the GN direction for zero residual problems, and the SQN direction for non-zero residual problems. Many studies have been made to the development of hybrid SQN methods (see, for example, [1,35,38,57]). Spedicato and Vespucci [54] presented an extensive set of numerical experiments concerning 24 hybrid algorithms using the test problems introduced by McKeown [37]. The results show that most of the hybrid SQN methods performed well especially for large residual problems, with better results than those of the specialized GN method.

Factorized Versions of Structured Quasi-Newton Methods.
One of the issues associated with SQN methods is the need of the approximated Hessian matrix to maintain its positive definiteness so that a line search approach can be used to achieve global convergence. In an attempt to overcome this difficulty, Yabe and Takahashi [64][65][66], and Sheng and Zou [53], independently, proposed a factorized version of SQN methods which maintains the positive definiteness of the Hessian matrix in the system H k d k = −g k . In the factorized version of SQN, the search direction is given by where the matrix L k (called L-update) is the same size as the Jacobian matrix J k , acting as its correction and expecting to generate a descent direction.
In [65], Yabe and Takahashi studied the modified version of Sheng and Zou factorized SQN method (called SZ-update), creating a generalized update which encompasses the SZ-update as a special case. By applying sizing techniques, Yabe [62], further proposed the factorized version of the structured Broyden-like family. In [67], Yabe and Yamaki established and proved the local and superlinear convergence of the proposed methods.
In [72], Zhang et al. considered sized factorized secant methods. Interestingly, their methods ensure quadratic convergence for zero residual problems, and superlinear convergence for nonzero residual problems. Instead of computing the search direction using (20), they consider where the factor F k acts as the sizing factor of the matrix L k , in which F k L k approaches zero for zero residual problems, even if the L k update does not approach the zero matrix. Zhou and Chen [75] proposed a global hybrid Gauss-Newton structured BFGS method that automatically reduces to Gauss-Newton step for zero residual problems, and to the structured BFGS step for nonzero residual problems. Unlike Fletcher and Xu, Zhou and Chen proposed a switching strategy for which where 0 < ≤ 1 2 λ min (H(x * )) is a constant and λ min (H(x * )) is the smallest eigenvalue of the Hessian H at the minimizer x * . z k is given by Condition (22) plays a role similar to (19); in addition, it can be used to ensure that the update matrix in the BFGS step is safely positive definite. Wang et al. [58] exploited the idea of the modified BFGS method for non-convex minimization proposed by Li and Fukushima [32], and proposed a hybrid Gauss-Newton structured modified BFGS method for solving nonlinear least-squares problems.
The hybrid Gauss-Newton structured quasi-Newton algorithms show compelling and improved numerical performance compared to the classical algorithm for solving nonlinear least squares problems. However, their performance on large-scale problems is very poor because they require a considerable large amount of storage. Based on this drawback, matrix-free approaches are preferable for solving large-scale nonlinear least-squares problems (cf. [39]).

Derivative-free Methods for Minimizing Sum of Squares of Functions.
Besides Gauss-Newton, Levenberg-Marquardt and SQN methods, other strategies based on the derivative-free approach for solving nonlinear least squares problems exist. These methods are intentionally developed to avoid computing the first derivative of the function when solving (1). To the best of our knowledge, the first derivative-free algorithm for nonlinear least-squares problems was devised in 1965 by Powell [49]. Powell's algorithm is similar to the Gauss-Newton step, except for the fact that the former approximates the first derivative along a set of n linearly independent directions rather than along the coordinate axis.
Peckham [47] also proposed a method for minimizing a function, given by the sum of squares, without computing the gradient. Peckham considered the different approach of approximating the first and second order terms of (3). In [7], Brown and Dennis proposed two derivative-free algorithms for minimizing sums of the squares of nonlinear functions. These methods are simply the analogs of the Gauss-Newton and Levenberg-Marquardt methods where the Jacobian matrix is approximated using finite difference techniques. The direction of these methods is given by where the matrix D k is componentwise computed by finite differences as with u j the j-th unit column vector and µ k and h j k given by Notice that h k and δ k are n dimensional vectors, the index j denotes the j-th component of these vectors, and Ralston and Jennrich [50] developed another derivative-free algorithm for minimizing sums of squares of a nonlinear function called DUD, for "don't use derivatives". The algorithm is designed specifically for data fitting applications in which the computations of the derivatives are too expensive. Ralston and Jennrich showed that their approach is competitive with the best existing derivative-free algorithms at that time, like Powell's [49] and Peckham's [47].
A trial to develop a derivative-free hybrid SQN method for nonlinear least-squares problems was given by Xu in [59]. Xu proposed a modification to the hybrid Gauss-Newton BFGS method in such a way that the Broyden's rank-one update is implemented for the algorithm to perform the Gauss-Newton step, whereas for the algorithm to take the BFGS step, the Jacobian matrix associated with the gradient is estimated using the finite difference operator. A good numerical behavior was observed from the algorithm, possibly because it combines the best features of the GN-Broyden and the finite-difference BFGS methods.
Working from a different perspective, but also as an attempt to develop a derivative-free algorithm for solving nonlinear least-squares problems, Shakhno and Gnatyshyn [52] approximated the Jacobian matrix in the Gauss-Newton step using a divided differences operator. Following the ideas of Shakhno and Gnatyshyn, Ren et al. [51] improved the local convergence result determining its radius of convergence.
Zhang et al. [70] and Zhang and Conn [69] developed a class of derivative-free algorithms for least-squares minimization. By taking advantage of the structure of the problem, they built polynomial interpolation models for each of the residual functions. The global convergence result of the proposed technique was obtained using the trust-region framework. It is worth mentioning that this new framework of derivative-free algorithms for solving nonlinear least-squares problems shows an excellent numerical performance compared with the algorithms that approximate the gradients using finite differences (cf. [69,70]).
Recently, by taking advantage of the development of Automatic Differentiation (AD) technology [23], Xu et al. [60] proposed an efficient quasi-Newton approach for solving nonlinear least-squares problems. The method uses the application of AD to directly and efficiently compute the gradient of the objective function without computing the explicit Jacobian matrix, which is sometimes expensive. The Jacobian matrix appearing in the first order term of the Hessian matrix is approximated using the Broyden matrix B k [9] updated in each iteration by the Broyden rank-one formula A fully Jacobian-free method for minimizing nonlinear least squares was recently proposed by Xu et al. [61]. This new approach was realized by the novel combination of preconditioning and AD techniques. In comparison to the previously discussed approaches, this new approach does not depend on the structure of the gradient and the Hessian matrix of the objective function, making it suitable for general large-scale unconstrained optimization problems. 4. Some Miscellaneous Methods for Nonlinear Least Squares. Despite the fact that our survey is mainly on SQN methods, there exist some interesting methods in the literature that are worth mentioning. We list some of them here.
Gould et al. [20] introduced an algorithm for nonlinear least-squares problems using the combination of filter and trust region techniques. Bellavia et al. [4] presented a regularized Euclidean residual algorithm for nonlinear least squares and nonlinear equations. They proved the global and quadratic convergent of the algorithm without the non-singularity assumption of the Jacobian.
Although the structured nonlinear conjugate gradient methods are hardly studied, Kobayashi et al. [30] exploits the special structure of the Hessian of the nonlinear least-squares objective function and introduced some conjugate-gradient methods with structured secant condition for nonlinear least-squares problems. For boxconstrained nonlinear least-squares problems, an inexact Gauss-Newton trust-region method was proposed by Porcelli in [48].
In another attempt, Gonçalves and Santos [19] proposed a simple spectral correction for the GN method, by approximating the residual Hessian in the second-order part of (3) using a sign-free spectral parameter. They showed that their approach is locally convergent when applied to the class of quadratic residual problems. In addition, convergence results were given for problems without ensured convergence of the GN method. Subsequently, Gonçalves and Santos [18] extended the idea, providing global convergence results to the case of non-quadratic residuals, by adopting the nonmonotone line search strategy of Zhang and Hager [71] as the globalization tool. Specifically, the direction of the approach of Gonçalves and Santos is obtained by solving where the parameter µ k is the spectral coefficient given by In [27], Karas et al. have devised explicit algebraic rules for computing the regularization parameter of the LM method, with global and local convergence analysis. Resting upon the regularizing aspect of the LM method, further analysis of the performance of methods with second-and higher-order regularization terms have been developed (see, e.g. [6,10,11,21,43,74]).

5.
Conclusions. We have briefly reviewed methods for solving nonlinear leastsquares problems, paying special attention to the structured quasi-Newton methods. The interested reader may refer to Ya-Xiang Yuan [68] review paper for some recent numerical methods based on Levenberg-Marquardt type, quasi-Newton's type, and trust-region algorithm.
Due to the nature of these problems, little attention was given to developing derivative-free methods for finding their solution. Based on this observation, we recommend further research on derivative-free methods for large-scale nonlinear least-squares problems.
Finally, we hope this small review will help students and researchers familiarize themselves with the required literature needed to understand classical and modern methods for solving nonlinear least-squares problems.
We conclude this brief survey stating some open (based on this survey and to the best of our knowledge) research topics: 1. It seems that hybrid GN-SQN algorithms perform better than the individual GN or SQN algorithm in practice. Is it possible to develop a single algorithm that solves nonlinear least-squares problems with better efficiency than the hybrid algorithm? 2. The hybrid algorithm proposed by Nazareth [41], adaptively solves nonlinear least-squares problems. We recommend further research on choosing a best convex combination parameter and, at the same time developing the convergence analysis of the algorithm. 3. Considering the good convergence property of the class of SQN algorithms, more research is needed to develop a simplified version of factorized SQN algorithms. 4. Due to the advancement of automatic differentiation (AD), the efficiency of the derivative-free hybrid method presented by Xu in [59] may likely improve by incorporating AD techniques. 5. Huschens-like methods for nonlinear least squares exhibit good local convergence properties. Does this property hold globally? 6. For large-scale problems, assembling a complete Hessian matrix may cause insufficient memory for storage. Keeping this in mind, is it possible to develop an algorithm that does not require the space for storing a Hessian matrix for minimizing the sums of squares of functions, considering its structure?