Numerical Optimization Algorithm of Wavefront Phase Retrieval from Multiple Measurements

Wavefront phase retrieval from a set of intensity measurements can be formulated as an optimization problem. Two nonconvex objective models (MLP and its variants LS) based on maximum likelihood estimation are investigated. We develop numerical optimization algorithms for real-valued function of complex variables and apply them to solve the wavefront phase retrieval problem efficiently. Numerical simulation is given with application to three wavefront phase retrieval problems. LS model shows better numerical performances than MLP model. An explanation for this is that the distribution of the eigenvalues of Hessian matrix of LS model is more clustered than MLP model. LBFGS shows more robust performance and takes fewer calculations than other line search methods.


Introduction
Phase retrieval has been a long-standing inverse problem in imaging sciences such as X-ray crystallography [27], electron microscopy [28], X-ray diffraction imaging [35], optics [19] and astronomy [11], just name a few.Those imaging techniques are based on the Fourier transform relation between the optical field and its diffraction pattern.Only the intensity of the optical field can be recorded by CCD (Charge Coupled Device), while the more important phase information is lost in the course of measurement.Mathematically speaking, phase retrieval in continuous setting is to find a function u(x) when given its Fourier transform amplitude |F (u(x))|.If we have retrieved the phase of Fourier transform, then the function can be easily obtained by inverse Fourier transform, hence the name phase retrieval.
The phase retrieval problem is the so-called inverse problem.Violation of continuous dependence on data is the usual difficulty in numerical solutions for most inverse problems and so is the phase problem.However, for phase retrieval problem, nonuniqueness is the more important source of ill-conditionedness.Clearly, if u(x) is a solution to the phase retrieval problem, then cu(x) for any scalar c ∈ C obeying |c| = 1, u(x − x 0 ) for any given x 0 and u(x − x 0 ) are also solutions.Those "trivial associates" are acceptable ambiguities from the physical viewpoint.The results are also applicable in discrete setting.References for theoretical results in terms of uniqueness of the phase retrieval problem can be found in [1][2][3] for continuous setting and in [4,16,33,34] for discrete model.Those references point out that the solution is almost relatively unique in multidimensional cases, except for "trivial associates" solutions.In literature [23], it is pointed out that, these uniqueness results are of fundamental importance, but it does not apply to numerical algorithms, in particular in the presence of noise.Nonuniqueness in the absolute sense, i.e., all solutions are same only up to a global phase shift, may be the main reason for the failure and common stagnation problem of practical algorithm [9].
Two approaches to mitigate the nonuniqueness of phase retrieval problem are using a priori information and using multiple measurements.The former approach relies on all kinds of a priori information about the signal, such as real-valudedness [11], positivity and support constraints.Practical reconstruction methods, such as Error Reduction (ER) and its variants (HIO [11], HPR and RAAR [21,22]), are developed.While those algorithms are simple to implement and amenable to additional constraints, they all take the disadvantages of slow convergence and easy stagnation [13], usually hundreds of iterations are needed for a reasonable solution.Besides, the oversampling method [26] is used to get a satisfying solution in above algorithms.
Instead of assuming constraints on the signal, the concept of using multiple measurements is introduced to phase retrieval, such as multiple diffraction patterns [8] and ptychography [32].Taking multiple measurements usually yields uniqueness solution to phase retrieval.It has shown that O(N ) Gaussian measurements yield the unique solution with high probability for a signal u ∈ C N [7].PhaseLift [7] which solves a semidefinite programming is proposed by lifting the vector u to a rank one matrix uu * .So it is prohibitive to twodimensional phase retrieval problem.
Optical wavefront phase retrieval problem we study here arises in optics and astronomy as a special case of phase retrieval problem, where two moduli including intensity of the wavefront and its Fourier transform intensity are given.Its algorithms have been widely used in adaptive optics to correct the aberrations of large telescopes, allowing them to obtain the diffraction-limited image.Light wave field distorts when passing through the perturbation atmospheric, leading the image blurring.Usually, the wavefront is parametrized with Zernike polynomials coefficients.But the Zernike-based parametrization has the disadvantage that we do not know how many terms are enough to fit the wavefront.Moreover, there is no analytic expression of Zernike polynomials for irregular domain, which is usually used in modern telescopes, such as Hubble Space Telescope and James Webb Space Telescope (JWST).
In this paper, we consider the wavefront phase retrieval problem from multiple measurements.In optics, multiple measurements (called diversity images) are easy to obtain at different positions (called image planes) through optical axis.This configuration is called phase diversity, which was proposed by Gonsalves [14] in 1976 is to overcome the ill-posedness of the wavefront phase retrieval problem.Misell method [28] as the natural extension of ER algorithm was used when there are more image planes, with the same disadvantages as ER algorithm.We formulate a direct error metric nonconvex optimization problem from a set of diversity images.The discrete scheme takes the advantage of allowing for the most accurate representation of the domain [23].
The main contributions in this paper consist three points: • Formulation of Objective Functions: Two objective functions (called MLP and LS models) are investigated.MLP model is deduced from the negative logarithm maximum likelihood estimation objective according to the Poisson photon distribution.The least squares (LS) model used in iterative projection method then can be interpreted as an approximation of the MLP model.
• Extension of Optimization Method of Function of complex variables: We develop the line search method for real-valued function of complex variables, which directly operates the complex gradient.It is more convenient than operating the real and image parts respectively.Since LBFGS algorithm scales well for large scale problem and shows more robust performance, we deduce the LBFGS algorithm, which directly operates on complex gradient.
• Eigenvalue Analysis of Hessian Matrix: We analyze the eigenvalues of the complex Hessian matrix for the two objective functions.From the analysis, the distribution of eigenvalues of Hessian matrix of LS model is generally more clustered than MLP model, which gives an intuitive explanation for the better numerical performance.
The plan of this paper is as follows.First in Section 2 we state the formulation of the wavefront phase retrieval problem in phase diversity setting.In Section 3, we present two nonconvex objectives to minimize for the solution of the wavefront phase retrieval problem.The gradient and Hessian operators are deduced in Section 3.3.Then we derive some line search methods for the unconstrained real-valued function of complex variables in Section 4 based on the C-R calculus (see subsection 4.1) and the global convergence to a stationary point are proven.In Section 5, we give an intuitive explanation on why LS model converges faster by analyzing the eigenvalues of Hessian matrices of the two objectives.In Section 6, we apply several line search algorithms to three wavefront phase retrieval problems, i.e., Zernike type, von Karman atmospheric type for disc pupil and JWST type for segmented pupil.LBFGS algorithm shows the most robust performance than other line search methods.Conclusion comes last in Section 7.
To be clear, the notations and conventions used in the paper are briefly summarized here.We use script letter F for Fourier transform and F for discrete Fourier transform.Symbols (x, y) and (ξ, η) denotes the coordination in signal plane and Fourier plane respectively.For every complex vector z ∈ C n , |z| ∈ R n denotes its element-wise magnitude vector.a • b and a/b denote their element-wise Hadamard product and quotient of matrices or vectors a and b. z denotes the element-wise complex conjugate of z.

Model of Optical Image Formation
We begin with an isoplanatic Fourier optics model for a simple telescope with incoherent light and denote the optical wave field (wavefront) as u(x), x = (x, y).The relation between intensity I recorded in the focus plane and object ψ is given by where we cancel out the phase shift before the integral with the reason that it does not change the intensity.The kernel in Equation ( 1) is with coordination scaling.The kernel characterizes the optical system, and |U (ξ, η)| 2 is known as the point spread function (PSF) of the optical system.Equation ( 1) is the basis of Fourier optics, the strict deduction from the Helmholtz equation and Huygens-Fresnel principle can be referred in book [15] or the review paper [23].If the intensity is recorded in the defocus plane, there is some phase shift in the kernel, see (3).
If the optical system is in a homogeneous medium, the wavefront u(x, y) is the indicator function of lens region A, i.e., u(x) = χ A (x) = 1, x ∈ A, 0, otherwise.

Wavefront Phase Retrieval Problem
In most applications, however, the assumption of homogeneity is not correct for the optical field.The resulting deviations are called phase aberrations.Deviations may occur at any point along the path of propagation and can be caused by atmosphere turbulence in an intervening medium or geometric flaws of lens.Aberration resulting from atmosphere turbulence can be simulated with the von Karman model based on a realization of a strictly stationary Gaussian stochastic process.In such case, the wavefront u(x) is not the indicator function but a complex function within the lens region.It can be represented in polar form where a(x, y) denotes the transmitting function and φ(x, y) is the aberration phase.
Equation (1) becomes I = |F (u)| 2 when object ψ(x) is a δ function.In astronomy and optical design, the object is assumed a point optical source.Then the wavefront phase retrieval problem arises when recovering generalized pupil function u(x) from the intensity I and intensity |u| 2 .Wavefront phase retrieval is the first step in adaptive optics technique to measure and reshape the wavefront phase.Then the reconstructed wavefront is used to compensate the wavefront aberrations induced by atmospheric turbulence.And it is also used to optical test to characterize the flaws of lens.To overcome the ill-posedness of the wavefront phase retrieval problem, the phase diversity method is used by providing multiple measurements to further restrict the wavefront u(x).
In phase diversity setting, diversity image intensities are observed at defocus d m , m = 1, 2, . . ., L − 1 (in wavelength) are given as (3) In summary, the mathematical model of wavefront phase retrieval with phase diversity measurements is to find the wavefront function u when given the intensity |u| 2 = I 0 and phase diversity image intensities I m , m = 1, 2, . . ., L − 1.
Numerical algorithms to wavefront phase retrieval problem is based on discretization.In the following, we focus on the discrete version of the problem.3 Choice of the Data Misfit Functional

Maximum Likelihood Estimation
The physics of CCD for recording the intensity of incoming wave field is a photon counting process, the noise resulted from the process and other noise, such as external/internal background radiation noise and thermoelectric noise, all yield Poisson process due to photon counting.Readout noise which is modeled by Gaussian distribution can be lowered to zero [36], so we only consider the Poisson noise model in CCD.
For brevity, we use the one-dimensional notation to describe the essence. 1 The observed photon counting data in diversity images I obs m , m = 0, 1, . . ., L−1 are described by random vectors I m ∈ R N , m = 0, 1, . . ., L− 1.The components of every vector are random variables yielding independent Poisson distribution with mean I m ∈ R N , where I m should be equal to the predicted data I m = |F (u; d m )| 2 from the image formulation model when given the wavefront u.The observed data Without loss of generality, we omit the subscript m for deducing the misfit functional.Assuming that the number of photons n i is proportional to I obs i , i = 1, 2, . . ., N .The probability of recording intensity I obs i (or receiving n i photons) is By independent property, the joint probability density is given by The negative log-likelihood function associated with the Poisson distribution can be cast as where we omit the log I obs i !constant offset in the summand. 2 So the data misfit functional in discrete model can be given by where Equation ( 5) can be written as by substituting M i = √ I i into ( 5), where M i , M obs i is pointwise amplitude.
Expand (7) around Equation ( 8) can be regarded as an approach to approximating the Poisson likelihood function L, and L corresponds to a Gaussian distribution about M i with a variance of 1/4.So another data misfit function can be proposed where It is worth to note that data misfit function ( 9) is fortunately identical to the cost function found throughout the phase retrieval literature.Fienup [11] considered ER method as a fixed-step gradient descent method applied to this cost function, which can be clearly seen in Section 3.3.
Remark.If we expand (5) around . This data misfit functional is recommended in [17].However, in our numerical tests this functional is not so good to use, so we do not consider it in the paper.

Two Objective Models
We have obtained the data misfit functional for every diversity image, then we can just add up every misfit function with equal or different weights, here the equal weight scheme is used for brevity.By looking at (6) and (9), if there exists an index such that |F (u; d)| i closes to zero, then its gradient is singular at this point, so we need some perturbation applied to objectives to overcome possible overflow in numerical algorithms.
Two data misfit models are presented to measure the error that quantifies inconsistency between the given data and the predicted data from the image formulation model (3).Hereafter the superscript (•) obs are omitted without ambiguity.Since equation ( 9) can be rewritten as We consider the following two models for the error metric E m (u) in the paper: • Maximum Likelihood Poisson (MLP) model: • Least Squares (LS) model: where operators . ., L − 1 are the forward formulation of predicted data, and K 0 (u) = |u| 2 is the measurement of pointwise amplitude of the wavefront.
Having obtained the misfit objective functional, in the framework of optimization we should derive the gradient and Hessian to utilize the line search method.In the next subsection, we use the Fréchet derivative to reduce the corresponding gradient and Hessian for MLP and LS models respectively.

Deduction of Gradient and Hessian Operator by Fréchet Derivative
In this section, the gradient and Hessian operator (Hessian-vector multiplication) of the two objectives are derived.We deduce those expressions according to Fréchet derivative.Because of the similarity of every image plane misfit functional, we take single image data to derive the expressions for brevity with notation omitted subscript m.All the above objectives ( 11) and ( 12) can be regarded as functional defined on Hilbert space C N to R. The gradient and Hessian operator of two objectives models to minimize are presented in the following.For details of deduction of MLP and LS models see Appendix A.2 and A.3.
We use symbol F m and F * m to denote the Fourier transform F (u; d m ) with defocus d m and adjoint operator respectively.For MLP model (11), its gradient and Hessian-vector multiplication are given by For LS model (12), its gradient and Hessian-vector multiplication are given by We see that the gradient ∇ J m (u) is identical to the projection onto the Fourier transform constraints in ER algorithm for the mth diversity image with a perturbed ǫ 2 .
It is noted that gradient and Hessian operators are similar to the complex gradient and Hessian-vector multiplication in the framework of C-R theory (see Section 4.1).It is more easy to deduce the above gradient and Hessian expression by following the Fréchet derivative instead of the definition ( 16) and (17).The theory deals with real-valued functions of complex variables.The objectives here are the cases.In next section, we develop several optimization methods which are similar to standard optimization method of real variables.It is directly implemented by complex gradient, not by representing the complex gradient in its real and imaginary parts.

Unconstrained Optimization Method
In this section, we consider to minimize the real-valued function f (z), where z ∈ C N .The optimization method is usually carried out with respect to the real and imaginary parts of these complex variables.This approach is used in literature [23] for wavefront phase retrieval problem.The reason for this approach is to avoid difficulties with the definition and interpretation of the gradient and Hessian with respect to complex variables [38].In next subsection we review the C-R theory of complex gradient and Hessian of real-valued function of complex variables.Those concepts are first described by researchers in signal processing, in which a number of problems with real-valued function of complex variables arise.2N with its real parts and image parts.Formally the conjugate coordinate derivatives are defined as

Complex Gradient and Hessian
Those definitions follow standard notation from multivariate calculus in which derivatives are row vectors and gradients are column vectors.
Definition 1. Assume that all partial derivatives of a real-valued function f of complex variables z with respect to x and y exist.The complex gradient of f is defined as Similarly, we define then the complex Hessian matrix is given by Actually, f is not analytic, for the reason that it is not satisfied the Cauchy-Riemann condition.Without ambiguity, the subscript f and superscript C of H may be omitted.If f is real-valued, then Because of those equalities, we only need store ∂f ∂z , H zz and H zz [18].Optimization method is deduced from the Taylor's approximation expansion, which takes the form From ( 18), we can define the Hessian operator or Hessian-vector multiplication at point z of function f : The above definition matches the Fréchet derivative as indicted in (32).

Line Search Methods
Considering the minimization of a real-valued function f (z) of complex variables, it is more convenient to operate the complex gradient directly.The straightforward extension for optimization of function of complex variables details in this section.Line search methods construct a sequence The basic method is first to choose a descent direction d k ∈ C N , then to minimize a one-dimensional optimization problem with some line search scheme to find the step length α k ∈ R at kth iteration.
where g k = ∇f (z k ).With holding global convergence to local minimizer, step length α k is not arbitrary.
There is usually a requirement of α k , which are known as Wolfe conditions: where condition 0 < c 1 < c 2 < 1 is satisfied.Generally we take c 1 = 10 −4 , c 2 = 0.9 as commended in book [29].Equations (22a) and (22b) are known as the sufficient decrease and curvature condition respectively.The former ensures sufficient decrease of the objective function, and the latter ensures the gradient converges to zero.
According to the choices of descent direction d k , we introduce some common line search methods.The simple case is the steepest descent method, which takes d k = −g k .The steepest descent (SD) method choose the fastest descent direction at every iteration, but its asymptotic rate of convergence is inferior to other methods.For poorly conditioned problems, it increasingly 'zigzags' as the gradients point nearly orthogonally to the shortest direction to a minimum point.To accelerate the rate of convergence, nonlinear conjugate gradient (NCG) method is widely used.The conjugate gradient direction d k is generated by the recurrence relation where d 0 = 0.There are a variety of options to choose parameter β k for nonlinear problem [37].In this paper, we take the Hestenes-Stiefel form Since we have the exact Hessian-vector multiplication (13c) and (14b), so we can implement the truncated Newton (TN) method, where the Newton direction is obtained by solving the equation using conjugate gradient method.Unfortunately, the corresponding Hessian matrix is not always positive definite.So in this case, we should adopt the negative curvature direction, see [29,Chapter 6].Newton method is considered as a two-order method provided initial point is in the neighborhood of the minimizer.It is shown in our numerical test that TN method is not robust and performs worse compared with other methods.
We deduce the LBFGS method for optimization of function of complex variables.LBFGS has advantages of better rate of convergence and less memory need than BFGS method.It is as well as easily implemented by two-loop recursion with vector-vector inner product.We begin with the BFGS formula of updating the approximation Hessian matrix.
If the number of variables N is large, the cost of storing and manipulating the matrix B C k become prohibitive.The famous LBFGS method is appropriate for large-scale problem, and the descent direction d k can be obtained by the easy two-loop recursion, which is described in Algorithm 1.

Algorithm 1 L-BFGS two-loop recursion
Its performance depends on the number of storing pair {s i , y i } m, and usually m takes 10, 20.In wavefront phase retrieval problem, the numerical performance is not sensitive to m, in which m takes 2. It is shown that its performance scales well and takes the priority over the NCG method in our numerical tests.
Another way to achieve global convergence is to use the trust region technique to determine a search direction and step length simultaneously.The approach can be easily extended to function of complex variables.We omit the extension for the reason that its performance is similar to truncated Newton method.
It is noted that the major differences between the optimization of function of real variables and optimization of function of complex variables are that we use the complex gradient instead of its real and imaginary parts and we take the real parts of some complex quantities.The generalization of the line search method to complex gradient can be proven in the framework of C-R calculus.It is easy to verify the correction, we omit the details with consideration of paper limitation.

Global Convergence of Wavefront Phase Retrieval
The below theorem states the global convergence to a stationary point of function of complex variables.Its proof is the same as the version for function of real variables.
Theorem 1. Suppose that f : C N → R is bounded below in C N and that f is continuously differentiable.And assume that the gradient ∇f is Lipschitz continuous, that is, there exists a constant L > 0 such that The iteration z k+1 = z k + α k d k , where d k is a descent direction, i.e., Re (d * k ∇f ) < 0, and α k satisfies the Wolfe conditions, then z k converges to a stationary point z † of f , i.e., ∇f (z † ) = 0.
Proof.We denote the gradient ∇f (z k ) as g k .From (22b) we have that and the Lipschitz condition (25) implies that . Combining these two relations, we obtain .
From the Wolfe descent condition, . By summing this expression over j = 0, 1, . . ., k, we obtain , where c = c 1 (1 − c 2 )/L.Since f is bounded below, we have that which concludes the proof.
Next theorem states that the gradient of the MLP model and LS model are Lipschitz.
Theorem 2. The gradient of function J m , J m are global Lipschitz for all m = 0, 1, . . ., L − 1, and the constants are L m = 1+ I m ∞ /ǫ 2 and L m = 1+ M m ∞ /ǫ for MLP and LS model respectively.Considering the physical setting of wavefront phase retrieval problem in phase diversity setting, given the data set I i , i = 0, . . ., L − 1 or M i , i = 0, . . ., L − 1, the overall optimization functional (4) with both MLP and LS models are Lipschitz.
Corollary 3. Since MLP and LS model have the property that its gradient is Lipschitz, so the steepest descent method is global convergence to a stationary point.

Least Squares Model is Better
In this section, we do analyze the eigenvalues of the complex Hessian matrix of the two models.It is motivated by the fact that the rough rate of convergence of optimization methods depends on the distribution of eigenvalues of the Hessian.
We can form the Hessian matrices of MLP model and LS model.One-dimensional discrete Fourier transform F (u) can be expressed in matrix-vector multiplication as F (u) = W u, where W is the corresponding DFT matrix.Without loss of generality, we assume the problem is one-dimensional. 3The diversity image formulation operator F m (u) can be written in matrix form F m (u) = W D m u = U m u, where diagonal matrix D (|d ii | = 1) signs the phase shift from the phase diversity. 4We first analyze the eigenvalues of every diversity image term E m .
The complex Hessian matrices are directly obtained from their Hessian operators respectively: • For MLP model, its Hessian matrix is given by • For LS model, its Hessian matrix 3 Define operator vec(•) is to stack a two-dimensional array to a column vector.For two-dimensional discrete Fourier transform, its compact matrix form is given by vec(F (u)) = (W T ⊗ W ) vec(u), where ⊗ is the Kronecker product. 4The unitary matrix U is not the optical field.
For convenience, we use symbol diag( * ) to denote the diagonal matrix without concern for the specific content.
For the particular form of the Hessian matrix H m , the following theorem is straightforward.
Theorem 4. Given vectors r ∈ R N , and c ∈ C N , then the eigenvalues of Hermitian matrix are real and can be worked out as r i ± |c i |, i = 1, 2, . . ., N .Then Hermitian matrix with unitary matrix U ∈ C N ×N share the same eigenvalues of H 1 .
Proof.Without loss of generality, we assume that |c i | = 0.Because similarity matrices share the same eigenvalues, we utilize the elementary row transform P i = I 2N + ci |ci| e i,i+N and elementary column transform it is obvious that H 2 and H 1 share the same eigenvalues.
The following corollary characterizes the eigenvalues of the Hessian matrix for the two models respectively.
Corollary 5.For the above two models, the eigenvalues of their Hessian matrices are And if in every iteration the point u satisfies the inequality F m (u) i ≥ I m,i for all i = 1, 2, . . ., N and all m = 0, 1, . . ., L − 1, then the corresponding objective functions are convex.
First we fix the m and consider just the matrix H m under the assumption that the iteration point u k is in the neighborhood of the solution to wavefront phase retrieval problem.For the two models, the distribution of eigenvalues of of LS model is more clustered than MLP model.For comparison, we multiply the eigenvalues by two, thus both the maximum eigenvalues approximate to two, when u is the minimizer such that F m (u) = I m .After multiplying by two, the maximum eigenvalue of Hessian of LS model is smaller than two, while the maximum eigenvalue of Hessian of MLP model is generally greater than two, if there exists an index i such that |F m (u)| 2 i ≤ I m,i (It is usually true for the nonconvex objective).In another direction, the minimum eigenvalue of Hessian of LS model is smaller than MLP model by the inequality: So LS model should have better performance in numerical simulation, since the distribution of its eigenvalues is generally more clustered than MLP model.This point can be demonstrated in Section 6 by numerical test.
From the eigenvalue analysis, we can prove the gradient is Lipschitz, and we state the following lemma.
Lemma 6.For a real-valued function f of complex variables, if its complex Hessian matrix H C satisfies H C 2 ≤ L, then the gradient ∇f is Lipschitz with constant L.
Now it is easy to show the proof of Theorem 2.
Proof of Theorem 2. From (28a) and (28b), for every diversity image m, the norm of Hessian is controlled by By the fact that given Herimitan matrices A, B and C = A + B, the inequalities of eigenvalues hold, i.e., So we have the equalities of the norm of overall Hessian: So the gradients of MLP and LS models are Lipschitz.

Numerical Test
In this section, The facts that the LS model is the best objective and the performance of LBFGS algorithm is superior to other line search methods are first demonstrated by numerical simulations for noiseless data for two different types of wavefront phase retrieval problems.Then we apply the LBFGS algorithm for minimizing the LS model objective to three wavefront phase retrieval problems (except the aforementioned two types, the more physical James Webb Space Telescope (JWST) model is included).We present the results of three types of problems for noiseless data and noisy data.

Setup and Error Measurement
The three types of wavefront phase retrieval problems we consider here are shown in Figure 1.Three types have different pupils and wavefront.For easy reference, we called the types are (i) Zernike type, (ii) von Karman type and (iii) JWST type respectively.The pupils are of the regular shape annulus (i) and disc (ii) and segmented (iii), which the JWST type is the reduced configuration for the true physical setting.The simulation phase aberration for Zernike type is presented by Zernike polynomials [24] in annulus pupil.Just for validating the performance, we construct the phase using 13th annulus basis polynomial with coefficients 0.1.Phase aberration for von Karman type is used to simulate atmosphere perturbation, which can be described by Fourier transform of a random process.And wavefront for JWST type is from the NIRCam data. 5PVs (peak-to-valley) of the wavefront aberration are 0.58, 1.02 and 1.47 respectively, and RMSs of aberration are 0.10, 0.19 and 0.21 in unit of wavelength respectively.In those wavefront phase retrieval problems we assume pointwise amplitude of the wavefront u equal to unity.To recover the wavefront (phase, more precise), just two diversity images are used, both two are out-of-focus with defocus −3 and 3 respectively.The perturbation ǫ can be taken as small as possible, here we take ǫ = 10 −14 .
We denote the solution to the wavefront phase retrieval problem as u, and û is the returned solution by the line search method, the relative root mean squared error (RMS) is defined as where c is to get rid of the effect of the constant phase shift of phase problem.From the constant c is given by In line search methods, for stop criteria, we take the three options: the number of iterations is 150, the tolerances for relative change of objective function T olF un and relative change of variable T olX during one iteration both are 10 −12 .The number of stored vector pairs for LBFGS method m = 2. Owing to the nonconvexity of the objectives, the behavior of the algorithms varies considerably depending on the initialization, hence we take 10 runs for every algorithm from random initial guesses.Those guesses all have unitary amplitude in the pupil with random phase uniformly distributed on (−π, π].

Comparison of Line Search Methods
We first demonstrate our results with the small size problems, i.e., Zernike and von Karman types, for noiseless data as a proof-of-concept.The performance of the four line search methods (SD, NCG, TN and LBFGS) applied to Zernike and von Karman types based on the LS model is shown in Figure 2 for an exemplary run.RMSs of all algorithms are below 10 −5 .The steepest descent method is comparatively slow in all algorithms, while truncated Newton method have the fastest convergence rate near the solution.Note that the number of iterations of TN method is greater than NCG and LBFGS methods for von Karman type, the behavior is due to the indefiniteness of the Hessian matrix.At those points we should adopt the negative curvature direction, the frequency approximates by 87% in an exemplary run.
In terms of the iterations, NCG method takes fewer iterations in four algorithms.It is worth noting that the limiting calculation for the wavefront phase retrieval problem is the Fourier transform, which is accomplished with the fast Fourier transform (FFT) algorithm.The average number of FFT calls for the four methods in 10 dependent runs is shown in Table 1.Though NCG method takes the advantage of fewer iterations, the number of FFT calls is greater than LBFGS method. 6The large number of FFT calls of TN method results from solving the Newton direction by CG method, which is the most exhausted computation.We also test the easy-implementation Misell method [28] to solve the wavefront phase retrieval problem, it stagnates earlier and the relative RMS is still above 0.99 after 500 iterations.
Its gradient and Hessian operator are given in the following: .
MLP, LS and LSI models are applied to Zernike and von Karman types is shown in Figure 3.The decrease of LS model is the fastest and followed by MLP model, LSI model is the slowest.The theory in Section 5 is validated by the numerical tests.LSI model is not appropriate for data errors with a non-Gaussian distribution from the view of statistics.A squared norm data misfit functional can be interpreted by maximum likelihood estimation for data with Gaussian noise, see [30].MLP model is more appropriate for Poisson noise.Note that the LSI model is used to solve the ptychographic phase retrieval problem [32].It performs well in the thousands of multiple measurements.If there are not many measurements, the LSI model can not use to solve the phase retrieval problem.

Noisy Measurements
For noisy case, we use Poisson process to simulate this noise model.The instability of the wavefront phase retrieval problem with different defocus is shown in Figure 4.When given two diversity images and the signalto-noise (SNR) is 10, the behavior of instability of inverse problem can be observed.While the error misfit function of LS model decreases versus iterations.RMS decreases in first tens of iterations, after arriving a minimum, RMS then increases in the sequential iterations.So we follow the Mozorov's discrepancy principles to terminate the iteration, i.e., we should stop the iteration when the Morozov's postpripori discrepancy principle has satisfied, like the Landweber iteration for linear inverse problem.The minima of RMS depend on defocus setting.The dependent is not sensitive.With the facts that the best choice for error misfit functional is LS model and the best algorithm for minimizing LS model is LBFGS algorithm.We explore the performance of the LBFGS algorithm applied to three types wavefront phase retrieval problems for noisy data.We repeat the LBFGS ten times with different random initials.The two diversity image data is added random Poisson noise for three different SNR levels, i.e., 30, 20 and 10.Since the data is noisy, we terminate the iteration by Morozov's principle.The recovery images for three SNR levels case and noiseless case are illustrated in Figure 5.To be clear, the wavefront errors for SNR levels (20 and 10) are also illustrated in Figure 6.The minimal values of RMS for the solution are approximately 0.22797, 0.10515 and 0.046935 for noise levels SN R = 10, 20 and 30, respectively.From those RMSs, we can roughly compute the order of convergence for the wavefront phase retrieval problem, which is used to describe the rate of the solution in noise case that approximates the true solution in noiseless case when noise level SN R decays to zero.The numerical rate of convergence is 0.34, 0.33 and 0.20 for Zernike type, von Karman type and JWST type, respectively.We also explore the numerical performance of LBFGS algorithm applied to three wavefront phase retrieval problems, when given only one diversity image instead.The diversity image data is measured at defocus 3.This physical setting has been used by Fienup to characterize the spherical aberration of Hubble Space Telescope [12], in which Zernike parametrization of the wavefront is used instead of pixel parametrization.The solution can also be obtained with more iterations than in two diversity images case.The algorithm may fail with some kinds of initials.Numerical solution to wavefront phase retrieval problem is more easily obtained in the defocus setting.The uniqueness of the solution of the problem in defocus setting accounts for the commonly known behavior, see the recent paper [25].

Conclusion
The wavefront phase retrieval problem given diversity images, a special example of phase retrieval, is considered in this paper.We formulated the problem as a nonlinear nonconvex optimization problem with appropriate data misfit functional.
We proposed three data misfit functionals (MLP and LS models), that metric the error between noisy data and predicted data from the formulation model to minimize.MLP model is based on maximum likelihood estimation on account of the Poisson statistics of data and LS model is then derived by Taylor expansion.According to Frechét derivative, we deduced the gradient and Hessian operator for two objective models respectively.These analytic expressions can be incorporated in optimization algorithm.
We discussed how standard iterative line search methods can be applied to solve the optimization problem.Since the variables of functional are complex, it is more convenient to optimize with the complex gradient than with respect to real and imaginary parts, which is used in the review paper [23].
After introducing the complex gradient, we derived several line search methods: the steepest descent and nonlinear conjugate gradient and truncated Newton and LBFGS algorithm, which were applied to the optimization problem.We analyzed the eigenvalues of complex Hessian of three objective models and inferred that the LS model is the best in all three models in view of its smallest condition number.We validated our results by numerical simulations and suggested that LBFGS algorithm give the best performance on a size of 1024 × 1024 JWST type problem.We found that the LSI model can not be used to solve the wavefront phase retrieval problem, while it efficiently solved the ptychographic phase retrieval problem [32].
The method can be easily extended to the problem that simultaneously recover the amplitude and phase in phase diversity setting [5,31].As demonstrated in the literature [20], PhaseLift often works much better if a priori knowledge of sparsity of the signals is incorporated.Despite the already very good performance of the LBFGS methods for our test problems, it is worth investigating whether their performance can be improved by exploring the sparsity structure.The success of the optimization method depends on the exact known defocus.We found that if the defocus is inaccuracy with the error above 15%, LBFGS algorithm failed.Other approaches for phase retrieval with multiple measurements, such as introducing random masks [9,10], take the same disadvantage.So adjoint optimization with respect the wavefront and defocus is worth investigating in the future.
The LBFGS method can be directly applied to the generalized phase retrieval problem.It should show better performance than Wirtinger Flow [6], in which the line search strategy is not applied.We are carrying on the work, which will be published elsewhere.Hessian operator can be found in a similar way, It can be validated that D 2 J [u](•, •) is a bilinear functional defined on C N × C N .

A.3 Gradient and Hessian Operator for LS Model
Misfit function ( 12) can be written as So derivative (35) and follow by chain rule, it yields So gradient and Hessian-vector multiplication of LS model are given by (14a) and (14b).
Function u is replaced by a vector u.Given the data set I obs 0 and I obs m , m = 1, 2, . . ., L − 1, we formulate the following optimization problem u := arg min u∈C N where S(•, •) is the data misfit functional, I m are the predicted data from image model I m = |F (u; d m )| 2 and I obs m are the observed data.

2
If we normalize the misfit functional by subtracting the minimal value I obs i − I obs i log I obs i , we obtain the Kullback-Leibler divergence KL I obs , I = N i=1 I i − I obs i − I obs i log I i I obs i .
with the scaling suggested by Shanno and Phua γ = Re(y

Figure 2 :
Figure 2: Comparison of algorithms applied to (a) Zernike and (b) von Karman data, an exemplary run is shown.Top row plots RMS versus iterations, bottom row shows the error misfit of LS model versus iterations.

Figure 3 :
Figure 3: Comparison of different objectives for two simulation data: (a) Zernike, (b) von Karman.RMSs of the solution versus iterations are plotted.

Figure 4 :
Figure 4: Instability of wavefront phase retrieval problem given two diversity images: (a) Zernike, (b) von Karman.RMSs versus iterations are plotted in the top row, error misfit of LS model versus iteration is plotted in the bottom row.

Figure 5 :
Figure 5: Recovery wavefront for three data: (a) Zernike, (b) von Karman, (c) JWST in different SNR level.Top row is without noise, then the SNR level decreasing from 30 to 10. RMSs are described below for every figure.
2, . . ., N to reduce the matrix H 1 .Iteratively left-multiplying and rightmultiplying H 1 by P i and P −1 i yields

Table 1 :
Total average number of FFT calls for different methods in 10 dependent runsHow to choose the objective to minimize?It is believed that choosing appropriate misfit functionals in designing efficient numerical algorithms for inverse problem.In this subsection, we give the performance comparison of the two models (MLP and LS) and the common used LSI model.The data misfit functional is just the least squares of the intensities which are directly recorded.So its objective E m (u) (see equation (4)) of LSI model is J m