Estimation and uncertainty quantification for the output from quantum simulators

The problem of estimating certain distributions over $\{0,1\}^d$ is considered here. The distribution represents a quantum system of $d$ qubits, where there are non-trivial dependencies between the qubits. A maximum entropy approach is adopted to reconstruct the distribution from exact moments or observed empirical moments. The Robbins Monro algorithm is used to solve the intractable maximum entropy problem, by constructing an unbiased estimator of the un-normalized target with a sequential Monte Carlo sampler at each iteration. In the case of empirical moments, this coincides with a maximum likelihood estimator. A Bayesian formulation is also considered in order to quantify posterior uncertainty. Several approaches are proposed in order to tackle this challenging problem, based on recently developed methodologies. In particular, unbiased estimators of the gradient of the log posterior are constructed and used within a provably convergent Langevin-based Markov chain Monte Carlo method. The methods are illustrated on classically simulated output from quantum simulators.


Introduction and motivation
Quantum computing holds a great promise to enable future simulations of quantum systems that would otherwise suffer from the curse of dimensionality when simulated classically.However, we are currently entering the Noisy Intermediate Scale Quantum (NISQ) device era [40].The hallmarks of NISQ era are relatively small (10-50 qubits) quantum computing devices and noisy qubits.Therefore, to succeed with quantum simulation on NISQ quantum computers, it is imperative to understand and fully characterize the effects of noise on the state of a quantum device.
In quantum mechanics the state of a d-qubit quantum device is generally described by a density matrix ρ -a 2 d -by-2 d complex-valued positive semidefinite matrix with Trρ = 1.
From the definition, it takes 2 2d − 1 real parameters to fully specify an arbitrary quantum state.If O denotes a quantum observable, described by a 2 d -by-2 d Hermitian matrix (O † = O), then measuring O yields one of its eigenvalues sampled from a distribution determined by ρ.In particular, the probability of the ith eigenvalue is p i = ψ i |ρ|ψ i , where |ψ i is the ith eigenvector in the bra-ket notation.Since there are at most 2 d eigenvalues, repeated measurements of any observable yields information about at most 2 d −1 parameters of ρ.To estimate every parameter of the state one must measure at least 2 d +1 independent observables.This approach is known as quantum state tomography [37,24].Its applicability is limited to small quantum devices due to exponential scaling in the number of measurements and related issues of solving exponentially large systems of equations.Bayesian methods also have been suggested [29,5,25,20] to facilitate uncertainty quantification (UQ) in the context of quantum state tomography.However, the existing Bayesian quantum tomography solutions face similar scalability challenge as the number of parameters to infer still grows exponentially with the size of a quantum device.
Here we consider a fundamentally new approach, which provides a scalable solution to UQ for the output of quantum simulations.We will restrict our attention to a simplified scenario in which the density matrix and observable commute, hence are simultaneously diagonalizable, and the eigenvectors are known.In this way we are able to treat the quantum state, which is represented by the density matrix, as a classical probability distribution, given by the diagonal of the density matrix.Note that the standard quantum state tomography methods would still scale exponentially even for this simplified scenario.
To avoid this problem we use the maximum entropy (MaxEnt) principle [27,28] to model a complete probability distribution based on a selected subset of observed or exact moments.We then use well-established stochastic simulation algorithms [43,30,13,14] to find the MaxEnt distribution with a cost which is polynomial in d.Adopting a MaxEnt ansatz for the likelihood provides a Bayesian posterior which (i) converges to the MaxEnt distribution as the number of observations tends to infinity, and (ii) can be simulated with a cost which is polynomial in d using recent stochastic simulation methods [32,42,49,45].It is the topic of ongoing investigation to generalize the results to density matrices, which will be reported in a future publication.It is hoped that a polynomial cost in d can be preserved in this case as well, for example combining the methodology here with that of [22,8].
Assume that we know the basis in which the output state from a d-qubit quantum computer is diagonal.The density matrix in that basis reads, where {|ψ j } 2 d j=1 comprise a basis in the Hilbert space of the quantum computer C 2 d .We also assume the observable O that we measure is diagonal in the same basis i.e.
(2) O = where {o j } 2 d j=1 are the eigenvalues of O and p j = ψ j |ρ|ψ j is the probability of observing state a measurement outcome corresponding to the eigenvalue o j .
Without loss of generality we can interpret the probability distribution p as a classical probability density p(x) over the binary random variable x ∈ {0, 1} d .The mathematical task at hand is then to find p(x) consistent with the observed outcomes {o j }.A canonical example of such density in our context is the Ising model which is parametrized by some matrix A ∈ R d×d .Given first and second moments of a distribution Π on {0, 1} d , the MaxEnt distribution is the Ising model (3) which satisfies the moment constraints [11] .With finitely many observations, one can replace the moments with observed empirical moments, and the corresponding MaxEnt distribution is the same one which maximizes the likelihood under a MaxEnt ansatz for its parameterized form [34]. Through a Bayesian framework [21], this likelihood can be combined with a prior on the parameters, resulting in a posterior distribution, which quantifies the uncertainty in the distribution given the observations.We will develop and implement novel algorithms for the solution of these problems with a cost which is polynomial in d, based on recently developed technologies mentioned above.Now we proceed with a more rigorous mathematical setup, and a summary of the results of the paper.

Mathematical Setup and Summary of Results
Let E = {0, 1}, and denote the state given a particular value of λ, and the objective of this work is to identify λ.We will denote the un-normalized measure by Let φ j : E d → R, for j = 1, . . ., J, and define φ = [φ 1 , . . ., φ J ] T : E d → R J .It is assumed that, for a particular unknown value of λ, we are given either (P1) (P2) M independent and identically distributed (i.i.d.) noisy observations of this process y (i) ∼ Π(•|λ), for i = 1, . . ., M .As mentioned above, the aim in either case is to reconstruct Π by identifying λ.
First, consider problem (P1).Assume we seek the probability distribution p which maximizes the entropy ( 5) subject to the constraints (4), with p replacing π(•|λ).This is in fact a convex optimization problem.Therefore, if there exists a probability distribution p which satisfies the constraints (4), then the necessary first order optimality conditions are also sufficient for global optimality [7].For this particular problem, the first order conditions can be simplified to p(x) = π(x|λ) = 1 Z(λ) q(x|λ), where ( 6) and the vector of Lagrange multipliers λ = [λ 1 , . . ., λ J ] T is the solution to the system of equations ( 7) This is referred to as the method of maximum entropy (MaxEnt).See Chapter 11.1 of [11] for the derivation.A density which is parametrized in the form ( 6) is more generally referred to as a MaxEnt ansatz [34].
In general, the system of equations (7) does not have a closed form solution, and it is a challenging problem to solve, particularly for large d, as one evaluation of the system would consist of O(2 d ) operations.The issue is known as the curse of dimensionality, and precludes the use of standard iterative algorithms for solving the problem.We propose to utilize a Robbins Monro type [43,30] stochastic iterative algorithm for its solution, together with a sequential Monte Carlo (SMC) sampler [15] unbiased estimator of a system of equations equivalent to (7), Q(m − φ|λ) = 0, within each iteration of the Robbins Monro.It is well-known that SMC samplers are stable with O(d 2 ) operations for such targets which can be evaluated pointwise (in x) [3], thus bypassing the curse of dimensionality.That is the first major contribution of the paper.Now, consider problem (P2), concerned with recovering π(•|λ) from the i.i.d.observations Y := {y (i) } M i=1 .Following the MaxEnt ethos, we choose an appropriate set of functions φ as above (4), based on our assumptions of the underlying quantum system, and adopt the ansatz (6) for the likelihood of a single observation, given λ: P(y (i) |λ) = π(y (i) |λ).For example, one could use the set of first and second moments φ(x) = vec({x i x j } 1≤i≤j≤d ).The likelihood of Y := {y (i) } M i=1 given λ is given by ( 8) The maximizer of ( 8) is called the maximum likelihood estimator (MLE).It is equivalent to the maximizer of log P(Y |λ).We have where The necessary first order optimality condition is vanishing gradient, i.e.
(11) Π( m − φ|λ) = 0 , which is clearly equivalent to to the MaxEnt solution (7), except with empirical moments replacing exact moments.This problem can therefore be solved using the same method described above.This solution is still somewhat unsettling.Even if we are content with our choice of φ, we have arrived at a point estimate for inference given a finite set of observations.The scrupulous Bayesian will not allow the story to end here.Indeed for the sake of uncertainty quantification (UQ), we require a posterior distribution over π(•|λ), or equivalently, a posterior distribution over λ.In order to accomplish this, we complete the picture with a prior on λ, giving (12) P(λ|Y where the constant of proportionality depends only upon the observations Y .The second major contribution of the paper concerns quantifying a posteriori uncertainty using this approach.Due to the fact that the likelihood P(Y |λ) itself has an intractable normalizing constant Q(1|λ) M , this notoriously difficult problem is referred to as doubly-intractable [33,35].
If we could at least evaluate a non-negative and unbiased estimator of the likelihood P(Y |λ), or equivalently (12), then there are algorithms which can be used, such as pseudomarginal Markov chain Monte Carlo (MCMC) method [2,1].The recent work [31] proposes to construct unbiased but signed estimates of the target (12), and then they use such estimate within pseudo-marginal MCMC by considering the absolute value of the estimate in the acceptance ratio and then storing the sign of the accepted samples in order to correct the ultimate estimator.The work [48] proposes to use alternative unbiased estimators in this same framework.Note that earlier work also appeared which allows one to sample from such a target [33,35], however these works require being able to sample from π(•|λ) for a given value of λ.This is possible in some cases, for example the Ising model (or Boltzmann machine) [41,9], but not in general.The works [31,48] are generally applicable.
We will take a new approach to doubly intractable simulation in the present work.A number of recently developed methods require only unbiased estimators of (9).These include the stochastic gradient Langevin dynamics (SGLD) [49], the zig-zag sampler [4], or the bouncy particle sampler [6,39].We propose to construct an unbiased estimator of (9) using the debiasing technique from the recent works [32,42], and then use this unbiased estimator within such MCMC methods.This is the second major contribution of this work.In particular, we constrain our attention here to the SGLD.
The rest of the paper is organized as follows.In Section 3 the SMC sampler is introduced, which will be used as a crucial component within all subsequent algorithms.In particular, it will be shown how to construct consistent estimators of Π(ϕ|λ) and unbiased estimators of Q(ϕ|λ).In Section 4 we introduce the Robbins Monro algorithm, and present the first main result regarding its implementation using the SMC sampler.Section 5 concerns Bayesian posterior UQ in the context of finitely many observations.In Section 6 we simulate data from a toy model, and illustrate the algorithm from Section 4 for computing the MaxEnt solution using both exact moments and observed moments (where it coincides with the MLE), as well as the algorithm from Section 5 for UQ in the case of observations.

SMC samplers for estimation of integrals
Here it will be described how to estimate expectations Π(ϕ|λ) := x∈E d ϕ(x)π(x|λ), for bounded functions ϕ : E d → R, and construct non-negative unbiased estimators of Q(ϕ|λ) = x∈E d ϕ(x)q(x|λ), for a given value of λ, using an auxillary variable technique.
Let η 0 = Γ 0 be a known probability distribution over a general state space U , and let η j be some probability over U for j = 0, . . ., J, whose density can be evaluated up to a normalizing constant, but cannot be sampled from directly.Define the unnormalized measure associated to η j by Γ j , with density γ j , so that (13) η j (dx) = Γ j (dx)/Γ j (1) , The objective is to approximate expectations for bounded functions ϕ : We cannot sample from this distribution, but we can obtain a convergent estimator using SMC samplers, as described below [15] (see also [26,36,10]).
Remark 3.1.In the context of this paper the target η J = Π(•|λ) will be considered in each inner iteration of the stochastic iterative algorithm.So γ J (x) = q(x|λ).To make the presentation more concrete, let γ 0 ∝ 1 and However, note that the choice of intermediate targets γ j to interpolate between an initial density γ 0 , which can be sampled from exactly, and γ J are essentially arbitrary.For example, one could incorporate several constraints per iteration or anneal in between individual constraints.
3.1.Estimating the target.Define Let M j denote a Markov kernel such that Iterate Algorithm 1 and define Algorithm 1 SMC sampler Let x i 0 ∼ η 0 .For j = 1, . . ., J, repeat the following steps for i = 1, . . ., N : The constant C(j) can be made to be independent of j if an additional strong mixing condition is assumed on M j , uniformly in j [13].
Observe that using Algorithm 1 we can construct an estimator by Recall that for any ϕ : U → R we have Γ J (ϕ) = Γ J (1)η J (ϕ) by definition (13).Define ( 16) The following proposition is proven in [13].The proof is reproduced in Appendix A for completeness, due to its importance to the results of this paper.

Stochastic approximation
Here it will be assumed that we are given values m = Π(φ|λ) ∈ R J , i.e.Problem (P1).As mentioned in Section 2, the MaxEnt approach entails solving the convex optimization problem (5) (which has domain {x; p(x) > 0}) subject to the constraints x∈E d p(x) = 1, and (4).It is necessary and sufficient to solve the equation (7), reproduced below As mentioned above, a naive computation of the left-hand side of the above equation for a single value of λ would incur a cost of O(2 d ), which makes the problem intractable for very small values of d > 15 or so.It is however possible to obtain a noisy estimate using a stochastic simulation algorithm.Algorithms such as MCMC [23,18] are designed to sample the state space in O(d) steps [44], and provide a consistent estimate of (17) for a given value of λ.
Returning to (17), observe that solving this equation is equivalent to solving Q(m − φ|λ) = 0. Assume we can simulate unbiased estimates F (λ), i.e. estimates such that E F (λ) = Q(m − φ|λ).The Robbins Monro algorithm [43] for stochastic fixed point iteration for finding a root of the equation Q(m − φ|λ) = 0 takes the form ( 18) Under technical conditions, and for appropriate choice of δ k such that k δ k = ∞ and k δ 2 k < ∞, it can be shown that this algorithm converges to a fixed point of (17) [43,30] without ever solving the equation exactly, even once.
We make the following assumption here, which is sufficient to ensure Assumption 3.1 Assumption 4.1.There exists a c > 0 such that for all λ ∈ Ξ and , where λ k is given by (18) and λ * is the solution to Q(m − φ|λ * ) = 0, hence (17).
Consider now running the SMC sampler Algorithm 1 with the choice γ 0 ∝ 1 and With this unbiased estimator of Q(m − φ|λ), we can use Robbins Monro to solve the fixed point problem for λ (17), hence the MaxEnt target π(x|λ).The following theorem makes this statement precise.Theorem 4.1 (Stochastic approximation with SMC).Assume 4.1 holds, and let λ k be generated by the iteration (18), where F (λ) := Γ N J (m − φ|λ) is independently generated from an SMC sampler algorithm at each iteration, as defined in (16).
The conclusion is an immediate consequence of Proposition 4.1.
Assume the evaluation of φ j is O(1).Then the evaluation of γ j is O(j), and since there will be J steps, the total cost of the algorithm will be O(J 2 ).Since λ ∈ R J , one may require up to O(J) steps of Robbins Monro.Therefore, a conservative estimate on the cost is O(J 3 ).Recall that J is the number of constraints.For p th moments, there are J = d+p−1 p d p constraints, yielding a conservative estimate on the final cost of O(d 3p ).The constant should be rather small however, and this is of course a vast improvement on O(2 d ) for even the best linear algorithm for solution with exact evaluation of (17).

Uncertainty Quantification
In the previous section we arrive at a single optimal value λ * , and a single MaxEnt target likelihood, i.e. uncertainty is not quantified.Here we venture to quantify uncertainty by putting a prior on λ, and computing (sampling from) the posterior P(λ|Y ) (12).This leads to a so-called doubly-intractable target, as described above.In this section a new method is introduced to sample from the target.In particular, in subsection 5.1 we introduce the debiasing method mentioned above, and describe how to construct unbiased (signed) estimators of Π(ϕ|λ) using estimator(s) from the SMC sampler.In subsection 5.2 we introduce the SGLD algorithm and present the second main result of the paper, regarding its implementation using the debiased estimators.The section is concluded with a description in subsection 5.3 of how to separate an explicit error distribution, which is more expensive and is left to future work.5.1.Debiasing.Some MCMC methods, such as SGLD [49], as well as some recently developed piecewise-deterministic Markov processes (PMDP) [12,4,6,39], require only an unbiased estimator of the gradient of the log-likelihood for implementation.Others, such as pseudo-marginal MCMC, require an unbiased and non-negative estimator of the likelihood itself.It is generally much easier to construct unbiased estimators which are signed [32,42].The work [31] shows how one can utilize the pseudo-marginal MCMC even with a signed estimator.However, one may easily argue that it is more natural to use a signed and unbiased estimator within SGLD or one of the PDMP algorithms.We consider the latter strategy here.
Suppose that we have a collection of random variables {∆ l } ∞ l=0 , where and we are interested in estimating EZ, where Note that (19)  is again an unbiased estimator: The exchange of the infinite sum and expectation is allowed by (19) and Fubini theorem.
Assumption 5.1.Let (19) hold and assume the probability p is chosen such that Proposition 5.1.For either estimator r ∈ {s, t}, let Z i r be i.i.d.draws of the estimator (21) or (22), for i = 1, . . ., N .Under assumption 5.1, E Z i r = EZ and we have, The expected cost is given by where C(∆ l ) is the cost to obtain the realization ∆ l .
See [32,42,31,46] and references therein for proof and further discussion of such approaches.In particular, this approach will allow us to transform consistent estimators to unbiased estimators for use in the algorithms mentioned above.The recent literature on PDMP is quite extensive, and so we defer consideration of these algorithms to future work and focus on SGLD, which will be described below in generality.First we illustrate how to construct such de-biased estimator of (9).5.1.1.Debiased SMC sampler estimator.We now describe one way to construct an unbiased estimator of (9) directly, using this de-biasing trick, without multiplying through by Q(1|λ).In particular, the single term estimator (21) will be considered.
Let N l = N 0 2 2l .Replace step (ii) from the SMC samplers algorithm with xi j = x i j−1 , i.e. omit the resampling step.This algorithm was named annealed importance sampling (AIS) in [36].The estimator ) is an unbiased estimator of Q(ϕ|λ), and the individual terms are independent: x i j ⊥ x i ′ j for i = i ′ and all j = 1, . . ., J.This means in particular that if we construct (24) ηN/4 , from two independent replications, one with N total particles and one with N/4 total particles, then two estimators have the same distribution; in particular, the same expectation.So we have the following proposition.
Proposition 5.2.Define the coupled estimator ∆ l := ηN l J (ϕ) − ηN l /4 J (ϕ), with particles coming from the same realization of the algorithm described above with N l particles.Then E Z s = η J (ϕ), with Z s given by (21).The factor 4 here relates to the specific choice of N l and can be replaced with N l+1 /N l in general.
Proof.Following from the explanation above, the equation ( 20) takes the form η J (ϕ) = ∞ l=1 E∆ l , so the result follows from Proposition 5.1.
Observing that in this case E∆ 2 l ∝ N −1 l and C(∆ l ) ∝ N l , we can observe that this is the so-called sub-canonical case [42].That is, if the rate of convergence of E∆ 2 l were faster (or rate of growth of expected cost slower), then we could choose p l in such a way that both asymptotic variance and cost were finite.As this is not the case, we instead choose p l ∝ N −1 l l log(l), as proposed in [42] (see also [16]), which results in finite variance and expected cost which is finite with high probability.5.2.SGLD.Given a symmetric positive definite matrix P (λ) ∈ R J×J the following SDE keeps p(λ) invariant [19,21] (25) dλ = g(λ)dt + P (λ) 1/2 dW where for j = 1, . . ., J For constant P (λ), it was proposed in [49] to replace ∇ log p(λ) by an unbiased estimator ∇ log p(λ) (i.e.E ∇ log p(λ) = ∇ log p(λ)) and approximate the SDE using Euler Maruyama method with stepsize δ n ∝ 1/n, similar to Robbins Monro, as follows For the moment, assume P (λ) = 1, and ∇ log p(λ) = ∇ log p(λ), and define the measure corresponding to p(λ) by µ.Furthermore, define the empirical measure corresponding to K steps of the constant δ-discretized version of ( 27) by ) and define the limiting diffusion (25) sampled at the same K points iδ, i = 1, . . ., K, by µ K .Then we have ( 28) The discretization error δ 2 comes from the standard Euler-Maruyama strong convergence rate for SDE with constant diffusion.Define the auto-covariance ∞ is a sample from (25) at time δn and λ∞ = λµ(dλ).One expects ρ n ≈ (1−δ) n .Therefore the integrated auto-covariance IACT= 2 ∞ n=0 ρ n ≈ δ −1 , and the sample error is asymptotically (after burn-in) proportional to IACT/K ≈ (δK) −1 .Balancing the error terms, one finds that δ ∝ K −1/3 is optimal, leading to MSE of K −2/3 .It is slightly worse than the MSE of K −1 for K steps of standard MCMC methods.This simple argument suggests that the optimal choice for δ n is δ n ∝ n −1/3 , which is not the same as for SGD.In fact, this result is made rigorous in the work [45,47] (see also references therein), leading to the following proposition.Proposition 5.3.Let P (λ) = 1, define the measure corresponding to p(λ) by µ, choose δ n ∝ n −1/3 , and let λ n be generated by (27).Then for suitably regular ϕ, one has Furthermore, this choice of δ n is asymptotically optimal in terms of cost, and for K steps, one has MSE ∝ K −2/3 .
Then for suitably regular ϕ, one has Furthermore, this choice of δ n is asymptotically optimal in terms of cost, and for K steps, one has MSE ∝ K −2/3 .
Proof.This follows directly from Propositions 5.2 and 5.3.

5.2.1.
Reimann manifold method.It is assumed in [38] that Proposition 5.3 holds also for deterministic evaluable P (λ), yielding a Reimann manifold SGLD (RMSGLD), but this is not proven.In any case, based on the above heuristic one would expect a different scaling of δ n to be optimal, since the strong rate of convergence is reduced to δ, so balancing terms gives δ ∝ K −1/2 and an MSE of K −1/2 .Nevertheless, here we further postulate that (32) holds indeed for the case in which we have an unbiased estimator g(λ) ≈ g(λ) and a non-negative unbiased estimator P (λ) ≈ P (λ).
Let P (λ) = Q(1|λ) = Γ J (1|λ), and let the target be p(λ) = P(λ|Y ) ∝ P(Y |λ)P(λ), so that where 1 ∈ R J denotes the vector of all 1s, and ).Then using the notations of Section 3, (26) takes the form This can be unbiasedly approximated by (33) as can P (λ) by simply P (λ) = Γ N J (1|λ).This approach would allow direct simulation without de-biasing, since only unbiased estimates of the unnormalized measure Q are required, which can be constructed directly from SMC samplers, as described in Proposition 3.2.Nonetheless, this is not pursued further here due to the expected higher cost of an optimal implementation.We note that the cost of these algorithms can be improved using the multilevel Monte Carlo method [17], but this is deferred to future work.

5.3.
Separating an explicit error distribution.Before concluding the section, we mention an extension of the model considered in the rest of the paper which provides deeper UQ.Suppose that we do not have i.i.d.observations y (i) ∼ π(•|λ), but in fact we have observations of the form d, and B(ǫ) denotes a Bernoulli with probability ǫ.In other words, each b ǫ i is independently 1 with probability ǫ and 0 otherwise.This model explicitly accounts for error in the observations, i.e. with probability ǫ any qubit may have been observed wrong.
We have P(d|y, λ) = ǫ |d−y| (1 − ǫ) 1−|d−y| .The likelihood of a single observation d is given by and so we can see that again we have an integral with respect to π(•|λ).Let D = [d (1) , . . ., d (M) ].Due to the independence between observations and errors, we have where we defined ϕ One can define An unbiased estimator of this function can be constructed from M independent SMC samplers, one for each factor.Despite being massively parallel, this is a significant computational burden, particularly for large M .Investigation of efficient algorithms for solving this problem is the topic of ongoing research.
6. Simulations 6.1.Maximum likelihood: exact marginals.Recall the form of F (λ), for a fixed λ from (4.1).Using the machinery from subsection 3.2 we could construct such an unbiased estimator, for any sequence of intermediate targets γ j (x|λ), and any J, provided that γ J (x|λ) = q(x|λ).The natural choice is of course to let as above, and this choice will be used here.The MCMC kernels M j are chosen as Metropolis-Hastings kernels with the reversible random walk proposal q(λ, λ ′ ) defined by λ ′ i = mod(λ i + 1, 2) with probability β and λ ′ i = λ i otherwise, for i = 1, . . ., J.In other words, q(λ, λ ′ ) = |λ − λ ′ | is a vector of independent Bernoullis with probability β, and is hence symmetric, with tunable parameter β determining the step-size.We note that a Gibbs sampler can also be used here.Now we let δ n = 1/n, and iterate (18) until convergence.
In order to test the accuracy of our methodology, we simulate the data as follows.Choose a ij ∈ R, for i = 1, . . ., d and j = i, . . ., d, and let A ij = a ij for j ≥ i and A ij = a ji otherwise.Now consider the true target (36) π( Similarly, let Λ be defined as Λ ij = λ ij for j ≥ i and Λ ij = λ ji otherwise, for coefficients {λ ij } {i=1,...,d, j=i,...,d} .We will identify a, λ with vectors in R J , where Notice this is of MaxEnt form with φ(x) = vec[{x i x j } {i=1,...,d, j=i,...,d} ], i.e. first and second moments are observed.We choose d = 10, let N = 2d, take d MCMC steps with β = 0.6, and define ) −1 ǫn 0 /(n 0 + n), n 0 = 5d, and ǫ = 1.For n ≤ 2d the direction of descent is taken as m−η N J (φ), while for n > 2d the full estimator ( 16) is used F (λ n ) = Γ N J (m−φ|λ n ), so that this is an unbiased estimator of Q(m − φ|λ).This construction prevents large fluctuations in Γ N J (1|λ n ) during early outer iterations from precluding stabilization of the algorithm.The truth, reconstruction, and convergence plot are given in Fig. 1. 6.2.Maximum likelihood: observations.In this section we explore maximum likelihood estimation with finitely many observations and empirical moments.We look at the cases of M = 1000 and M = 50 observations.The results are shown in Figure 2. The top two panels correspond to M = 1000 (left) and M = 50 (right).The good news is that we obtain the MLE, up to the bias arising due to the observational error, quite rapidly: within several hundred iterations for M = 1000 and several tens for M = 50.This is observed in the middle two panels.Notice from the plot for M = 50(right) that the misfit error eventually starts to increase once the algorithm begins to fit noise.This is typical and can be expected.When the algorithm is stopped the error is quite significant.In the top two panels, we observe that the reconstruction for M = 50 does not look similar to the true A (shown in the left panel of Figure 1), however for M = 1000 it looks quite acceptable.Just for a sanity check, observe the bottom two panels which show the observed moments for M = 50 on the left and predicted moments using the reconstruction (top right) on the right.The agreement is quite decent, and reasonable given the noise level of the observations.The point is that the error in the observations translates to a much larger error in the parameters λ.
As an initial experiment we ran the SGD algorithm using (37) with L ∼ p as described in subsection 5.1, the numerator constructed as in (24), and p chosen as described in subsection 5.1.1,with the same prototype problem except with d = 4, and with exact moments.The results are shown in Figure 3 left.The middle and right panels show the results of running the algorithm with the original estimator, and the simple biased but consistent estimator η N J (m−φ|λ), respectively.Note the algorithm with the biased but consistent drift appears to be convergent also.As mentioned earlier, this is done in practice, but we are not aware of theoretical results verifying this.Therefore we will proceed with the SGLD using this unbiased estimator, i.e. we iterate (27) with P = 1 and δ n = n −1/3 , and drift given by (37), with L ∼ p.
The full SGLD MCMC is run with M = 1000 observations on the d = 4 qubit system.It takes roughly two hours on a laptop to obtain K = 10 6 samples.The pairwise marginals on the diagonal diag(Λ) are illustrated in Figure 4.The histograms are constructed by discarding the first half of the samples for burn-in, and resampling the remaining samples according to their weights {δ n }.The true value is also indicated in red in the plots.While the truth is indeed in the region of high probability, and the mean is reasonably close to the truth, the spread of the posterior is still quite significant, and more than one would hope for with M = 1000 observations.This is consistent with the results of Figure 2, where it is observed that the error in the coefficients is amplified quite a bit in comparison to the error in the moments.It is however not feasible to compute the pushforward distribution from the coefficients to the moments here, since such estimation for a single moment would require a separate large sample simulation (e.g. by MCMC or SMC) for each of the K = 10 6 samples.We look at the same simulation with M = 10 6 observations, and the results are plotted in Figure 5.The spread is much tighter, and also there is a strong correlation between the diagonal elements in this case.

Summary and Future Directions
A first attempt has been made to estimate the MaxEnt distribution associated to outputs of a quantum computer, and quantify the posterior uncertainty.This task has lead to a novel use of the SMC sampler for construction of an unbiased estimator of the intractable drift for use in a Robbins Monro stochastic approximation algorithm, which can be used to efficiently compute the MaxEnt distribution for exact moments or the MLE distribution for observed moments.The Bayesian formulation yields to a doubly-intractable target.For its solution we use de-biased SMC sampler estimators of the log-likelihood within a SGLD algorithm to construct consistent posterior expectations.All the algorithms are provably convergent, and numerical simulations are provided using (classically) simulated data from a toy model.This exercise reveals that the posterior uncertainty in the distribution can be significantly amplified with respect to the uncertainty in the moments arising from having finitely many observations.Topics for further research include: (0) the extension of the current framework to density matrices, (i) the incorporation of an explicit error distribution for the observations within the model, (ii) exploration of more complex targets, i.e. more qubits, (iii) MLMC acceleration of SGLD, (iv) investigation of PDMP MCMC methods as alternatives to SGLD, (v) other debiasing strategies and opportunites for acceleration, and of course (vi) investigation of outer problems, such as model selection.

Figure 1 .
Figure 1.Truth (left) and reconstruction after K = 10 4 steps (middle) for known truth.The error as a function of iteration is given in the right plot.

Figure 2 .
Figure 2. The reconstruction with 1000 (left) and 50 (right) observations are given in the top row, along with the corresponding convergence plots in the second row.The bottom row shows the actual second moments m (left), with M = 50 observations, which are used to train the model, and the moments under the reconstruction (right).

Figure 3 .
Figure 3. Convergence of SGD with the debiased estimator described in Section 6.3 (left), the original unbiased estimator F (middle), and the simple consistent but biased estimator (right).

Figure 4 .
Figure 4. Illustration of pairwise marginal UQ for the posterior on the diagonal of Λ with M = 1000 observations and d = 4 qubits.The true value of the parameters is indicated in red.

Figure 5 .
Figure 5. Illustration of pairwise marginal UQ for the posterior on the diagonal of Λ with M = 10 6 observations and d = 4 qubits.The true value of the parameters is indicated in red.
is an unbiased estimator of EZ.Furthermore, if we instead draw L ∼ p, where p is defined such that p ℓ = l≥ℓ pl , then one can see that guarantees that EZ < ∞.If we draw L ∼ p, where p = [p 0 , . . .], then