Spectral methods to study the robustness of residual neural networks with infinite layers

Recently, neural networks (NN) with an infinite number of layers have been introduced. Especially for these very large NN the training procedure is very expensive. Hence, there is interest to study their robustness with respect to input data to avoid unnecessarily retraining the network. Typically, model-based statistical inference methods, e.g. Bayesian neural networks, are used to quantify uncertainties. Here, we consider a special class of residual neural networks and we study the case, when the number of layers can be arbitrarily large. Then, kinetic theory allows to interpret the network as a dynamical system, described by a partial differential equation. We study the robustness of the mean-field neural network with respect to perturbations in initial data by applying UQ approaches on the loss functions.

1. Introduction. The use of machine learning algorithms has gained a lot of interest in the past decades [42,43,74]. Besides the data science problems like data clustering, regression and classification, image recognition and pattern formation, there are novel applications in the field of engineering as e.g. for production processes [57,68,78]. The origins of artificial neural networks date back to the 1940s [33,53]. Neural networks (NN) have been applied to a huge variety of applications like computer vision, speech recognition or robotics. More recently, also applications to mathematical problems in numerical analysis [62,63,83] and optimal control [65] have been studied. The popularity of NN can be partially explained by its data driven ansatz and its simplicity. Having collected sufficient experimental data the NN can be trained and subsequently used as a forward model for nearly all related applications. Most commonly NN are trained by the backpropagation algorithm [30,46,84]. Although several studies deal with reducing the computational costs of NN [24,82], the training procedure is usually computationally expensive especially for large networks. To decide, when a network has to be retrained, quantifying the informative value of the prediction is necessary.
In this work we study the impact of uncertainty in initial data on the NN output. More precisely, we assume that we have a trained NN. We are interested in quantifying the perturbation of the loss function, when the input signal is perturbed, to study the robustness of the network. This subject has received a lot of attention in the past decade and we refer to [73] for a detailed review. On the one hand, NN have been used to improve UQ techniques [44,81], but also uncertainty quantification methods have been applied to NN intensively.
Bayesian neural networks provide an extension to posterior inference, which may be seen as maximum likelihood estimation of the weights [79,3,29]. Also priors on the weights can be imposed [8]. Approximate Bayesian computation (ABC) forms a class of computational methods, which is linked to Bayesian statistics that can be used to estimate the posterior distributions of model parameters [7,49,52,6,39].
In all of these model-based statistical inference methods, the likelihood function determines the informative value. Namely, it expresses the probability of the observed data under a particular statistical model.
Typically, the likelihood function causes numerical challenges. ABC methods circumvent this issue by widening the realm of the models for which statistical inference can be considered. However, they are based on assumptions and approximations, which have to be carefully addressed.
In general, these issues become not less challenging if the number of layers increases. For several residual neural networks, however, studying the limit for an infinite number of layers may result in a simplification. Indeed, we obtain for particular residual neural networks in the limit a partial differential equation. Then, uncertainties can be studied by Monte-Carlo, stochastic collocation and stochastic Galerkin methods.
The proposed stochastic Galerkin method describes a priori a functional dependence on the stochastic input. This can be viewed as a prior, where samples may be drawn easily. These are propagated deterministically through the hidden layers of the network. Hence, our approach is closely related to moment matching networks (MMN) [67,50]. Having trained a MMN, one can also draw easily independent random samples from the output [18,60]. With our approach, a partial differential equation propagates the uncertainties through the kinetic residual network with infinite layers.
The outline of the paper is as follows. In Section 2 we review the definition of neural networks and especially recall a recently introduced kinetic approach to neural networks. Then, we describe the propagation of uncertainties through the kinetic model. Finally, the UQ methods are presented in Section 3. Especially, we compare the three presented methodologies as possible techniques to quantify the robustness of the residual neural networks. In Section 4, we conduct numerical examples. We present classical machine learning tasks, such as clustering and regression problems to study the robustness of the network with respect to random initial data. In particular, uncertainty quantification is performed on the loss functions. We finish the paper in Section 5 with a brief conclusion and an outlook on future research perspectives.
2. Neural networks and formulation in UQ framework. The crucial building blocks of an artificial neural network are neurons and layers. Each neural network consists at least of one input layer, one hidden layer and one output layer. The input layer = 0 is simply characterized by the measurement of d features or input signals x 0 ∈ R d . A feature is one type of measured data, e.g. temperature of a tool, length or width of a vehicle, color intensity of an image. The output layer = L contains, after overall application of the network, the output result. Furthermore, Hidden layers Input layer Output layer Illustration of a neural network with L + 1 = 5 total layers. a neural network may consist of arbitrary many hidden layers = 1, . . . , L − 1.
The total number of layers, i.e. L + 1 including input and output layer, defines the deepness of the network's structure. The number of neurons in each layer is denoted by N . We assume for the input and output layers N 0 = N L = 1. But the number of neurons in each hidden layer = 1, . . . , L − 1 may vary. We can imagine each neuron being characterized by a value x k, ∈ R d , k = 1, . . . , N , cf. Figure 1, which is the result of compositions of function blocks defined by an activation function σ : R → R. Following [51], the update of a multilayer perceptron neural network can be written in a general fashion as where the weight matrices w kj ∈ R d×d and biases b ∈ R d are a priori unknown and need to be obtained by training. Here, the activation function is evaluated componentwise. Famous examples of activation functions are the sigmoid σ S (x) = 1 1+exp(−x) , ReLu σ R (x) = max{0, x} or hyperbolic tangent σ T (x) = tanh(x) functions. Weights, biases, as well as the size of a NN, implicitly define the structure of the network. For a comprehensive introduction to NN we refer to [2,5,31] and the references therein. A popular subclass of artificial neural networks are residual neural networks (ResNet), where we are interested in. In comparison to (1) the activation energy of the previous layer is added to the activation function. We assume N = N for each layer = 1, . . . , L − 1. Thus, the update formula of a ResNet [32,51] is So far we have neglected the dependence of the input measurement x 0 . In case of M different measurements we denote the forward propagation of the i-th measurement by x k, 1 , . . . , x k, M . 2.1. Training of a network. A crucial part in applying a neural network is the training of the network. Once the structure of the network is established, training

Hidden layers Input layer
Output layer aims to find weights and biases that minimize the distance of the output layer to given targets h i ∈ R d with respect to a suitable loss L, i.e.
Here, the collection of weights and biases are denoted by W and B. A typical example, see e g. [38], is the quadratic loss function L is often used for regression problems.
The computational cost of the training procedure depends on the size of the network and may be expensive. Typically, stochastic gradient descent is used to perform the optimization of the parameters [30,46,84].

2.2.
Kinetic formulation of the ResNet. Recently, studying the limit L → ∞, i.e. studying deep neural networks with infinite layers, has gained interest in the mathematical community [34,66,12]. This allows to represent a discrete network as a continuous function. Similarly to [12] we reformulate the discrete time neural network as a system of continuous ordinary differential equations. This is the intermediate step to formulate the discrete ResNet as a kinetic partial differential equation (PDE) which represents the mathematical cornerstone we consider in this work.
We shortly summarize the derivation of the mean-field equation for a ResNet. The kinetic formulations of a ResNet are derived under the crucial assumption that each hidden layer corresponds to one neuron, i.e. N = 1 for all = 1, . . . , L − 1. Then, the number of neurons is fixed and prescribed by the dimension of the input signal only. Therefore, each layer consists of d neurons and each one contains a scalar activation energy, as illustrated in Figure 2.
We introduce the microscopic model SimResNet from [34]. The time evolution of the activation energy of a neuron is for each fixed i = 1, . . . , M defined as where the discrete time t identifies the layer. In fact we observe that, compared to (2), the network is reformulated by introducing a parameter ∆t > 0 and a time discrete concept which corresponds to the layer discretization. More precisely, the time step is defined as ∆t := 1 L+1 . In this way, we can interpret (4) as an explicit Euler discretization of an underlying time continuous model, i.e.
for each fixed i = 1, . . . , M with w(t) ∈ R d×d , b(t) ∈ R d being continuous extensions of the discrete optimal weights and biases. The simplifications introduced in the formulation of the SimResNet (4) allow to reduce the complexity of a neural network drastically, and especially the cost of the training stage. A further simplification, considering diagonal matrices w ∈ R d×d , is possible and the resulting network has been shown to provide a suitable approximation of any function f ∈ L 1 (R d ), see [51].
The special form of the SimResNet is beneficial to compute kinetic formulations. We perform the kinetic limit in the number of measurements M , which means that we consider the case of infinitely many measurements. Since the dimension of a measurement x i is directly related to the dimension of the variable of the kinetic distribution function, moderate dimenions d should be considered in applications. According to [34], the mean field equation, corresponding to the dynamics (5), reads where g : The initial data g 0 (x) describe the distribution of the measured data and the mean field equation (6) preserves mass, i.e. R d g(t, x) dx = 1 for all t ≥ 0. The derivation is classical and we refer to [27,37] for general details on mean field theory.

2.3.
Propagation of perturbations and robustness of the network in the kinetic limit. We assume a trained NN with a numerically expensive training procedure. Thus, the user of a NN aims to avoid unnecessary training runs. Initial values in real world applications are usually measurements, which are commonly noisy and hence, are perturbations of the true states. We refer to the perturbations as uncertainties and we considered only perturbations of inputs. The NN model propagates the uncertainties forward.
To quantify the forward propagation of the uncertainties through the NN model, and, consequently, the robustness with respect to input signals, we assume a perfectly trained NN, meaning that there have been no measurement errors of the training set nor numerical errors in the training procedure. The M input signals x 0 i ∈ R d are perturbed by a d-dimensional random variable η, which is defined on the probability space Ω, F(Ω), P . The perturbed input signals with different noise levels ε i > 0 read as . . , M and ω ∈ Ω. Since we consider a trained NN, the structure of the network does not depend on the uncertainties. Therefore, the dependence of the output signal on the uncertainty can be expressed by On the macroscopic level, we look for random solutions were B denotes the Borel set, that satisfy almost surely (P-a.s.) the parameterized mean field equation Moreover, we consider solutions that belong for each fixed g(t, x, ·) to the space 3. Uncertainty quantification methods. The research field of uncertainty quantification concerns with the estimation and characterization of uncertainties in the model. Typically, quantifying the impact of uncertain input parameters of a model is of interest. These inputs might be the initial conditions of a PDE or any model parameter.
One may distinguish between intrusive and non-intrusive methods. Non-intrusive methods compute for fixed realisations of the stochastic input the corresponding solutions. Then, the statistics of interest are computed. Since deterministic numerical solvers need not be changed, non-intrusive methods are often preferred in practice. For instance, in the case of partial differential equations finite volume based methods have been proven successful in previous works [69,76,17,80,54,77,64,1,55]. Desirable numerical schemes are in smooth regions high-order accurate, but can also resolve singularities in an essentially nonoscillatory (ENO) fashion. CWENO schemes consist of a weighted combination of local reconstructions on different stencils and allow unstructured grids [15,14,16,35,45,70].
Numerical cost can be substantially decreased if it is possible to state a priori a functional dependence of the solution on the stochastic input. The dependence can be expressed by orthogonal functions, e.g. orthogonal polynomials, also known as polynomial chaos (gPC) expansions [9,85,87].
One large class of UQ methods are described by collocation approaches [72,86]. They have in common to be non-intrusive and to solve the model exactly for discrete values in the random space. Collocation approaches are very popular and have been applied to a wide range of problems. The simplest method of this class is the wellknown and widely used Monte-Carlo method. Another collocation approach is the pseudo spectral or stochastic collocation approach, which can make use of the given functional dependence on the stochastic input.
Stochastic Galerkin methods are intrusive. They compute directly the gPC coefficients that determine the functional dependence. Expansions of the stochastic input are substituted into the governing equations and they are projected to obtain deterministic evolution equations for the gPC coefficients. Applications of this procedure are challenging for general PDEs, since desired properties like hyperbolicity may be lost [59,19,21,22,47,58]. The resulting systems for linear hyperbolic [28,61] and kinetic equations [10,11,36,41,71,88,89] remain well-posed. Furthermore, convergence results of truncated gPC expansions to the true solution can be derived by analyzing the regularity of the solution on the uncertain input [23,36,89].
For our applications it is natural to choose collocation approaches in the case of a large artificial NN. The advantage is that any black box solver for the NN can be used. Besides the non-intrusive methods, also the stochastic Galerkin method will be applied to kinetic differential equations.
So far, we have specified the quantity of interest (QoI) by a loss function (3). With the introduction of uncertainties, we obtain the random variables We are interested in statistics such as mean and variance, i.e.
E QoI := QoI(ω) dP(ω) and V QoI : In the following we compute these statistics with Monte-Carlo, stochastic collocation and stochastic Galerkin methods. First, we introduce non-intrusive methods, namely Monte-Carlo and stochastic collocation.
3.1. Monte-Carlo methods. For M independent and identically distributed random variables QoI 1 , . . . , QoI M , having the same distribution as the quantity of interest QoI, we consider the unbiased Monte-Carlo estimators for mean and variance Since we are mostly interested in the first and second moments, we consider the root of the mean squared error (RMSE), which satisfies We observe the convergence rate is of order O M − 1 /2 . This slow convergence is the drawback of the MC method, however, the rate is independent of the dimension of stochastic input. The RMSE can also be decreased by reducing the variance. Typical variance reduction techniques are control variates, importance sampling, antithetic variables [26] and Multilevel-Monte-Carlo methods [25].

Collocation methods.
Under the assumption that the probability measure dP(ω) is absolutely continuous with respect to the Lebesgue measure dω, we can express the moments as the Lebesgue integral which is almost surely unique defined [56]. Instead of interpreting the quantity of interest QoI as a random variable, i.e. a measurable function with an associated probability measure, the Radon-Nikodym derivative allows an interpretation as a parameterized function, i.e.

QoI
: with associated probability density ρ(ω). Note that this assumption is rather restrictive, since one cannot define a probability density with respect to an arbitrary measure. For instance, discrete random variables cannot be parameterized in this way. For many practically relevant probability measures, e.g. normal and uniform distributions, we can compute statistics of interest more efficiently by exploiding the functional dependence on the stochastic input. More precisely, we will consider projections onto the orthogonal subspaceŝ An orthogonal basis of S K , denoted as (φ k ) K k=0 , is called generalized polynomial chaos (gPC) basis. Typical choices are Then, the dependency of a quantitiy of interest on the stochastic input can be expressed as With a multi-index k := (k 1 , . . . , k d ) ∈ K and an index set K ⊆ N d 0 definition (gPC) is extended to the multidimensional case by Common choices for multidimensional bases, see e.g. [48,86], are the tensor basis In the following, we use the one-dimensional notation (gPC) with K + 1 = |K|. According to [9,13,20] the truncated expansion converges to the exact solution with respect to the L 2 (ρ)-norm under the mild assumption QoI ∈ L 2 (ρ). The rate of convergence, however, depence on the regularity with respect to the stochastic input. In particular for Legendre polynomials, [86,Th. 3.6] gives the estimate Here, H q ρ denotes the weighted Sobolev space We will exploid this polynomial approximation in several ways.
Gaussian quadrature. A typical approach to compute the integrals in one dimension is Gaussian quadrature with nodes ξ 1 , . . . , ξ M and weights w 1 , . . . , w M . According to [75], we have the error estimates where Pol p denotes the set of polynomials with degree equal or smaller than p ∈ N 0 . Therefore, this approach can be optimal in one dimension if there is a smooth dependency on the stochastic input. In multidimensions, however, when the multidimensional functions are represented by full tensors and full grids, the computational complexity grows exponentially with the number of dimensions. Unless the problem structure allows for sparse grids, deterministic quadrature is typically not feasible in higher dimensions.
3.3. Stochastic Galerkin. The stochastic Galerkin (SG) method is built on the idea of the original Galerkin method, i.e. expanding the solution with the help of orthogonal polynomials. In contrast, the underlying equations are projected first. Then, the solution is reconstructed. We consider the general dynamical system ∂ t QoI(t, ω) = D QoI(t, ω) . The stochastic Galerkin formulation reads as The stochastic Galerkin method consists now in solving these systems of equations. However, there are several challenges. Even if the underlying dynamical system is for each ω ∈ Ω ρ well-posed, there exist in general no unique solution to the stochastic Galerkin formulation. And even if the stochastic Galerkin formulation is for each fixed K ∈ N 0 well-posed, the solution does not necessarily converge to the desired solution for K → ∞. We can resolve these issues for solutions to general, but linear kinetic initial value problems where the uncertainty arises only from initial values. This means for the mean field equation T = σ 1 d w(t)x+b(t) , Q = 1 d w(t)σ 1 d w(t)x+b(t) . Then, the stochastic Galerkin formulation consists of K + 1 decoupled systems and is hence well-posed. Moreover, the convergence for K → ∞ should be relatively fast. We consider L 2 -solutions [4, Def. A.3] to the linear, kinetic equations (8) and (9). These are formally given by If the solution operator M(t) is a continuous and hence bounded operator, satisfying M(t) < ∞ for all t ≥ 0, we have the a priori estimate Hence, for linear kinetic equations, this approach can provide spectral convergence if initial values depend sufficiently smooth on the random input. 4. Numerical examples. In this section we use the previous methods to study the behavior of uncertainty in the initial input of a neural network model to quantify its robustness by measuring the deviation from the correct target as function of the size of the uncertainty.
In particular, we apply Monte-Carlo and stochastic collocation to the discrete deep artificial neural network, and we apply stochastic Galerkin to the PDE formulation of the deep residual neural network.
We focus on regression and classification problems, which are tasks belonging to the field of supervised machine learning. Both are based on using known datasets, usually named as training datasets, to make predictions. The main difference between the two types of problems is that the output variable in regression is numerical or continuous, while the output for classification is categorical or discrete. Once we have trained the network, we store the optimal parameters to perform the forward propagation of the network with given noisy input signals. The uncertain perturbation is added to the input data of the training dataset, as described in Section 2.3. We consider a uniformly distributed random variable on [−1, 1], i.e. η ∼ U(−1, 1). We recall that the strength of the perturbation is modeled by the multiplicative factor ε > 0.
We apply Monte-Carlo and stochastic collocation to compute the evolution of the network under the uncertain inputs. In particular, with Monte-Carlo we consider M = 1000 random samples from the given uniform distribution of the uncertain parameter. Instead, for stochastic collocation we perturb the true input data along the 4 collocation points of the Legendre polynomials. In both cases, we consider 10 equally spaced values of the noise level ε, from 0.1 to 1. We compute the expected value and the variance of the quantity of interest, which gives information on the L 2 -distance between the true target and the output of the network under uncertain input data.
In the top left panel of Figure 3 we show the expected value and the standard deviation from the expected value of the pointwise error, namely computed on each data input. In Figure 3 we provide the expected value and the variance as functions of the noise level ε, in the top right panel and in the bottom left panel respectively. As we expect, the error decreases with the noise level. Moreover, the bottom right panel shows the convergence rate of the Monte-Carlo method as a function of the

Classification problem.
On the other hand, classification problems attempt to estimate a function h : H ⊆ R d → C from the input variables x ∈ H of the dataset to discrete or categorical target variables y = h(x) ∈ C. Here C is a discrete set of the categorical variables. In the following we consider d = 1 and From an application point of view, this experiment can be seen as the identification problem of the type of a vehicle, such as car or truck, depending on the   measurement of its length. Here, we consider cars having label 0 and trucks having label 1. The interesting question in applications is how low should be the measurement noise level in order to have, for instance, an error below to 5% of wrong classifications.

Monte Carlo
We make use of the discrete model (1), with L = 3, which means 4 total layers and 2 hidden layers, each one having N 1 = N 2 = 4 neurons. Moreover, the activation function is chosen as σ(x) = σ T (x) = tanh(x).
The training dataset is artificially built using M = 20 uniformly distributed points in H = [3,8] representing the input signals x 0 i with targets y i = h(x 0 i ) for i = 1, . . . , M . We again train the network using the gradient descent method with learning rate 0.1. The result of the training step is observed in the left panel of Figure 4, where the blue circles represent the true targets y i , i = 1, . . . , M , while the red stars are the estimated values by the trained network.
The uncertain perturbation is added by considering a uniformly distributed random variable on η ∼ U(−1, 1). For the strength of the perturbation, modeled by the multiplicative factor ε, we choose 7 equally space values between 0.4 and 1.
We perform the evolution of the network under the uncertain inputs using Monte-Carlo and stochastic collocation. In particular, with Monte-Carlo we consider M = 1000 random samples from the given uniform distribution of the uncertain parameter. Instead, for stochastic collocation we perturb the true input data along the 4 collocation points of the Legendre polynomials. We compute the expected value and the variance of the quantity of interest that gives information on the averaged total number of wrong classifications.
In the left panel of Figure 4 we show the expected value and the standard deviation from the expected value of the pointwise error, namely computed on each data input. In Figure 4 we provide also the plot of the expected value and of the variance as function of the noise level ε, in the center panel and in the right panel respectively. As we expect, the error decreases with the noise level. We observe that, for instance, to have less than 5% of wrong classifications, the noise level needs to be less than 0.6.

Kinetic
ResNet. Regression problems based on the random mean field equation (6) consist of the following steps.
1) Determine gPC modes for initial data by an orthogonal projection. 2) Solve the stochastic Galerkin formulation (9). 3) Reconstruct the statistics of interest.
We will perform these steps for the following test cases, where a uniform distributed random variable η ∼ U(−1, 1) and normalized Legendre polynomials are used. We consider where zero flux boundary conditions are used, and Orthogonal projection of initial values. In Case 2, there is a smooth dependence on the stochastic input, i.e. g 0 (x, ·) ∈ C ∞ , whereas we have only g 0 (x, ·) ∈ L 2 in Case 1. Therefore, the gPC modesĝ k (0, x) := g 0 (x, ·) φ k (·) ρ are computed using a standard Monte-Carlo method with 10 5 samples in Case 1 and with Gaussian quadrature with 100 nodes in Case 2.
The upper panel of Figure 5 shows exact realisations of initial data for Case 1 in black being in the confidence region (gray), which contains all realisations. The mean (red) and the variance (blue) are calculated exactly by V g 0 (x, η) = E g 2 0 (x, η) − E g 0 (x, η) 2 and We observe that the moments are smooth although realisations are discontinuous. This allows a relatively accurate approximation by superpositions of smooth gPC modes (green), whose evolution is described by equation (9). Indeed, the exact variance is well approximated from below by ĝ 1 (0, x), . . . ,ĝ K (0, x) 2 V g 0 (x, η) for K → ∞, as illustrated in blue with respect to the right y-axis. The second and third panel show the root of the mean squared error (RMSE). Due to the nonsmooth dependence on the stochastic input we observe the slow decay rate 1 /2, which is similar to a Monte-Carlo method. Figure 6 is devoted to Case 2. Here the RMSE decays faster. We even observe exponential decay as guaranteed by the estimate (7). As reference we have used the truncation K = 15. The left y-axis shows the gPC modesĝ k (0, x) := g 0 (x, ·) φ k (·) ρ of initial values. The mean E g 0 (x, η) =ĝ 0 (0, x) is plotted in red, the others in green. The confidence region, which contains P-a.s. all realisations g 0 x, η(ω) is gray shaded. The right y-axis shows the variance corresponding to different gPC truncations. The upper left panel shows for each fixed signal x the root mean squared error (RMSE) of the gPC truncations. The right panel shows the total RMSE for various gPC truncations.
Solving the stochastic Galerkin formulation. To approximate the PDEs (9) a discretization ∆x > 0 is used to divide an interval H = [x min , x max ] into N cells C j := x min +(j −1)∆x, x min +j∆x , j = 1, . . . , N such that H = C 1 ∪· · ·∪C N . The evolution of cell averages is simulated by a third-order CWENO-reconstruction from [15]. It uses a local Lax-Friedrichs flux and a strong stability preserving (SSP) Runge-Kutta method with three stages [40]. The upper panels of Figure 7 show that the solution to Case 1 converges to a delta distribution in x ∈ {2, 8}. The lower panels investigate the probability distribution for a fixed signal. The distribution is determined by a Matlab buildin kernel density estimator with 10 5 samples from the truncated gPC expansion. Figure 8 shows the time evolution of the gPC modes for Case 2, likewise. Here the solution converges to a delta distribution in x = 1.
Reconstructing the statistics of interest. The target of Case 1 is characterized by two delta distributions located at the binary values x ∈ {2, 8}, the target of Case 2 Figure 7. Case 1: Solutions to the stochastic Galerkin formulation with truncation K = 30 and discretization ∆x = 0.01. The upper panels show in red the first gPC modeĝ 0 (t, x), which describes the temporal evolution of the mean. The other modes are plotted in green, which determine the spread of the stochastic perturbations. The lower panels show the probability distributions of the random quantity g(t, x, η) for a fixed signal x ∈ {2, 8} at time t = 0 (left), t = 0.5 (middle) and t = 1 (right). is a delta distribution at x = 1. If N is even, the discrete analogues read as We are interested in the random loss By using the multinomial theorem, we can express moments exactly as with precomputed constants C k := E φ k0 0 (η) · . . . · φ k K K (η) .
In particular, we have E QoI(η) = ĝ 0 − h,ĝ 1 , . . . ,ĝ K 2 . The gPC expansion G K [g](t, x, εη) gives also samples for general loss functions and noise levels.  Figure 9 and 10 show the mean, variance and the probability distribution for a fixed signal x ∈ H, where the target is a delta distribution. In Case 1, a higher noise level lets realisations be closer to the target. Hence, the mean of quadratic deviations increases for lower noise levels. In contrast, the mean decreases in Case 2 with the noise level, since realisations are closer to the target for small noise levels. variance variance Figure 10. Case 2: Mean, variance and distribution for random loss. The left panels show for each fixed signal x the mean and the variance of the random loss. The middle panels show for various noise levels ε the mean and variance of the total random loss. The right panels show the probability distributions of the random loss in x = 1, respectively at time t = 0 and t = 5.

5.
Conclusion. In this work, we have proposed a technique to use UQ approaches to quantify the robustness of the model output of a neural network due to the influence of uncertain inputs. The robustness is measured by the expected distance and variance of the output to the true target. In particular, the cornerstone of this work is the use of the mean-field formulation of a discrete time residual neural network. Several numerical tests have been applied by employing Monte-Carlo, Stochastic Collocation and Stochastic Galerkin methods. Here, uncertainties arise only from input data. Uncertainties in the model structure and predictive uncertainty are topics of interest as well, and possibly subjects of future research.