NUMERICAL INVESTIGATION OF ENSEMBLE METHODS WITH BLOCK ITERATIVE SOLVERS FOR EVOLUTION PROBLEMS

. The ensemble method has been developed for accelerating a sequence of numerical simulations of evolution problems. Its main idea is, by manipulating the time stepping and grouping discrete problems, to make all members in the same group share a common coeﬃcient matrix. Thus, at each time step, instead of solving a sequence of linear systems each of which contains only one right-hand-side vector, the ensemble method simultaneously solves a single linear system with multiple right-hand-side vectors for each group. Such a system could be solved eﬃciently by using direct linear solvers when the problems are of small scale, as the same LU factorization would work for the entire group members. However, for large-scale problems, iterative linear solvers often have to be used and then this appealing advantage becomes not obvious. In this paper we systematically investigate numerical performance of the ensemble method with block iterative solvers for two typical evolution problems: the heat equation and the incompressible Navier-Stokes equations. In particular, the block conjugate gradient (CG) solver is considered for the former and the block generalized minimal residual (GMRES) solver for the latter. Our numerical results demonstrate the eﬀectiveness and eﬃciency of the ensemble method when working together with these block iterative solvers.


1.
Introduction. It is common to run a sequence of numerical simulations in scientific and engineering application problems, such as numerical weather prediction using the ensemble-based data assimilation, proper orthogonal decomposition reduced order modeling that requires offline data generation, and uncertainty quantifications by random sampling approaches. For these numerical simulations, one needs to first discretize the problems in both space and time, and then solves a sequence of linear systems: for j = 1, 2, . . . , J, and J is the total amount of simulations, named the ensemble size; A j , x j and b j are respectively the coefficient matrix, state variable vector and right-hand-side (RHS) vector in the j-th discrete problem. The choice of linear solvers is usually determined by the size and structure of A j . It is well known that dense direct methods are limited by the size of the problem, and work well for sizes up to a few thousands; and sparse direct methods, depending on both size and sparsity pattern, require good ordering. When the fill-in generated by sparse direct solvers becomes too costly, iterative methods have to be used. Their computational complexity depends on the size, sparsity of the coefficient matrix, and preconditioning is usually applied in order to accelerate the convergence [37]. When all simulations in the sequence possess a common coefficient matrix A (that is, A j = A for all j), direct methods can easily share information among all RHS vectors: one factorization of A could be used in solving all the problems. If the problem size is too big for direct solvers, block iterative methods have to be used, which are also able to share system information among all RHS vectors by using the same Krylov subspace, and solve all the linear systems simultaneously. For sequences sharing a common coefficient matrix, block iterative algorithms, such as block CG and its variants for symmetric positive definite (SPD) system [28,29,25,19], block GMRES and its variants for nonsymmetric systems [35,13,27,18,7,4,3,1] and hybrid block algorithms [36], have been developed to solve such system with many RHS vectors. The block algorithms have been used to accelerate the solution even for linear systems with only one RHS vector in [29,6]. The advantage of a block solver over individual ones lies in the following facts: (i) the product of a matrix and J vectors is more efficient than J times matrix-vector products [8,12]; (ii) the search space generated from J RHS vectors is usually larger than that from one single vector, thus the Krylov subspace method could potentially converge in fewer iterations [31]; (iii) when some of the RHS vectors are dependent, the search space could be compressed and problems can be solved more efficiently; and (iv) it only accesses the coefficient matrix once a time for J problems; when accessing A represents a main computational bottleneck of a linear solver, or A needs to re-generate at each time step, this leads to a significant computational advantage [2]. Several block iterative algorithms were studied using parallel computing in [32], which shows that they have a high level of parallelization and, thus, can be applied for solving large-scale multisource problems.
The appearance of a common coefficient matrix for a group of problems is appealing, however, in general, the coefficient matrix would vary from one problem to another, then it becomes unapparent on how to share the information among RHS vectors. Hence, neither direct nor block iterative solvers can be straightforwardly applied. Approaches such as the seed/recycled Krylov subspace methods have been developed, which solve each RHS independently, while storing some information from the solve and using it in subsequent solves [5]. The accumulated information enlarges the search space, thus would potentially reduce the number of iterations. For slowly-changing linear systems, subspace recycling techniques such as GCRO with deflated restarting (GCRO-DR) have been introduced for accelerating the solution process in [30]. The block version of GCRO-DR was recently introduced in [31], and its high-performance implementation is available in the Belos package of the Trilinos project developed at Sandia National Laboratories. Note that the underlying assumption of such approaches is that all the systems are closely related, which certainly holds in some applications, but is in lack of rigorous mathematical foundation.
As seen from the preceding discussion, the research for accelerating a sequence of numerical simulations mainly starts from the numerical linear algebra's point of view and the goal is to solve (1) more efficiently when A j varies from one case to another. Recently, the ensemble method has been introduced to tackle this issue for evolution problems at the numerical algorithm level [21,22,20,16,26,14,15,11]. The idea is to ensure all the linear systems in a group to share a common coefficient matrix by manipulating numerical discretization schemes. Then, either direct or block iterative solvers could be naturally applied. Such an approach would lead to the following system: where A is the common coefficient matrix, consist of state variable vectors and RHS vectors from J discrete problems, respectively. To the best of our knowledge, the research investigations on ensemble methods primarily focus on the numerical analysis, including stability analyses and error estimates so far. The resultant linear systems are solved using direct solvers, but not with block iterative ones. For instance, the ensemble-based Monte-Carlo and multi-level Monte-Carlo methods have been developed in [23,24]. With the LU factorization, it has been shown that the use of ensemble methods leads to significant computational savings over the individual simulations. But large-scale applications certainly need to be considered in practice. Hence, in this paper, we take a couple of widely used PDE models in heat transfer and incompressible fluid flows, and for the first time, perform a comprehensive study on numerical behaviors of the ensemble methods together with block iterative solvers. The rest of this paper is organized as follows. In Section 2, we briefly review the ensemble-based time-stepping algorithms for the heat equation and the Navier-Stokes equations. Block iterative solvers are discussed in Section 3. Several numerical experiments are then performed and presented in Section 4 to demonstrate the effectiveness and efficiency of the ensemble method working with the block iterative solvers. Finally some concluding remarks are drawn in Section 5.
2. Ensemble method for evolution problems. We consider two popular mathematical models governing heat transfer and incompressible fluid flows respectively: the heat equation and the Navier-Stokes equations. Assume that, for either case, one needs to complete J numerical simulations under different computational settings including distinct body source functions, boundary and initial conditions, and physical parameters. We first present the ensemble time-stepping schemes for the aforementioned mathematical models and discuss the associated stability conditions and error estimates.
For simplicity of presentation, we assume the diffusion parameter ν j , in the j-th problem, to be a constant function. The finite element (FE) method is used for the spatial discretization in this paper, but other types of numerical methods could be used as well. A uniform mesh T h with size h, and a uniform time partition with the time step size ∆t are taken throughout our discussion. Denoted by Ω the computational domain, by Γ D the boundary of Ω on which the Dirichlet condition is imposed and by Γ N where the Neumann boundary condition is imposed. We use (·, ·) for the L 2 -inner product on Ω, ·, · Γi for the inner product on Γ i , and · for the L 2 norm. Define the spaces and V h g D the space of piecewise continuous functions on Ω that reduce to polynomials of degree ≤ m on each element of T h . Additionally, we assume the total number of time steps to be N and adopt the following notations: for n = 0, . . . , N , and time discretizations In the ensemble-based time stepping, we introduce the ensemble average of diffusion coefficients and discretize the equation in an implicit-explicit manner. The resulting semi-discrete system reads: associated with the same boundary and initial conditions, provided ∂ u n j ∂n := g n+1 N,j . Using the standard conforming FE method, we take V h g D ,j and V h 0 as the trail and test spaces, respectively, and obtain the fully discrete system: to find u n+1 j,h ∈ V h g D ,j such that The choices of D t and u n j,h , as specified in (4) or (5), lead to the ensemble-based timestepping algorithms of first-or second-order accuracy, respectively. The scheme needs an initial condition u 0 j,h to start with, for which the projection of u 0 j onto the FE space can be taken. The second-order scheme is a two-step method that requires one more initial condition u 1 j,h , which could be obtained from the first-order scheme.
Lemma 2.2. Let u n j and u n j,h be the solutions of equations (6) and (8) at time t n , respectively. Assume f j ∈ L 2 (H −1 (Ω); 0, T ) and the stability condition, (9) or (10), holds. Then there exists a generic constant C > 0 independent of J, h and ∆t such that for the first order scheme and for the second order scheme.

2.2.
Navier-Stokes equations. We next consider a group of incompressible flow problems, the governing Navier-Stokes equations (NSE) describe the motion of incompressible Newtonian fluid flows: to find the vector-valued velocity function u j (x, t) and the scalar function p j (x, t), for j = 1, . . . , J, satisfying After introducing ensemble averages of diffusion coefficient and velocity field and using the ensemble-based time stepping, we achieve the following semi-discrete system associated with the same boundary and initial conditions of (13): and Q h is the space of piecewise continuous functions on Ω that reduce to polynomials of degree ≤ s on each element of T h . Assume that, in order to guarantee the stability of FE approximations, the pair of spaces (V h 0 , Q h ) satisfies the discrete inf-sup (or LBB h ) condition. One example for which the LBB h stability condition holds is the family of Taylor-Hood P s+1 -P s element pairs (i.e., m = s + 1 in the definition of V h 0 ), for s ≥ 1 [17]. The fully discrete system reads: to find u n+1 The choices of D t and u n j,h , as specified in (4) or (5), respectively lead to the ensemble-based time stepping of first-or second-order accuracy in time. The initial condition u 0 j,h can be obtained by projecting u 0 j onto the FE space, while the second initial condition required for starting the second-order scheme can be sought by the first-order scheme.
It is seen that the ensemble method is semi-implicit. It first approximates the advection using the a priori known quantity u n h and approximates the diffusion using the viscosity average ν, and then treats the remainders explicitly, therefore, leads to a single coefficient matrix for all the members in the group. It could be shown that the following stability conditions and error estimates hold [16,15], under the regularity assumptions on the NSE given by Lemma 2.3. The ensemble scheme (15) is stable for all j = 1, . . . , J, if there exists some µ, 0 ≤ µ < 1, and the following time-step condition and parameter deviation condition are satisfied: and or for the second-order scheme.
Lemma 2.4. Suppose that the P 2 -P 1 Taylor-Hood FE pair is used for the spatial discretization. For the first-order ensemble time-stepping, assume that u 0 For the second-order ensemble time-stepping, assume that the initial errors u 0 accurate, then the approximation error of the ensemble scheme at time t N satisfies where C is a generic positive constant which does not depend on J, ∆t, and h.
3. Block-based iterative solvers. Since the ensemble method leads to a single system (2) for the entire group, direct methods can be used to efficiently solve it, as only one single LU factorization is needed for advancing all the simulations at each time step. However, for large-scale problems, iterative linear solvers have to be used due to the memory restriction and computational cost. Because of the different algebraic structures in discrete heat equation and discrete Navier-Stokes equations, we will investigate and discuss block-based iterative algorithms for them separately in this section.
3.1. Solution to discrete heat equation. Assume that there are locally n u degrees of freedom for each of the J problems and define the basis functions to be . The mass and stiffness matrices are defined by M = [m ik ] and S = [s ik ] with elements m ik = Ω ψ k ψ i dx, and s ik = Ω ∇ψ k ·∇ψ i dx. At time t n , the vector related to the source term and boundary term is given by i = Ω f n j ψ i dx+ νg n N,j , ψ i , and the approximate solution vector is denoted by u n j . The full discretization of the group of heat equations based on the ensemble method, (8), yields the following linear system: where A is a sparse SPD, n u × n u matrix and B n+1 is a n u × J matrix (or so-called the block vectors in the literature). Furthermore, for the first-order scheme; and Mu n j −

2∆t
Mu n−1 j for the second-order scheme. If one has access to a semi-implicit numerical solver (the coefficient matrix A (j) and vector b (j) ) for any individual simulation, then it is straightforward to obtain the linear system for the aforementioned ensemble method. Take the first-order semi-implicit scheme for example, the coefficient matrix ∆t Mu n j can be extracted from individual simulation codes for j = 1, . . . , J, then the matrices in ensemble simulations are: Among the iterative solvers in the general family of Krylov subspace methods, the conjugate gradient method (CG) developed by Hestenes and Stiefel is the most well known for solving a real SPD system [37]. For a SPD system with multiple RHS vectors, a block conjugate gradient method (BCG) has been developed as a generalization of CG in this context in [28] and further in [29,25]. Comparing with CG, BCG has the advantage of potentially faster convergence because the search space is augmented when multiple problems are considered together instead of a single problem. It becomes more attractive when accessing A (j) represents the main computational bottleneck of a linear solver (e.g. when A (j) is stored outside of the system memory, or the elements of A (j) have to be regenerated at each use) as it can explore multiple search directions in a single pass over A. However, in practice, BCG could fail due to the rank deficiency issues for which the block search direction vectors become linearly dependent. A simple solution was developed in [19], which extracts a set of basis vectors from the search space at each iteration and uses them as new search directions, thus could avoid inverting a potentially non-full rank matrix. The algorithm is named breakdown-free BCG (BFBCG) and is presented in Algorithm 1.

Algorithm 1:
The breakdown-free block CG [19] Input: matrix A ∈ R n×n , matrix B ∈ R n×J , initial guess X0 ∈ R n×J , preconditioner K ∈ R n×n , tolerance tol ∈ R and maximum number of iteration maxit ∈ R.
Different from the original BCG, the breakdown-free BCG introduces an orthogonalization process, orth, which extracts an orthogonal basis P i from the search space (denoted by P i ). This helps to overcome situations in which two or more vector components in the residual matrix R i are dependent because the lack of full rank in R i would result in rank deficiency in Z i and P i , and further fails the BCG method. Assume the rank of P i is r i , the resulting orthogonal matrix P i ∈ R nu×ri . In Algorithm 1, the choice of Θ i and Λ i guarantees the column spaces of R i+1 is orthogonal to the search space P i , and P i+1 is conjugate to all previous search spaces.
At each iteration, the algorithm involves three matrix-matrix products , three matrix updates that include three n u × r i matrix and r i × J matrix products P i Θ i , Q i Θ i and P i Λ i , two solutions of linear systems with the coefficient matrix (T i ) ri×ri , one solution of a linear system with coefficient matrix M nu×nu , and an orthogonalization procedure. Suppose n u is much bigger than r i , and M is easy to invert as a preconditioner, the above mentioned matrix-matrix products and system solvers require O(n u J max( , r i )) flops, where is the number of nonzero entries in each row of A. Special attention needs to be paid to the orthogonalization process orth, which could be implemented by the reduced (economy) SVD. However, in order to reduce the computational complexity, we choose the method of snapshots and drop the singular values smaller than 10 −12 . To determine left singular vectors associated with the retained singular values, it only requires O(n u J 2 ) flops, thus is comparable to the other operations in the algorithm.
The performance of the algorithm in the ensemble simulations will be investigated in Section 4, in which we choose the incomplete Cholesky factorization for preconditioning, but other types of preconditioners such as multigrid and domain decomposition can be certainly used as well.
3.2. Solution to discrete Navier-Stokes equations. Assume that there are locally n u degrees of freedom for velocity and denote the basis functions by {φ 1 (x), φ 2 (x), . . . , φ nu (x)}, and n p local degrees of freedom for pressure with the basis functions defined by {ψ 1 (x), ψ 2 (x), . . . , ψ np (x)}. Let the approximate solution be We also define the matrix D = [d ik ] with entries d ik = − Ω ψ k ∇ · φ i dx, the matrices from the discretization of the convection term N n = [n ik ] and R n j = [r ik ] with entries n ik = Ω (u n h · ∇φ k ) · φ i dx and r ik = Ω (u n j,h · ∇φ k ) · φ i dx, respectively, and denote the approximate solution vector by u n j . The fully discrete linear system for the group of NSE problems based on the ensemble method, (15), reads where for the first-order scheme; and Mu n−1 j for the second-order scheme. Similar to the heat equation case, if one has access to a semi-implicit numerical solver (the coefficient matrix A (j) and vector b (j) for the j-th problem), then the linear system for the ensemble simulations could be assembled in a straightforward manner. Taking the first-order scheme for example, we have . Note that A keeps the same sparse structure as A (j) .
Since A is non-symmetric, when its dimension is large, the generalized minimum residual method (GMRES) method could be used to solve the linear system having a single RHS vector. To speed up the iteration, we follow [10] and use the least-square commutator preconditioning. The preconditioner K has the following form: where M = diag(M) consists of the diagonal entries of the velocity mass matrix. This preconditioning is applicable when the mixed approximation is uniformly stable with respect to the inf-sup condition, which is also fully automated, without requiring the construction of any auxiliary operators. The GMRES, developed by Saad and Schultz [33], is to find an approximate solution from the Krylov subspace that minimizes the residual. The algorithm has been extended to block versions (see [34,18] for an introduction and [35] for analysis), which use block Arnoldi process in generating Krylov subspace vectors. The block GMRES algorithm, named by BGMRES, is presented in Algorithm 2 that makes use of Algorithm 3 for the block Arnoldi process.

Algorithm 2: The block GMRES
Input: matrix A ∈ R n×n , matrix B ∈ R n×J , initial guess X0 ∈ R n×J , preconditioner K ∈ R n×n , tolerance tol ∈ R and maximum number of iteration maxit ∈ R. Output: approximate solution Xs ∈ R n×J . Since the search space keeps increasing during the BGMRES process, the iterations can be restarted after certain steps. Meanwhile, within a restart cycle, the basis matrices could become linearly dependent, thus deflation can be executed to remove redundant information for improving the efficiency of computation. Such algorithms have been designed in [4,3] that perform deflations either at the beginning of iterations or during the entire iteration process. As for the time dependent problems we considered here, the solution at previous time step serves as a good initial guess for the current iteration. Together with a well-designed preconditioner, the number of iterations in BGMRES should be small. Therefore, in our numerical

Algorithm 4: The block GMRES with deflations [4]
Input: matrix A ∈ R n×n , matrix B ∈ R n×J , initial guess X0 ∈ R n×J , preconditioner K ∈ R n×n , tolerance d , tol ∈ R and maximum number of iteration maxit ∈ R. Comparing with the BGMRES in Algorithm 2, the BGMRES-D algorithm has an extra execution of SVD of the matrix T ∈ R J×J at the beginning of the iterative solution, but it only takes O(J 3 ) flops. Due to the truncation of singular values, the dimension of V 1 usually is smaller than, and never greater than, J, which reduces the computational efforts in performing the block Arnoldi iterations. 4. Numerical experiments. The goal of this section is to test the performance of the ensemble method that uses the aforementioned preconditioned block iterative solvers. The performance of ensemble simulations will be compared with the corresponding individual simulations. To this end, we first validate our simulation codes by checking the convergence rates of ensemble approximations, then use the CPU time as a criterion for measuring the computational efficiency. In particular, for approximation errors, we define E N j := u j (t N ) − u N j,h , the L 2 error of the j-th approximation result at the final time t N in the ensemble simulations; and let E N j be the L 2 error of the j-th individual simulation result at the final time. All simulations are implemented on Matlab and performed on a PC, equipped with an Intel Core i7 processor running at 2.9GHz. Our codes are developed within the framework of IFISS (incompressible flow and iterative solver) software [9], which are executed sequentially in all the numerical tests. We also note that parallel computing is often needed in many large-scale engineering problems and would like to leave it to our future work. We first check the rate of convergence in E N j by considering two test cases: (i) the first one uses the first-order ensemble method together with bilinear elements; and (ii) the second one uses the second-order ensemble method together with biquadratic elements. Uniform square meshes with size h and uniform time discretization with step size ∆t are selected for partitioning the spatial domain and temporal interval, respectively. Denoted by N x , N y and K the number of partitions in horizontal, vertical and temporal directions. Based on Lemma 2.2, when a uniform mesh refinement is taken, the ensemble simulation is expected to converge linearly in the first case and converge quadratically in the second case when h ∼ O(∆t).
When the discrete systems are not of very large-scale, one still could use a direct solver such as LU or sparse LU in solving the systems. However, we would like to check the performance of ensemble methods working with a block iterative algorithm. Thus, we use the breakdown-free preconditioned BCG as the linear solver for all the ensemble simulations as discussed in subsection 3.1. An incomplete Cholesky factorization is applied for preconditioning. In the iterative algorithm, the maximum number of iterations is set to be maxit = 20 in a restart cycle and convergence criteria is relative residual less than tol = 1 × 10 −8 . Although the ensemble contains J = 100 members, due to the limit of space, we only list the results of three problems with j = 1, 50 and 100 in Tables 1 -2 for these two test cases, respectively. Wherein, ν 1 = 1.1901 × 10 −2 , ν 50 = 8.4951 × 10 −3 , ν 100 = 1.0154 × 10 −2 ; w 1 = 9.6995 × 10 −2 , w 50 = −9.4653 × 10 −2 , w 100 = −3.3367 × 10 −2 .
It is observed that, in either case, the expected rate of convergence has been achieved.   Next, we compare the performance of ensemble simulations with individual simulations on the same group of problems. For this purpose, we take the first-order ensemble method in time and bilinear finite elements in space with the following parameters: N x = 128, N y = 256 and K = 400. The total number of degrees of freedom is n u = 33, 153. The individual simulations are based on a semi-implicit time stepping together with the standard preconditioned CG algorithm, but each problem in the group is solved separately. We choose the convergence criteria of CG to be same as the BFBCG: maxit = 20 in each restart cycle and tol = 1 × 10 −8 . Three individual simulations of problems with j = 1, 50 and 100 have the following approximation errors: Comparing them with the corresponding ensemble simulation errors (listed in the last row of Table 1), we observe the accuracy of both approaches are very close. However, as listed in Table 3, the CPU time for time integrations in the ensemble with BFBCG solver is 5.658 × 10 2 seconds, while that of individual simulations with the preconditioned CG solver is 1.4640×10 3 seconds, which leads to a speedup factor of nearly 2.60. In this case, neither BFBCG nor preconditioned CG needs to restart the iterations during the simulations. The number of iterations per time step in BFBCG is the same as the individual preconditioned CG solve. The average execution time per time step in BFBCG (for 100 problems) is 5.6362 × 10 −1 seconds, and that in each individual PCG solve is 1.1482 × 10 −2 seconds. For a fair comparison, we multiply the execution time of a single preconditioned CG solve by 100, which costs about 2 times larger than one BFBCG solve. We observe that the main computational saving in BFBCG comes from the reduction of search directions. Thus, we plot the change in the rank of P i with respect to iterations every 10 time steps in Figure 1. It is seen that the search dimension increases with iterations, but the largest rank of P i during the simulation is 9, which is much less than J. In the j-th problem, u j (x, y, t) = sin(πx) cos(πy)e −2νj π 2 t , v j (x, y, t) = − cos(πx) sin(πy)e −2νj π 2 t , p j (x, y, t) = 1 4 (cos 2πx + cos 2πy)e −4νj π 2 t , and the diffusion coefficient ν j = 0.01(1 + j ) with j a random number uniformly distributed in [−0.2, 0.2]. We first check the rate of convergence in velocity approximations for the firstorder and second-order ensemble methods, respectively. The Q 2 /Q1 Taylor-Hood elements are used in both cases for spatial discretization. Uniform rectangular meshes with size h and uniform time discretization with step size ∆t are selected respectively for partitioning the spatial domain and temporal interval. The same notation h, ∆t, N x , N y and K as Problem 1 are used. In both tests, we fix big enough N x and N y so that the temporal error would dominate the approximation errors, and vary the time step size to check the rate of convergence. Based on Lemma 2.4, the ensemble simulation is expected to converge linearly in the first case and converge quadratically in the second case when a uniform time refinement is taken.
As discussed in subsection 3.2, the BGMRES-D algorithm together with the leastsquare commutator preconditioner is used for solving the discrete systems resulted from the ensemble-based time stepping. The maximum number of iterations in each restart cycle is maxit = 50 and stopping criterion is fulfilled if the relative residual is no greater than tol = 1 × 10 −8 . The ensemble size is J = 40. However due to the limit of space, we only list the results of three problems for j = 1, 20 and 40 in Tables 4 -5. Wherein, ν 1 = 8.0619 × 10 −3 , ν 20 = 1.1681 × 10 −2 , ν 40 = 1.0804 × 10 −2 , and E N j denotes the velocity approximation errors in L 2 norm at the final time. It is seen, in both cases, that the expected rates of convergence have been obtained.   To illustrate the efficiency of the ensemble method, we take the second test case when N x = N y = 256 and K = 40 for example, and compare the execution time of ensemble simulations with 40 individual simulations using the same mesh and time step sizes. In this case, the total number of degrees of freedom is n u = 132, 098 and n p = 16, 641. In individual simulations, preconditioned GMRES algorithm is applied to solve discrete linear systems with the same stopping criterion as the preconditioned BGMRES-D, but each problem in the group is solved separately. We choose the convergence criteria of GMRES to be same as the BGMRES-D: maxit = 50 in each restart cycle and tol = 1 × 10 −8 . Three individual simulations of problems with j = 1, 20 and 40 have the following approximation errors: Table 5, we find that the accuracy of ensemble simulations is close to individual simulations. The corresponding execution times of both approaches are listed in Table 6. It is shown that the ensemble simulation is over 10 times faster than the individual simulations.  It is also observed that neither the BGMRES-D nor the single GMRES solver restarts the iterations during the simulations. The iteration number per time step in BGMRES-D is the same as that of the GMRES in each individual simulation. But as the deflation is performed at each time step, the dimension of the RHS matrices in BGMRES-D is always no greater than J. The time evolution of its rank is plotted in Figure 2, which shows the maximum value of the rank is 8. This indicates fewer matrix-vector products are evaluated in ensemble simulations than individual ones. On the other hand, the least-square commutator preconditioning involves two discrete Poisson solves and matrix-vector products. Since for the group of problems, the ensemble simulations only apply one preconditioning to the coefficient matrix per iteration, while each individual simulation would require one preconditioner to the coefficient matrix at every iteration, which leads to the significant computational savings. Dirichlet condition u = 1 − y 2 , v = 0 is imposed at the inflow boundary (x = 0 and −1 ≤ y ≤ 1), zero Dirichlet condition on the top and bottom of the channel (0 ≤ x ≤ 8 and y = ±1), and do-nothing boundary on the outflow boundary (x = 8 and −1 < y < 1). The flow is at rest at the initial time. Viscosity coefficients vary among the problems, in particular, ν j = (1 + j )/300 with j a random perturbation uniformly distributed in [−0.2, 0.2] in the j-th problem.
In the simulations, we use second order schemes with a uniform time step size ∆t = 0.01 for time integration, and keep the same parameters in block and individual iterative solvers as those in Problem 2. As there is no analytic solutions, we only show the ensemble simulation results together with individual ones, and compare their CPU times. Due to the limit of space, we only show the velocity magnitude field at the final time for three problems associated with Reynolds numbers: Re 1 = 126.7001, Re 20 = 176.9705 and Re 40 = 150.6167 in Figure 3, and the evolution of speed at a point behind the cylinder, (6,2), in Figure 4. The simulation results are so close that no obvious difference can be observed.
The time evolution of the dimension of RHS vectors in the ensemble simulation is shown in Figure 5. It is seen that the size of RHS matrix does shrink after deflations. The corresponding simulation times of the ensemble simulations with preconditioned BGMRES-D and individual simulations with preconditioned GMRES are listed in Table 7. It is shown that, comparing with the individual simulations, the ensemble simulation saves about 85% CPU time.     Remark 1. We note that the speedup factor of the ensemble simulations over the individual ones in the Problem 3 is not as great as Problem 2, it is because the computational saving of the block iterative algorithms is problem dependent. The solutions in Problem 3 are rapidly changing with time, as small differences in the viscosity could result in relatively large changes in the state, thus the deflation of the RHS matrix is not as effective as that in Problem 2, which causes the speedup less eminent.

5.
Conclusions. The ensemble method has recently been developed for efficiently solving a group of evolution problems. It, using an ensemble-based time stepping, leads to a single linear system of multiple right-hand-side vectors for the entire group. Thus, all the problems can be solved simultaneously at each time step, which naturally share information among right hand sides. In this paper, we demonstrate, for the first time, the efficacy of the ensemble method when it works together with block iterative solvers. Our future work would extend the ensemble method to multi-physics systems and investigate the efficiency and scalability of block iterative solvers in parallel and high performance computing.