A PARALLEL DOMAIN DECOMPOSITION ALGORITHM FOR LARGE SCALE IMAGE DENOISING

,


(Communicated by Xuecheng Tai)
Abstract. Total variation denoising (TVD) is an effective technique for image denoising, in particular, for recovering blocky, discontinuous images from noisy background. The problem is formulated as an optimization problem in the space of bounded variation functions, and the solution is obtained by solving the associated Euler-Lagrange equation defined on the domain occupied by the entire image. The method offers high quality results, but is computationally expensive for large images, especially for three-dimensional problems. In this paper, we introduce a highly parallel version of the algorithm which formulates the problem as multiple overlapping, but independent, optimization problems, and each is defined on a portion of the image domain. This approach is similar to the overlapping Schwarz type domain decomposition method, but is non-iterative, for solving partial differential equations, and is highly scalable, without using any coarse grids, for parallel computers with a large number of processors. We show by a theory and also by some two-and three-dimensional numerical experiments that the new approach has similar numerical accuracy as the classical TVD approach, but is much more efficient on parallel computers.

1.
Introduction. Denoising is usually the first step when processing a digital image since the noise often impact the quality of later phases of the process. Consider a noisy image defined on x ∈ Ω ⊂ R d (d = 2, 3): where u(x) is the ideal image to be restored, z(x) is the observed image, ε(x) represents the noise to be removed. One of the most successful and popular methods is the Rudin-Osher-Fatemi (ROF) total variation based image denoising method [33]. ROF model can be described as: Find u ∈ BV (Ω), such that min u∈BV (Ω) where |∇u| is the Radon measure, α is a positive parameter, and BV (Ω) is the space of functions of bounded variation on Ω. The solution of the minimization problem (1) is usually obtained by solving the associated Euler-Lagrange equation which is obtained by the Fréchet differentiation of (1): with the homogeneous Neumann boundary condition ∂u ∂n = 0. To avoid the singularity at |∇u| = 0, a small positive parameter β is usually introduced to the TV functional, and the Euler-Lagrange problem is modified accordingly [33]. There are many techniques available for image denoising, such as the Gaussian smoothing method [4], empirical Wiener filters [10], wavelet thresholding methods [19], nonlocal means methods [6], variation based methods [2], and so on. Because of the capability of preserving sharp edges and boundaries with a high quality recovery, the variation based methods have been widely used with great success [2,33]. In addition to denoising, the TV method has been successfully applied to other image processing problems, including image debluring, inpainting, segmentation and cartoon-texture decompositions [5,9,11,30,38]. In this paper, we focus on the image denoising problem.
Most existing algorithms such as these in [11,26,31,33,39] for solving (1) target sequential computers. With the recent development of photographic technology, the image resolution is significantly higher. For the denoising of high resolution and large scale images, such as color images, three-dimensional images, the memory requirement and the computing time increase drastically, as a result, parallel computing becomes necessary. There are some recent works for parallel image processing using GPU, multi-core and many-core architectures; see, for example [1,20,32,35]. Single and multilevel domain decomposition (DD) methods were applied to TV minimization problems, and both overlapping and nonoverlapping DD methods were considered. In [23], some non-overlapping DD methods were introduced for the TV model and in [22], some overlapping methods were considered for the same class of problems. A two-level overlapping DD method for the primal total variation minimization was studied in [40], and later they extended the method to the nonlocal TV model in [13]. In [12], a DD method for its dual counterpart was considered and a convergence analysis was provided. In [25], some nonoverlapping DD methods coupled with semi-smooth Newton were studied for the pre-dual TV problem and a convergence analysis was also provided. More recently, a dual decomposition based primal nonoverlapping DD method was introduced in [29] for the TV model. These DD based approaches are more suitable for CPU-based parallelizations, but only limited parallel experiments were reported. In the PhD thesis [27] the author investigated some optimal control and convex programming based nonoverlapping DD methods for smoothed TV and the methods are scalable with up to 144 processor cores. All the above mentioned methods are iterative and require several linear and nonlinear iterations to converge. In this paper, we propose a non-iterative parallel DD based algorithm applied directly to the optimization problem to solve the large scale image denoising problem on a supercomputer.
To motivate our new algorithm, we make an obvious, but often ignored, observation about a major difference between image denoising problems and problems arising from, for example, solving elliptic partial differential equations (PDE). The difference is about data dependency, which is very important for the purpose of parallelization on large scale parallel computers. Elliptic problems, and many other PDE problems, are globally dependent, in the sense that one can not find the solution on part of the computational domain without knowing the solution on the rest of the domain. On the other hand, image denoising is a local problem, i.e., the noise in part of the domain has nothing to do with the noise on other different parts of the image. The advantage of the noise locality property is not taken into account by the classical ROF approach. In other words, once the optimization problem (1) is formed, the noise becomes a globally connected field. Additional efforts (decomposition by domain and several iterations) will be needed to break up the global field in order to introduce parallelism.
For the purpose for better parallelization, we take advantage of the fact that the noise in different part of the image domain is independent, and form local optimization problems on subdomains, one for each processor of the parallel computer. We note that the algorithm does need a small amount of information from nearby subdomains, therefore, we allow the subdomains to overlap each other. An error analysis is provided in the paper to show that the solution obtained from the new method is essentially the same as the solution of the classical iterative approach based on the global optimization problem, and more importantly, the new method is much faster on parallel computers because there is no communication between processor cores. The algorithm proposed in the paper is aimed to solve minimization problems in the BV space, and the numerical simulations presented in the paper show that the algorithm performs quite well when the observation is in the BV space. However, we need the assumption that the observation z ∈ W 1,1 (Ω) for the theory to work.
The rest of the paper is organized as follows. In Section 2, the new method is introduced and the existence, uniqueness and error analysis of the method are given in Section 3. Some numerical experiments are shown in Section 4 to demonstrate the efficiency and robustness of the algorithm, and to compare the method with the classical ROF approach for two-and three-dimensional problems. Some conclusions and discussions are given in Section 5.

2.
A non-iterative overlapping Schwarz method. In this section, we propose a non-iterative overlapping Schwarz method (NiOS) for solving the image denoising problem. The method to be introduced here is different from (1) which is based on an optimization problem defined on the whole domain, but we will show experimentally and theoretically that the difference is small, in terms of the quality of the restored image. The advantage of the new method is that it minimizes the communication cost and is scalable on parallel computers without requiring any coarse grids.
The NiOS method begins with a partition of the image domain into N nonoverlapping subdomains denoted as Ω k 0 (k = 1, 2, · · · , N ), that satisfy Ω = N k=1 Ω k 0 and Ω i 0 Ω j 0 = ∅, i = j. In other words, each pixel belongs to one and only one subdomain. Then we extend each subdomain into a larger subdomain Ω k δ (k = 1, 2, · · · , N ) by including δ layers of pixels (δ is a positive integer) from the neighboring subdomains; see, for example, Fig. 1. On each subdomain Ω k δ , we define a ROF problem as follows: where z k = z| Ω k δ is the sub-image with noise on the subdomain Ω k δ . The Euler-Lagrange equation corresponding to (3) takes the form where β is a small positive parameter added to the dominator to avoid the possible singularity when |∇u k | is 0. There are many discretization techniques available for solving (4), such as finite difference method or finite element method. After solving the problem on the overlapping subdomain, we throw away part of the solution and only keep the solution components on the nonoverlapping subdomains Ω k 0 and the denoised image u h (the discretized solution) is obtained by combining those solutions as follows: where R k 0 is a restriction operator from the overlapping subdomain Ω k δ to the nonoverlapping subdomain Ω k 0 , that is, for a vector u k h defined on Ω k δ , R k 0 u k h takes components belonging only to Ω k 0 . E k 0 is an extension operator from the nonoverlapping subdomain Ω k 0 to the whole domain Ω which is defined as follows. Let m be the total number of pixels in Ω and m k the total number of pixels in Ω k δ . Then the extension operator E k 0 is a m×m k matrix, whose elements (E k 0 ) l1,l2 are 1 if 1 ≤ l 1 ≤ m and 1 ≤ l 2 ≤ m k correspond to a pixel in Ω k 0 or 0 otherwise. In other words, we extend the local vector u k h to a global vector by adding zeros (0, · · · , 0, R k 0 u k h , 0, · · · , 0). Fig. 2 gives an example for the N = 4 case. The framework of the NiOS algorithm is given in Algorithm 1.

Algorithm 1 : NiOS
Step 1. Partition the original image into N nonoverlapping sub-images Ω k 0 (k = 1, 2, · · · , N ) and then extend each sub-image to include δ layers of pixels surrounding the sub-image Ω k δ to form overlapping subdomains.
We remark that Algorithm 1 is similar to the overlapping domain decomposition method that decomposes the computational domain into overlapping subdomains and then restricts the original partial differential equation to each of the subdomains, except here we apply the overlapping decomposition idea directly to the optimization problem. The other difference is that DD usually requires several iterations to obtain the solution. However, numerical experiments in this paper indicate that the NiOS algorithm can obtain a sufficiently accurate solution without any iteration. To understand this, an error analysis is provided in the next section to investigate the difference between the solutions obtained by NiOS and the classical ROF method based on solving the global optimization problem. Note also that traditional DD requires coarse grids in order to be scalable, but NiOS is scalable without any coarse grids. It is important to note that the increase of the overlapping size is helpful to reduce the difference between NiOS and the classical ROF but it does not improve much of the quality of the image measured by the peak signal-to-noise ratio. To balance the accuracy and efficiency of the NiOS algorithm, a relatively small overlapping size should be chosen according to our experimental studies.
3. Existence, uniqueness, and error analysis. In this section, we first prove the existence and uniqueness of the solution for the Euler-Lagrange equation. Then an auxiliary one-dimensional problem is studied. Finally we provide an error analysis which shows that the solution obtained by NiOS is close to the solution obtained by the method based on the global optimization approach.
We first consider the existence and uniqueness of the solution of the Euler-Lagrange equation with a homogeneous Neumann boundary condition ∂u ∂n = 0. Step 2: Distribute overlapping subdomain problems to 4 processors and solve these four overlapping sub-problems (B); Step 3: Combine images from non-overlapping subdomains (C). Here the points with different colors are the image pixels distributed to different processor cores.
We are mainly interested in the case that Ω is a bounded domain with Lipschitz boundary. By denoting the vector function Φ : we can rewrite (6) as a capillarity type equation: When the boundary condition takes the form Φ(▽u) · n = cos θ, the problem was carefully studied in [24,36,37]. In these papers, ϕ(x, v) is a given function and assumed to have Hölder continuous partial derivatives with respect to (x, v) ∈Ω×R and satisfy inf x∈Ω,v∈R whereΩ represents the closure of Ω. Ural'tseva [37] shows that (7) has a C 2 (Ω) solution if θ = constant ∈ (0, π) and Ω is a convex domain of class C 2,γ . Using a different method, Spruck [36] obtains a similar result in Ω ∈ R 2 with C 4 smooth boundary without the restriction of convexity. Moreover, the angle θ is allowed to be non-constant, and the cases θ = 0, π are also discussed. Similar results are reported in [24].
In this paper, we aim to estimate the error of NiOS on a bounded domain with Lipschitz boundary. Note that the noisy image z is usually considered as a function in L 2 (Ω) and in such a situation ϕ(x, v) may not have Hölder continuous partial derivatives with respect to x ∈Ω. In such situation, the strong regularity results of [24,36,37] do not hold for the Euler-Lagrange equation, we therefore switch to a weak form of the problem. We first prove that the Euler-Lagrange equation has a unique weak solution in W 1,1 (Ω) ∩ L 2 (Ω). The weak form of (7) is: Find holds for any φ ∈ W 1,1 (Ω).
Lemma 3.1. The following inequality holds for all x, y ∈ R d with d = 1, 2 or 3, The last inequality holds due to the monotonicity of the function |Φ|.
Proof. According to the variation theory, the solution of the Euler-Lagrange equation is also the solution of the following minimization problem Thus, similar to [21] (Theorem 2, page 448), we obtain that there exists at least one solution u ∈ V (Ω) such that L(u) = min Multiplying the Euler-Lagrange equation with u and taking the integral, we obtain Since the second term in the left-hand side is nonnegative, this implies u L 2 (Ω) ≤ z L 2 (Ω) . Because z ∈ W 1,1 (Ω), there is a positive constant C such that ▽z L 1 (Ω) ≤ Inverse Problems and Imaging Volume 13, No. 6 (2019), 1259-1282 C. LetΩ be a subdomain of Ω defined asΩ = x ∈ Ω |▽u| ≥ 1 . Then we have . Subsequently, we conclude that where C is a constant independent of α and β. The proof is complete.
In the rest of this section, we assume β < 1 and α is a given positive constant.
We introduce an auxiliary one-dimensional problem as follows, and later we will consider two-and three-dimensional problems as tensor products of one-dimensional problems Here D represents the derivative and the scalar function Ψ : R → R is defined as The auxiliary problem (9) has a unique solution that satisfies Here C is a positive constant independent of α and β. osc refers to the oscillation defined as Proof. The uniqueness of the solution is easily obtained from Theorem 3.1. Let σ = α β 1/2 ,ȳ = σy, andv = v/(σβ 1/2 ). Then,v is a solution of a capillarity type equation as followsv − Dȳ Dȳv (Dȳv) 2 + 1 = 0 in (σa, σb).
If ν 1 = ν 2 = 0, then the solution of (9) is v = 0 and satisfies (10), which completes the proof in this situation. Next, we consider the case with ν 1 = 0 and ν 2 > 0. The proof of the other case is similar. Let us assume that the minimum value which implies that the estimates (10) hold.
Next, we assume We only consider the case with ν 2 > β 1/6 . The other cases are similar. Since Dv(y) is monotonically increasing in [a, b], we can choose a sequence ξ n such that Dv(ξ n ) = β γn , for n = 1, 2, · · · . Here γ 1 = 1/6 and γ n = 1/6 + γ n−1 /3. It is clear 1/4 = lim n→∞ γ n , γ n ≤ 1/4, and lim n→∞ ξ n = ξ 0 with Dv(ξ 0 ) = β 1/4 . Since ξ i > ζ 1 and v(ξ i )/α ≥ C, we have (11) Together with (11) and (12), we have Since −y log β ≥ (1 − β y ) holds for any small positive y, we get Then we obtain b − ξ n ≤ −Cβ 1/2 log β. Let n → ∞, we have b − ξ 0 ≤ −Cβ 1/2 log β, which shows the fourth inequality of (10)  Now let us consider the error estimates for the two-and three-dimensional problems based on the above Lemmas and Theorem. Assume u is the solution of (6) on domain Ω and u k is the solution of the sub-problem on Ω k δ . Here ∩ Ω. From Theorem 3.1, we get the uniqueness of the two solutions. For a fixed k, it is clear that u satisfies It is clear that |ν(x)| ≤ 1 and ν(x) = 0 for x on ∂Ω. The following theorem provides an estimate of the distance between u and u k in L 2 (Ω k δ ).

Theorem 3.2. Let Ω be a bounded domain with Lipschitz boundary and Ω
. Assume u is the solution of (6) on Ω and u k is the solution of (4) on Ω k δ . Then we have u − u k 2 Here C is a constant which is dependent on u W 1,1 (Ω) but independent of α and β.
Proof. We obtain the uniquenesses of u and u k from Theorem 3.1. For i = 1, 2, · · · , d, let w s i and w u i be the solutions of the following equations ⊂ Ω. In this case, Ω k δ is a rectangular or hexahedral subdomain of Ω. With the definition of u, u k , w s whereν s i = − (1 + β)Φ(▽u(x s )) · n, andν u i = (1 + β)Φ(▽u(x u )) · n. Here x s = (x 1 , · · · , x i−1 , x s i , x i+1 , · · · , x d ) and x u = (x 1 , · · · , x i−1 , x u i , x i+1 , · · · , x d ) are two points on ∂Ω k δ . By multiplying the above equation by u − u k and taking the integral, we get (15) u − u k 2 The first two terms in the right-hand side of (15) satisfy

Thanks to Theorem 3.1 and Lemma 3.2, we obtain
The last two terms in the right-hand side of (15) satisfy . Since the second term in the left-hand side of (15) is nonnegative, we have u − u k 2 L 2 (Ω k δ ) ≤ C(α + β 1/4 ), which completes the proof in this situation.
whereν s i = − (1 + β)Φ(▽u(x s ))·n andν u i = (1 + β)Φ(▽u(x u ))·n, respectively. Here w s i are w u i are given in (13) and (14), respectively. It is important to note thatν u i andν s i are both equal to zero whilex u andx s are on ∂Ω. Forx u orx s on ∂Ω k δ \ ∂Ω, we have thatν u i orν s i is a function independent of x i . Similar to the case A, multiplying the above equation by u − u k and taking the integral give that (16) The first summation in the right-hand side of (16) satisfies Thanks to Theorem 3.1 and Lemma 3.2, we obtain The second part in the right-hand side of (16) satisfies . Since the second term in the left-hand side of (16) is nonnegative, we have Let Ω be a bounded domain with Lipschitz boundary. Suppose z ∈ W 1,1 (Ω)∩L 2 (Ω). Assume u is the solution of (6) on Ω andũ is the solution obtained by (4) and (5). Then we have (17) u −ũ 2 L 2 (Ω) ≤ C(α + β 1/4 ). Here C is a constant which is dependent on u W 1,1 (Ω) but independent of α and β.
which completes the proof.

Numerical experiments.
In this section, some numerical experiments using the proposed algorithm are presented, and we also compare with the results obtained by the classical ROF approach. We focus on the efficiency, scalability and robustness of the algorithm. In the experiments, the noisy images are obtained by adding some Gaussian white noise generated by the Matlab function imnoise(I,'gaussian', M, σ 2 ). A standard second-order finite difference method (5-point stencil for 2D and 7-point stencil for 3D) is used to discretize (4) or (6) with a mesh size equals to distance between the two nearest pixels. To measure the quality of the restored images, the peak signal-to-noise ratio (PSNR) is used.
where (u j,r − u j ) represents the difference of the pixel values between the restored and original images, and N i is the number of pixels in the ith spatial coordinate direction. Higher values of PSNR means better restoration quality. Typical values of PSNR in lossy image and video compression are between 30dB to 50dB, and acceptable values for wireless transmission quality loss are usually considered to be about 20dB to 25dB. Our algorithm is implemented on top of the open source package PETSc [3] developed at Argonne National Laboratory. All computations are carried out on the Tianhe 2 supercomputer at the China National Supercomputer Center in Guangzhou. The compute node has a dual six-core Intel Xeon X5650@2.76GHz processor and 24GB of memory.
To compare with the classical ROF model, we implement a parallel solver for (6) using the Newton-Krylov-Schwarz method (NKS) [8] described in Algorithm 2 below, where an inexact Newton method is used as the nonlinear solver and a Krylov subspace method (GMRES) [34] is used to solve the Jacobian system preconditioned by an overlapping domain decomposition method [7]. In Algorithm 2, F (u) = 0 represents the nonlinear system arising from the finite difference discretization of (6) on a rectangular or hexahedral mesh, J k is the full Jacobian of F (u) at point u k , M −1 k is an additive Schwarz preconditioner. The inexactness means that the accuracy of the Jacobian solver is determined by a parameter η ∈ [0, 1) in the sense of

Algorithm 2 : Newton-Krylov-Schwarz (NKS)
Use the observed image as the initial guess u 0 For k = 0, 1, . . . until convergence, do 1. Construct the Jacobian matrix J k 2. Solve the following right-preconditioned Jacobian system inexactly by a Krylov subspace method . Do a cubic line search to find a step length τ k 4. Set u k+1 = u k + τ k s k End The number of subdomains equals to the number of processors and the subdomain solver for the Schwarz method is an incomplete LU factorization (ILU). The relative residual stopping conditions are used for the linear and nonlinear solvers, which are 10 −4 and 10 −6 , respectively. The overlap in the Schwarz preconditioner is set to 1. The NKS method is a very powerful parallel solver for nonlinear problems and it has been well studied for lots of problems, for example, problems in fluid dynamics [15,28] and optimization problems [14,18].
Note that the subdomain problems (4) in NiOS are also solved with the sequential version of the NKS method on a single processor, where ILU is employed as the subdomain solver. In the rest of the paper, unless otherwise stated, we denote by "Newton" as the total number of Newton iterations for NKS and it also used as the average number of local Newton iterations over all processors for NiOS, "GMRES" as the average number of GMRES iterations per Newton iteration, "Time" as the total compute time in second, and n p as the number of processors. 4.1. Two-dimensional image denoising. For two-dimensional problems, we test three benchmark gray scale images, boat-1024 × 1024, cameraman-2048×2048, and cameraman-4096 × 4096, respectively. The original images have a dynamic range [0,1], and some Gaussian white noises with variance σ 2 = 0.04 are added to them. For the purpose of comparison, we set the parameters as β = 10 −4 and α = 0.18. The computed results are shown in Fig. 3-4. A comparison of the results obtained by the NiOS method and the NKS method is shown in Fig. 4, clearly the restored images of the two methods are very similar. Furthermore, the PSNRs for the two images are almost the same (PSNR NiOS = 28.536962 and PSNR NKS = 28.536417). The compute time comparison is in Table 1, which shows that NiOS is much faster than NKS, especially when the number of processors is large, for example, when n p = 1024, for the cameraman-4096×4096 case, NiOS is 4.5 times faster than NKS.
To further compare the two methods, we plot the difference (or we refer to as the error of the NiOS method) of the two solutions in Fig. 5, and the larger error appears mainly near the inner boundary of the subdomains. Table 2 shows how the error depends on the parameters α and β. We observe that the error decreases with the decrease of α and increases when β decreases. The possible reason that the error increases when β decreases is that Theorem 3.1 gives us an upper bound of u W 1,1 (Ω) , but the relationship between u W 1,1 (Ω) and β is still unknown. Thus the constant C in the error estimate (17) may increase as β decreases which leads to the increase of the error. According to the estimate (17), the error is proportional to α + β 1/4 which matches with the numerical results in Table 2. To  check the convergence rate, we reduce the parameters proportionally, that is if α is reduced by 1 2 , then β should be reduced by ( 1 2 ) 4 . Table 3 shows the convergence rate of the algorithm, which matches the estimate from the theory. As mentioned before, β is an artificial parameter used to prevent the singularity at |▽u| = 0, therefore the value should be chosen as small as possible. As reported in Table 2, the number of nonlinear iterations increases sensibly as β decreases, but the image quality is improved slowly at the same time. So in this paper, we use α = 0.18 and β = 1.0 × 10 −4 unless mentioned otherwise. With these values of α and β, the difference of solutions obtained using NiOS and NKS in relative sense is about 1.0 × 10 −3 , which is small enough to indicate that these two images are essentially the same.
The parallel performance of NiOS for the 2D cameraman image case is given in Table 4 and Fig. 6. Table 4 shows that the numbers of Newton and GMRES iterations decrease with the increase of the number of processors (which equals to the number of subdomains). The possible reason is that the subdomain problem becomes smaller when the number of subdomains increases, and this makes the subdomain problem easier to solve. Fig. 6 shows that the parallel performance of the NiOS method is almost ideal (the ideal speedup is defined as: the total compute time is halved as the number of processors is doubled). This is because the algorithm is communication-free and the only factor that influences the parallel efficiency is the unbalance of the subdomain problems, since different subdomain problems have different smoothness properties, so the number of Newton, GMRES iterations, and the total compute time are different as shown in Table 5. The compute time for the fastest subdomain is almost half of that for the slowest subdomain. In all test cases, we consider the time spent by the slowest subdomain as the total compute time. In some of our tests, the total compute time is not exactly halved when the number of processors doubles because the computations on different processors are not always balanced.

4.2.
Three-dimensional magnetic resonance image denoising. In medical diagnosis, the three-dimensional magnetic resonance (MR) plays an important role, however, random noise is difficult to avoid in the imaging process and the noise may lead to numerous systematic errors in subsequent applications. For threedimensional image processing, the computation is more expensive because of the large size. In this section, the performance of the NiOS method is investigated. The three-dimensional images used for this experiment are from the BrainWeb database 1 [16]. We focus on the restoration of two images: T1-weighted (T1-w) image and T2-weighted (T2-w) image, both of which contain 181×217×181 voxels. 9% Rician noise is added to the images using the white Gaussian noise as in [17]. Suppose that the pixel values of the images lie in the interval [0, 1]. 9% Racian noise means that the Gaussian noise used is equivalent to 9 100 ν, where ν is the brightest tissue in the image ( 150 255 for T1-w and 250 255 for T2-w). In the experiments, the PSNR  Table 3. The error of the NiOS method with respect to the parameters α and β changed proportionally, for example, when α is reduced by 1 2 , β is reduced by ( 1 2 ) 4 . "ERROR" has the same definition as in Table 2    are used as criteria for the quality of image restoration. For the sake of clarity, the PSNR are estimated only in the brain region obtained by removing the background. Instead of removing the noise slice by slice as in most of the previous works, we use the full three-dimensional model, that is we solve the ROF model (1) in the three-dimensional domain. An example of the noisy image and its partition for parallel computing are shown in Fig. 7. The slices of restored images are shown in Fig. 8 and the three-dimensional reconstructions of the images are shown in Fig. 9.
The PSNR values for the noisy T1-w and T2-w images are 26.3706 and 21.9557, and after denoising, the PSNR values are promoted to 30.2275 and 25.7565, respectively. Table 6 shows a comparison of NiOS and NKS. The results indicate that NiOS saves a lot of compute time compared with NKS, especially when the number of processors is large. Table 7 shows the effect of the overlapping size in the NiOS   method. We see that a small overlap is good enough in terms of the compute time and image quality. The last column of Table 7 shows that the difference of the solutions of NiOS and NKS decreases with the increase of the overlapping size.
To further understand the newly developed method, we combine the NiOS method and the NKS method by using the solution of the NiOS method as the initial guess of the NKS method. In Table 8, the "First Stage" refers to the NiOS method and the "Second Stage" refers to the NKS method with an initial guess obtained by Figure 8. Slice 100 of the 3D MR images. From left to right, the first row: the original T1-w image, the T1-w image with a Racian noise at 9% and the restored image; second row: the detailed partial images of the first row images; third row: the original T2-w images, the T2-w image with a Racian noise at 9%, and the restored image. The last row: the detailed partial images of the third row images.  NiOS. We see that the solution of the NiOS method provides a very good initial guess for the NKS method because the NKS method converges much faster than starting from the noisy image, as compared with the previous experiment.

5.
Conclusion. We developed a new parallel non-iterative domain decomposition method for large scale image denoising. Numerical results show that the newly developed method works well for 2D and 3D image denoising problems and a linear speedup is obtained with up to 1024 processors. Traditional domain decomposition methods, whether overlapping or nonoverlapping, require several iterations to find the solution and need at least one coarse grid to achieve linear scalability on parallel computers. The new method requires neither iteration nor coarse grid to obtain the solution and is scalable on parallel computers. A theoretical analysis is provided to show that the error of the method, compared with the classical ROF model, is proportional to α + β 1/4 , and the estimate is also confirmed by numerical experiments with two-and three-dimensional images. We have not tested, but would like to mention that the method can easily be combined with other methods, such as the augmented Lagrangian method, the primal-dual method, the split Bregman iteration, etc.