Convergence Theorems for the Non-Local Means Filter

In this paper, we establish convergence theorems for the Non-Local Means Filter in removing the additive Gaussian noise. We employ the techniques of"Oracle"estimation to determine the order of the widths of the similarity patches and search windows in the aforementioned filter. We propose a practical choice of these parameters which improve the restoration quality of the filter compared with the usual choice of parameters.

where I is the uniform N × N grid of pixels on the unit square, Y = (Y (x)) x∈I is the observed image brightness, f : [0, 1] 2 → R + is an original image (unknown target regression function) and ε = (ε (x)) x∈I are independent and identically distributed (i.i.d.) Gaussian random variables with mean 0 and standard deviation σ > 0.
Important denoising techniques for the model (1) have been developed in recent years. A very significant step in these developments was the introduction of the Non-Local Means Filter by Buades et al [1]. For closely related works, see for example [2][3][4][5][6].
The basic idea of the filters by weighted means is to estimate the unknown image f (x 0 ) by a weighted average of observations Y (x) of the form where for each x 0 and h > 0, U x0,h denotes a square window with center x 0 and width 2h, w(x) are some non-negative weights satisfying x∈U x 0 ,h w(x) = 1. The choice of the weights w(x) is usually based on two criteria: a spatial criterion so that w(x) is a decreasing function of the distance between x and x 0 , and a similarity criterion so that w(x) is also a decreasing function of the brightness difference |Y (x) − Y (x 0 )| (see e.g. [7,8]), which measures the similarity between the pixels x and x 0 . In the Non-Local Means Filter, h > 0 can be chosen relatively large, and the weights w(x) are calculated according to the similarity between data patches Y x,η = (Y (y) : y ∈ U x,η ) (identified as a vector whose composants are ordered lexicographically) and Y x0,η = (Y (y) : y ∈ U x0,η ), instead of the similarity between just the pixels x and x 0 . Here η > 0 is the size parameter of data patches.
Unfortunately, the ideal implementation of Non-Local Means is computationally expensive. Therefore, for the sake of rising the speed of denoising, only a neighborhood of the estimated point is considered. In practice, the similarity patches of size 7 × 7 or 9 × 9 and search windows of size 19 × 19 or 21 × 21 are often chosen. However these choices are empirical and the problem of optimal choice remains open. As a consequence, the results of the numerical simulations are not always satisfactory.
In this paper, we use the statistic estimation and optimization techniques to give a justification of the Non-Local Means filter, and to suggest the order of sizes of search window and similarity patch. Our main idea is to minimize a tight upper bound of the L 2 risk by changing the width of the search window. We first obtain an explicit formula for the optimal weights w * h in terms of the unknown function f. The corresponding weighted mean f * h is called "Oracle"; the "Oracle" f * h is shown to have an optimal rate of convergence and high performance in numerical simulations. To mimic the "Oracle" f * h , we estimate w * h by some adaptive weights w h based on the observed image Y. We thus obtain the Non-Local Means Filter with the proper width of window. Numerical results show that the Non-Local Means Filter with proper width of window outperforms the Non-Local Means Filter with standard choice.
The paper is organized as follows. In Section 2, we introduce the "Oracle" estimator of Non-Local Means Filter and reconstruct Non-Local Means Filter with the idea of "Oracle" theory. Our main theoretical results are presented in Section 3 where we give the rate of convergence of the Non-Local Means Filter. In Section 4, we present our simulation results with a brief analysis. Section 5 gives the conclusion of our paper. Proofs of the main results are deferred to Section 6.

Notations
Let us set some notations to be used throughout the paper. The Euclidean norm of a vector x = (x 1 , ..., The cardinality of a set A is denoted by card A. For a positive integer N , the uniform N × N grid of pixels on the unit square is defined by Each element x of the grid I will be called pixel. The number of pixels is n = N 2 . For any pixel x 0 ∈ I and a given h > 0, the square window of pixels will be called search window at x 0 . We naturally take h as a multiple of 1 N (h = k N for some k ∈ {1, 2, · · · , N }). The size of the square search window U x0,h is the positive integer number For any pixel x ∈ U x0,h and a given η > 0, a second square window of pixels U x,η will be called patch at x. Like h, the parameter η is also taken as a multiple of 1 N . The size of the patch U x,η is the positive integer The vector Y x,η = (Y (y)) y∈Ux,η formed by the values of the observed noisy image Y at pixels in the patch U x,η will be called simply data patch at x ∈ U x0,h .

the "Oracle" of Non-Local means
In order to study statistic estimation theory of the Non-Local Means algorithm, we introduce an "Oracle" estimator (for details on this concept see Donoho and Johnstone (1994 [19])) of Non-Local means. Denote where and H > 0 is a constant. It is obvious that Note that the function ρ f,x0 (x) ≥ 0 characterizes the similarity of the image brightness at the pixel x with respect to the pixel x 0 , therefore we shall call ρ f,x0 similarity function. The usual bias-variance decomposition of the Mean Squared Error (MSE) The inequality (11) combining with (9) implies the following upper bound where We shall define a family of estimates by minimizing the function g (w * h ) in w * h and plugging the optimal weights into (7). We shall consider the local Hölder condition where β > 0 and L > 0 are constants, h > 0, η > 0 and x 0 ∈ I. The following theorem gives the rate of convergence of the "Oracle" estimator and the proper width h of the search window.
Suppose that the function f satisfies the local Hölder condition (14) and f * h (x 0 ) is given by (7). Then For the proof of this theorem see Section 6.1. We confirm the theorem by simulations that the difference between the "Oracle" f * h (x 0 ) and the true value f (x 0 ) is extremely small (see Table 1 and the definition of PSNR can be found in Section 4 ). The latter, at least from the practical point of view, the theorem justifies that it is reasonable to optimize the upper bound g Theorem 1 displays that the choice of a small search window, in the place of the whole observed image, suffices to ensure a denoising without loss of visual quality, and explains why we take a small search window for the simulations in the Non-Local Means algorithm.

Reconstruction of Non-Local Means filter
With the theory of "Oracle" estimator, we reconstruct the Non-Local Means filter [1]. Let h > 0 and η > 0 be fixed numbers. For any x 0 ∈ I and any x ∈ U x0,h , the distance between the data patches Y x,η = (Y (y)) y∈Ux,η and Y x0,η = (Y (y)) y∈Ux 0 ,η is defined by T x is the translation mapping: T x y = x + (y − x 0 ) and m is given by (6), which measures the similarity between the data patches Y x,η and Y x0, Define an estimated similarity function ρ x0 by and an adaptive estimator f h by where and U x0,h given by (4).
Suppose that the function f satisfies the local Hölder conditions (14) and ρ x0 is given by (25). Then there is a constant c 1 such that P max For the proof of this theorem see Section 6.2.
In the Theorem 2, we consider that σ is a constant and the Hölder condition (14) implies that . Therefore, if n is large enough, we have It is to say that the larger the standard deviation of the noise is, the more useful our theorem will be. We take the test image "Lena" as an example, which is degraded by Gaussian noise with σ = 10, σ = 20 and σ = 30 respectively. We fix the size of search window M = 13×13, H = 0.4×σ+2 and choose the size of similarity patch m ∈ {2k+1 : k = 1, 2, · · · , 20}. In the cases of σ = 20 and σ = 30, Figure 1 (b) and (c) illustrate that the value of PSNR value of increases when the size of a similarity patch increases. The evolutions of PSNR value are in accordance with Theorem 2. However, in the case of σ = 10, Figure 1 (a) displays that the PSNR value increases when the size of a similarity patch increases in the interval [3,15] and reaches the peak value. But it decreases in the interval [15,41]. This means that the value σ = 10 is not large enough to satisfy the condition In order to improve the results, we sometimes shall use the smoothed version of the estimate of brightness variation d 2 κ (Y x,η , Y x0,η ) instead of the non smoothed one d 2 (Y x,η , Y x0,η ). It should be noted that for the smoothed versions of the estimated brightness variation we can establish similar convergence results. The is defined by where κ(y) are some weights defined on U x0,η . With the rectangular kernel κ r (y) = 1, y ∈ U ′′ x0,η , 0, otherwise, we obtain exactly the distance d 2 (Y x,η , Y x0,η ). Other smoothing kernels κ(y) used in the simulations are the Gaussian kernel where h g is the bandwidth parameter, and the following kernel: for y ∈ U x0,η , if y − x 0 ∞ = j N for some j ∈ {0, 1, · · · , N η}. κ(y) = κ 0 (y) is used in our paper and Buades et al [1].
To avoid the undesirable border effects, we mirror the image outside the image limits. In more detail, we extend the image outside the image limits symmetrically with respect to the border. At the corners, the image is extended symmetrically with respect to the corner pixels.
The following is the algorithm for denoising used in Buades et al [1].
Let {H, M, m} be the parameters. Repeat for each x 0 ∈ I -compute d 2 κ (Y x,η , Y x0,η ) (given by (21)) and A detailed theory analysis and the convergence of Non-Local Means Filter will be given in Section 3. In Section 4, the numerical simulations show that we can optimize the parameters to make Non-Local Means Filter better.

Convergence theorem of Non-Local means
Now, we turn to the study of the convergence of the Optimal Weights Filter. Due to the difficulty in dealing with the dependence of the weights we shall consider a slightly modified version of the proposed algorithm: we divide the set of pixels into two independent parts, so that the weights are constructed from the one part, and the estimation of the target function is a weighted mean along the other part. More precisely, assume that x 0 ∈ I, h > 0 and η > 0. To prove the convergence we split the set of pixels into two parts I = I ′ x0 ∪ I ′′ x0 , where and I ′′ x0 = I I ′ x0 . Define an estimated similarity function ρ ′ x0 is given by where U ′′ x0,η = U x0,h ∩ I ′′ x0 with U x0,h given by (4). Then an adaptive estimator f ′ h by where and U ′ x0,h = U x0,h ∩ I ′ x0 with U x0,h given by (4). In the next theorem we prove that the Mean Squared Error of the estimator f ′ h (x 0 ) converges at the rate n − 2β 2β+2 which is the usual optimal rate of convergence for a given Hölder smoothness β > 0 (see e.g. Fan and Gijbels (1996 [20])).
and H > 2c 1 n α− 1 2 and H > √ 2Lh. Suppose that the function f satisfies the Hölder condition (14) and f ′ h is given by (26). Then For the proof of this theorem see Section 6.2.

Simulation
In this section, we compare the performance of the Non-Local Means Filter computed using the parameters proposed in this paper with those proposed in Buades et al [1]. The results were measured by the usual Peak Signal-to-Noise Ratio (PSNR) in decibels (db) defined as P SN R = 10 log 10 255 2 M SE , where f is the original image and f h is the estimated one.
We have done simulations on a commonly-used set of images available at http://decsai. ugr.es/javier/denoise/test images/. The potential of the estimation method is illustrated with the 512 × 512 "Lena" image (Figure 2(a)) corrupted by an additive white Gaussian noise (Figure 2(a) right, PSNR= 22.10db, σ = 20). We have seen experimentally that the filtering parameter H can take values between 0.4 × σ + 2 and 0.5×σ+2, obtaining a high visual quality solution. Theorem 3 implies that the search window is of size c 0 σ 2 2β+2 . Assuming that β = 1, we get a search window of size c 0 √ σ. Experimentations show that when the size of the search window takes values 1.5× √ σ+4.5, we obtain the best quality for Non-Local Means Filter. Our simulations also show that it is convenient to take the similarity patch size as m = 17×17 for σ = 10, and m = 21×21 for σ = 20 and σ = 30. In Figure 2(b) left, we can see that the noise is reduced in a natural manner and significant geometric features, fine textures, and original contrasts are visually well recovered with no undesirable artifacts (PSNR= 32.39db). To better appreciate the accuracy of the restoration process, the square of difference between the original image and the recovered image is shown in Figure 2(b) right, where dark values correspond to high-confidence estimates. As expected, pixels with a low level of confidence are located in the neighborhood of image discontinuities. For comparison we give the image denoised by the Non-Local Means Filter with 21 × 21 search windows and 9 × 9 similarity patches (PSNR= 31.51db) and its square error, given in Figure 2 (c). The overall visual impression and the numerical results are improved using our theory.
In Table 2, we show a comparison of PSNR values of Non-Local Means Filter computed with parameters propose in Buades et al [1] and with those proposed in our paper. It is easy to see that the visual quality rises noticeably as the standard deviation σ increases. Nothing improves in the visual quality for σ = 10, but it improves with average 0.50db for σ = 20 and average 0.98db for σ = 30. The comparison with several filters is given in Table 3. The PSNR values show that the Non-Local Means Filter with proper parameters is as good as more sophisticated methods, like [3,26,28,29], and is better than the filters proposed in [22][23][24][25][26]. The proposed approach gives a denoising quality which is competitive with that of the recent method BM3D [16].

Conclusion
We have proposed new theorems of Non-Local Means Filter, based on optimization of parameters in the weighted means approach. Our analysis shows that a small search window is preferred rather than the whole image and a large similarity patch (m = 21 × 21) is also preferred rather than the small similarity patch (m = 7 × 7). The proposed theorems improve the usual parameters of Non-Local Means Filter both numerically and visually in denoising performance. We hope that the con-    vergence theorems for the Non-Local Means Filter that we deduced can also bring similar improvements for recently developed algorithms where the basic idea of the Non-Local means filter is used.
6 Proofs of the main results

Proof of Theorem 1
Denoting for brevity and then we have Noting that te − t 2 H 2 , t ∈ [0, H/ √ 2) is increasing, it is easy to see that Since e − t 2 H 2 , t ∈ [0, H/ √ 2) is decreasing, Using one term Taylor expansion, The above three inequalities (28), (33) and (32) imply that Taking into account the inequality and (33), it is easily seen that Combining (31), (34), and (35), we give Let h minimize the latter term of the above inequality. Then Substituting (37) to (36) leads to Therefore (12) implies (15).
Then (38) becomes If t = c ′ z/ √ m ≤ t 0 /3, we obtain Choosing c ′ sufficiently small we arrive at for some constant c 3 > 0. In the same way we show that This proves the lemma.

Proof of Theorem 3
Taking into account (25), (26), and the independence of ǫ(x), we have where From the proof of Theorem 1, we infer that By Theorem 2 and its proof, for ρ ′ x0 defined by (25), there is a constant c 1 such that Let B = max x∈U ′ x 0 ,h ρ