A fast patch-dictionary method for whole image recovery

Various algorithms have been proposed for dictionary learning. Among those for image processing, many use image patches to form dictionaries. This paper focuses on whole-image recovery from corrupted linear measurements. We address the open issue of representing an image by overlapping patches: the overlapping leads to an excessive number of dictionary coefficients to determine. With very few exceptions, this issue has limited the applications of image-patch methods to the local kind of tasks such as denoising, inpainting, cartoon-texture decomposition, super-resolution, and image deblurring, for which one can process a few patches at a time. Our focus is global imaging tasks such as compressive sensing and medical image recovery, where the whole image is encoded together, making it either impossible or very ineffective to update a few patches at a time. Our strategy is to divide the sparse recovery into multiple subproblems, each of which handles a subset of non-overlapping patches, and then the results of the subproblems are averaged to yield the final recovery. This simple strategy is surprisingly effective in terms of both quality and speed. In addition, we accelerate computation of the learned dictionary by applying a recent block proximal-gradient method, which not only has a lower per-iteration complexity but also takes fewer iterations to converge, compared to the current state-of-the-art. We also establish that our algorithm globally converges to a stationary point. Numerical results on synthetic data demonstrate that our algorithm can recover a more faithful dictionary than two state-of-the-art methods. Combining our whole-image recovery and dictionary-learning methods, we numerically simulate image inpainting, compressive sensing recovery, and deblurring. Our recovery is more faithful than those of a total variation method and a method based on overlapping patches.

1. Introduction.Our general problem is to restore an image M from its corrupted linear measurements in the form of b = A(M) + ξ, where A is a linear operator and ξ is some noise.Examples of such recovery include image denoising in [7,2,18].We say a signal x ∈ R n is sparse (or approximately sparse) under a dictionary D ∈ R n×K if x = Dy (or x ≈ Dy) and y ∈ R K has only a few nonzeros.Many types of signals can be sparsely represented by some dictionary.For example, natural images are approximately sparse under dictionaries based on various wavelet, curvelet, shearlet, and other transforms.Suppose x has a sparse representation under a dictionary D. Then given D and linear measurements b = A(x) + ξ, one can recover x through sparsely coding x via solving min y y 0 , subject to A(Dy) − b 2  2 ≤ , ε where • 0 counts the nonzero number of its argument and is often approximated by • 1 for tractable computation, and ≥ ε 0 is a parameter corresponding to ξ.Once a solution y of ( 1) is obtained, the original signal x can be estimated by Dy.The dictionary D can be either predetermined or learned from a set of training data.Predetermined dictionaries, such as orthogonal or overcomplete wavelets, curvelets, and discrete cosine transforms (DCT), can have advantages of fast implementation over a learned one.Assuming easy availability of training datasets, however, it has been demonstrated (e.g., in [10,7,17]) that a learned dictionary can better adapt to natural signals and improve the recovery quality.
For natural images, existing methods such as MOD [8] and KSVD [1] learn a dictionary D to sparsely represent the patches of an image, rather than the whole image itself.In other words, the size of dictionary atoms is the same as that of the image patches, for example, 6 × 6 or 8 × 8. To denoise an image M with a patchsize dictionary, the pioneering work [7] denoises each of the overlapping patches of M via sparse coding and then estimates M as the average of all the denoised patches together with the observed noisy image.This patch-based method was then extended to compressed sensing MRI -a whole-image recovery problem -in [18], which starts from a rough estimate of M, then simultaneously updates dictionary D and sparse coefficients of all overlapping patches, and finally averages all the recovered patches to estimate M. Dong et al. in [6] use local dictionaries to sparsely represent local patches and incorporate additional local auto-regression (AR) and non-local similarity (NLS) terms to reduce overfitting and improve recovery results.Their model was demonstrated effective on image debluring and super-resolution.These and their follow-up works (e.g., [9,25]) use overlapping patches since tiling non-overlapping patches can cause visible grid artifact along the patch boundaries, which is avoided by using overlapping patches.1.2.Learn a dictionary.Due to a lack of analytic structures, it can be computationally demanding to learn a dictionary.One of the most popular algorithms for dictionary learning is KSVD [1]: where X ∈ R n×p is the training dataset, • 2 denotes the Euclidean norm, s is a parameter to control sparsity, and d i is the ith column of D. KSVD attempts to solve (2) by alternatively updating Y and D in a certain way.The objective is monotonically non-increasing and the denoising and inpainting performances are very good, but the convergence to a stationary point is not guaranteed.Furthermore, it is slow as it performs SVD to update D and exact minimization to update every y j in each iteration.
Another popular method is the online dictionary learning (OLM) [14], which, via an online update approach, attempts to solve min where Y 1 = i,j |y ij | is a convex relaxation of • 0 , and λ is a tuning parameter to balance data fitting and sparsity level.OLM alternatively updates Y and D as follows.When D is fixed, it randomly picks a batch of columns of X and applies sparse coding to each selected column.Letting S be the index set of all previously selected samples and Y S contain their sparse coefficients, the method then updates , where X S denotes the submatrix consisting of all columns of X indexed by S. The above two steps are then repeated until convergence.The algorithm often runs faster than KSVD, and its efficiency relies on the assumption that all training samples have the same distribution.Assuming that the training data admits bounded probability with a compact support and Y S Y S is uniformly positive definite, it is shown that the iterate sequence asymptotically satisfies the first-order optimality condition of (3).The global convergence of the iterate sequence is still open.
We refer the interested readers to the review paper [20] for other dictionary learning methods.In addition, more complicated models have been proposed to learn dictionaries for specific tasks; see [15,13] for example.We do not intend to consider those models and will keep our focus on (3) in this paper.

Contributions. This paper makes the following contributions:
• We propose a simple, novel method that recovers a whole image by applying sparse coding to its patches.In addition to the traditional denoising, inpainting, and deblurring tasks, the method can be applied to recovering an image from its whole-image linear measurements, which arise in the applications of compressive sensing and medical imaging.The method is simple and can include additional energy terms and constraints, as well as to be embedded in more complicated imaging applications.We want to emphasize that our method recovers the whole image at a time and is different from local recovery methods such as those in [7,14] which process image patches one by one.• Along with the method, we introduce a numerical algorithm for dictionary learning that is fast and has provable convergence to a stationary point.The algorithm is based on our recent work on block proximal gradient update in [22].Compared to the existing algorithms, the proposed algorithm has a low per-iteration cost and converges fast.• We provide Matlab codes for three different imaging tasks that are (i) inpainting: fill in image missing pixels; (ii) compressive sensing recovery: recover an image from its undersampled linear measurements; (iii) image deblurring: restore a clean image from its blurs.On these tasks, our codes compare favorably to total variation (TV) methods, as well as those from [14,6] using overlapping patches and learned dictionaries.
1.4.Organization.The rest of the paper is organized as follows.In section 2, we give a new model for recovering an image from its linear measurements, and also discuss how to improve recovery results.Section 3 applies a block proximal gradient method to (3) and makes a new dictionary learning algorithm.Numerical results are reported in section 4, and finally section 5 concludes the paper.
2. Problem formulation.Given a patch-size dictionary D, we aim at recovering an image M from its corrupted linear measurements b = A(M) + ξ, where A is a linear operator and ξ is some noise.The case of A = I has been considered in the pioneering work [7], which alternatively performs sparse coding to denoise every patch and takes average over all overlapping denoised patches together with the observed noisy image.Throughout the discussion in the remaining part of the paper, we assume that a generic image has size N 1 × N 2 and training patches to be n 1 × n 2 .The dictionary D has K atoms, and all of them are vectors in n 1 n 2 dimensional space.Keep in mind that an m × n matrix is equivalent to an m • n vector under Matlab's reshape operation.Hence, we will use a matrix and its reshaped vector interchangeably.For example, a dictionary atom can be regarded as either a vector of length n 1 n 2 or an n 1 × n 2 patch.
2.1.Our model.Motivated by [7], we exactly represent an image by where R ij is an operator taking the (i, j)-th patch, R ij is the adjoint of R ij , and P contains a subset of patches covering all the pixels of M, ensuring that T P is invertible.Note that T P is diagonal, and thus its inverse can be implemented in a pixel-by-pixel manner.If every patch R ij (M) in P has a sparse representation under D, i.e., R ij (M) = Dy ij for a sparse vector y ij , then the above representation can be written as Using this representation, we make the following weighted 1 model: min where w ij ≥ 0 is a weight vector for (i, j) ∈ P , σ is the noise level determined by ξ, and " " denotes component-wise product.Equivalently, one can consider the unconstrained model: min where ν is a parameter corresponding to σ. Upon solving ( 5) or ( 6), one can use Our models are similar to that in [6]: where S denotes the set of all overlapping patches, ν is a parameter balancing sparsity and data fitting, D kij is a given local dictionary used to represent the (i, j)-th patch, and AR(•) and NLS(•) are two regularization terms corresponding to local auto-regression and non-local similarity.The local dictionaries are often incomplete (i.e., fewer columns than rows).Similar to non-overlapping patches (see next paragraph), non-completeness of local dictionaries and AR and NLS terms can reduce variable freedom and increase recoverability of (7).However, the use of more dictionaries and complicated regularization terms makes (7) more difficult to solve than our models.
Choice of P .One question is how to choose P , the subset of covering patches, such that (5) or ( 6) work well for recovering M. We let P be a subset of non-overlapping, covering patches and focus on the unconstrained model (6). Figure 1 compares the two approaches.In this test, we set A = I and b = M + 0.05ξ with ξ ∼ N (0, I), and we compared (6) with two different P 's.In Figure 1, the left image uses all overlapping patches, and the right image uses one subset of non-overlapping, covering patches.We see that ( 6) with all patches produces much worse result than that with non-overlapping P .We want to emphasize here that our results do not counter the intuition that using more patches should give better recovery.The results in Table 2 of section 4 demonstrate that using more different subsets of non-overlapping, covering patches can consistently improve the recovered image quality.The phenomenon in Figure 1 can be explained as follows.Using all the overlapping patches in ( 5) or ( 6) introduces too many unknowns to decide.The 1 minimization typically needs O(s log(n/s)) or more measurements to recover an s-sparse signal of length n.Suppose that the y ij corresponding to each patch has at least r nonzeros and all the (N 1 −n 1 +1)(N 2 −n 2 + 1) overlapping patches are used.Then vector are nonzeros.On the other hand, we have at most N 1 N 2 measurements, not sufficiently many to reach O(s log(n/s)) = O(rN 1 N 2 log(K/r)).Therefore, unless more constraints or regularizations on y are introduced to help (see (7) for instance), we cannot use all the patches.
Since the image may not be evenly divided and the selected patches need to cover all the pixels of the image, we allow them to have different sizes.Slightly abusing the notation, we still use P to denote the set of selected patches, but P can also contain some smaller patches near the boundary.Although we can partition the image arbitrarily with blocks no greater than n 1 × n 2 , for simplicity we make the following assumptions.
• Interior patches (e.g., patch "A" in Figure 2) have size n 1 × n 2 ; • Left and right boundary patches have n 1 rows, and lower and upper boundary patches have n 2 columns; patch "B" in Figure 2 is an example; • Corner patches (e.g., patch "C" in Figure 2) can have fewer than n 1 rows and n 2 columns; • All patches are vertically and horizontally aligned.
Remark 2. Under Assumption 1, the way an image is partitioned into patches is uniquely determined by the size of the upper-left corner patch.
Figure 2 illustrates how we partition a 100 × 100 image into non-overlapping patches in three different ways.Every patch is no greater than 8 × 8, and all interior patches are 8 × 8.However, since the image cannot be evenly partitioned, the patches near the boundary of the image may be smaller than 8×8.For example, in partition 1, all the right boundary patches are 8 × 4, and the upper-right corner patch is 4 × 4; in partition 3, all the left and right boundary patches are 8 × 2, and the lower-left and lower-right corner patches are 4 × 2.

Definition of operators.
As P consists of non-overlapping covering patches, then every pixel must be contained by exactly one patch, and it is not difficult to verify T P = I.If (i, j) ∈ P is one interior patch, then R ij (M) means to take the (i, j)-th patch of M, and R ij (x) is to first generate an N 1 × N 2 zero matrix, and then add x to its (i, j)-th patch.However, as (i, j) ∈ P is a boundary or corner patch and its size is smaller than n 1 × n 2 , the corresponding operators need to act accordingly.For example, let (i, j) be patch "C" in Figure 2 Averaging scheme.As shown in [7], tiling non-overlapping patches to perform image denoising would yield visible artifacts on block boundaries, and it was also observed when we solved (6) once with non-overlapping patches.Though using all patches in (6) at once does not give good recovery, we still want to use them in some way.Note that we have the freedom to choose P in ( 6), so we can solve it for different P 's.For example, if M ∈ R 100×100 and dictionary atoms are 8 × 8, we can partition the image into non-overlapping patches in the three different ways in Figure 2, and solve (6) for each partition.It turns out that averaging the recovered images from different P 's can remove the artifacts occuring on block boundaries and improve PSNR value; see the numerical results in section 4. Algorithm 1 summarizes our method.Note that ( 6) can be solved for different P 's in parallel.

Algorithm 1:
Data: Dictionary D, patch size (n1, n2), image size (N1, N2), measurements b, linear operator A, and parameter ν.Choose t different ways to partition the image into non-overlapping patches; denote them as P1, . . ., Pt. Solve (6) for P k and let the recovered image be M k , for k = 1, . . ., t.Average all the recovered images by M = 1 t t k=1 M k and output M.

2.2.
Adaptive dictionary update.After obtaining an estimated image M by Algorithm 1, we can update the dictionary D using patches extracted from M. Since M is close to the original image M, the updated dictionary D from M should better represent the patches of M. Hence, it is possible to further improve the result using the adaptively updated dictionary, and this process can be repeated several times.Algorithm 2 summarizes our adaptive method.
We observe that only the first adaptive update gives significant improvement, and subsequent ones make only minor changes to the dictionary and thus little improvement to the recovered image.For this reason, in the numerical experiments, we will update the dictionary only once.

Algorithm 2:
Data: Dictionary D, patch size (n1, n2), image size (N1, N2), measurements b, linear operator A, and parameter ν. repeat Run Algorithm 1 and let the recovered image be M. Update dictionary D from patches extracted from M. until convergence 3. Block proximal gradient method for dictionary learning.Both Algorithms 1 and 2 require an initial dictionary D, which can be an analytic dictionary such as orthogonal or overcomplete wavelets, curvelets or DCT, or a learned one.For our purpose, a learned dictionary is preferable since it can be more adaptive to natural images [10,7,17].To learn a dictionary, one can apply any available solver such as MOD, KSVD and OLM.We choose to use a new dictionary learning method, which applies the BPG method proposed in [22] to (3).Compared to some state-of-the-art methods, the new algorithm is often faster and produces more faithful dictionaries.Though (3) is non-convex jointly with respect to D and Y, it is convex with respect to each of them while the other one is fixed.With this bi-convexity property, the BPG method is shown to generate a sequence globally converging to a stationary point of (3).
3.1.Block proximal gradient method.Recently, [22] characterized a class of multi-convex problems and proposed a BPG method for solving these problems.For simplicity and our purpose, we review the method only for bi-convex problems like (3).Consider min where f is differentiable and convex with respect to either x or y by fixing the other one, and r x , r y are extended-valued convex functions.At the k-th iteration of BPG, x and y are updated alternatively by where ) denotes an extrapolated point with weight ω k x ≥ 0, and L k y and ŷk have the same meanings for y.
BPG is a variant of the block coordinate minimization (BCM) method (see [21] and the references therein), which updates x, y cyclically by minimizing the objective with respect to one block of variables at a time while the other is fixed at its most recent value.Though BCM decreases the objective faster, subproblems for BCM are usually much more difficult than those in (9).For simple r x and r y , the updates in (9) have closed form solutions.
Under some boundedness assumptions, [22] establishes subsequence convergence of the APG method.Further assuming the so-called Kurdyka-Lojasiewicz (KL) property (see [12,5] for example), it shows that the sequence {(x k , y k )} generated by (9) globally converges to a stationary point of (8) .
3.2.Dictionary learning.We learn a dictionary from training dataset X via solving (3) be the fidelity term in (3).Applying ( 9) to (3), we alternatively update D and Y by where The updates in (10) can be explicitly written as where in (11a), P D (•) denotes the Euclidean projection to D defined for any D as and in (11b), S τ (•) denotes soft-thresholding operator defined for any Y by where A denotes matrix operator norm of A. Hence, YY is a Lipschitz constant of ∇ D (D, Y) about D. Throughout our numerical tests, we take The extrapolation weights are taken as where The weight ω k has been used in FISTA [3], showing that this kind of extrapolation significantly accelerates the proximal gradient method for convex composite problems.We observe that the extrapolation with weights in (13) can also greatly speed up the BPG method for solving (3).
To make the whole objective non-increasing, we redo the k-th iteration by setting , where is the objective of (3).As shown in [22], the setting of ).The non-increasing property is not only required by global convergence, but also important to make the algorithm perform stably and converge rapidly.The pseudocode of our method is shown in Algorithm 3.  In (2) we set s = r, i.e., the true sparsity level was assumed, and in (3) we set λ = 0.5/ √ n.Algorithm 3 was terminated as long as was satisfied in three consecutive iterations or it ran over 1000 iterations.KSVD was run to 200 iterations, and OLM ran to the same time as that of Algorithm 3.All other parameters for KSVD and OML were set to their default values.We fixed n = 36 and tested three different pairs of (K, p).For each pair of (K, p), sparsity level r varied among {4, 6, 8, 10, 12}.The recovery of each atom d of the original dictionary D was regarded successful if where di is the i-th column of an estimated dictionary D. The average running time and recovery rates of 50 independent runs are shown in Table 1.From the table, we see that our method used much less time than KSVD with comparable recovery rates.When sparsity level r is big (e.g., r = 12) or the training samples are not so many (e.g., p = 20n), our method got much higher recovery rates than those by KSVD.For the first two pairs of (K, p), OLM tends to give lower rates than our method, and it may be because our method converges fast but OLM does not.However, in the case (K, p) = (4n, 100n), we want to mention that OLM can give results similar to ours if it is allowed to run a very long time.
4.2.Whole image recovery.This section tests the performance of Algorithms 1 and 2 on image recovery.Two different dictionaries were compared for Algorithm 1.
One was an overcomplete DCT, generated in the same way as in [1].Another one was learned from 20,000 8 × 8 grayscale patches, that were 100 randomly extracted patches from each of the 200 images in the training set of the Berkeley segmentation dataset [16].For the learned dictionary, we first subtracted each training patch by its mean, and then trained a dictionary D using these zero-mean patches via solving (3) with K = 256 by Algorithm 3, where we chose λ = 0.8/ √ n to make the average nonzero number per column of Y about 8. Finally, we let D = [e, D] ∈ R 64×257 and used D in our tests, where e is a vector with all one's.Such an atom with constant components is called a DC in [1], which shows that the processed dictionary D performs better than D for real-world image processing tasks.Here, we want to mention that for an image patch x, if x − mean(x) has a sparse representation under D, i.e., x − mean(x) = Dy with sparse y, then x = mean(x)e + Dy, which means x is sparse under D. Therefore, the above processing is reasonable.The used overcomplete DCT is also 64 × 257, and its first column is a DC.For Algorithm 2, we used the above D as its initial dictionary, and updated the dictionary only once by learning a new one via Algorithm 3 using patches2 of the first-step estimated image, which is exactly the output of Algorithm 1 using D. Then we used the updated dictionary to perform image recovery once more to get the final result.
Implementation.In (6), we took b = A(M) + σξ, where ξ ∼ N (0, I) is Gaussian noise, and σ = σ A(M) 2 / ξ 2 throughout our tests, where σ varied among {0.01, 0.05, 0.10}.We took ν = σ for the first two kinds of A and ν = 0.1σ for the third kind of A. The definitions of different A's are given in the next paragraph.In addition, we set all elements of w ij to one except its first component, which was set to zero.Under this setting, using any DC as the first atom of D would make no difference for the solution of (6).Then, (6) was solved via YALL1 (version 1.4) [24], for which we used Gaussian random starting point and 10 −4 as its stopping tolerance.All other parameters of YALL1 were set to their default values.We chose YALL1 due to its high efficiency for solving (6) and easy call by providing operations of A and A .
Three different kinds of A were tested.The first one did image inpainting and used the sampling operator P Ω , which takes all pixels of its argument in Ω and zeros out all others.The adjoint of P Ω is to fill in the locations in Ω by its argument and other locations by zero.The second one did compressed image recovery and took A as the composition of P Ω and two-dimensional complex-valued circulant operator C 2 , i.e., A = P Ω • C 2 .We did the same normalization to A as in [23], which uses such A for testing learned circulant sensing operators.Performing C 2 on a matrix M can be realized by one fast Fourier transform (FFT), one inverse FFT and some component-wise multiplications, and the adjoint of C 2 is to do one fast Fourier transform (FFT), one inverse FFT and some component-wise divisions.The third kind of A was a blurring operator with a 9 × 9 kernel.We used two different kernels, which were generated by Matlab's commands fspecial('average', [9,9]) and fspecial('motion',10,45) respectively.The implementation of a blurring operator can also be realized by one FFT, one inverse FFT, and some componentwise products.Hence, all the three kinds of A can be easily realized in algorithms and in hardware.
Our method processes a subset of nonoverlapping, covering patches together, and thus it recovers the whole image at a time.For A = P Ω , one can denoise all overlapping patches independently since every measurement only involves one single pixel.However, this way would require noise information on each patch while our method only on the whole image.For the other two A's, every measurement mixes more than one or even all image pixels, and one would not be able to process different patches independently.
Results.First, let us see how the averaging scheme in Algorithm 1 improves the recovery performance.We tested it on the grayscale versions of Castle and Lena images shown in Figure 3, and both of the two images are unrelated to the training samples.We chose five different partitions, whose upper-left corner patches were 8 × 8, 8 × 4, 4 × 8, 8 × 2, and 2 × 8, respectively.(Recall that each partition is uniquely determined by its upper-left corner patch under Assumption 1.)For each partition, we solved (6) to obtain a recovered image.Let the recovered images  be denoted by M 1 , M 2 , M 3 , M 4 , M 5 .We compared PSNR values of the running average M av j = 1 j j i=1 M i and the M i that had the greatest PSNR among the five, denoted M best .Table 2 lists the average results of five independent runs for four different A's and noise level σ = 1%.For the first two A's, we took 30% uniformly random pixels, i.e., SR := |Ω| N1N2 = 30%.From the results, we see that the averaging scheme consistently improves the recovery performance.Note that there are at most n 1 n 2 different partitions under Assumption 1.We observed that the more different partitions we used, the better result could we get by the averaging scheme.However, the rate of improvement drops as the number of partitions increases, as shown in Table 2.For this reason, we only use three different partitions in the remaining experiments.
Next, we compare Algorithm 1 with two different dictionaries and Algorithm 2 on the four images shown in Figure 3.All of these images were unrelated to the learned dictionary D. To show the effectiveness of (6), we also included a TV-based method for the first two A's and an overlapping patch-based method for the third kind of A in the comparison.The TV-based method solves min where • TV denotes TV semi-norm, and the overlapping patch-based method solves (7).We employed TVAL3 (version beta2.4)[11] to solve (15), and its default settings were used.The model (7) was solved by the algorithm in [6], and its code was available online from the authors' webpage.We set its maximum number of iterations to 10 4 , which was sufficiently large to make the algorithm to solve (7) to a high accuracy.In their code, the second group of local dictionaries were used, and we tuned the parameters par.tau and par.c1 while all the other parameters were set to their default values.For color images, each of RGB channels was recovered independently.
For A = P Ω , we tested SR = 30%, 50%, and for A = P Ω • C 2 , we tested SR = 10%, 20%, 30%.For each tested image, we chose three different partitions, whose upper-left corner patches were 8 × 8, 8 × 4, and 4 × 8, respectively.The same three partitions were used in both Algorithms 1 and 2. Table 3 lists the average results of five independent trials by the compared methods for A = P Ω , Table 4 for A = P Ω • C 2 and Table 5 for image deblurring.From the results, we see that Algorithm 1 works better with learned D than DCT except for the Castle image when A is average blurring operator and σ = 10%.Our method with learned D is consistently better for A = P Ω and much better for A = P Ω • C 2 than TV-based model (15).For both blurring operators, our method is better than that in [6] for solving (7) except when noise level σ = 10%, the latter performs better on the Boat image for average and the Castle image for both average and motion.In addition, Algorithm 2 with adaptively updated dictionary makes improvement over Algorithm 1 in all cases except for the Castle and Boat images when A is average blurring operator and σ = 10%.The improvement usually increases as SR increases.It is reasonable since higher SRs give cleaner images, which further generate better dictionaries.
We provide open source codes on our websites and welcome the interested reader to try it on more datasets.5. Conclusions.Dictionary learning has been popularly applied to image denoising, super-resolution, classification and feature extraction.Various algorithms have been proposed for learning dictionaries to achieve different goals.In this paper,

Figure 1 .
Figure 1.Image denosing comparison of two methods: (left image) solving (6) with all patches used at once, (right image) solving(6) with one subset of non-overlapping, covering patches.In (6), ν = 0.05, and D was learned according to section 4.2.

Figure 2 .
Figure 2. Three different ways of partitioning a 100 × 100 image into non-overlapping patches, where each patch is no greater than 8 × 8 and all interior patches have size 8 × 8. partition 1 partition 2 partition 3 . Then we define • R ij (M): first generate an 8 × 8 zero matrix, and then replace its upper-right 4 × 4 corner submatrix with the upper-right 4 × 4 corner patch of M; • R ij (x): first generate a 100 × 100 zero matrix, and then replace the upperright 4×4 corner patch corresponding to "C" with the upper-right 4×4 corner submatrix of x.
denote extrapolated points with ω k d , ω k y ≤ 1, and L k d and L k y are taken as Lipschitz constants of ∇ D (D, Y k−1 ) and ∇ Y (D k , Y) about D and Y respectively.

Remark 3 .
Our algorithm uses proximal update for both D and Y.It differs from other methods such as KSVD and OLM which perform exact minimization to update D and/or Y. Maintaining closed form solutions for both D and Ysubproblems ensures the algorithm to have a lower per-iteration complexity, and the extrapolation technique lets it take a small number of iterations to achieve a faithful solution.

Figure 3 .
Figure 3. Four tested images.From left to right: Castle, Lena, Plane, Boat

Table 1 .
Average running time and recovery rates of 50 independent runs by Algorithm 3, KSVD, and online learning method (OLM)