An efficient dictionary learning algorithm for sparse representation

21 downloads 0 Views 482KB Size Report
As we know, the main process of many learning algorithms [8-10] can be divided into two stages: sparse coding and dictionary update. Sparse coding is to find ...
An efficient dictionary learning algorithm for sparse representation Leyuan Fang1 and Shutao Li1 1.College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China E-mail: [email protected], [email protected] Abstract: Sparse and redundant representation of data assumes an ability to describe signals as linear combinations of a few atoms from a dictionary. If the model of the signal is unknown, the dictionary can be learned from a set of training signals. Like the K-SVD, many of the practical dictionary learning algorithms are composed of two main parts: sparse-coding and dictionary-update. This paper first proposes a Stagewise least angle regression (St-LARS) method for performing the sparse-coding operation. The St-LARS applies a hard-thresholding strategy into the original least angle regression (LARS) algorithm, which enables it to select many atoms at each iteration and thus results in fast solutions while still provides good results. Then, a dictionary update method named approximated singular value decomposition (ASVD) is used on the dictionary update stage. It is a quick approximation of the exact SVD computation and can reduce the complexity of it. Experiments on both synthetic data and 3-D image denoising demonstrate the advantages of the proposed algorithm over other dictionary learning methods not only in terms of better trained dictionary but also in terms of computation time. Key Words: Dictionary learning, sparse representation, least angle regression, hard thresholding

1

INTRODUCTION Sparse and redundant representation of a signal y ∈ \ N

over a dictionary D ∈ \ N × K (with K columns denoted as atoms) refers that one can find a linear combination of a few atoms from D that is “close” to the signal y. Previous works [1-4] have shown that modeling a signal with such a sparse decomposition is very effective in many signal and image processing applications. A fundamental consideration in employing the above sparse model is the choice of the dictionary D. The majority of works on this topic can be parted into two main categories: analytical-based and learning-based. The analytical approaches construct the dictionary, basing on various types of wavelets [5], and its variants [6-7]. The learning approaches recommend using machine learning techniques to infer the dictionary from a set of training examples [8-10]. Advantages of these approaches are the finer dictionaries they produce compared to analytical approaches, and the significantly better performance in applications. However, the complexity constraints in these algorithms often limit the size of the dictionaries that can be trained, and the dimensions of signals that can be processed. So, reducing the complexity of these learning algorithms is a stiff challenge for us. As we know, the main process of many learning algorithms [8-10] can be divided into two stages: sparse coding and dictionary update. Sparse coding is to find the sparest solution of the training signals, which dominates the complexity of the dictionary learning. Commonly used strategies conducting the sparse coding are typically based on greedy pursuit and convex relaxation. Greedy pursuit employs some greedy algorithms (e.g. the matching pursuit (MP) [11] and orthogonal matching pursuit (OMP) [12]) to get an approximate solution of the sparse representation. Since the MP and OMP are that only one atom is selected at each

iteration, Donoho et al. have recently proposed the stagewise orthogonal matching pursuit (StOMP) [13] by applying a threshold strategy named false discover rate (FDR) control into the atom selection process. So, the StOMP can select more than one atom for each iteration and thus has low computational complexity than the MP and OMP, especially for large-scale problems. However, the high non-convexity of these greedy algorithms usually can not find the optimal solution and this will enable the dictionary learning algorithm to get caught in local minimal or even saddle points [2, 14]. The convex relaxation approaches (e.g. the basis pursuit (BP) [4] and the LASSO [15]) use the A1 -norm as a sparseness measure and have been proven to obtain the sparsest solution. But they run much slower than the above greedy algorithms, and thus will create heavy computational burden to the whole learning algorithm. A fast algorithm called least angle regression (LARS) introduced in [16] can make a small modification to solve the LASSO’s problem and the computational complexity of it is very close to that of the greedy methods. But, the LARS also just allows one atom to be chosen in the atom selection process, and therefore there is a strong incentive to select more atoms for each iteration in order to accelerate the convergence, as in [13]. Following this idea, we propose a method named St-LARS by applying a hard-thresholding strategy into the LARS. This can enable it to select more than one atom at each iteration while still keep good performance as the LARS. It is worthwhile to note that for the hard-thresholding is less time consuming than the thresholding controlled by the FDR, the St-LARS generally can run much faster than the StOMP, as will be demonstrated in the experimental part. As a result, the St-LARS greatly accelerates the stage of the sparse coding and gives a good sparse solution for the dictionary update stage. Updating the dictionary, when the sparse representation is found, is comparatively easier. The SVD decomposition used

978-1-4244-7210-9/10/$26.00 ©2010 IEEE

in the K-SVD has been shown to be a wiser method than many others approaches [8-10, 14, 17]. In this paper, we replace the exact SVD computation with a simpler approximated one (ASVD) [18] and thus obtain further acceleration for the dictionary learning algorithm. The rest of this paper is organized as follows. In Section 2, we introduce the proposed dictionary learning algorithm. Our experimental results on both synthetic data and 3-D image denoising are presented in section 3. Section 4 concludes this paper and suggests future work.

2

modifications and its computational complexity is linear with the size of input signals as the greedy algorithms. However, the LARS computes the solution by only adding one atom to its active set at each iteration. Therefore, we adapt the idea from [13], and propose the St-LARS algorithm to solve the problem (3). y

DICTIONARY LEARNING

Dictionary learning is the task of learning or training a dictionary such that it is well adapted to its training data. Usually the objective is to give sparse representation of the training set, making the total error as small as possible, i.e. minimizing the sum of squared errors. Let the training data constitute the columns in a matrix Y and the sparse coefficients vectors are the columns in matrix X. The objective function of the dictionary learning can be stated formally as a minimization problem

min || Y − DX ||2F subject to || X ||P ≤ T , D,X

cs

Is

dx s

xIs s

(1)

where the function || ⋅ ||P denotes the A P -norm. A practical optimization strategy can be found by splitting the problem into two parts which are alternately solved within an iterative loop [2, 8-10]. The two parts are: 1) Sparse coding: keeping D fixed, find X; 2) Dictionary update: keeping X fixed, find D. Since the proposed dictionary learning algorithm is also based on the two parts, the following subsections will give the detailed descriptions of our improvements on these parts.

2.1 Sparse coding Consider solving the optimization problem (1) with A P -norm penalty over the sparse matrix X while keeping the dictionary D fixed. This problem can be solved by optimizing over each vector x of the sparse matrix individually: min || y − Dx ||2F subject to || x ||P ≤ t , x

(2)

where the y represents one signal of the training matrix Y. Notice that the above optimization task can be easily changed to be 1 (3) || y − Dx ||2F + λ || x ||P . x 2 For a proper choice of λ , the two problems are equivalent. If the p in the function || ⋅ ||P is set to 0, the above problem is known to be NP-hard in general [19] and some greedy algorithms [11-13] can not get the optimal sparse solution of it. The BP [4] and LASSO [15] replace the combinatorial A 0 -norm with the A1 -norm. As reported in [4], they achieve sparser solution but are slower than the greedy approaches for most experiments. In [16], an algorithm called LARS is introduced to solve the LASSO’s problem with minor min

rs

xˆ s

Fig. 1. The scheme of the St-LARS algorithm.

The scheme of St-LARS is presented in Fig. 1, which consists of the following seven steps. Step 1: Set initial solution x 0 = 0 , initial residual r0 =y , initial estimate I0 = φ , initial threshold λ0 =|| DT y||∞ and counter s=1. Step 2: Apply the D to the current residual rs-1 , getting a vector of residual correlations cs : cs (k ) = d Tk rs-1 , k = 1 , ..., K ,

(4)

for each corresponding atom ( K is the number of atoms in D, d k denotes the k-th atom in D, and c s (k ) stands for the k-th element in c s ). The residual correlations cs are supposed to contain a small fraction of significant non-zeros. Step 3: Perform hard thresholding with threshold λs to find the significant non-zeros in cs : I s = {k :| cs (k ) |> λs }.

(5)

The λs is calculated by λs = μλs −1 , where the μ is a threshold decrease step.

Step 4: Compute the update direction dx s by projecting the

d k =E k g k / || E k g k ||2 ,

residual rs onto subspace spanned by the columns of D belonging to I s :

g k = ( E k )T d k .

T Is

−1

T Is s

dx s (I s ) = (D DIs ) D r .

(6)

Step 5: Update the solution x s by augmenting x s −1 in the direction dx s with a step size ε (typically chosen equally to the threshold decrease step): x s = x s −1 + ε dx s . (7) Step 6: Construct a new Dx s using the x s . Then, the current residual can be calculated by rs = y − Dx s . (8) Step 7: Check the stopping condition. The procedure stops with the output of final solution xˆ s when the A 2 -norm of current residual rs reaches an error goal, or when the threshold λ is less than a pre-chosen value. If it is not satisfied, we set s = s + 1 and go to the step 2. This algorithm is a fast stagewise approximation to LARS and provides a good approximation to the solution of (3). It is still note that the computation in the step 4 will usually not be carried out explicitly for its high computation cost. So, we employ a progressive Cholesky update process to reduce the work involved in the matrix inversion, and thus further accelerate the St-LARS. For more details about the Cholesky process, the reader can be found in [18, 20].

2.2

Dictionary update

Given fixed sparse matrix X, the dictionary D can be updated by solving the following problem: min || Y − DX ||2F . D

(9)

In general, this problem can be solved using gradient descent with iterative projection [17] or other least square methods [8]. However, it can be much more efficiently solved using the SVD decomposition. For a given atom k, the quadratic term in (9) is rewritten as || Y − ∑ d j gTj − d k gTk ||2F =|| E k − d k gTk ||2F , j≠k

(10)

where the gTj are the rows of sparse matrix X, the d k

denotes the atom of the dictionary D and the E k stands for the residual matrix. After the SVD decomposition of the matrix E k , both the atom d k and gTk can be updated. To avoid the introduction of new non-zeros in X, the update uses only the signal vectors whose current representations use the atom d k . Since the exact SVD computation is time consuming, adopting a much quicker approach to get a solution of (10) is a more exciting option. Therefore, our implementation here uses the ASVD, which employs a single iteration of alternate-optimization over the atom d k and the spare matrix row gTk , which is given by

(11)

This operation has been proven to both reduce the complexity and give very close result to the full decomposition of Ek [18].

3

EXPERIMENTS

We evaluate the proposed method with synthetic and real data. Using synthetic data with random dictionaries helps us to examine the ability of the proposed method to recover dictionaries exactly (to within an acceptable squared error). This synthetic experiment is very similar to that in [10], except for the dimension of the data and the dictionary we generate are slightly higher. To test the performance on real data, we choose the 3-D CT volumes, which need a comparatively larger dictionary. So these experiments are to demonstrate that our method is more applicable to large-scale dictionary learning problems than other approaches. 3.1

Synthetic experiments

In these simulations, a 50 × 100 dictionary D is first generated by normalizing a matrix with i.i.d. uniform random entries. Then, we produce 1500 data signals of dimension 50, each created by a linear combination of ten different dictionary atoms, with independent locations and uniformly distributed i.i.d. coefficients in random. After that, white Gaussian noise with varying signal-to-noise ratio (SNR) is added to the resulting data signals. The training dictionary is initialized with data signals and the number of the training iteration in this setting is set to 100. If the squared error between the learnt and the true dictionary element is below 0.2, it is classified as correctly recovery. To allow a fair comparison, the simulations are repeated five times and the average values are calculated. In Fig. 2, the dictionary recovery success rates of our method are compared with that of the K-SVD [10], StOMP-ASVD, and LARS-ASVD. The StOMP-ASVD and LARS-ASVD are the methods which first apply the StOMP and LARS to perform the sparse coding respectively, and then employ the ASVD to conduct the dictionary update. The stopping criteria for the LARS in LARS-ASVD and St-LARS in our method are λ = 0.3 in (3). For the OMP in the K-SVD and StOMP in the StOMP-ASVD, they are stopped when the fixed numbers of ten coefficients are found. The threshold decrease step μ and the step size ε in (7) are both chosen to 0.95. As can be seen from the Fig. 2, the LARS-ASVD and our method recover nearly the same number of atoms and perform better than the other two methods for all the tested cases. In Fig. 3, we also compare the computation time of the four algorithms for the above simulations. The simulations are done in the environment of an AMD Athlon CPU 2.81 GHz with a 2.00 GB RAM PC, operating under Matlab 7.10.0. As we can see, the speed advantage of our method is obvious and it is about two times faster than the other approaches. It is still notice that the StOMP-ASVD can not run very fast in these

simulations and even slower than the K-SVD, though the StOMP can also select many atoms at each iteration on the sparse coding stage. The main reason for this is that the FDR

In this subsection, our method is tested on two 3-D CT volumes: Visible Male-Head and Visible Female-ankle. Like the K-SVD denoising algorithm in [1], we first train an over-complete dictionary using blocks from the noisy image, and then denoise the image using this dictionary. The peak-signal-to-noise ratio (PNSR) is used as objective denoising measure. The size of the training block is 8×8×8 and the size of the dictionary is 512×1000. The number of the training blocks is chosen to 80000 and the number of the training iteration is set to 5. We should mention that some newly denoising techniques, e.g. multi-scale framework [2] and non-local simultaneous sparse-coding [21], can be used here to further improve our performance. However, our work here only focuses on the original denoising formulation for simplicity, and thus do not conduct a thorough comparison with some of the state-of-the-art works.

Fig. 2. Comparison of the dictionary recovery success rates using different dictionary learning methods

(a)

(b)

(c)

(d)

Fig. 3. Comparison of the computation costs of the dictionary learning methods

control which calculates the threshold for each iteration of the StOMP consumes large computation, and the computational complexity of it is far larger than the hard-thresholding used in our method. We guess that if the StOMP applies in larger dictionary or higher dimensional signals, the StOMP-ASVD will become faster than the K-SVD. 3.2

3-D image denoising experiments

Table 1: CT denoising results using the K-SVD, and our method. Values represent PSNR (dB), and are averaged over 5 executions. The best results in this table are labeled in bold. Noise level

σ 20 50 75 100

Vis. F. Ankle K-SVD Our method 37.52 37.55 33.11 33.22 31.20 31.38 29.71 29.88

Vis. M. Head K-SVD Our method 37.62 37.66 32.79 32.93 30.35 30.53 29.11 29.36

Fig. 4. Denoising results for Visible Female-ankle. Images are mainly provided for qualitative evaluation. (a) Original image ( σ = 50 ); (b) Noisy image; (c) Denoised using the K-SVD; (d) Denoised using our method.

The denoising results are summarized in Table 1. We can see from the table that the PSNR gain of our method over the K-SVD is consistent, though not very large. Some actual denoising results are shown in Fig. 4. It can be seen that the visual quality of the K-SVD is very close to that of our method. Although the differences in visual quality are typically small, the main appeal of our method is its substantially better complexity, as depicted in Table 2. As can be seen, the complexity advantage of our method translates to a × 2 − × 4 reduction in denoising time compared to the K-SVD. So, it can be easily concluded that our method is more suitable for large-scale dictionary learning problems.

Table 2: Running times (in second) of the K-SVD, and our method for the results in Table 1. Method/ σ

4

Vis. F. Ankle

Vis. M. Head

K-SVD

20 8467

50 3126

75 2193

100 1830

20 9662

50 3814

75 2336

100 1970

Our method

2152

1223

1095

965

2365

1396

1151

1032

CONCLUSION

In this paper, we combine the St-LARS and ASVD into an efficient dictionary learning algorithm. The St-LARS gives a better solution for the sparse coding and greatly reduces the complexity of it by simply exploiting a hard-thresholding strategy. The ASVD is a quick approximation way for updating dictionary, and thus further accelerate the whole dictionary learning algorithm. The experimental results demonstrate the superior performances of the proposed method in terms of training better dictionary and reducing the computational complexity. So this learning algorithm is very suitable for some large-scale signal processing applications (like the 3-D CT denoising above). However, the full potential of this algorithm is needed to be further explored, and other signal processing tasks (involving color image denoising, deblurring, and inpainting) are expected to benefit from it.

[8]

[9]

[10]

[11]

[12]

[13]

ACKNOWLEDGEMENTS This paper is supported by the National Natural Science Foundation of China (No. 60871096), the Ph.D. Programs Foundation of Ministry of Education of China (No.200805320006), the Key Project of Chinese Ministry of Education (2009-120), and the Open Projects Program of National Laboratory of Pattern Recognition, China.

[14]

[15] [16]

REFERENCES

[17]

[1]

[18]

[2]

[3]

[4]

[5] [6]

[7]

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image process., 15(12): 3736-3745, 2006. J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM J. Multiscale Model. Simul., 7(1): 214-241, 2008. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. Proc. ICCV 2009, Tokyo, Japan, 2009, pp.2272-2279. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Rev., 43(1): 129-159, 2001. S. Mallat. A wavelet tour of signal processing, third edition. Academic Press, 2009. M. N. Do and M. Vetterli. The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans. Image Process., 14(12): 2091-2106, 2005. D. L. Donoho. Wedgelets: nearly minimax estimation of edges. The Annals of Statistics, 27(3): 859-897, 1999.

[19]

[20] [21]

K. Engan, S. O. Aase, and J. Hakon Husoy. Method of optimal directions for frame design. Proc. ICASSP 1999, Washington, USA, 1999, Vol.5, pp.2443-2446. S. Lesage, R. Gribonval, F. Bimbot, and L. Benaroya. Learning unions of orthonormal bases with threshold thresholded singular value decomposition. Proc. ICASSP 2005, Philadelphia, USA, 2005, Vol.5, pp.293-296. M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process., 54(11): 4311-4322, 2006. S. Mallat and Z. Zhang. Matching pursuit with time-frequency dictionaries. IEEE Trans. Signal Process., 41(12): 3397-3415, 1993. Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. Proc. AsilomarSS, California, USA, 1993, Vol.1, pp.40-44. D. L. Donoho, Y. Tsaig, I. Drori, and J.-L. Starck. Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit. 2006 [Online]. Available: http://stat.stanford.edu/~idrori/StOMP.pdf, Preprint. R. Rubinstein, A. M. Bruckstein, and M. Elad. Dictionaries for sparse representation modeling. Proceeding of the IEEE, 98(6): 1045-1057, 2010. R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1): 267-288, 1996. B. Efron, T. Hastie, I. Johnston, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2): 407-499, 2004. Y. Censor and S. A. Zenios. Parallel optimization: theory, Algorithms and Applications, Oxford University Press, 1997. R. Rubinstein, M. Zibulevsky, and M. Elad. Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit. Technical Report-CS Technion, 2008. G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Constructive approximation, 13(1): 57-58, 1997. T. Blumensath and M. Davies. Gradient pursuits. IEEE Trans Signal Process., 56(6): 2370-2382, 2008. K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process., 16(8): 2080-2095, 2007.