Multilevel Dictionary Learning for Sparse ... - IEEE Xplore

1 downloads 0 Views 300KB Size Report
MULTILEVEL DICTIONARY LEARNING FOR SPARSE REPRESENTATION OF IMAGES. Jayaraman J. Thiagarajan, Karthikeyan N. Ramamurthy and Andreas ...
MULTILEVEL DICTIONARY LEARNING FOR SPARSE REPRESENTATION OF IMAGES Jayaraman J. Thiagarajan, Karthikeyan N. Ramamurthy and Andreas Spanias SenSIP Center, School of ECEE, Arizona State University, Tempe, AZ 85287-5706 USA. Email: {jjayaram,knatesan,spanias}@asu.edu ABSTRACT

when an overcomplete set of basis functions is used [5]. The linear model considered for representation is Y = ΨA + N, where the collection of T training vectors is given by Y = [y1 y2 ...yT ] ∈ RM ×T and each training vector yi ∈ RM . The dictionary, which is a collection of representative patterns is given by Ψ = [ψ 1 ψ 2 ...ψ K ] ∈ RM ×K . The set of T coefficient vectors is denoted by the matrix A = [a1 a2 ...aT ] ∈ RK×T . Additive noise is modeled by the matrix N whose elements are independent realizations from the Gaussian distribution N (0, σ 2 ). The column vectors of Ψ are assumed to be normalized. From [2], the joint optimization of sparse representation and dictionary learning is posed as,

Adaptive data-driven dictionaries for sparse approximations provide superior performance compared to predefined dictionaries in applications involving representation and classification of data. In this paper, we propose a novel algorithm for learning global dictionaries particularly suited to the sparse representation of natural images. The proposed algorithm uses a hierarchical energy based learning approach to learn a multilevel dictionary. The atoms that contribute the most energy to the representation are learned in the first level and those that contribute lesser energies are learned in the subsequent levels. The learned multilevel dictionary is compared to a dictionary learned using the K-SVD algorithm. Reconstruction results using a small number of non-zero coefficients demonstrate the advantage of exploiting energy hierarchy using multilevel dictionaries, pointing to potential applications in low bit-rate image compression. Superior performance in compressed sensing using optimized sensing matrices with small number of measurements is also demonstrated.

min Y−ΨA2F subject to ai 0 ≤ s, ∀i and ψ j 2 = 1, ∀j, Ψ,A

(1) where s is the sparsity (the maximum number of non-zero elements) of the representation of a training vector, .F is the Frobenius norm, .0 is the 0 norm and .2 is the 2 norm. In this paper, we consider the class of dictionary learning algorithms that perform a two-step iterative procedure (a) given the dictionary, evaluate the coefficients (sparse coding) and, (b) given the coefficients, update the dictionary so as to achieve better representation. A wide range of such learning algorithms have been proposed in the literature [6] and they differ in the method used to evaluate the coefficients and the procedure used to update the dictionary elements. The two-step iterative procedure of dictionary learning has a strong connection to the problem of data clustering and hence algorithms such as the K-means and K-hyperline clustering [7] are special cases of (1). In particular, the K-SVD algorithm proposed in [2] is a direct generalization of the Khyperline clustering algorithm. K-SVD allows the use of any pursuit algorithm in its implementation and has been highly successful in image coding and denoising applications. In this paper, we propose a multilevel dictionary (MLD) learning framework that designs global representative dictionaries for natural images. We use a hierarchical energy based learning approach that uses the K-hyperline clustering algorithm to learn representative atoms for each level of the dictionary. In the first level, the patches from training images are used for learning the dictionary atoms. For the subsequent levels, the residual of the representation from the previous level are used as the training data. The algorithm we proposed

Index Terms— dictionary learning, sparse representations, natural image statistics, K-hyperline clustering, compressed sensing 1. INTRODUCTION Parsimonious representations are useful in analysis-synthesis applications in signal and image processing. The primary motivation for considering sparse representations of images is the fact that natural images can be sparsely decomposed into a set of elementary features [1] such as edges, lines and other features. Sparse representation problems require a good approximation of a signal as a linear combination of elementary signals with a constraint on coefficient sparsity. Learning dictionaries from data allows us to extract key patterns specific to natural images, thereby providing a better representation than using predefined dictionaries. Sparse approximations that are evaluated using learned overcomplete dictionaries are useful in many signal/image processing applications such as compression [2], denoising [3], and pattern classification [4]. In overcomplete models, the number of basis functions is greater than the dimensionality of the input signal. It has been shown that, the approximation capability is significantly improved

978-1-61284-227-1/11/$26.00 ©2011 IEEE

271

DSP/SPE 2011

in [8] learns a graph-structured dictionary for shift-invariant representations and uses a graph-based pursuit for reconstruction. However, the algorithm proposed in this paper builds a global dictionary and the pursuit algorithm for reconstruction is also much different than the algorithm proposed in [8]. We show the effectiveness of the MLD learning algorithm using a global dictionary trained from a limited set of patches from natural images. For comparison, a global dictionary learned using the K-SVD algorithm for the same training data set is used. We demonstrate the utility of multilevel global dictionaries in compression of natural images by showing that the representation evaluated, with a very few non-zero coefficients, using an MLD achieves lesser approximation error in comparison to a K-SVD dictionary. Furthermore, we observe that the multilevel dictionary achieves a significant improvement in performance in compressive sensing of images, with highly reduced number of measurements. The paper is organized as follows. In Section 2, we provide a brief theoretical exposition of the K-hyperline clustering which is the building block of MLD learning. Section 3 describes the proposed MLD learning algorithm and the procedure to compute coefficients using an MLD. The next section incorporates simulation results and Section 5 concludes the paper.

The left singular vector corresponding to the largest singular value of the decomposition, is the centroid of cluster j. 2.1. Distortion Function for Clustering K-hyperline clustering can be posed as a problem of empirical risk minimization over the distortion function class [10]   GK =

gΨ (y) = d(y, ψ j ), j = argmax |yT ψ l | . (3) l∈{1,··· ,K}

The class GK is obtained by taking functions corresponding to all possible combinations of K hyperlines from the RM space for the set Ψ. Minimizing the distortion for {yi }Ti=1 is equivalent to finding gΨ ˆ = argmin g∈GK

d(y, ψ) = y − ψ(yT ψ)22 ,



(2)

where ψ is assumed to be normalized. In the cluster centroid update stage, we perform singular value decomposition (SVD) of Yj = [yi ]i∈Cj , where Cj = {i|H(yi ) = j} contains training vector indices corresponding to the cluster j.

272

(4)

This is the same as finding the function g corresponding to minimum expected risk of learning (i.e.), gΨ ˆ = argming∈GK EPT [gΨ ], for the probability measure PT [11]. The covering number for the function class GK with respect to the supremum norm is upper bounded by (Lemma 2.1 in [9])

2. K-HYPERLINE CLUSTERING The K-hyperline clustering algorithm is an iterative K-means like procedure that performs a least squares fit of K 1-D subspaces to the training data [7]. The algorithm can be obtained by using the value of s = 1 in (1), i.e., by constraining the coefficient vector to be 1-sparse. The columns of the dictionary Ψ, {ψ j }K j=1 , are the K normalized cluster centroids. We denote the clustering operation by {Ψ, A} = KHYPL(Y, K). In order to perform theoretical analysis of the clustering procedure, we assume that the data Y lies in RM and define the probability space (Y, Σ, P ), where P is an unknown probability measure. The training samples, {yi }Ti=1 , are T i.i.d. realizations from the probability space. We also define an empirical probability measure PT that assigns the mass T −1 to each of the T training samples [9]. K-hyperline clustering is an alternating minimization procedure that proceeds in two stages after initialization: the cluster assignment and the cluster centroid update stages. In the cluster assignment stage, the training vector yi is assigned to a cluster j based on the minimum distortion criteria, H(yi ) = argminj d(yi , ψ j ), which is equivalent to H(yi ) = argmaxj |yiT ψ j |. The distortion measure is

T 1 gΨ (yi ). T i=1

N∞ (, GK ) ≤

8R3 K +  

M K .

(5)

This shows that N∞ (, GK ) grows polynomially and hence the class GK of functions is uniform Donsker [12]. 2.2. Characteristics The three important characteristics of the K-hyperline clustering algorithm, as described in [9], are (a) local optimality (b) Voronoi tessellation of RM space and (c) stability. Local optimality is proved by showing that the clustering algorithm satisfies the Lloyd’s optimality conditions [13]. Furthermore, it is also shown that the K cluster centroids tessellate the RM space into 2K convex Voronoi regions. The stability of the clustering algorithm is established by proving that the cluster centroids learned are not significantly different when different i.i.d. training sets from the same probability space are used for training. The proof of stability of the clustering algorithm is outlined below and more details on the characteristics of the algorithm are given in [9]. When there is no unique global minimizer to the objective function (4), the K-hyperline clustering is stable with respect √ to a change in o( T ) samples between two i.i.d. training sets of T samples each, as T → ∞. Let Ψ = {ψ k }K k=1 and Λ = {λk }K k=1 be the cluster centroids trained using the two different i.i.d. training sets. Since the distortion function class for K-hyperline clustering is uniform Donsker, gΨ − gΛ L1 (P ) → 0, as T → ∞. This implies that the

Input T Y = [yi ]i=1 , M × T matrix of training vectors. L, maximum number of levels of the dictionary. Kl , number of elements in level l, l = {1, 2, ..., L}. , error goal of the representation. Output Ψl , adapted dictionary for level l.

Fig. 1. (a) 1-D submanifold lying on a 3-D unit sphere, (b) normalized residual vectors of the 1-D submanifold after removing the contribution of 4 cluster centers. distortion functions for the clusterings Ψ and Λ become arbitrarily close to each other. In order to show that the clustering is stable, we prove that the cluster centroids themselves are close to each other. We define a distance measure Δ(Ψ, Λ) = max1≤j≤K min1≤l≤K (d(ψ j , λl ))1/2 + (d(ψ l , λj ))1/2 . Using Theorem 3 in [9], we show that the cluster centroids also become arbitrarily close to each other (i.e.), Δ(Ψ, Λ) → 0 as gΨ − gΛ L1 (P ) → 0. The optimality and stability characteristics of the Khyperline clustering algorithm make this a candidate for sophisticated dictionary learning in sparse representations. The stability properties will be useful in learning dictionaries that faithfully identify the underlying stable patterns in natural signals and images. 3. PROPOSED MULTILEVEL DICTIONARY LEARNING ALGORITHM 3.1. Motivation The statistical structure of natural images allows for their efficient representation by a set of elementary features. Global dictionaries that learn moderately complex features from a small training set can generalize well to a wide range of natural images. Learning an elementary feature set for natural images often involves dividing the images into patches and using the patches as training data. The natural image patches are highly self-similar because they can be represented by a small set of shared elementary features. High similarity means that the normalized inner product between the vectorized image patches is high. In this paper, we propose a learning approach to build global dictionaries that captures the energy hierarchy in image patches, in addition to identifying the elementary features. Atoms that contribute the most to the energy of the representation are learned in the first level, followed by the next set of highest energy contributing atoms and so on.

273

Algorithm Initialize: l = 1 and R0 = Y. Λ0 = {i | r0,i 22 > }, index of training vectors with norm greater than error goal. ˆ 0 = [r0,i ] R i∈Λ0 . while Λl−1 = ∅ and l ≤ L Initialize: Al , Kl × T coefficient matrix, all zeros. Initialize: Rl , M × T residual matrix, all zeros. ˆ l } = KHYPL(R ˆ l−1 , Kl ). {Ψl , A t ˆ ˆ Rl = Rl−1 − Ψl Al . rl,i = rtl,j where i = Λl−1 (j), ∀j = 1, ..., |Λl−1 |. ˆl,j where i = Λl−1 (j), ∀j = 1, ..., |Λl−1 |. al,i = a Λl = {i | rl,i 22 > }. ˆ l = [rl,i ] R i∈Λl . l ← l + 1. end Table 1. Algorithm for building the multilevel dictionary.

In order to illustrate the reason for multilevel learning on natural image patches, we use synthetic data shown in Figure 1(a). The highly self-similar synthetic training data given by the columns of Y lie on a 1-D submanifold in a 3-D sphere. The residual from the K-hyperline clustering is given by R = Y − ΨA, where A is the coefficient matrix. On removing the contribution of the 4 cluster centers learned using K-hyperline clustering from the training data and normalizing the residual, we observe that the normalized residual is more spread out on the 3-D sphere than the training vectors themselves. This can be seen in Figure 1(b). This can also be seen from the average variance of the training vectors in the three dimensions, 0.0767, and the average variance of the normalized residuals, 0.3321. Comparing Figures 1(a) and 1(b), we also observe that the similarity between Y and R is less. Summarizing our observations, we have: (a) the residual vectors (columns of R) have lesser self-similarity in comparison to the training vectors and (b) the similarity of the residuals with the original training vectors is also less. Therefore, we propose to learn a new set of dictionary atoms for representing the residual instead of optimizing atoms across both the data and its residual. This makes the multilevel learning a non-iterative procedure, unlike the K-SVD algorithm which updates the entire dictionary in every iteration.

Fig. 2. Multilevel dictionary with 16 levels of 16 atoms each learned from the Berkeley segmentation dataset. 3.2. Algorithm

3.3. Using an MLD for Approximation

We denote the MLD as Ψ = [Ψ1 Ψ2 ...ΨL ], and the coefficient matrix as A = [AT1 AT2 ...ATL ]T . Here, Ψl is the dictionary in level l and Al is the coefficient matrix in level l. The approximation in level l is expressed as

When using the MLD to compute sparse representations, we need to ensure that the dictionary generalizes well for unknown test data in addition to imposing the energy hierarchy. The representation is computed using a modified version of the Matching Pursuit (MP) method. In every step of the pursuit algorithm, we build a sub-dictionary with atoms selected from the current level as well as the two immediately preceding and following levels. The sub-dictionary is used to compute a 1-sparse representation for that step of the pursuit. This procedure generalizes the patterns learned in multiple levels to the unknown test images apart from imposing the energy hierarchy. Though more sophisticated adaptive schemes can be designed for computing sparse approximation using the MLD, we observe from the results that even a simple setup as this achieves significant approximation performance.

Rl−1 = Ψl Al + Rl , for l = 1, ..., L,

(6)

where Rl−1 , Rl are the residuals for the levels l − 1 and l respectively and R0 = Y, the matrix of training image patches. This implies that the residual matrix in level l − 1 serves as the training data for level l. Note that the sparsity of the representation in each level is fixed at 1. K-hyperline clustering is employed to learn the dictionary Ψl from the training matrix, Rl−1 , for that level. MLD learning can be formally stated as an optimization problem that proceeds from the first layer until the stopping criteria is reached. For layer l, the optimization problem is, argmin Rl−1 − Ψl Al 2F subject to al,i 0 ≤ 1,

4. SIMULATION RESULTS

Ψl ,Al

for i = {1, ..., T },

(7)

where al,i is the ith column of Al and T is the number of columns in Al . The stopping criteria is given either by the imposing a limit on the maximum residual representation error or the maximum number of levels (L). Note that total number of levels is the same as the maximum number of non-zero coefficients (sparsity) of the representation. The error constraint can be stated as, rl,i 22 ≤ , ∀i = 1, ..., T for some level l. Table 1 lists the MLD learning algorithm with sparsity and error constraints. In MLD learning, for a given level l, the residual rl,i is orthogonal to the representation Ψl al,i . This directly leads to the relation, Rl−1 2F = Ψl Al 2F + Rl 2F .

(8)

Detailed analysis of convergence and stability properties of the MLD algorithm will be included in an expanded version of the paper submitted elsewhere.

274

Simulation results presented in this section use global dictionaries learned using the MLD and K-SVD algorithms. The training data for learning the dictionaries consisted of 3750 patches of size 8 × 8 extracted from the 250 grayscale training images of the Berkeley segmentation dataset (BSDS) [14] (15 randomly chosen patches from each image). No noise was added to the image patches when learning the dictionaries and the image intensities are in 0 − 255 scale. For the K-SVD dictionary, 256 atoms were learned using the MATLAB toolbox available online [15]. For MLD learning, the number of atoms was fixed at 16 per level and the number of levels was fixed at 16, which leads to a total of 256 atoms. Initialization for each level was performed using the K-means algorithm. For both the learning procedures, the mean value was subtracted from each training patch during preprocessing. The training image patches can contain geometric patterns or stochastic textures or a combination of both. This fact is demonstrated in [16], where the authors define explicit

MSE

MSE

MSE

Sparsity (# of non−zero coefficients)

Sparsity (# of non−zero coefficients)

(a) Boat.

Sparsity (# of non−zero coefficients)

(b) House.

(c) Lena.

Fig. 3. MSE obtained for representation of standard images using MLD and K-SVD dictionaries. Image

Measurement SNR (dB) 0 15 25

Barbara

Boat

Couple

House

Lena

Man

N=16

N=32

N=16

N=32

N=16

N=32

N=16

N=32

N=16

N=32

N=16

N=32

525.0

374.9

450.5

272.8

434.2

267.8

364.6

193.6

273.7

168.1

367.1

235.3

440.75

391.6

328.7

246.6

326.6

243.7

248.9

178.5

213.9

148.1

263.4

202.4

289.6

103.9

228.4

55.5

232.6

56.0

176.9

31.6

121.7

27.7

181.0

48.1

212.9

109.7

85.7

44.4

82.8

41.4

49.5

23.6

43.1

22.6

72.2

37.4

275.5

83.9

212.4

38.3

220.9

38.2

161.2

17.9

112.4

17.2

175.4

33.9

200.8

93.3

73.2

32.4

70.1

28.9

37.6

14.0

35.0

14.3

61.1

27.0

Table 2. Comparison of performance in compressed sensing using MLD and K-SVD dictionaries. For each measurement SNR, the numbers in first row indicate the MSE with the K-SVD dictionary and those in second row indicate the MSE with MLD. Lesser MSE for each case is given in bold font. manifolds of low dimensions for geometric patterns and implicit manifolds for stochastic textures and learn the manifolds using information projection. The learned MLD shown in Figure 2 contains geometric patterns in the first few levels, stochastic textures in the last few levels and a combination of both in the middle levels. This also shows that the geometric patterns contribute the most energy to the representation, whereas the stochastic textures contribute the least energy to the representation in natural images. 4.1. Energy Hierarchy in Representation Using the learned K-SVD and multilevel dictionaries, we performed representation of standard images with different number of non-zero coefficients. The standard images were divided into non-overlapping 8 × 8 patches. For all the three cases shown in Figure 3, the mean square error (MSE) of the representation with MLD is lower than that obtained using the K-SVD dictionary, when the number of non-zero coefficients is small. This demonstrates the fact that the atoms that contribute the most energy are given preference while computing representation using the MLD. This property of the MLD will result in superior PSNR during compression at low bit-rates.

275

4.2. Compressed Sensing We demonstrate the performance of MLD in compressive sensing with optimized measurements. We use an approach proposed in [17] to design measurement matrices optimized to the learned MLD and K-SVD dictionaries and compare their performances. Table 2 shows the MSE obtained for different images using the K-SVD and MLD. The SNR of the measurement process is fixed at 0dB, 15dB and 25dB respectively and we show the performance for N = 16 and N = 32 measurements per 8 × 8 patch (non-overlapping). As it can be clearly seen, the MLD outperforms the K-SVD dictionary when the number of measurements is less, at all noise levels. The performance of both dictionaries become comparable at increased number of measurements. The improvement in reconstruction for small number of measurements, particularly of the edges, is apparent from the images shown in Figure 4. 5. CONCLUSIONS We presented a multilevel dictionary learning algorithm that exploits the energy hierarchy found in natural image features

(a) K-SVD dictionary (MSE=175.4).

(b) Multilevel dictionary (MSE=61.1).

Fig. 4. Reconstructed images from N = 16 compressive measurements (SNR of measurement process = 25dB). to learn efficient global dictionaries. The proposed setup employs the K-hyperline clustering algorithm to learn features for each level of the structured dictionary. The initial results with representation and compressive sensing applications show that exploiting energy hierarchy is particularly useful when a small number of non-zero coefficients are used. Future extensions to this work include proving the convergence and stability properties of the MLD learning algorithm. 6. REFERENCES [1] D. J. Field, “What is the goal of sensory coding?,” Neural Computation, vol. 6, pp. 559–601, 1994. [2] M. Aharon et.al., “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, November 2006. [3] M. Aharon and M. Elad, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, no. 12, pp. 3736– 3745, 2006. [4] J.J. Thiagarajan et.al., “Sparse representations for pattern classification using learned dictionaries,” Proceedings of Twentyeighth SGAI International Conference on Artificial Intelligence, 2008. [5] B. D. Rao and Y. Bresler, “Signal processing with sparseness constraints,” in Proceedings of IEEE ICASSP, Seattle, 1998. [6] K. Kreutz-Delgado et.al., “Dictionary learning algorithms for sparse representation,” Neural Computation, vol. 15, no. 2, pp. 349–396, 2003. [7] Z. He et.al., “K-hyperline clustering learning for sparse component analysis,” Signal Processing, vol. 89, pp. 1011–1022, 2009.

276

[8] J.J. Thiagarajan et.al., “Shift-invariant sparse representation of images using learned dictionaries,” Proceedings of IEEE Workshop on MLSP, pp. 145–150, 2008. [9] J. J. Thiagarajan et.al., “Optimality and stability of the Khyperline clustering algorithm,” Pattern Recognition Letters (submitted), 2010. [10] J. M. Buhmann, “Empirical risk approximation: An induction principle for unsupervised learning,” Tech. Rep. IAI-TR-98-3, The University of Bonn, 1998. [11] A. Caponnetto and A. Rakhlin, “Stability properties of empirical risk minimization over Donsker classes,” Journal of Machine Learning Research, vol. 7, pp. 2565–2583, 2006. [12] A. Rakhlin and A. Caponnetto, “Stability of K-means clustering,” in Advances in Neural Information Processing Systems, B. Sch¨olkopf, J. Platt, , and T. Hoffman, Eds., Cambridge, MA, 2007, vol. 19, MIT Press. [13] A. Gersho and R. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, 1992. [14] “Berkeley segmentation dataset,” Available at http://www.eecs.berkeley.edu/Research/Projects/CS/vision/ grouping/segbench/. [15] “K-SVD matlab toolbox,” Available http://www.cs.technion.ac.il/ elad/software/.

at

[16] S. Zhu et.al., “Learning explicit and implicit visual manifolds by information projection,” Pattern Recognition Letters, vol. 31, pp. 667–685, 2010. [17] J. M. Duarte-Carvajalino et.al., “Learning to sense sparse signals: simultaneous sensing matrix and sparsifying dictionary optimization,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1395–1408, 2009.