HYPERSPECTRAL IMAGE CLASSIFICATION BASED ... - IEEE Xplore

2 downloads 0 Views 377KB Size Report
Hao Wu, Saurabh Prasad, Minshan Cui, Nam Tuan Nguyen, Zhu Han. Department of Electrical and Computer Engineering, University of Houston, Houston, TX, ...
HYPERSPECTRAL IMAGE CLASSIFICATION BASED ON DIRICHLET PROCESS MIXTURE MODELS Hao Wu, Saurabh Prasad, Minshan Cui, Nam Tuan Nguyen, Zhu Han Department of Electrical and Computer Engineering, University of Houston, Houston, TX, USA ABSTRACT In this work, we propose a new density estimation method for hyperspectral image data based on Dirichlet Process Gaussian mixture models (also known as infinite Gaussian mixture models — IGMMs), which successfully captures the complex multi-modal (potentially non-Gaussian) statistical structure of hyperspectral data. The mixture model we get from this will then be applied to the classification problem. This IGMM based approach is a non-parametric Bayesian method helping circumvent the problem of model selection, which is unavoidable and often difficult when employing traditional parametric Gaussian mixture models (GMM). Inference model based on Gibbs sampling employed during the inference of model parameters. As a preprocessing step, we use Local Fisher’s Discriminant Analysis (LFDA) for dimension reduction since we expect it to preserve the multi-modal non-Gaussian structure of the hyperspectral data, which will benefit much in the aspect of computation cost. We compared our proposed IGMM based classification method to the existing state-of-the-art classification methods using popular hyperspectral imagery datasets. The results of our experiments show that the proposed LFDA-IGMM method and GMM method have almost the same performance (sometimes outperforming LFDA-GMM), and they outperform the other commonly used classification approaches when there is a sufficient number of training samples. Index Terms— GMM, Dirichlet Process Mixture Model, IGMM, Gibbs Sampler, LFDA 1. INTRODUCTION Hyperspectral imagery (HSI) provides rich information captured over a wide range of the electromagnetic spectrum for each pixel in remote sensed images, and usually hundreds of bands. This abundant spectral information yields more precise measures and allows a very accurate characteristics of the materials present in the observed scene. Thus, it is more suitable for land cover classification applications compared to natural images or multispectral images, which have much fewer bands (often less than 10 bands). Although HSI has many attractive properties, the classification of hyperspectral images, under the Bayesian framework, is still a very chal-

978-1-4799-1114-1/13/$31.00 ©2013 IEEE

lenging problem due to the high dimensionality of the data and its complex statistical distributions. Much research about this problem has been conducted during the past decade. The high-dimensional feature space of hyperspectral data often leads to potential statistical ill-conditioning, and requires a large number of training samples. Thus, a variety of dimensionality reduction methods such as Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA) and their variants like Regularized LDA and Kernel PCA have been employed frequently as a preprocessing step to HSI classification. In order to preserve the relevant statistical information of the hyperspectral data, we employ Local Fisher’s Discriminant Analysis (LFDA) for dimensionality reduction since it has been shown to preserve the statistical structure of multi-modal and non-Gaussian data successfully in [1]. In order to capture the complex local statistical structure of hyperspectral data, a Gaussian mixture model can fit the data well, which has also been illustrated in [1]. When using the GMM-based classifier, the traditional Expectation Maximization (EM) [2] algorithm is employed to estimate the parameters (mixture weights, mean vectors and covariance matrixes) of the Gaussian components iteratively. However, EM is unable to estimate the number of components in the mixture model, which is a difficult task in this procedure and is also referred to as the model selection problem. Thus, in order to implement EM, we first need to estimate the number of mixture components. A typical approach for determining the number of modes/components is to employ the Bayesian Information Criteria (BIC) or the Akaike Information Criteria (AIC) [1, 3]. Although LFDA-GMM achieves good classification performance, estimation of the number of mixture components may be inaccurate. It also requires a tuning process, wherein the number of modes (mixtures) are learned by observing AIC or BIC as a function of the number of modes over a certain range of suitable values. Thus, in order to solve this problem, we introduce the Dirichlet Process Mixture model, a non-parametric Bayesian method, to circumvent this model selection problem by assuming there is a potentially infinite number of components in the mixture model. Note that only a finite number of components can be detected with a certain number of observations. From this infinite Gaussian mixture model, we can update the posterior distribution of the desired parameters. Then, using

1043

IGARSS 2013

number of mixtures, K, we can use a simple EM method to → − iteratively compute the values of parameters πk and θk for k = 1, ..., K. B. Infinite Gaussian Mixture Model IGMM is an extension of FGMM by letting K → ∞ and we get the following model: → − → − ~π |α ∼ Stick(α), θk ∼ H ,

(4)

where ~π |α ∼ Stick(α) stands for: (a) FGMM

(b) IGMM βk ∼ Beta(1, α); πk = βk

Fig. 1. Gaussian mixture model: finite case and infinite case

k−1 Y

(1 − βl ), k → ∞,

(5)

l=1

the Gibbs sampling method, we can sample the parameters and indicators (class labels) for each observation from their posterior distributions. Upon convergence, these samples will approximate the posterior distribution accurately [4]. The remainder of this paper is organized as follows. In section 2, we discuss the relevant fundamental concepts of IGMM and how a Gibbs sampler is used to infer the model’s parameters. In section 4, we review the dimensionality reduction algorithm - LFDA. In section 5, we provide a description of the datasets and show the classification results of different algorithms under different conditions. We conclude by summarizing our results in section 6.

and Θ ∼ H is a shorthand of : − → ∼ Gaussian(− →, Σ /K ), µ µ k 0 k 0 Σk ∼ InverseW ishartv0 (Λ0 −1 ).

(6)

Here, for mathematical convenience, the Gaussian distribution and Inverse Wishart distribution above are chosen [6] because they are conjugate priors for the Gaussian distribution, which is instrumental in deriving a closed form solution for the posterior distribution for C = {ci }N i=1 and Θ = {θk }K k=1 when K → ∞. 3. INFERENCE MODEL: GIBBS SAMPLING

2. GENERATIVE MODEL: INFINITE GAUSSIAN MIXTURE MODEL {xi }N i=1

For hyperspectral data, let be the N pixels in the image and ci the corresponding class labels. The hyperspectral data can be modeled as being generated from a Gaussian Mixture Model, which is described as: − − ci |→ π ∼ M ultinomial(·|→ π ), (1) → − → − xi |ci = k ∼ Gaussian(·|θk ), (2) → − − → − →, Σ ) where → π = (π1 , π2 , ..., πK ) and θk stands for (− µ k k which are the mean vector and covariance matrix of each Gaussian distribution, respectively. Depending on whether we have the prior knowledge of the number of mixtures, two kinds of generative model can be used to represent the data as depicted in Fig. 1 (the finite case and infinite case). A. Finite Gaussian Mixture Model → − − The parameters → π and θi are defined as following: − α α → → − → − π |α ∼ Dirichlet(·| , ..., ), θk ∼ H . (3) K K

In the inference model, given the observations {xi }N i=1 , our goal is to infer the parameters of the generative model based on their posterior probabilities, which cannot be computed analytically. Alternatively, we can obtain samples of them [7] by sampling their posterior probabilities using Gibbs sampler, a widely used Markov Chain Monte Carlo (MCMC) method, which iteratively samples new values for each variable conditioned on the current values of all other variables [8]. Upon convergence, these samples will approximate the posterior distribution precisely. Under IGMM, the posterior distribution for the unknown parameters is defined as [5]: − − P (C, Θ, → π , α|→ x)∝ − P (→ x |C, Θ)P (Θ|H)

N Y

− − P (ci |→ π )P (→ π |α)P (α).

(7)

i=1

Our goal is to find {C, Θ}. Based on our generative model − in (4), (5) and (6), we are able to integrate out → π and get the posterior distributions for Θ and C (for more details see [4]). The posterior distribution for Θ is:

The concentration parameter in the Dirichlet distribution, α, encode our beliefs about how uniform or skewed the mixture weights will be [5]. In this case, since we know the exact

1044

Y → − → − → − − − − P (θk |C, → x , Θ−k , → π , α) ∝ P (→ xi |ci , θk )PH (θk ) ci =k

(8)

→ − where Θ−k = {θ1 , ..., θk−1 , θk+1 , ..., θK } and PH (θk ) is the → − probability of θk under H. The posterior distribution of C is: − P (ci = k|C−i , ~x, Θ, ~π , α) ∝ P (→ xi |ci , Θ)P (ci = k|C−i , α) (9) where the prior  nk,−i N−1+α , ci ∈ C−i , (10) P (ci = k|C−i , α) = α / C−i . N−1+α , ci ∈ C−i = {c1 , ..., ck−1 , ck+1 , ..., cN } refers to all indices except i, and nk,−i refers to the number of observations assigned to − the k th cluster, excluding the ith observation → xi . This generative process in (10) is known as the Chinese restaurant process [9], and implies that we assign an existing label (ci ∈ C−i ) nk,−i and assign a new to the current pixel with probability N−1+α α label (ci ∈ / C−i ) to the current pixel with probability N−1+α . From the above discussions, we sample new values for Θ and C according to (8) and (9) iteratively until it reaches convergence. The Gibbs Sampler we used is a Matlab implementation1 provided by Wood et al [5].

Algorithms LDA-ML SVM LFDA-GMM LFDA-IGMM

Table 1. Indian Pines: Overall accuracy (%) for different numbers of training samples per class

Algorithms LDA-ML SVM LFDA-GMM LFDA-IGMM

5. EXPERIMENTAL SETUP AND RESULTS In this section, the classification results using our proposed algorithm which we call LFDA-IGMM are compared with the baseline algorithms including LDA-ML (LDA based projection, followed by a quadratic Gaussian maximumlikelihood classifier), support vector machines (SVM) and LFDA-GMM. The hyperspectral datasets employed in this work include the “Indian Pines”, “University of Pavia” and “Pavia Center” datasets. The “Indian Pines” dataset was acquired using 1 http://www.gatsby.ucl.ac.uk/

Number of training samples 40 70 100 150 74.57 81.62 83.24 86.05 84.28 84.58 88.13 89.9 82.94 88.7 90.6 91.47 83.07 88.44 90.55 92.23

Table 2. University of Pavia: Overall accuracy (%) for different numbers of training samples per class

Algorithms LDA-ML SVM LFDA-GMM LFDA-IGMM

4. LOCAL FISHER’S DISCRIMINANT ANALYSIS As we stated in section 1, the densely sampled HSI data has very high dimension, which often result in highly correlation between different bands. Thus, dimensionality reduction is a critical preprocessing step for HSI analysis. LFDA, an extension to LDA, avoids making the restrictive assumption of LDA about the homoscedastic Gaussian class conditional distributions. However, the spectral feature space of HSI data may be non-Gaussian or even significantly multi-modal. In previous work, LFDA coupled with traditional GMMs [1] results in very effective classification of HSI data. In this work LFDA serves as a useful preprocessing to the IGMM classifier as well — owing to their ability to yield good betweenclass separation in the projection while, at the same time, preserving the within-class local structure, i.e., neighboring data points in the original space remain neighbors after the projection and vice-versa. Refer to [1, 10] for further details of LFDA and it’s application for hyperspectral classification.

Number of training samples 40 70 100 150 60.98 69.78 74.65 77.02 71.15 76.81 78.81 81.9 70.01 73.42 76.08 77.11 69.2 73.64 77.6 80.47

Number of training samples 40 70 100 150 87.13 91.58 92.31 93.52 91.25 93.88 94.72 95.55 91.5 94.26 95.28 95.98 91.02 93.71 95.44 96.42

Table 3. Pavia Center: Overall accuracy (%) for different numbers of training samples per class NASA’s AVIRIS sensor and was collected over northwest Indiana’s Indian Pine test site in June 19922 . The image represents a vegetation-classification scenario with 145 × 145 pixels and 220 bands in the 0.4- to 2.45-µm region of the visible and infrared spectrum with a spatial resolution of 20 m. From 16 different land-cover classes in the image, 7 classes are discarded due to their insufficient number of training samples. The “University of Pavia” and “Pavia Center” dataset were collected using the Reflective Optics System Imaging Spectrometer (ROSIS) sensor [11]. The images have 115 spectral bands with a spectral coverage from 0.43- to 0.86-µm, and a spatial resolution of 1.3 m. It has 14 classes. For all GMM experiments, we evaluated the AIC as a function of different numbers of mixtures to determine the optimal number of mixtures. Tab. 1, Tab. 2 and Tab. 3 show the classification overall accuracy of different algorithms versus the number of training samples employed per class for the above three datasets. For “Indian Pines”, we use the rest of the data except training data as testing samples. For the other two, we randomly choose 500 samples in the rest of data as testing samples. The proposed LFDA-IGMM has a performance compara2 ftp://ftp.ecn.purdue.edu/biehl/MultiSpec

˜fwood/code.html

1045

7. REFERENCES −0.3

−0.4

−0.5

−0.6

−0.2

−0.3

−0.3

−0.35

−0.4

−0.4

−0.5

−0.45

−0.6

−0.5

−0.7

−0.55

−0.8 −0.7 −2.2

−2

−0.9 −2.2

−1.8 −1.6

−0.35

−0.6 −2

−0.65 −2.2

−1.8 −1.6

0.1

−2

[2] N. Vlassis and A. Likas, “A greedy em algorithm for gaussian mixture learning,” Neural Processing Letters, vol. 15, no. 1, pp. 77–87, 2002.

−1.8 −1.6

−0.35 −0.4

−0.4 0

−0.5

[3] G. Schwarz, “Estimating the dimension of a model,” The annals of statistics, vol. 6, no. 2, pp. 461–464, Mar. 1978.

−0.45

−0.45

−0.5

−0.1

−0.55

−0.55

−0.6

−0.2 −0.6 −0.65 −2

[1] W. Li, S. Prasad, J. E Fowler, and L. M. Bruce, “Locality-preserving dimensionality reduction and classification for hyperspectral image analysis,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 50, no. 4, pp. 1185–1198, Apr. 2012.

−0.65 −1.8

−1.6

−1.4

−0.3 −2

−0.3

−0.3

−0.4

−0.4

−1.8

−1.6

−1.4

−0.7 −2

[4] C. E. Rasmussen, “The infinite gaussian mixture model,” Advances in neural information processing systems, vol. 12, no. 5.2, pp. 2, 2000.

−1.8 −1.6 −1.4

−0.3

[5] F. Wood, S. Goldwater, and M. J Black, “A nonparametric bayesian approach to spike sorting,” in Engineering in Medicine and Biology Society. 28th Annual International Conference of the IEEE, 2006, pp. 1165– 1168.

−0.35 −0.4

−0.5

−0.5

−0.6

−0.6

−0.45 −0.5 −0.55

−0.7 −0.8 −2.2

−0.7

−2

−1.8 −1.6

−0.8 −2.5

−0.6 −2

−1.5

−1

−0.65 −2.2

−2

−1.8 −1.6

[6] N. T. Nguyen, R. Zheng, and Z. Han, “On identifying primary user emulation attacks in cognitive radio systems using nonparametric bayesian classification,” Signal Processing, IEEE Transactions on, vol. 60, no. 3, pp. 1432–1445, Mar. 2012.

Fig. 2. Gaussian mixture models of the 9 classes in “Indian Pines” for 300 training samples per class. The number of mixture components for each class is: 4, 5, 4, 4, 4, 6, 5, 5, 5.

[7] R. M Neal, “Probabilistic inference using markov chain monte carlo methods,” University of Toronto, Tech. Rep. CRG-TR-93-1, 1993.

ble to LFDA-GMM (sometimes outperforms LFDA-GMM) and provides a very important benefit in that we have removed one degree of freedom (number of mixtures) from the statistical learning process for hyperspectral classification. Both methods outperform the other methods when the number of training samples is sufficiently large. For visualization, Fig. 2 displays the Gaussian mixture model from our algorithm.

[8] R. M Neal, “Markov chain sampling methods for dirichlet process mixture models,” Journal of computational and graphical statistics, vol. 9, no. 2, pp. 249–265, Jun. 2000. [9] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the indian buffet process,” Gatsby Computational Neuroscience Unit, Tech. Rep. 2005-001, May 2005.

6. CONCLUSIONS In this paper, we presented a classification framework based on LFDA, a locality preserving dimensionality reduction approach and IGMM, a nonparametric Bayesian classifier. IGMM is more accurate and easier to implement than GMM for building the class-conditional models with hyperspectral imagery. The LFDA algorithm for dimensionality reduction is also proven to be an effective method to preprocess our data. Experimental results have shown that our proposed LFDA-IGMM algorithm for hyperspectral image classification has a better performance than the other algorithms with a relative larger training sample size.

[10] M. Sugiyama, “Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis,” The Journal of Machine Learning Research, vol. 8, pp. 1027–1061, May 2007. [11] P. Gamba, “A collection of data for urban area characterization,” in Geoscience and Remote Sensing Symposium. IEEE International, 2004, vol. 1.

1046