Unsupervised Satellite Image Classification Using ... - IEEE Xplore

0 downloads 0 Views 401KB Size Report
Mar 25, 2012 - IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. ... letter presents an efficient unsupervised semantic classification.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 130

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. 1, JANUARY 2013

Unsupervised Satellite Image Classification Using Markov Field Topic Model Kan Xu, Wen Yang, Member, IEEE, Gang Liu, and Hong Sun, Member, IEEE

Abstract—Recently, the combination of topic models and random fields has been frequently and successfully applied to image classification due to their complementary effect. However, the number of classes is usually needed to be assigned manually. This letter presents an efficient unsupervised semantic classification method for high-resolution satellite images. We add label cost, which can penalize a solution based on a set of labels that appear in it by optimization of energy, to the random fields of latent topics, and an iterative algorithm is thereby proposed to make the number of classes finally be converged to an appropriate level. Compared with other mentioned classification algorithms, our method not only can obtain accurate semantic segmentation results by larger scale structures but also can automatically assign the number of segments. The experimental results on several scenes have demonstrated its effectiveness and robustness. Index Terms—Label cost, Markov random field (MRF), satellite image, topic model, unsupervised classification.

I. I NTRODUCTION

P

ARTITIONING satellite images into semantically meaningful regions, namely, the classification with consistency of semantics, has played an important role in recent years. However, manual interpretation needs plentiful and expensive human effort. Thus, an automatic and efficient method for image semantic extraction becomes one of the most challenging problems in remote sensing applications. In order to achieve an efficient content extraction of satellite images, many clustering algorithms directly based on image features have been proposed. However, low-level features cannot precisely represent the semantic of images. Hence, the relationship between low-level features and image semantics becomes a central issue discussed recently. Many studies focused on mapping low-level features to high-level semantics and eliminating the gap between them. Some authors argue that topic models, such as probabilistic latent semantic analysis (pLSA) [1] and latent Dirichlet allocation (LDA) [2], which were originally developed for topic discovery in text domain,

Manuscript received January 15, 2012; revised February 28, 2012; accepted March 25, 2012. This work was supported in part by the National Natural Science Foundation of China under Grant 40801183 and in part by the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing Special Research Funding. The authors are with the Signal Processing Laboratory, School of Electronic Information, and the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430072, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LGRS.2012.2194770

are competent to accomplish this kind of work. According to the topic models, features are modeled as “visual words” by vector quantization, and images are regarded as documents and modeled as mixture of latent topics. The classification or segmentation results derived from them can be more reliant on content coherence. In addition, the effective computation based on approximate inference methods can map the highdimensional feature counts into low-dimensional topic vectors, which makes topic models remarkable. It has shown that representing images by topic mixtures outperforms using lowlevel features [3]. It has been reported that the annotation performance of large satellite images can benefit from topic models [4]. Despite the successful application and impressive performance of topic models, they suffer from poor spatial coherence because of the independence assumption of visual words and images. A random field model such as the Markov random field (MRF) has been employed for solving this problem since the spatial information between neighboring regions in an image is thereby enforced. In [5] and [6], the authors have defined an MRF over hidden topic assignment, which has been obtained by pLSA, to describe the spatial relationship of latent topics. The experimental results of supervised and weakly supervised settings demonstrate that the two combined models are complementary and the segmentation and recognition accuracy is obviously improved. However, the aforementioned method was previously proposed for supervised image classification. Furthermore, the number of classes, which has a significant influence over classification result, is usually specified based on a prior or educated assignment in most classification algorithms. Moreover, because of the high resolution, satellite images have relatively richer information. The image semantics of large structure cannot be represented properly by only using topic models. In this letter, we extend the combination of topic models and MRFs to the topic of unsupervised satellite image classification, more precisely, employing an MRF prior over latent topic labels, which are obtained by LDA, to enhance the spatial coherence information. Furthermore, the label cost mentioned in [7], which has an excellent generalization and can give its own contribution to image segmentation, is added to the random fields of latent topics. Automatically segmenting an image into coherent parts is always an important issue in segmentation. The minimum description length criterion was first proposed in [8] for unsupervised segmentation, to represent the image more compactly. In [7], the authors have pointed out that using α expansion can be more powerful, because the segmentation based on such an algorithm relies on contour evolution and explicit

1545-598X/$31.00 © 2012 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU et al.: UNSUPERVISED SATELLITE IMAGE CLASSIFICATION

131

merging of adjacent regions. Based on label cost and Bayesian information criterion (BIC), we introduce an iterative algorithm over latent topics, through which the number of classes is eventually appropriately converged. The whole process works automatically, instead of assuming it beforehand as a constant. Meanwhile, because of the global coupling function of label cost, both the consistency of semantics is well kept and the oversmooth effect is avoided. Our experimental results show that, based on the inferred reasonable number of segments, the proposed method outperforms other mentioned methods. Moreover, the efficiency of semantics extraction is also thereby demonstrated. The rest of this letter is organized as follows. In Section II, we review previous and related works. In Section III, we describe our approach in detail. In Section IV, we evaluate the experimental results qualitatively and quantitatively based on four scenes of satellite images. Finally, we conclude in Section V. II. P REVIOUS W ORK This section briefly summarizes the topic models and random fields. First, we give an overview of the LDA model. Then, we introduce the MRF model. A. LDA Model As explained in [4], LDA is originally developed for text document modeling, which is a generative probabilistic model for collections of discrete data. Contrary to the pLSA model, LDA makes it possible to assign probability to documents in the training corpus and can easily generalize to new documents. It assumes that each document is in the form of mixture of latent topics and each topic is generated from words. All topics drawn from words of a document are assumed to be conditionally independent of each other. In LDA, each document is a sequence of N words wn , denoted by w = {w1 , w2 , . . . , wN }. Because of the application of such complete generative model from text domain to image domain, some terminologies need to be defined: The corpus D denotes the image data set, documents correspond to subimages, and word wn is equivalent to patch of subimages. Each image is thereby regarded as a sequence of N visual words. Its generative process is as follows. 1) Choose a K-dimensional Dirichlet random variable θ ∼ Dir(α); here, K denotes the number of topics. 2) For each of the N words wn : a) Choose a topic zn ∼ M ultinomial(θ). b) Choose a word wn from P (wn |zn , β), a multinomial probability conditioned on the topic zn with probability matrix β. According to the aforementioned assumption, the likelihood of this model can be written as p(D|α, β) N  M  d    p(zdn |θd )p(wdn |zdn , β) dθd . p(θd |α) = d=1

n=1 Zdn

Here, M is the amount of documents in corpus, and α and β are hyperparameters. Unfortunately, (1) is intractable for exact parameter inference. The corresponding solution is using variational inference or some other approximate inference algorithms, such as Gibbs sampling. B. MRF The lack of spatial information is an obvious drawback of topic models due to the conditional independence assumption. Using the MRF model to capture the local correlations between all spatially adjacent neighbors is an effective way to solve this problem. The prior of MRF over node is as follows [5]: ⎛ ⎞   log θdzi + φ(zi , zj )⎠ . (2) P (Z|θd ) ∝ exp ⎝ i

Here, i ∼ j denotes all spatial neighbor nodes i and j. The terms φ(zi , zj ) are edge potentials between neighboring nodes. As described in [5], these potentials are parameterized according to the Potts model φ(zi , zj ) = σ · [zi = zj ].

(3)

Here, σ is set empirically, which, when positive, awards configurations wherein neighboring nodes have the same label and, when negative, punishes configurations wherein neighboring nodes have different labels. III. O UR A PPROACH In this section, we describe our approach. First, we introduce the LDA–MRF combination model. Second, we explain how the iterative algorithm works, which is based on label cost and BIC. A. LDA–MRF Model Similar to the work in [5] and as explained in Section II-B, we place an MRF prior with eight-neighbor connectivity over the latent topic labels, which are inferred from LDA. See Fig. 1 for the graphical representation. B. Energy Optimization Combining topic models with MRF is able to compensate the loss of contextual information. However, only using local coupling is not able to keep the consistency of semantic by large structure in satellite images, because the influence on value of smooth factor in MRF leads to either oversegment or oversmooth. In order to balance such problem, in [7], the α-expansion algorithm has been extended to incorporate label costs at each expansion; by doing so, the energy has been represented in the form of the sum of three terms data cost

(1)

i∼j

smooth cost

label cost

      Dp (fp )+ Vpq (fp , fq )+ hL · δL (f ) . E(f ) = p∈P

pq∈N

L⊆

(4)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 132

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. 1, JANUARY 2013

also can automatically obtain the number of segments. The process of iterations is as follows. 1) Using MRF spatial prior to constrain the latent topics, which are obtained by LDA, while setting the initialized number of latent topics (segments). 2) Optimizing the aforementioned latent topic fields with label cost. During the optimization, figure out whether the resulting fields match with BIC according to (6). If the minimum is reached, go to step 3); if not, i.e., the number of latent topics (segments) is still reducing, then the iteration is continued. 3) Employing the resulting number of latent topics (segments), which is finally determined in step 2), and then using the energy minimization algorithm to infer eventual labels. IV. E XPERIMENT

Fig. 1. MRF prior over latent topic labels with eight-neighbor connectivity: w nodes are visual words, and z nodes are latent topics.

Here, the indicator function is defined on label subset L  def 1, ∃p : fp ∈ L δL (f ) = 0, otherwise.

(5)

This extended algorithm can simultaneously optimize “smooth cost” and “label cost,” from which the approximate solution can be kept by a certain level. The authors in [7] have discussed that some information criteria have been employed by selection of statistical model, in order to avoid overfitting and determine the order of the model through several iterations. Since the information criteria prefer to explain the data with fewer and simpler models, they penalize the overly complex models [9]. Here, BIC is preferred, and it can be written as min −2 ln Pr(X|θ) + |θ| · ln |P |. θ

(6)

Here, θ is a model, Pr(X|θ) is the likelihood function, |θ| denotes the number of parameters in θ that can vary, and |P | is the number of observations. The label costs could be scaled in some proportion (e.g., linear) to the estimated number of observations per model under the suggestion of BIC. The number of models can be determined under this algorithm by examining iterative convergence. In our approach, label cost is added into label optimization of the latent topic fields. At the same time, combining label cost with BIC, the number of segments in the image is automatically determined by iterative operation. In this way, our method not only can add the global information into the topic model but

In this section, we present the experimental results on four scenes of satellite images, which are all derived from Google Earth. For generality, we only use scale invariant features transform (SIFT) feature based on pixels. Each image is treated as corpus, with a size of 800 × 800 pixels. Furthermore, they have been divided into 6400 subimages with a size of 10 × 10 pixels, which are regarded as documents. To build a visual vocabulary, we use k means to quantize the SIFT descriptors, producing 300 clusters. The centroids are thus regarded as words, so each document has 100 words accordingly. In addition, the number of latent topics (segments) is set as 20 initially. We have evaluated the performance of the proposed method, LDA–MRF method, k-means clustering, and iterative self-organizing data analysis technique (ISODATA) clustering [10] from two aspects. On the one hand, the qualitative evaluation is examined in terms of the semantics of the segmentation results. On the other hand, the results are also quantitatively evaluated in terms of the purity and entropy. The segmentation results are shown in Fig. 2. The first column shows the original images. The second column shows the corresponding ground truth. The results of our method [algorithm (I)], LDA–MRF method [algorithm (II)], k-means clustering [algorithm (III)], and ISODATA clustering [algorithm (IV)] are presented from the third to sixth columns, respectively. Please note that, in algorithms (II) and (III), the number of classes must be beforehand fixed. Thus, we first use our method to determine the number of classes and set the same number in those two algorithms. Moreover, the coefficient of smoothness in MRF in algorithm (II) is set empirically. The initial number of classes in algorithm (IV) is set at 20. The qualitative evaluation comes directly from Fig. 2. The image in the first row is acquired in Oberpfaffenhofen, Germany. The major objects include forest, grass, ground, building, and road. The forest area produced by algorithm (III) has many speckles, and the grass surrounding the building cannot be recognized. By algorithm (II), because of the smoothness of MRF, the forest and grass have been confused, and the grass surrounding the building has been misclassified partially as building. By algorithm (IV), the number of classes is converged at three, and there are also many speckles in the grass and forest.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. XU et al.: UNSUPERVISED SATELLITE IMAGE CLASSIFICATION

133

Fig. 2. Classification results of four different scenes. The first column shows the original images. The second column shows the hand-labeled ground truth. The third to sixth columns show the classification results of [algorithm (I)] our method, [algorithm (II)] LDA–MRF method, [algorithm (III)] k-means clustering, and [algorithm (IV)] ISODATA clustering, respectively.

The number of classes from algorithm (I) is converged at five, and the result seems to be more compact than the other algorithms. The forest, grass, and building have been well separated from each other. The image in the second row is acquired in Stuttgart, Germany. The survey area includes residential area, forest, road, and grass (two sorts). As in the first row, by algorithm (III), the forest area and residential area still have many speckles, and the two sorts of grass are treated as one part. The classification of forest has been obviously improved by algorithm (II), but the problem over the extraction of grass and residential area remains. The number of classes is converged at four by algorithm (IV). The two kinds of grass have been successfully divided, but the forest and residential area have been confused. According to the result of algorithm (I), the number of classes is converged at five. The forest area and two sorts of grass area have been clearly distinguished. Furthermore, the contour of the residential area has also been effectively extracted. Unfortunately, the narrow path near the residential area cannot be completely recognized. The image in the third row is acquired in Beijing, China, and the major objects include bridge, river, road, grass, and ground. By algorithm (III), the bridge and road have been confused with each other. In addition, the classification of grass and river is also confusing. By algorithm (II), the speckle has been eliminated. Meanwhile, the bridge and river are recognized as one part, and the confusion between grass and river still remains. The number of classes is

converged at four in algorithm (IV), and the result seems very similar to that of k-means clustering. As shown in the result from algorithm (I), the number of classes is converged at six. Meanwhile, the bridge, road, grass, and road all have been effectively distinguished. The only fly in the ointment is that a part of the road in the right side has been misclassified as bridge. The image in the last row is acquired in Tucson, AZ. The major objects include woodland, water, building, ground, and grass. The water area and building can be well extracted by algorithm (III), and the regular shape of them might be the main reason. The speckle phenomenon in the woodland and grass is similar to that of the previous experiment. Compared with algorithm (III), by algorithm (II), the grass and woodland can be effectively extracted without speckle noise. Specially, the small road traversed from the woodland to grass has mostly been correctly classified. However, the water area and part of the building are misclassified. The number of classes is converged at three by algorithm (IV). The water area is partially misclassified, and there are also many speckles in woodland. The number of classes is converged at five from algorithm (I). The result has shown the accurate location of each part, particularly in the aspects of distinguishing the water area and ground. In contrast, the small road detected by algorithm (II) has not been correctly classified, and part of the ground has been misclassified as building. Although our proposed method has a few shortcomings in this image, it still outperforms the other algorithms intuitively.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 134

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. 1, JANUARY 2013

TABLE I P URITY AND E NTROPY OF THE E XPERIMENTAL R ESULTS

V. C ONCLUSION In this letter, an unsupervised semantic classification algorithm has been proposed for high-resolution satellite images. The iterative algorithm based on label cost and BIC can automatically determine the number of classes in the classification. Moreover, it can keep the consistency of semantics as well. The evaluation over four scenes has shown that the proposed method achieves better classification performance.

R EFERENCES

The quantitative evaluation results are shown in Table I. The purity (higher is better) and entropy (lower is better) defined in [11] are employed. According to Rosenberg and Hirschberg [11], the clustering problem is treated as a mapping from each data point to its cluster assignments. The target partitions are referred to as classes, and only the hypothesized clusters are referred to as clusters. The two measures of homogeneity are defined as follows: k    1 max nir i n r=1   q k  nr 1  nir nir − Entropy = . log n log q i=1 nr nr r=1

Purity =

(7)

Here, q is the number of classes, k is the number of clusters, nr is the size of cluster r, and nir is the number of data points in class i clustered as cluster r. As shown in Table I, the results from algorithm (I) significantly outperform those from the other two algorithms.

[1] W. Yi, H. Tang, and Y. Chen, “An object-oriented semantic clustering algorithm for high-resolution remote sensing images using the aspect model,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 3, pp. 522–526, May 2011. [2] D. Larlus and F. Jurie, “Latent mixture vocabularies for object categorization and segmentation,” Image Vis. Comput., vol. 27, no. 5, pp. 523–534, Apr. 2009. [3] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification via pLSA,” in Proc. Eur. Conf. Comput. Vis., Graz, Austria, 2006, vol. 4, pp. 517–530. [4] M. Liénou, H. Maître, and M. Datcu, “Semantic annotation of satellite images using latent Dirichlet allocation,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 1, pp. 28–32, Jan. 2010. [5] J. Verbeek and B. Triggs, “Region classification with Markov field aspect models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2007, pp. 1–8. [6] W. Yang, D. Dai, J. Wu, and C. He, “Weakly supervised polarimetric SAR image classification with multi-modal Markov aspect model,” in Proc. ISPRS, TC VII Symp. (Part B), 100 Years ISPRS—Advancing Remote Sensing Science, Vienna, Austria, Jul. 5–7, 2010, pp. 669–673. [7] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov, “Fast approximate energy minimization with label costs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2010, pp. 2173–2180. [8] S. C. Zhu and A. L. Yuille, “Region competition: Unifying, snakes, region growing, and Bayes/MDL for multiband image, segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 9, pp. 884–900, Sep. 1996. [9] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, vol. 8. Cambridge, U.K.: Cambridge Univ. Press, 2003, p. 12. [10] N. Memarsadeghi, D. M. Mount, N. S. Netanyahu, J. Le Moigne, and M. de Berg, “A fast implementation of the ISODATA clustering algorithm,” Int. J. Comput. Geom. Appl., vol. 17, no. 1, pp. 71–103, 2007. [11] A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropybased external cluster evaluation measure,” in Proc. Joint Conf. EMNLPCoNLL, 2007, pp. 410–420.