Remote Sensing Image Scene Classification Using Multi-Scale ...

0 downloads 0 Views 5MB Size Report
Jun 8, 2016 - diamond; (4) beach; (5) buildings; (6) chaparral; (7) dense residential; ... (15) overpass; (16) parking lot; (17) river; (18) runway; (19) sparse ...
remote sensing Article

Remote Sensing Image Scene Classification Using Multi-Scale Completed Local Binary Patterns and Fisher Vectors Longhui Huang 1 , Chen Chen 2 , Wei Li 1, * and Qian Du 3 1 2 3

*

College of Information Science and Technology, Beijing University of Chemical Technology, 100029 Beijing, China; [email protected] Department of Electrical Engineering, University of Texas at Dallas, Dallas, TX 75080, USA; [email protected] Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762, USA; [email protected] Correspondence: [email protected]; Tel.: +86-1814-6529-853

Academic Editors: Gonzalo Pajares Martinsanz, Xiaofeng Li and Prasad S. Thenkabail Received: 18 February 2016; Accepted: 30 May 2016; Published: 8 June 2016

Abstract: An effective remote sensing image scene classification approach using patch-based multi-scale completed local binary pattern (MS-CLBP) features and a Fisher vector (FV) is proposed. The approach extracts a set of local patch descriptors by partitioning an image and its multi-scale versions into dense patches and using the CLBP descriptor to characterize local rotation invariant texture information. Then, Fisher vector encoding is used to encode the local patch descriptors (i.e., patch-based CLBP features) into a discriminative representation. To improve the discriminative power of feature representation, multiple sets of parameters are used for CLBP to generate multiple FVs that are concatenated as the final representation for an image. A kernel-based extreme learning machine (KELM) is then employed for classification. The proposed method is extensively evaluated on two public benchmark remote sensing image datasets (i.e., the 21-class land-use dataset and the 19-class satellite scene dataset) and leads to superior classification performance (93.00% for the 21-class dataset with an improvement of approximately 3% when compared with the state-of-the-art MS-CLBP and 94.32% for the 19-class dataset with an improvement of approximately 1%). Keywords: remote sensing image scene classification; completed local binary patterns; multi-scale analysis; fisher vector; extreme learning machine

1. Introduction Remote sensing is an effective tool for Earth observation, which has been widely applied in surveying land-use and land-cover classifications and monitoring their dynamic changes. With the improvement of spatial resolution, remote-sensing images present more detailed information such as spatial arrangement information and textural structures, which are of great help in recognizing different land-use and land-cover scene categories. The goal of image scene classification is to recognize the semantic categories of a given image based on some priori knowledge. Due to intra-class variations and wide range of illumination and scale changes, scene classification of high-resolution remote sensing images remains a challenging problem. The last decade saw considerable efforts to employ computer vision techniques to classify aerial or satellite image scenes. The bag-of-visual-words (BOVW) model [1], which is one of the most popular approaches in image analysis and classification applications, provides an efficient approach to solve the problem of scene classification. The BOVW model, derived from document classification in text analysis, represents an image as a histogram of frequencies of a set of visual words by mapping the Remote Sens. 2016, 8, 483; doi:10.3390/rs8060483

www.mdpi.com/journal/remotesensing

Remote Sens. 2016, 8, 483

2 of 17

local features to a visual vocabulary. The vocabulary is pre-established by clustering the local features extracted from a collection of images. The traditional BOVW model ignores spatial and structural information, which severely limits its descriptive ability. To overcome this issue, a spatial pyramid matching (SPM) framework was proposed in [2]. This approach partitions an image into sub-regions, computes a BOVW histogram for each sub-region, and then concatenates the histograms from all sub-regions to form the SPM representation of an image. However, SPM only considers the absolute spatial arrangement, and the resulting features are sensitive to rotation variations. Thus, a spatial co-occurrence kernel, which is general enough to characterize a variety of spatial arrangements, was proposed in [3] to capture both the absolute and relative spatial layout of an image. In [4], a multi-resolution representation was incorporated into the bag-of-features model and two modalities of horizontal and vertical partitions were adopted to partition all resolution images into sub-regions to improve the SPM framework. In [5], a concentric circle-structured multi-scale BOVW model was presented to incorporate rotation-invariant spatial layout information into the original BOVW model. The aforementioned BOVW variants focus on capturing the spatial layout information of scene images. However, the rich texture and structure information in high-resolution remote sensing images has not been fully exploited since they merely use the scale-invariant feature transform (SIFT) [6] descriptors to capture local features. There is also a great effort to evaluate various features and combinations of features for scene classification. In [7], a local structural texture similarity descriptor was applied to image blocks to represent structural texture for aerial image classification. In [8], semantic classifications of aerial images based on Gabor and Gist descriptors [9] were evaluated individually. In [10], four types of features consisting of raw pixel intensity values, oriented filter responses, SIFT-based feature descriptors, and self-similarity were used within the framework of unsupervised feature learning. In [11], global features extracted using the enhanced Gabor texture descriptor (EGTD) and local features extracted using the SIFT descriptor were fused in a hierarchical approach to improve the performance of remote sensing image scene classification. Recently, deep learning has received great attention. Different from the afore-mentioned BOVW and its variants that are considered mid-level representations, deep learning is an end-to-end feature learning method (e.g., from an image to semantic label). Among deep learning-based networks, convolutional neural networks (CNNs) [12,13] may be the most popular for learning visual features in computer vision applications, such as remote sensing and large-scale visual recognition. K. Nogueira et al. [14] presented the PatreoNet, which has the capability to learn specific spatial features from remote sensing images without any pre-processing step or descriptor evaluation. AlexNet, proposed by Krizhevsky et al. [15], was the first to employ non-saturating neurons, GPU implementation of the convolution operation and dropout to prevent overfitting. GoogLeNet [16] deployed the CNN architecture and utilized filters of different sizes at the same layer to reduce the number of parameters of the network. However, CNNs have an intrinsic limitation, i.e., the complicated pre-training process to adjust parameters. In [17], multi-scale completed local binary patterns (MS-CLBP) features were utilized for remote sensing image classification. The extracted features can be considered global features in an image. However, the global feature representation may not able to characterize detailed structures and distinct objects. For example, some land-use and land-cover classes are defined mainly by individual objects, e.g., baseball fields and storage tanks. In this paper, we propose a local feature representation method based on patch-based MS-CLBP, which can be used to extract complementary features to global features. Specifically, the CLBP descriptor is applied to partition dense image patches and extract a set of local patch descriptors, which characterize the detailed local structure and texture information in high-resolution remote sensing images. Since the CLBP [18] operator belongs to a gray-scale and rotation-invariant texture operator, the extracted local descriptors are robust to rotation image transformations. Then, the Fisher kernel representation [19] is employed to encode the local descriptors into a discriminative representation (i.e., Fisher vector (FV)). FV describes patch descriptors by their deviation from a “universal” generative Gaussian mixture model (GMM). To improve the

Remote Sens. 2016, 8, 483

3 of 17

discriminative power of the feature representation, multiple sets of parameters for the CLBP operator (i.e., MS-CLBP) were utilized to generate multiple FVs. The final representation for an image is achieved by concatenating all the FVs. For classification, the kernel-based extreme learning machine (KELM) [20] is adopted for its efficient computation and good classification performance. There are two main contributions from this work. First, a local feature representation method using patch-based MS-CLBP features and FV is proposed. The MS-CLBP operator is applied to the partitioned dense regions to extract a set of local patch descriptors, and then the Fisher kernel representation is used to encode the local descriptors into a discriminative representation of remote sensing images. Second, the two implementations of MS-CLBP are combined into a unified framework to build a more powerful feature representation. The proposed local feature representation method is evaluated using two public benchmark remote sensing image datasets. The experimental results verify the effectiveness of our proposed method as compared to state-of-the-art algorithms. The remainder of the paper is organized as follows. Section 2 presents the related works including CLBP and the Fisher vector. Section 3 describes two implementations of MS-CLBP, patch-based MS-CLBP feature extraction, and the details of the proposed feature representation method. Section 4 provides the experimental results. Finally, Section 5 concludes the paper. 2. Related Works 2.1. Completed Local Binary Patterns Local binary patterns (LBP) [21,22] are an effective measure of spatial structure information of local image texture. Consider a center pixel and its gray value, tc . Its neighboring pixels are equally spaced on a circle of radius r with the center at location tc . If the coordinates of tc are p0, 0q and m neighbors tti uim“´01 are considered, the coordinates of ti are denoted as p´rsinp2πi{mq, rcosp2πi{mqq. Then, the LBP is calculated by thresholding the neighbors tti uim“´01 with the center pixel tc to generate an m-bit binary number. The resulting LBP for tc in decimal number can be expressed as follows: LBPm,r ptc q “

mÿ ´1

i

s pti ´ tc q 2 “

i “0

mÿ ´1

# i

s pdi q 2 , spxq “

i “0

1,

xě0

0,

xă0

(1)

where di “ pti ´ tc q represents the difference between the center pixel and each neighbor, which characterizes the spatial local structure at the center location. Further, the resulted di is robust to illumination changes and they are more efficient than the original image in pattern classification due to the fact that the central gray level tc is removed. The difference vector di can be further decomposed into two components: the signs and magnitudes (absolute values of di , i.e., |di |). However, the original LBP only uses the sign information of di while ignoring the magnitude information. In the improved CLBP [18], the signs and magnitudes are complementary, from which the difference vector di can be perfectly reconstructed. Figure 1 illustrates an example of the sign and magnitude components of the CLBP extracted from a sample block, where Figure 1a–d denote original 3 ˆ 3 local structure, difference vector, sign vector and magnitude vector, respectively. Note that “0” is coded as “´1” in CLBP (as seen in Figure 1c). Two operators, CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M), are used to encode these two components. CLBP_S is equivalent to the traditional LBP operator while the CLBP_M operator can be expressed as, CLBP_Mm,r “

mÿ ´1 i “0

# f p|di | , cq2i , f px, yq “

1,

xěy

0,

xăy

(2)

where c is a threshold that is set to the mean value of |di |. Using Equations (1) and (2), two binary strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively. Two ways to combine the CLBP_S and CLBP_M codes are presented in [18]. Here, the first way (concatenation) is

Remote Sens. 2016, 8, 483 Remote Sens. 2016, 8, 483

4 of 17 4 of 17

where c is a threshold that is set to the mean value of di . Using Equations (1) and (2), two binary where c is a threshold that is set to the mean value of di . Using Equations (1) and (2), two binary 4 of 17 and denoted as CLBP_S and CLBP_M codes, respectively. Two ways to strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively. Two ways to combine the CLBP_S and CLBP_M codes are presented in [18]. Here, the first way (concatenation) is combine the CLBP_S and CLBP_M codes are presented in [18]. Here, the first way (concatenation) is used, in which the histograms of the CLBP_S and CLBP_M codes are calculated separately, and used, thethe CLBP_S andand CLBP_M codes are calculated separately, and then used, in in which whichthe thehistograms histogramsofof CLBP_S CLBP_M codes are calculated separately, and then the two histograms are concatenated. Note that there is also the CLBP-Center part which codes the two histograms are concatenated. Note that there is also the CLBP-Center part which codes the then the two histograms are concatenated. Note that there is also the CLBP-Center part which codes the values of the center pixels in the original CLBP. Here, only the CLBP_S and CLBP_M operators values of the pixels in the CLBP. Here, onlyonly the CLBP_S and and CLBP_M operators are the values of center the center pixels in original the original CLBP. Here, the CLBP_S CLBP_M operators are considered for computational efficiency. considered for computational efficiency. are considered for computational efficiency. Remote Sens. 2016, 483 strings can be 8,produced

Figure block; (b) (b) the local differences; (c) the(c)sign of CLBP; the Figure 1.1. (a) (a)3 3ˆ ×3 sample 3 sample block; the local differences; thecomponent sign component of (d) CLBP; Figure 1.value (a) 3of ×local 3 sample block; (b) the local differences; (c) the sign component of CLBP; absolute differences; (e) the magnitude component of CLBP. (d) the absolute value of local differences; (e) the magnitude component of CLBP. (d) the absolute value of local differences; (e) the magnitude component of CLBP.

Figure22presents presentsananexample example CLBP_S CLBP_M coded images corresponding to an Figure of of thethe CLBP_S andand CLBP_M coded images corresponding to an input Figure 2 presents an example of the CLBP_S and CLBP_M coded images corresponding to an input scene aerial (viaduct scene (viaduct It observed can be observed that CLBP_S and CLBP_M operators can aerial scene). scene). It can be that CLBP_S and CLBP_M operators both canboth capture input aerial scene (viaduct scene). It can be observed that CLBP_S and CLBP_M operators both can capture thepattern spatialand pattern and the contrast of local image such texture, such as edges and corners. the spatial the contrast of local image texture, as edges and corners. capture the spatial pattern and the contrast of local image texture, such as edges and corners.

Figure 2. (a) Input image; (b) CLBP_S coded image; (c) CLBP_M coded image. Figure 2. (a) Input image; (b) CLBP_S coded image; (c) CLBP_M coded image.

2.2. Fisher Vector 2.2. Fisher Vector After local feature extraction (especially for patch-based feature extraction), the popular localfeature featureextraction extraction (especially patch-based for patch-based feature extraction), the popular After local (especially feature extraction), BOVW model is usually employed to encodefor features into histograms. However,the thepopular BOVWBOVW model BOVW ismodel is usually employed to encode features into histograms. However, the BOVW model model usually employed to encode features into histograms. However, the BOVW model has has an intrinsic limitation, namely the computational cost in assignment of local features to visual hasintrinsic an intrinsic limitation, namely the computational cost in assignment offeatures local features to words, visual an limitation, namely the computational cost in assignment of local to visual words, which scales as the product of the number of visual words, the number of regions and the words,scales which scales as the of product of the number of visual the numberand of regions and the which the product the number visual words, the words, number of regions localcompact feature local featureasdimensionality [23]. Severalofextensions to the basic BOVW model tothe build local feature dimensionality [23]. Severaltoextensions to the basic BOVW model to build compact dimensionality [23]. Several extensions basic BOVW to build vocabularies vocabularies have been proposed. The mostthe appealing one is model the Fisher kernelcompact image representation vocabularies have beenThe proposed. The most appealing one is the Fisherimage kernelrepresentation image representation have been proposed. most appealing one is the Fisher kernel [19,24], [19,24], which uses high-dimensional gradient representation to represent an image. Due to [19,24],uses which uses high-dimensional gradient representation to an represent an image. Due to which high-dimensional gradient representation to represent image. Due to informative informative representations with compact vocabularies, this representation contains more informative representations with compact vocabularies,contains this representation contains more representations with compact vocabularies, this representation more information than a simple information than a simple histogram representation. information than a simple histogram representation. histogram representation. An FV is a special case of Fisher kernel construction. Let X = {xt , t = 1 ... T } be the set of is aa special special case case of of Fisher Fisher kernel kernel construction. construction. Let Let XX “= {tx xtt,, tt =“ 11......TTu } be An FV is be the set of local patch descriptors extracted from an image. A Gaussian mixture model (GMM) is trained on local patch descriptors extracted from an image. A Gaussian mixture model (GMM) is trained on the patch descriptors extracted from an image. A Gaussian mixture model (GMM) is trained on the training images using Maximum Likelihood (ML) estimation [25,26]. Let denote the training images using using Maximum Likelihood (ML) estimation [25,26]. Let P denote probability the training images Maximum Likelihood (ML) estimation [25,26]. Let the denote the is the probability density function of the GMM with parameters λ = {ωi , μi , Σi , i = 1...K } , where density function of function the GMM parameters λ “ tωi , µλi , Σ where K is the number = i{,ωi i“ , μ1...Ku, is the probability density of with the GMM with parameters i , Σ i , i = 1...K } , where ω μi and Σ i are the mixture weight, mean vector, and covariance number of components. of components. ωi , µi and ω Σii ,,are mixture mean vector, covariance of the ith μi the and Σ i areweight, the mixture weight,and mean vector, matrix and covariance number of components. i th matrix ofcomponent, the i Gaussian component, The image be characterized by the Gaussian respectively. The imagerespectively. can be characterized by thecan gradient of the log-likelihood matrix of the i th Gaussian component, respectively. The image can be characterized by the gradient ofon thethe log-likelihood of the data on the model: of the data model: gradient of the log-likelihood of the data on the model: GλX “ ∇λ logP pX|λq (3) (3) (3) The gradient describes the direction along which parameters are to be adjusted to best fit the ` 2˘ data. Under an independence assumption, the covariance matrices are diagonal, i.e., Σi “ diag σi . Then following [27], LpX|λq “ logPpX|λq is written as,

Remote Sens. 2016, 8, 483

5 of 17

LpX|λq “

T ÿ

(4)

logPpxt |λq

t “1

The probability density function of xt generated by the GMM is Ppxt |λq “

k ÿ

ωi pi pxt |λq

(5)

i “1

Let γt piq be the occupancy probability, i.e., the probability of descriptor xt generated by the i-th Gaussian. ωi pi pxt |λq (6) γt piq “ Pp i| xt , λq “ k ˇ ř ω j p j pxt ˇλq j “1

with the Bayes formula mathematical derivations providing the following results,  T „ ÿ BLpX|λq γt piq γt p1q for i ě 2 “ ´ Bωi ωi ω1

(7)

t “1

« ff T ÿ xtd ´ µid BLpX|λq “ γt piq 2 Bµid pσid q t “1 « ff 2 T ÿ pxtd ´ µid q BLpX|λq 1 “ γt piq ´ d 3 Bσid σi pσid q t “1

(8)

(9)

where d denotes the dth dimension of a vector. The diagonal closed-form approximation in [27] is used to normalize the gradient vector by multiplying the square-root of the inverse of the Fisher information ´1{2

matrix, i.e., Fλ

. Let f ωi , f µd , and f σd denote the diagonal of Fλ corresponding to BLpX|λq {Bωi , i

i

BLpX|λq {Bµid , and BLpX|λq {Bσid , respectively, and we have the following approximation, f ωi “ Tp f µd “ i

f σd “ i

1 1 ` q ω i ω1

(10)

Tωi

(11)

2

pσid q

2Tωi

(12)

2

pσid q

Thus, the normalized partial derivatives are f ωi ´1{2 BLpX|λq {Bωi , f µd ´1{2 BLpX|λq {Bµid , and i

f σd ´1{2 BLpX|λq {Bσid . The final gradient vector (i.e., FV) is just a concatenation of all the partial i

derivative vectors. Therefore, the dimensionality of FV is p2D ` 1q ˆ K, where D denotes the size of the local descriptors. 3. Proposed Feature Representation Method Inspired by the success of CLBP and FV in computer vision applications, we propose an effective image representation approach for remote sensing image scene classification based on patch-based MS-CLBP features and FV. The patch-based MS-CLBP is applied as the local patch descriptors and then the FV is chosen as the encoding strategy to generate a high-dimensional representation of an image.

Remote Sens. 2016, 8, 483 Remote Sens. 2016, 8, 483

6 of 17 6 of 17

descriptors and then the FV is chosen as the encoding strategy to generate a high-dimensional descriptors andofthen the FV is chosen as the encoding strategy to generate a high-dimensional representation an image. Remote Sens. 2016, 8, 6 of 17 representation of483 an image. 3.1. Two Implementations of Multi-Scale Completed Local Binary Patterns 3.1. Two Implementations of Multi-Scale Completed Local Binary Patterns 3.1. Two Implementations of Multi-Scale Local Binary CLBP features computed from aCompleted single-scale may not Patterns be able to detect the dominant texture CLBP features computed from a single-scale may not be able the to detect the dominant texture features from an image. A possible solution is to characterize imagethe texture in multiple CLBP features computed from a single-scale may not be able to detect dominant texture features from an image. A possible solution is to characterize the image texture in multiple resolutions, There are two implementations MS-CLBP features fromi.e., an MS-CLBP. image. A possible solution is to characterizefor thethe image texturedescriptor in multiple[17]. resolutions, resolutions, i.e., MS-CLBP. There are two implementations for the MS-CLBP descriptor [17]. In the first implementation, the radius of a circle is altered to change the spatial resolution. i.e., MS-CLBP. There are two implementations for the MS-CLBP descriptor [17]. In the first implementation, the radius of a combining circle is the altered to changeprovided the spatialbyresolution. The In multi-scale analysis is accomplished by information multiple the first implementation, the radius of a circle r is altered to change the spatial resolution. The multi-scale analysis is accomplished by combining the information provided by multiple operators of varying r ) . For simplicity, the number of neighbors is fixed m and operators different The multi-scale analysis( m is,accomplished by combining the information provided byto multiple operators of varying ( m , r ) . For simplicity, the number of neighbors is fixed to m and different values of r pm, are rq. tuned achieve the example a 3-scale (three r of values) of varying For to simplicity, theoptimal numbercombination. of neighbors An is fixed to m of and different values r are values ofachieve r are tuned to achieve optimal combination. example of a r3-scale (three values) CLBP to operator isthe illustrated in the Figure 3. The CLBP_S histogram features tuned optimal combination. An exampleand ofAn a CLBP_M 3-scale (three values) CLBPrextracted operator CLBP operator illustrated Figure 3. The CLBP_S and CLBP_M histogram features extracted from each scale are concatenated to form an MS-CLBP representation. One disadvantage ofscale this is illustrated in is Figure 3. TheinCLBP_S and CLBP_M histogram features extracted from each from each scale are concatenated to form an MS-CLBP representation. One disadvantage of this multi-scale analysis implementation is representation. that the computational complexityofincreases due to multiple are concatenated to form an MS-CLBP One disadvantage this multi-scale analysis multi-scale implementation is that the computational increases due to multiple resolutions.analysis implementation is that the computational complexity increases complexity due to multiple resolutions. resolutions.

Figure 3. 3. An , 1 and r2 ,=r22 ,“ Figure An example example of of the the first firstimplementation implementationof ofaa3-scale 3-scaleCLBP CLBPoperator operator( m (m= 8“, 8,r1 r=11“ 2, Figure 3. An example of the first implementation of a 3-scale CLBP operator ( m = 8 , , , and r = 1 r = 2 1 2 r3 = 3r3). “ 3 ). and r3 = 3 ).

the second second implementation, implementation, the original original image is is down-sampled down-sampled using using the the bicubic bicubic interpolation interpolation In the In themultiple second implementation, the original image is down-sampled using0the bicubic interpolation to obtain images at different scales. The value of scale is between and 1 (here, 1 denotes denotes to obtain multiple images at different scales. The value of scale is between 0 and 1 (here, 1 denotes the original image). Then, Then, the the CLBP_S CLBP_S and CLBP_M operators operators of fixed radius and the number of the originalare image). Then, themultiple-scale CLBP_S and images. CLBP_M operators fixed radius and the number of applied to the the multiple-scale images. The CLBP_M histogram features neighbors applied to CLBP_Sofand CLBP_M histogram features neighbors are applied to image the multiple-scale images. Thean CLBP_S andrepresentation. CLBP_M histogram features extracted from from eachscale scale image areconcatenated concatenated form an MS-CLBP representation. example extracted each are totoform MS-CLBP AnAn example of extracted from each scale image are concatenated to form an MS-CLBP representation. An example of the second implementation of the MS-CLBP descriptor is shown in Figure the second implementation of the MS-CLBP descriptor is shown in Figure 4. 4. of the second implementation of the MS-CLBP descriptor is shown in Figure 4.

Figure 4. An example of the second implementation of a 3-scale CLBP operator ( m = 8 , r = 2 ). Figure ). Figure 4. 4. An An example example of of the the second second implementation implementationof ofaa3-scale 3-scaleCLBP CLBPoperator operator(m ( m“ = 88,, rr“ = 22 ).

3.2. Patch-Based MS-CLBP Feature Extraction 3.2. Patch-Based MS-CLBP Feature Extraction Given an image, the CLBP [18] operator with a parameter set ( m, r ) is applied to generate Given an the CLBP [18] operator a parameter setset pm,(rq isr )applied to generate two an image, image, thewith CLBP operatorwith with a parameter m,(i.e., is applied to generate two CLBP coded images one[18] corresponding to the sign component CLBP_S coded image) CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded image) and two CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded image) the other the magnitude component (i.e., CLBP_M coded image). Two complementary components of CLBP (CLBP_S and CLBP_M) can capture the spatial patterns and contrast of local image texture, such

Remote Sens. 2016, 8, 483

7 of 17

and the other the magnitude component (i.e., CLBP_M coded image). Two complementary of 17 (CLBP_S and CLBP_M) can capture the spatial patterns and contrast of 7local image texture, such as edges and corners. Then, the CLBP coded images are partitioned into overlapped patches in an image grid. For simplicity, the overlap between two patches is half of the as edges and corners. Then, the CLBP coded images are partitioned into B ˆ B overlapped patches in patch size (i.e., B 2 ) in both horizontal and vertical directions. To incorporate spatial structures of an image grid. For simplicity, the overlap between two patches is half of the patch size (i.e., B{2) in an image at different scales (or sizes) and createspatial morestructures patch descriptors, the second both horizontal and vertical directions. To incorporate of an image here at different scales implementation of MS-CLBP is employed by resizing the original image to different (e.g., (or sizes) and create more patch descriptors, here the second implementation of MS-CLBP isscales employed 1 2 and 1 3 of the original image). Specifically, the CLBP operator with the same parameter set is by resizing the original image to different scales (e.g., 1{2 and 1{3 of the original image). Specifically, applied to operator the multi-scale images generate set patch-based features. i, the CLBP with the same to parameter is appliedCLBP to thehistogram multi-scale imagesFor to patch generate two occurrence histograms the nonparametric statistical estimate) are computed from the sign patch-based CLBP histogram(i.e., features. For patch i, two occurrence histograms (i.e., the nonparametric component (CLBP_S) and the magnitude component (CLBP_M). A histogram feature vector statistical estimate) are computed from the sign component (CLBP_S) and the magnitude component denoted by h is formed by concatenating the two histograms. If patches are extracted from (CLBP_M). A histogram feature vector denoted by hi is formed by concatenating the two histograms. i Remote Sens. 2016, 483 components of8,CLBP

If patches areimages, extracted from thematrix multi-scale images, a feature by H “torh = [h1 , h 2matrix ,..., h M ]denoted is generated represent theMmulti-scale a feature denoted by H Ms 1 , h2 , ..., hthe is generated to represent the original image. EachHcolumn of the matrix H is avector histogram vector original image. Each column of the matrix is a histogram feature for feature a patch. The for a patch. The proposed patch-based CLBP feature extraction method is illustrated in Figure 5. proposed patch-based CLBP feature extraction method is illustrated in Figure 5.

Figure 5. 5. Patch-based feature extraction. extraction. Figure Patch-based CLBP CLBP feature

As noted in [21], LBP features computed from a single scale may not be able to represent As noted in [21], LBP features computed from a single scale may not be able to represent intrinsic intrinsic texture features. Therefore, different parameter sets ( m, r ) are utilized for the CLBP texture features. Therefore, different parameter sets pm, rq are utilized for the CLBP operator to achieve operator to achieve the first implementation of the MS-CLBP as described in [17]. Specifically, the the first implementation of the MS-CLBP as described in [17]. Specifically, the number of neighbors is fixed andused multiple ( ) are used in feature the patch-based feature number of neighbors ( )radii (m) is fixed and multiple (r) are in theradii patch-based CLBP extractionCLBP as shown in ( are considered, extraction shown in Figure (m,considered, r1 ), (m, r2 ),...,a(set m, rof q, pm, r2 q, ...,sets pm, (i.e., rq q ) are Figure 5. Ifas q!parameter sets (i.e.,5. Ifpm, r1)parameter matrices q ) q) feature

{

{

}

}

denoted Hpm,r1matrices , ..., Hpm,rqby obtained for an image. is worth that Itthe q , Hpm,r2 qdenoted q can H ( mbe a set of byfeature can be It obtained fornoting an image. is , r1 ) , H ( m , r2 ) ,..., H ( m , rq ) proposed patch-based MS-CLBP feature extraction effectively unifies the two implementations of the worth noting that the proposed patch-based MS-CLBP feature extraction effectively unifies the two MS-CLBP [17]. implementations of the MS-CLBP [17]. 3.3. A Fisher Kernel Representation 3.3. A Fisher Kernel Representation Fisher kernel representation [19] is an effective patch aggregation mechanism to characterize Fisher representation is an effective aggregation to characterize a a sample of kernel low-level features, and[19] it exhibits superiorpatch performance over mechanism the BOVW model. Therefore, sample ofkernel low-level features, isand it exhibits superior performance over the BOVW model. the Fisher representation employed to encode the dense !local patch descriptors. ) Therefore, the Fisher kernel representation is employed to encode the dense local patch descriptors. r 1 s r 2 Given NT training images with NT feature matrices, H , H s , ..., Hr NT s representing [2] T ] H[1] , H ,..., H[ Nimage Given patchtraining images with featureCLBP matrices, representing the the local descriptors (i.e., patch-based features) of each are obtained using the feature extraction method illustrated in Figure 5. Since q parameter sets (i.e., local patch descriptors ((i.e., patch-based CLBP features) of each image are obtained using the pm, r1 q, pm, r!2 q, ..., pm, rq q ) are employed CLBP operator, each image yields q feature matrices ) for the feature extraction method illustrated in Figure 5. Since parameter sets (i.e., r js r js r js denoted by Hpm,r q , Hpm,r q , ..., Hpm,r q , where j P r1, 2, ..., NT s. For each CLBP parameter set, the q 2 1 ) are employed for data the are CLBP operator, each yields feature corresponding feature matrices of the training used to estimate the image GMM parameters via the

{

{

}

}

[ j] [ j] Expectation-Maximization CLBP sets,CLBP q GMMs are created. matrices denoted by H[( mj ](EM) j ∈q[1, 2,..., parameter NT ] . For each parameter set, , r1 ) , H ( algorithm. m , r2 ) ,..., H ( m , rqTherefore, ) , where for After estimating the GMM parameters, q FVs are obtained for an image. Then, the q FVs are simply the corresponding feature matrices of the training data are used to estimate the GMM parameters concatenated as the final feature representation. Figure 6 shows the detailed procedure for generating via the Expectation-Maximization (EM) algorithm. Therefore, for q CLBP parameter sets, q FVs. As illustrated in Figure 6, the stacked FVs (f) from the q CLBP parameter sets serve as the final feature representation of an image before being fed into a classifier.

Remote Sens. 2016, 8, 483

8 of 17

GMMs are created. After estimating the GMM parameters, q FVs are obtained for an image. Then, the FVs are simply concatenated as the final feature representation. Figure 6 shows the detailed procedure generating FVs. As illustrated in Figure 6, the stacked FVs ( ) from the q CLBP Remote Sens.for 2016, 8, 483 8 of 17 parameter sets serve as the final feature representation of an image before being fed into a classifier.

Figure 6. Fisher vector representation. Figure 6. Fisher vector representation.

4. Experiments 4. Experiments Two domain datasets datasetsare are used to demonstrate the effectiveness of the Twostandard standardpublic public domain used to demonstrate the effectiveness of the proposed proposed image representation method for remote sensing land-use scene classification. In the image representation method for remote sensing land-use scene classification. In the experiments, experiments, with a radial basis function (RBF) kernel is for employed for classification due to KELM withKELM a radial basis function (RBF) kernel is employed classification due to its generally its excellent generally excellent classification performance and low computational cost. The classification classification performance and low computational cost. The classification performance of the performance of the proposed method compared with the proposed method is compared withisthe state-of-the-art in state-of-the-art the literature. in the literature. 4.1.4.1. Experimental Data andand Setup Experimental Data Setup The first well-known UC-Merced UC-Mercedland-use land-use dataset [28]. is first the public first public The firstdataset datasetisis the well-known dataset [28]. It isItthe ground ground truth land-use that consists of 21 land-use andclass eachcontains class truth land-use scene scene image image datasetdataset that consists of 21 land-use classesclasses and each contains 100 images of 256 × 256 pixels.The The images were aerial 100 images with awith sizea size of 256 ˆ 256 pixels. were manually manuallyextracted extractedfrom from aerial orthoimagery downloaded from thethe United States Geological Survey (USGS) National Map. This is is orthoimagery downloaded from United States Geological Survey (USGS) National Map. This a challenging dataset duedue to atovariety of spatial patterns in those 21 classes. SampleSample images images of each of a challenging dataset a variety of spatial patterns in those 21 classes. land-use class areclass shown Figurein7.Figure To facilitate a fair comparison, the same the experimental setting each land-use arein shown 7. To facilitate a fair comparison, same experimental reported in [28] isinfollowed. Five-fold cross-validation is performed, in which the the dataset is is setting reported [28] is followed. Five-fold cross-validation is performed, in which dataset randomly partitioned Thereare are2020images images from each land-use a randomly partitionedinto intofive five equal equal subsets. There from each land-use classclass in a in subset. subset. Four subsets arefor used for training the remaining subset for The testing. The classification Four subsets are used training and theand remaining subset for testing. classification accuracy is accuracy is theover average overcross-validation the five cross-validation evaluations. the average the five evaluations.

Remote Sens. 2016, 8, 483 Remote Sens. 2016, 8, 483

9 of 17 9 of 17

Figure 7. Examples from the 21-class land-use 7. Examples land-use dataset: dataset: (1) agricultural; (2) airplane; (3) baseball Figure 7. Examples from the 21-class land-use dataset: (1) agricultural; (2) airplane; (3) baseball diamond; (4) buildings; (6) chaparral; (7) dense residential; (8) forest; freeway; (10) golf (4)beach; beach;(5)(5) buildings; (6) chaparral; (7) dense residential; (8) (9) forest; (9) freeway; diamond; (4) beach; (5) buildings; (6) chaparral; (7) dense residential; (8) forest; (9) freeway; course; harbor; intersection; (13) medium densitydensity residential; (14) mobile home home park; (10) golf(11) course; (11) (12) harbor; (12) intersection; (13) medium residential; (14) mobile (10) golf course; (11) harbor; (12) intersection; (13) medium density residential; (14) mobile home (15) (16) parking lot; (17) (19) sparse residential; (20) storage tanks; park;overpass; (15) overpass; (16) parking lot;river; (17) (18) river;runway; (18) runway; (19) sparse residential; (20) storage park; (15) overpass; (16) parking lot; (17) river; (18) runway; (19) sparse residential; (20) storage (21) tennis tanks; (21) courts. tennis courts. tanks; (21) tennis courts.

The second second dataset used in our experiments the 19-class scene[29]. dataset [29]. of It The used in our is the is 19-class satellite satellite scene dataset It consists The seconddataset dataset used in experiments our experiments is the 19-class satellite scene dataset [29]. It consists of 19 classes of high-resolution satellite scenes. There are 50 images with sizes of 600 × 600 19 classesofof19 high-resolution satellite scenes. Therescenes. are 50 images with50sizes of 600 ˆ 600 pixels for×each consists classes of high-resolution satellite There are images with sizes of 600 600 pixelsThe for images each class. The images are extracted from large satelliteEarth. images on Google Earth. An class. are extracted from large satellite images on Google An example of each pixels for each class. The images are extracted from large satellite images on Google Earth.class An example of each class is shown in Figure 8. The same experimental setup in [30] was used. Here, 30 is shown in Figure 8. The same experimental setup in [30] was used. Here, 30 images are randomly example of each class is shown in Figure 8. The same experimental setup in [30] was used. Here, 30 images are randomly selected per class as training data and the remaining images as testing data. selected per randomly class as training dataper and the remaining as testing data. Theimages experiment is repeated images are selected class as trainingimages data and the remaining as testing data. Thetimes experiment is repeated 10 times with different realizations of randomly selectedClassification training and 10 with different realizations of randomly selected training and testing images. The experiment is repeated 10 times with different realizations of randomly selected training and testing images. Classification averaged over the 10 trials. accuracy is averaged over the accuracy 10 trials. is testing images. Classification accuracy is averaged over the 10 trials.

Figure 8. Examples from the 19-class satellite scene dataset: (1) airport; (2) beach; (3) bridge; Figure 8. Examples from the 19-class satellite scene dataset: (1) airport; (2) beach; (3) bridge; (4) commercial; (5) desert; (6) farmland; (7) football (8) forest; (9) industrial; (10)(3)meadow; Figure 8. Examples from the 19-class satellite scenefield; dataset: (1) airport; (2) beach; bridge; (4) commercial; (5) desert; (6) farmland; (7) football field; (8) forest; (9) industrial; (10) meadow; (11) mountain; (12) park; (13) (15) port; railway station; (17) (4) commercial; (5) desert; (6) parking; farmland;(14) (7) pond; football field; (8)(16) forest; (9) industrial; (10)residential; meadow; (11) mountain; (12) park; (13) parking; (14) pond; (15) port; (16) railway station; (17) residential; (18) (19) viaduct. (11) river; mountain; (12) park; (13) parking; (14) pond; (15) port; (16) railway station; (17) residential; (18) river; (19) viaduct. (18) river; (19) viaduct.

Remote Sens. 2016, 8, 483

10 of 17

Remote Sens. 2016, 8, 483

10 of 17

Note that the original images in these two datasets are color images; the images are converted from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for that the original images in these two datasets are color images; the images are converted sceneNote classification. from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for 4.2. Parameters Setting scene classification.

The number of neighbors ( m ) in the CLBP operator has a direct impact on the dimensionality 4.2. Parameters Setting of the FV since patch-based CLBP features are used as local patch descriptors. A large value of The number of neighbors (m) in the and CLBP operator hasthe a direct impact on complexity. the dimensionality of will increase the feature dimensionality then increase computational Based on the since patch-based CLBPinfeatures as local patch descriptors. large value of m will the FV parameter tuning results [17], mare = 8used is empirically chosen for bothAthe 21-class land-use increase the feature dimensionality and then increase the computational complexity. Based on the dataset and the 19-class satellite scene dataset as it balances the classification performance and parameter tuning results inIn [17], m “ 8 the is empirically bothare theadopted 21-class for land-use dataset computational complexity. addition, parameter chosen settingsfor in [17] the MS-CLBP and the 19-class satellite scene dataset as it balances the classification performance and computational descriptor. Specifically, 6 radii (i.e., r = [1 : 6] ) are used for the MS-CLBP descriptor, resulting 6 complexity. In addition, the parameter settings in [17] are adopted for the MS-CLBP descriptor. parameters sets {(m = 8, r1 = 1),..., (m = 8, r6 = 6)} . Specifically, 6 radii (i.e., r “ r1 : 6s) are used for the MS-CLBP descriptor, resulting 6 parameters sets Then, the number of scales is studied for the first implementation of the MS-CLBP operator for tpm “ 8, r1 “ 1q, ..., pm “ 8, r6 “ 6qu. generating multi-scale images and the number of Gaussians ( K ) in the GMM. For the 21-class Then, the number of scales is studied for the first implementation of the MS-CLBP operator land-use dataset, 80 images are randomly selected per class for training and the remaining images for generating multi-scale images and the number of Gaussians (K) in the GMM. For the 21-class for testing. For the 19-class satellite scene dataset, 30 images per class are randomly selected for land-use dataset, 80 images are randomly selected per class for training and the remaining images for training and the remaining images for testing. Different numbers of Gaussians are examined along testing. For the 19-class satellite scene dataset, 30 images per class are randomly selected for training with different choices of multiple scales including {1,1 [1: 2],...,1 [1: 6]} . For instance, 1 [1: 2] and the remaining images for testing. Different numbers of Gaussians are examined along with indicates choices that scale = 1 (original image) and t1, scale = 1/2 at half size of different of multiple scales including 1{r1 : 2s,(down-sampled ..., 1{r1 : 6su. Forimage instance, 1{r1 of : 2sthe indicates the original areimage) used to generate images with two scales. 9 and present the that scale = 1image) (original and scale = two 1/2 (down-sampled image at Figures half of the size 10 of the original classification results with different numbers of Gaussians in the GMM and different numbers of image) are used to generate two images with two scales. Figures 9 and 10 present the classification scales for the two datasets, respectively. results with different numbers of Gaussians in the GMM and different numbers of scales for the two datasets, respectively.

Figure 9. Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed Figure 9. accuracy (%) versus varying numbers of Gaussians and scales for our method forClassification the 21-class land-use dataset. proposed method for the 21-class land-use dataset.

Remote Sens. 2016, 8, 483

11 of 17

Remote Sens. 2016, 8, 483 Remote Sens. 2016, 8, 483

11 of 17 11 of 17

Figure 10. Classification accuracy (%) versus varying numbers of Gaussians and scales for our Figure 10. Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed proposed method for the 19-class satellite scene dataset. Figure Classification accuracy (%) versus varying numbers of Gaussians and scales for our method10. for the 19-class satellite scene dataset. proposed method for the 19-class satellite scene dataset.

Thus, the optimal number of Gaussians for the 21-class land-use dataset is 35 and the optimal Thus, the optimal number of Gaussians for the 21-class land-use dataset isand 35 and the optimal multiple scales is 1 [1: number 4] simultaneously considering classification computational Thus, the optimal of Gaussians for the 21-class land-use accuracy dataset is 35 and the optimal multiple scales is 1{r1 : 4s simultaneously considering classification accuracy and computational complexity. Similarly, the number of Gaussians classification for the 19-class satellite and scenecomputational dataset is 20 multiple scales is 1 [1: 4] optimal simultaneously considering accuracy complexity. Similarly, the optimal number of Gaussians for the 19-class satellite scene dataset is 20 and and the optimal multiple is 1 [1: 4] . of Gaussians for the 19-class satellite scene dataset is 20 complexity. Similarly, thescale optimal number the optimal multiple scale is 1{r1 : 4s. the proposed method and Since the optimal multiple scale is 1extracts [1: 4] . dense local patches, the size of the patch ( B × B ) is Since the proposed method extracts dense local patches, the size of the patch (B ˆ B) is determined determined empirically. The classification accuracies varying patch of sizes illustrated Since the method extractswith dense localwith patches, the are patch ( B11. × B It ) in is empirically. Theproposed classification accuracies varying patch sizesthe aresize illustrated in Figure is Figure 11. It is obvious that B = 32 achieves the best classification performance for the 21-class determined Thethe classification accuracies with varying sizes are illustrated in obvious that empirically. B “ 32 achieves best classification performance for thepatch 21-class land-use dataset. The land-use dataset. The sizethat of the=images in the 19-class is 600 performance × 600 pixels, for which is21-class about Figure 11. It is obvious 32 achieves best dataset classification size of the images in the 19-classBdataset is 600 ˆ the 600 pixels, which is about twice the size ofthe the images twice the size of the images the images 21-class dataset. Therefore, theispatch set a B = 64 for the land-use dataset. The size ofinthe the is19-class 600 ×size 600ispixels, in the 21-class dataset. Therefore, the patchinsize set a B dataset “ 64 for the 19-class dataset.which is about 19-class dataset. twiceInthe size of the images in the 21-classefficiency, dataset. Therefore, the patch size is set a(PCA) B = 64[31,32] for the addition, to gain computational principal component analysis is In addition, to gain computational efficiency, principal component analysis (PCA) [31,32] is 19-class dataset. employed to reduce the dimensionality of FV features. The PCA projection matrix was calculated employed to reduce the dimensionality ofefficiency, FV features. The PCA projection matrix(PCA) was calculated addition, computational principal component [31,32] is usingInthe featurestoofgain the training data, and the principal components that analysis accounted for 95% of the using the features of the training data, and the principal components that accounted for 95% of the employed to reduce the dimensionality FV features. Theexperiments. PCA projection matrix was calculated total variation of the training features areof considered in our total of the features are considered in our experiments. usingvariation the features oftraining the training data, and the principal components that accounted for 95% of the total variation of the training features are considered in our experiments.

Figure11. 11.Classification Classificationaccuracy accuracy(%) (%)versus versusvarying varyingpatch patchsizes sizesfor forthe the21-class 21-classland-use land-usedataset. dataset. Figure

Figure 11. Classification accuracy (%) versus varying patch sizes for the 21-class land-use dataset.

Remote Sens. 2016, 8, 483 Remote RemoteSens. Sens.2016, 2016,8,8,483 483

12 of 17 12 12of of17 17

4.3. FV Representation Model 4.3. FV Representation vs. BOVW 4.3. FV Representationvs. vs.BOVW BOVWModel Model ToTo verify thethe advantage of FV as to the model, the MS-CLBP+BOVW is applied verify advantage of as to the model, the isis To verify the advantage of FV FVcompared as compared compared toBOVW the BOVW BOVW model, the MS-CLBP+BOVW MS-CLBP+BOVW toapplied both the 21-class land-use dataset and the 19-class satellite scene dataset and the performance to both the 21-class land-use dataset and the 19-class satellite scene dataset and the applied to both the 21-class land-use dataset and the 19-class satellite scene dataset and the is compared withisour approach. Theour same parameters used for the MS-CLBP feature. the BOVW performance with approach. The same parameters are for MS-CLBP performance is compared compared with our approach. Theare same parameters are used used for the theIn MS-CLBP feature. In BOVW model, 30,000 patches are from patches K-means feature. In the the BOVW model, 30,000 patchesfrom are randomly randomly selected from all all patches and and K-means to model, 30,000 patches are randomly selected all patchesselected and K-means clustering is employed clustering employed 1024 words typical The classification clustering employed to 1024 visual visual words as as aa performance typical setting. setting. The classification generate 1024isisvisual wordsto as generate agenerate typical setting. The classification of the proposed method performance of the proposed method and MS-CLBP+BOVW is evaluated over each category of the performance of the proposed method and MS-CLBP+BOVW is evaluated over each category of the 12 and MS-CLBP+BOVW is evaluated over each category of the two datasets as shown in Figures two datasets as shown in Figures 12 and 13, respectively. As can be seen from Figure 12, the as shown 12 Figure and 13,12, respectively. Asmethod can be provides seen from Figure 12, the andtwo 13, datasets respectively. As caninbeFigures seen from the proposed better performance proposed method better than MS-CLBP+BOVW in proposed method provides provides better performance thantwo, MS-CLBP+BOVW in almost almost all all categories categories than MS-CLBP+BOVW in almost all performance categories except medium density residential and parking except two, medium density residential and parking lot, and two categories (agricultural and except two, medium density residential and parking lot, and two categories (agricultural and lot, and two categories (agricultural and forest) have equal performance. In Figure 13, the proposed forest) have equal performance. In Figure 13, the proposed method achieves greater accuracy than forest) have equal performance. In Figure 13, the proposed method achieves greater accuracy than method achieves greater accuracy than all classes except beach and industrial for the 19-class satellite all allclasses classesexcept exceptbeach beachand andindustrial industrialfor forthe the19-class 19-classsatellite satellitescene scenedataset. dataset. scene dataset.

Figure 12.12. proposed method andMS-CLBP+BOVW MS-CLBP+BOVW the 21-class Figure Per-class accuracy (%) on the 21-class Figure 12.Per-class Per-classaccuracy accuracy(%) (%) of of the the proposed proposed method method and and MS-CLBP+BOVW onon the 21-class land-use dataset. land-use dataset. land-use dataset.

Figure Per-class accuracy (%) on the 19-class Figure 13.Per-class Per-classaccuracy accuracy(%) (%) of of the the proposed proposed method method and and MS-CLBP+BOVW onon the 19-class Figure 13.13. proposed method andMS-CLBP+BOVW MS-CLBP+BOVW the 19-class satellite scene dataset. satellite scene dataset. satellite scene dataset.

4.4. Comparison to the State-of-the-Art Methods Comparison State-of-the-ArtMethods Methods 4.4.4.4. Comparison toto thetheState-of-the-Art In Inthis thissection, section,the theeffectiveness effectivenessof ofthe theproposed proposedimage imagerepresentation representationmethod methodisisevaluated evaluatedby by In this section, the effectiveness of the proposed image representation method is evaluated by comparing comparingits itsperformance performancewith withpreviously previouslyreported reportedperformance performancein inthe theliterature. literature.Specifically, Specifically,the the comparing its performance with previously reported performance in the literature. Specifically, the proposed proposed method method isis compared compared with with the the MS-CLBP MS-CLBP descriptor descriptor [17] [17] applied applied to to an an entire entire remote remote proposed method is compared with the MS-CLBP descriptor [17] applied to an entire remote sensing sensing sensing image image to to obtain obtain aa global global feature feature representation. representation. The The comparison comparison results results are are reported reported in in image to obtain athe global feature representation. The comparison results are reported in Table 1. Table Table 1.1. From From the comparison comparison results, results, the the proposed proposed method method achieves achieves superior superior classification classification From the comparison results, the proposed method achieves superior classification performance over

Remote Sens. 2016, 8, 483

13 of 17

other existing methods, which demonstrates the effectiveness of the proposed image representation for remote sensing land-use scene classification. The improvement of the proposed method13over the Remote Sens. 2016, 8, 483 of 17 global representation developed in [17] is 2.4%. This improvement is mainly due to the proposed local featureperformance representation which unifieswhich the two implementations of the MS-CLBP descriptor. overframework other existing methods, demonstrates the effectiveness of the proposed image for remote land-use4.7% sceneimprovement classification. over The the improvement the Moreover, therepresentation proposed approach is ansensing approximately MS-CLBP of + BOVW proposed method over the global representation developed in [17] is 2.4%. This improvement is method, which verifies the advantage of the Fisher kernel representation as compared to the BOVW mainly due to the proposed local feature representation framework which unifies the two model. Figure 14 shows the confusion matrix of the proposed method for the 21-class land-use dataset. implementations of the MS-CLBP descriptor. Moreover, the proposed approach is an The diagonal elements of the matrix denote the mean class-specific classification accuracy (%). We find approximately 4.7% improvement over the MS-CLBP + BOVW method, which verifies the an interesting phenomenon from Figure 14 that diagonal elements for beach and forest are extremely advantage of the Fisher kernel representation as compared to the BOVW model. Figure 14 shows large but forthe storage tankmethod is relatively small. The land-use reasons are that images of beach thediagonal confusionelements matrix of proposed for the 21-class dataset. The diagonal and forest present richmatrix texturedenote and structures similarity(%). for We the beach elements of the the mean information; class-specific within-class classification accuracy find anand forest categories is high but relatively low storagefor tank; and some images of storage interesting phenomenon from Figure 14for thatcategory diagonalof elements beach and forest are extremely large but diagonal for as storage tank is relatively small. The reasons are that images of beach tank are similar to otherelements class such buildings. and forest present rich texture and structures information; within-class similarity for the beach and forest categories is high but of relatively low for category offorthe storage tank; and somedataset. images of storage Table 1. Comparison classification accuracy (%) 21-class land-use tank are similar to other class such as buildings. Method Accuracy(Mean ˘ std) Table 1. Comparison of classification accuracy (%) forthe 21-class land-use dataset. BOVW Method [28] SPM BOVW [28] [28] BOVW + Spatial Co-occurrence SPM [28] Kernel [28] Color Gabor [28] BOVW + Spatial Co-occurrence Kernel [28] Color (HLS) Gabor [28] Color histogram [28] histogram (HLS) [7] [28] StructuralColor texture similarity Structural texture similarity [7] Unsupervised feature learning [33] feature learning [33] Saliency-Guided Unsupervised unsupervised feature learning [34] Saliency-Guided unsupervised feature learning [34] Concentric circle-structured multiscale BOVW [5] Concentric circle-structured multiscale BOVW [5] Multifeature concatenation [35] Multifeature concatenation [35] Pyramid-of-Spatial-Relatons (PSR) [36] Pyramid-of-Spatial-Relatons (PSR) [36] MCBGPMCBGP + E-ELM [37][37] + E-ELM ConvNet ConvNet with specific spatial features [38] with specific spatial features [38] gradient boosting randomconvolutional network[39] [39] gradient boosting randomconvolutional network GoogLeNet GoogLeNet [40] [40] OverFeatConvNets OverFeatConvNets [40][40] MS-CLBP MS-CLBP [17] [17] MS-CLBP + BOVW MS-CLBP + BOVW The Proposed

The Proposed

76.8

Accuracy(Mean ± std) 75.3 76.8 75.377.7 77.780.5 80.581.2 81.286.0 86.0 81.7 ˘ 1.2 81.7 ± 1.2 82.7 ˘ 1.2 82.7 ± 1.2 86.6 ˘ 0.8 86.6 ± 0.8 89.5 ˘ 0.8 89.5 ± 0.8 89.189.1 86.52 86.52 ± 1.3˘ 1.3 89.39 ˘ 1.10 89.39 ± 1.10 94.53 94.53 92.80 ± 0.61 92.80 ˘ 0.61 90.91 ± 1.19 90.91 ˘ 1.19 90.6 ± 1.4˘ 1.4 90.6 89.27 ± 2.9˘ 2.9 89.27 93.00 ± 1.2

93.00 ˘ 1.2

FigureFigure 14. Confusion matrix of proposed method land-usedataset. dataset. 14. Confusion matrix of proposed methodfor forthe the 21-class 21-class land-use

RemoteSens. Sens.2016, 2016,8,8,483 483 Remote

14of of17 17 14

When compared with CNNs, it can be found that the classification accuracy of CNNs is close When compared with CNNs, it can found that the classification of CNNs is close to that of our method. Even though thebe performance of some CNNs isaccuracy better than the proposed to that of our method. Even though the performance of some CNNs is better than the proposed method, they need a pre-training process with a large amount of external data. Thus our method is method, they need pre-training process with a large amount of external data. Thus our method is still competitive in aterms of limited requirement for external data. still competitive in terms of limited requirement for external data. The comparison results for the 19-class satellite scene dataset are listed in Table 2. It indicates comparison resultsoutperforms for the 19-class scene dataset listed inthe Table It indicates that The the proposed method othersatellite existing methods andare achieves best2.performance. that proposed method outperforms other existing methods andmethod achieves performance. The the proposed method provides about 7% improvement over the inthe [31]best which utilized a The proposed method provides about 7% improvement over the method in [31] which combination of multiple sets of features, indicating the superior discriminative powerutilized of the aproposed combination of multiple sets of features, indicatingmatrix the superior power for of the feature representation. The confusion of thediscriminative proposed method theproposed 19-class feature confusion matrix of thediagonal proposed methodof forthe thematrix, 19-class scene satelliterepresentation. scene dataset isThe shown in Figure 15. From elements thesatellite classification dataset is shown in Figure 15. From diagonal elements of the matrix, the classification accuracy for accuracy for bridges is relatively small because some texture information in the images of bridges is bridges small because some texture information in the images of bridges is similar to those similar is torelatively those in the images of ports. in the images of ports. Table 2. Comparison of classification accuracy (%) for the 19-class satellite scene dataset. Table 2. Comparison of classification accuracy (%) for the 19-class satellite scene dataset. Method Accuracy (Mean ± std) Bag of colors [25] 70.6 (Mean ± 1.5 ˘ std) Method Accuracy Tree of c-shapes [25] 80.4 ± 1.8 Bag of colors [25] 70.6 ˘ 1.5 Bag of SIFT [25] 85.5 ± 1.2 Tree of c-shapes [25] 80.4 ˘ 1.8 Multifeature concatenation [25] 90.8 ± 0.7 Bag of SIFT [25] 85.5 ˘ 1.2 LTP-HF [23] 77.6˘ 0.7 Multifeature concatenation [25] 90.8 SIFT + LTP-HF + Color histogram [23] 93.6 LTP-HF [23] 77.6 MS-CLBP 93.493.6 ± 1.1 SIFT + LTP-HF + Color[1] histogram [23] MS-CLBP + BOVW 89.29 MS-CLBP [1] 93.4 ±˘1.3 1.1 MS-CLBP + BOVW 89.29±˘1.2 1.3 The Proposed 94.32 The Proposed 94.32 ˘ 1.2

Figure 15. Confusion matrix of proposed method for the 19-class satellite scene dataset.

Figure 15. Confusion matrix of proposed method for the 19-class satellite scene dataset.

5. Conclusions 5. Conclusions In this paper, an effective image representation method for remote sensing image scene In this paper, an effective representation method for remote image scene classification was introduced. Theimage proposed representation method is basedsensing on multi-scale local classification introduced. The proposed representation methodtoisthebased on multi-scale local binary patternswas features and Fisher vectors. The MS-CLBP was applied partitioned dense regions binary patterns features Fisher vectors. The MS-CLBP was appliedthe todetailed the partitioned of an image to extract a setand of local patch descriptors, which characterize structuredense and regions of an image to extract a set of local patch descriptors, which characterize the detailed texture information in high-resolution remote sensing images. The Fisher vector was employed to structure and texture information in high-resolution remote sensing images. The Fisher vector was

Remote Sens. 2016, 8, 483

15 of 17

encode the local descriptors into a high-dimensional gradient representation, which can enhance the discriminative power of feature representation. Experimental results on two land-use scene datasets demonstrated that the proposed image representation approach obtained superior performance as compared to the existing methods for scene classification, with an obvious improvement such as 3% for the 21-class land-use dataset compared with the state-of-the-art MS-CLBP and 1% for the 19-class satellite scene dataset. In future work, combining global and local feature representations for remote sensing image scene classification will be investigated. Acknowledgments: This work was supported by the National Natural Science Foundation of China under Grants No. NSFC-61571033, 61302164, and partly by the Fundamental Research Funds for the Central Universities under Grants No. BUCTRC201401, BUCTRC201615, XK1521. Author Contributions: Longhui Huang, Chen Chen and Wei Li provided the overall conception of this research, and designed the methodology and experiments. Longhui Huang and Chen Chen carried out the implementation of the proposed algorithm, conducted the experiments and analysis, and wrote the manuscript. Wei Li and Qian Du reviewed and edited the manuscript. Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations The following abbreviations are used in this manuscript: LBP CLBP MS-CLBP FV ELM KELM BOVW SPM SIFT EGTD GMM CLBP_S CLBP_M RBF USGS PCA

Local binary patterns Completed local binary patterns Multi-scale completed local binary patterns Fisher vector Extreme learning machine Kernel-based extreme learning machine Bag-of-visual-words Spatial pyramid matching Scale-invariant feature transform Enhanced Gabor texture descriptor Gaussian mixture model Completed local binary patterns sign component Completed local binary patterns magnitude component Radial basis function United States Geological Survey Principal component analysis

References 1.

2.

3. 4. 5. 6. 7.

Yang, J.; Jiang, Y.-G.; Hauptmann, A.G.; Ngo, C.-W. Evaluating bag-of-visual-words representations in scene classification. In Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval, the 15th ACM International Conference on Multimedia, Augsburg, Bavaria, Germany, 23–28 September 2007; pp. 197–206. Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 2169–2178. Yang, Y.; Newsam, S. Spatial pyramid co-occurrence for image classification. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1465–1472. Zhou, L.; Zhou, Z.; Hu, D. Scene classification using a multi-resolution bag-of-features model. Pattern Recognit. 2013, 46, 424–433. [CrossRef] Zhao, L.-J.; Tang, P.; Huo, L.-Z. Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4620–4631. [CrossRef] Lowe, D.G. Distinctive image features from scale-invariant key points. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef] Risojevic, V.; Babic, Z. Aerial image classification using structural texture similarity. In Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Bilbao, Spain, 14–17 December 2011; pp. 190–195.

Remote Sens. 2016, 8, 483

8. 9. 10. 11. 12. 13. 14. 15.

16.

17. 18. 19. 20. 21. 22. 23.

24. 25. 26. 27.

28.

29. 30. 31.

16 of 17

Risojevi´c, V.; Momi´c, S.; Babi´c, Z. Gabor descriptors for aerial image classification. In Adaptive and Natural Computing Algorithms; Springer: Berlin, Germany; Heidelberg, Germany, 2011; pp. 51–60. Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [CrossRef] Zheng, X.; Sun, X.; Fu, K.; Wang, H. Automatic annotation of satellite images via multifeature joint sparse coding with spatial relation constraint. IEEE Geosci. Remote Sens. Lett. 2013, 10, 652–656. [CrossRef] Risojevic, V.; Babic, Z. Fusion of global and local descriptors for remote sensing image classification. IEEE Geosci. Remote Sens. Lett. 2013, 10, 836–840. [CrossRef] Goodfellow, I.; Courville, A.; Bengio, Y. Deep Learning. Book in Preparation for MIT Press; The MIT Press: Cambridge, MA, USA, 2016. Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1–127. [CrossRef] Yue, J.; Zhao, W.; Mao, S.; Liu, H. Spectral—Spatial classification of hyperspectral images using deep convolutional neural networks. Remote Sens. Lett. 2015, 6, 468–477. [CrossRef] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Image net classification with deep convolutional neural networks. In Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2012; pp. 1106–1114. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. Chen, C.; Zhang, B.; Su, H.; Li, W.; Wang, L. Land-use scene classification using multi-scale completed local binary patterns. Signal Image Video Process. 2015, 10, 1–8. [CrossRef] Guo, Z.; Zhang, L.; Zhang, D. A Completed modeling of local binary pattern operator for texture classification. IEEE Trans Image Process. 2010, 19, 1657–1663. [PubMed] Perronnin, F.; Sánchez, J.; Mensink, T. Improving the Fisher Kernel for Large-Scale Image Classification. Computer Vision—ECCV 2010 Lecture Notes in Computer Science; Springer-Verlag: Berlin, Germany, 2010; pp. 143–156. Huang, G.-B.; Zhou, H.; Ding, X.; Zhang, R. Extreme Learning Machine for Regression and Multiclass Classification. IEEE Trans. Syst. Man Cybern. 2012, 42, 513–529. [CrossRef] [PubMed] Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [CrossRef] Li, W.; Chen, C.; Su, H.; Du, Q. Local binary patterns and extreme learning machine for hyperspectral imagery classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3681–3693. [CrossRef] Krapac, J.; Verbeek, J.; Jurie, F. Modeling spatial layout with fisher vectors for image categorization. In Proceedings of International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1487–1494. Sánchez, J.; Perronnin, F.; Mensink, T.; Verbeek, J. Image classification with the fisher vector: Theory and practice. Int. J. Comput. Vis. 2013, 105, 222–245. [CrossRef] Liu, C. Maximum likelihood estimation from incomplete data via EM-type Algorithms. In Advanced Medical Statistics; World Scientific Publishing Co.: Hackensack, NJ, USA, 2003; pp. 1051–1071. Jaakkola, T.S.; Haussler, D. Exploiting generative models in discriminative classifiers. Adv. Neural Inf. Process. Syst. 1999, 11, 487–493. Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. Dai, D.; Yang, W. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 2011, 8, 173–176. [CrossRef] Sheng, G.; Yang, W.; Xu, T.; Sun, H. High-resolution satellite scene classification using a sparse coding based multiple feature combination. Int. J. Remote Sens. 2012, 33, 2395–2412. [CrossRef] Ren, J.; Zabalza, J.; Marshall, S.; Zheng, J. Effective feature extraction and data reduction in remote sensing using hyperspectral imaging [applications corner]. IEEE Sign. Process. Mag. 2014, 31, 149–154. [CrossRef]

Remote Sens. 2016, 8, 483

32. 33. 34. 35.

36. 37. 38.

39. 40.

17 of 17

Chen, C.; Li, W.; Tramel, E.W.; Fowler, J.E. Reconstruction of hyperspectral imagery from random projections using multi hypothesis prediction. IEEE Trans. Geosci. Remote Sens. 2014, 52, 365–374. [CrossRef] Cheriyadat, A.M. Unsupervised feature learning for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 52, 439–451. [CrossRef] Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2175–2184. [CrossRef] Shao, W.; Yang, W.; Xia, G.-S.; Liu, G. A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization. In Lecture Notes in Computer Science Computer Vision Systems; Springer: Berlin, Germany; Heidelberg, Germany, 2013; pp. 324–333. Chen, S.; Tian, Y. Pyramid of spatial relatons for scene-level land use classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1947–1957. [CrossRef] Cvetkovi´c, S.; Stojanovi´c, M.B.; Nikoli´c, S.V. Multi-channel descriptors and ensemble of extreme learning machines for classification of remote sensing images. Sign. Process. 2015, 39, 111–120. [CrossRef] Keiller, N.; Waner, O.; Jefersson, A.; Dos, S. Improving spatial feature representation from aerial scenes by using convolutional networks. In Proceedings of the SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, 26–29 August 2015; pp. 44–51. Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network framework. IEEE Trans. Geosci. Remote Sens. 2016, 54, 1793–1802. [CrossRef] Keiller, N.; Otavio, P.; Jefersson, S. Towards better exploiting convolutional neural networks for remote sensing scene classification. 2016, ArXiv E-Prints, arXiv:1602.01517, http://arxiv.org/abs/1602.01517. © 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).