Deep neural networks for learning classification ...

10 downloads 64225 Views 27MB Size Report
synthetic aperture sonar big data. Johnny L. Chen ... data for which manual analysis is prohibitive without computational aid. Although ... Moreover, there is a critical lack of labeled SAS imagery of real targets such as mines and UXOs. This is a ...
Deep neural networks for learning classification features and generative models from synthetic aperture sonar big data Johnny L. Chen and Jason E. Summers

Citation: Proc. Mtgs. Acoust. 29, 032001 (2016); doi: 10.1121/2.0000458 View online: http://dx.doi.org/10.1121/2.0000458 View Table of Contents: http://asa.scitation.org/toc/pma/29/1 Published by the Acoustical Society of America

Volume 29

http://acousticalsociety.org/

172nd Meeting of the Acoustical Society of America Honolulu, Hawaii 28 November - 02 December 2016

Interdisciplinary: Paper 5pID1

(Topical Meeting on Data Science and Acoustics II)

Deep neural networks for learning classification features and generative models from synthetic aperture sonar big data Johnny L. Chen and Jason E. Summers ARiA, Washington, DC; [email protected], [email protected] Autonomous synthetic aperture sonar (SAS) imaging by unmanned underwater vehicles (UUVs) provides an abundance of high-resolution acoustic imagery useful for studying the seafloor and identifying targets of interest (e.g. unexploded ordnance or mines). Unaided manual processing is cumbersome as the amount of data gathered by UUVs can be enormous. Computer-vision and machine-learning techniques have helped to automate classification and object-recognition tasks, but often rely on hand-built features that fail to generalize. Deep-learning algorithms facilitated by emergence of graphics-processing unit (GPU) hardware and highly optimized neural-network implementations have recently enabled great improvements in computer vision. Autoencoders allow for deep unsupervised learning of features based on a reconstruction objective. Here, we present unsupervised feature learning applied to seafloor classification of SAS images. Deep architectures are also capable generative models. We illustrate this with generative networks that are capable of generating realistic SAS images of different seafloor bottom types. Deep models allow us to construct algorithms that learn hierarchical and higher-order SAS features, which promise to improve automatic target recognition (ATR) and aid operators in processing the large data volumes generated by UUV based SAS imaging. (Work supported by the Office of Naval Research.)

Published by the Acoustical Society of America © 2017 Acoustical Society of America [DOI: 10.1121/2.0000458] Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 1

J. L. Chen and J. E. Summers

1.

Deep neural networks for learning classification features and generative models

INTRODUCTION

Remote sensing via synthetic aperture sonar (SAS) has provided high-resolution imagery for studying and understanding the seafloor. One important application of SAS is in finding and identifying man-made objects such as unexploded ordnance (UXO) and underwater mines. Utilization of unmanned underwater vehicles (UUVs) for this purpose has produced an enormous amount of data for which manual analysis is prohibitive without computational aid. Although imagery of the seafloor is abundant, the majority of this data is unlabeled with respect to the items present in the image (e.g., clutter including rocks and anthropogenic objects such as crab traps) and with respect to the local seafloor characteristics (surface sediment type, morphology, and distribution). Moreover, there is a critical lack of labeled SAS imagery of real targets such as mines and UXOs. This is a significant obstacle to the development and training of classification algorithms that generalize well, such that algorithms developed from existing datasets often fail in new environments. We address this issue by investigating deep-learning methods for semi-supervised learning of deep representations and features for classification and generation (i.e., synthesis) of SAS imagery. Deep learning has surpassed many classification benchmarks due to availability of large labeled datasets such as ImageNet.1 However, annotations are more sparse in the SAS imaging domain so in the more realistic scenario, there is an abundance of unlabeled data and far fewer (several orders of magnitude) labeled samples. Some in the field have addressed the issue by training classification algorithms on synthesized data from physics-based simulations; however, this approach is prone to errors due to the inability of simulations to reproduce features or model the physics driving the SAS imaging process. Improving such methods requires understanding of underlying physics, which is often unknown or not able to be captured in current state-of-the-art simulations. Being able to learn what image or physical features are missing from simulations is valuable, and one of our goals is to use deep learning to elucidate the properties missing from simulated data. Currently, hand-crafted features are commonly used to identify targets in sonar imagery.2 These methods can fail to generalize in new “contexts” or environments, such as when an algorithm has been trained to identify targets in one type of seafloor and is tested on a new type of seafloor or when a new type of clutter is present. Deep networks for transfer learning may be able to address such scenarios, but the first priority is in understanding how deep neural networks learn from SAS imagery. Here we report the first application of deep convolutional neural networks for semi-supervised learning and generative modeling of SAS images. Building on prior work by Summers et al.,3 we describe preliminary empirical results in utilizing deep convolutional autoencoders (CAEs) for semi-supervised learning and classification and gain insight into the learned hierarchical features by visualizing the feature maps. As a test of our network, we use it for na¨ıve segmentation of images by seafloor type using a sliding window. As an alternative to physics-based simulations for synthesizing objects of interest in varying contexts and environments, we investigate two recent architectures, generative adversarial networks (GANs) and style-transfer networks. We find that these networks are able to generate synthetic SAS images that are subjectively among the most realistic synthetic SAS images generated to date.

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 2

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

2.

METHODS

A.

SYNTHETIC APERTURE SONAR DATA

Data acquisition is performed by the Small Synthetic Aperture Minehunter II (SSAMII) UUV system developed by the Naval Surface Warfare Center Panama City Division and the Applied Research Laboratory, Penn State University. SAS images are formed by coherent summation of acoustic signals from multiple pings as the UUV moves along-track. The received data is beamformed using proprietary software. Each sea-bottom image consists of 32 million pixels. Each image was normalized to [0,1] and identified as one of five classes by sea-bottom type. B. i.

NEURAL-NETWORK ARCHITECTURE AND TRAINING Convolutional Autoencoder

To evaluate semi-supervised classification, we implemented a stacked CAE, as shown in Fig. 1, which was modeled after recent architectures for computer-vision tasks.4, 5 Specifically, the encoder comprised repeated motifs of convolutional layers and subsampling/pooling layers and the decoder performed up-sampling to return to the original resolution. Small convolutional filters (3x3) were used throughout all layers; stacking multiple convolutional and pooling layers allows for a larger receptive field as the network progresses deeper. Each convolutional layer was followed with a rectified linear (ReLU) activation, except the last layer, which used a sigmoid activation. In order to regularize the network and force the lower dimensional layers to learn important features without overfitting, we used both Gaussian noise at the input (viz., a “denoising autoencoder”) and dropout6, 7 in the hidden layers. The first step was “pretraining” the network entirely unsupervised on a large set of unlabeled data. Although the network can be trained in a greedy layer-wise fashion,8 we trained the autoencoder end-to-end via reconstruction loss and backpropagation. Next was “fine-tuning,” where the decoder was discarded and replaced with a softmax layer to output class probabilities and the network was trained with a small subset of labeled data. Pretraining and fine-tuning were done with 32x32x3 image patches after whole SAS images were divided into 400x400 patches and downsampled. Pretraining was performed for 100 epochs via binary crossentropy (log) loss and the gradient-descent Adaptive Moment Estimation (Adam) optimizer.9 Activation sparsity was enforced using L1 penalty (1e-5) in the lowest dimensional layer. Fine-tuning was performed for 200 epochs via multiclass cross-entropy (log) loss and the Adam optimizer. To estimate the number of labeled samples necessary per class label, we varied the number of labeled samples and tested against a separate validation dataset. For the training dataset, we balanced the number of samples per class. Validation was performed with a dataset of 200 images per class. ii.

Generative Adversarial Network

For synthesizing SAS images, we investigated generative adversarial networks (GANs).10, 11 GANs consist of a deconvolutional generator network for generating images coupled to a discriminator network for classifying real from generated images. The generator’s goal is to generate realistic SAS images that will fool the discriminator. We implemented the same architecture as in Salimans et al.,10 including techniques for stabilized training. Specifically, the generator comprised transposed convolutional+ReLu layers mapping a random noise vector sampled from a uniform

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 3

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

Figure 1: Architecture of convolutional autoencoder for unsupervised learning. Convolutional layers are indicated by Conv-[number of filters]. Each reduction in size represents 2x2 maxpooling and every increase in size represents 2x2 upsampling. distribution [0, 1] to an image. The discriminator comprised convolutional+leaky ReLu (slope = 0.2) layers and discriminated between generated and real data. iii.

Style Transfer Networks

As another method of generating SAS images, we investigated transfer of SSAMII image “style” onto images of underwater target-like objects using an ImageNet pretrained VGG19 network12 as a feature extractor. Developed mainly in the context of image processing for computer graphics,13–16 we adapted the approach for generating targets in different contexts and environments. We utilized manual semantic-map annotations16 to correctly transfer the style corresponding to target shadow and highlight. The images were generated via gradient descent using the latent representations of the VGG19 network after feeding forward an exemplar SAS image. A patch-based approach15, 16 was used to preserve local distributions during image synthesis.

3.

EXPERIMENTAL RESULTS

A.

UNSUPERVISED PRETRAINING

During pretraining of the CAE the reconstruction loss decreases and plateaus after 20 epochs (Fig. 2a). The quality of reconstruction was monitored during training. Reconstruction quality visibly improves with more epochs, with more details being reconstructed by the decoder as training progresses (Fig. 2b). Overall, the dimensions were reduced 6-fold in the bottleneck of the CAE and the decoder successfully reconstructed images from the lower dimension. We found that deep networks with more pooling failed to converge and reconstruction failed. This could be due to loss of information/resolution, as max-pooling layers throw away information. Another possibility is internal covariate shift, which causes deep networks to train poorly because small errors in shallow Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 4

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

0.5680 0.5675

Loss

0.5670 0.5665 0.5660 0.5655 0.5650 0.56450

20

40

Epoch

60

80

100

(a)

(b)

Figure 2: (a) Reconstruction loss during training and (b) decoder reconstructions of input (top row) after one epoch (middle row) and 100 epochs (bottom row). layers can propagate and grow in deeper layers. This may be remedied by batch normalization to reduce internal covariate shift during training.17 B.

SUPERVISED FINE-TUNING

After unsupervised pretraining, the decoder was discarded and a softmax layer was placed on top of the last encoder layer resulting in a classifying CNN. The CNN was fine-tuned with a small set of labeled data. We varied the number of labeled samples for training and found that pretraining significantly improved classification accuracy, especially when the number of labeled samples per class ≤100 (Fig. 3a). For all cases, pretraining improved classification accuracy. Additionally, without any pretraining, the network often failed to converge and needed to be reinitialized. Pretraining improves performance by learning a feature extractor in the encoder portion of the autoencoder (i.e., it learns to represent the distribution of the image data). One hypothesis is that this initializes the network at a more advantageous starting point,8 allowing convergence to a better minimum. We found that the CNN performed best at classifying sand ripple and worst at classifying rocky sand (Fig. 3b). Our assumption that sea bottom types are mutually exclusive and our use of hard labels may not be appropriate for this task. As seen in Fig. 4, patches can consist of more than one class; for example, rocky sand is a mixture of sand and rock classes. We are exploring the use of soft labeling (i.e., nonexclusive labels) for training to improve classification performance. We evaluated the trained CNN on bottom-type segmentation by using a sliding window to feed patches into the CNN to assign class labels (Fig. 5). The CNN is able to segment the scene by bottom type with good visual accuracy. This method assumes each patch is conditionally independent of its neighbors, but in reality there are correlations between neighboring patches. Thus, we expect segmentation quality and accuracy to improve with recent deep architectures such as fully convolutional networks and those that utilize conditional random fields on top of deep representations.19, 20 Due to lack of ground-truth labeled datasets in this domain, we have not done a rigorous hyperparameter search for optimal classification or segmentation accuracy.

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 5

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

1.0

% validation accuracy

0.9 0.8 0.7 0.6 0.5 0.4

with pre-training CAE no pre-training CAE

0.3 0.20

50

100

150

200

250

300

# training samples per class

350

400

(a)

(b)

Figure 3: (a) Validation accuracy as a function of number of labeled samples per class with unsupervised pretraining and without pretraining (random initialization of network). Error bars represent +/- one standard deviation. (b) Confusion matrix for the validation dataset.

Figure 4: Two-dimensional T-SNE18 embedding using feature vectors of image patches from the validation set. There are five distinct clusters corresponding to each sea-bottom class.

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 6

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

Figure 5: Segmentation of SAS images patchwise using the fine-tuned CNN. A sliding window is used to feed patches into the CNN and the class probability is converted to a color coding corresponding to the legend.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 6: Input images (a,d) and their corresponding feature maps after ReLU (b,c,e,f). (b,e) Feature maps for a representative node in layer Conv-256. (c,f) Feature maps from one node in layer Conv-16 activated on ripple.

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 7

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

Figure 7: Feature maps for all nodes in Conv-16 for different sand-ripple orientations. Node 16 (bottom) is activated on ripple of a particular orientation, as the activations disappear when the image patch is flipped. C.

VISUALIZING LEARNED FEATURES

In order to understand the learned features of the network we visualized the feature maps after feeding forward whole SAS images through the network. As expected, there was indication of hierarchical features for these SAS images (Fig. 6). In the lower layers, the nodes seem to be activated on local features such as image intensity (Fig. 6b,e). Activations become more classspecific in deep layers. For example, we found nodes in Conv-16 that are highly activated on sand ripple (Fig. 6f) and do not activate on sandy ridges (Fig. 6c). Focusing on Conv-16 for the case of ripple (Fig. 7), we observed that multiple nodes activated on ripple, suggesting that there was co-adaptation.6 Some nodes were activated on ripple imaged at a particular orientation and thus when we flipped or rotated the image patch, the activation disappeared. Using these visualization methods there were not any explicit object detectors (viz., for targets or clutter). This may be due to bottom types being more texture-like and not having specific object or object parts. It is possible co-adaptation was preventing formation of more object-like detectors. We are investigating ways of regularizing and enforcing sparsity to prevent co-adapting features using networks such as winner-take-all autoencoders.21 D.

GENERATIVE ADVERSARIAL NETWORKS

Training a conventional generative adversarial network (GAN) proved difficult as one network often overpowered the other and failed to converge. The generative network may also map to one image that continuously fools the discriminator and thus all parameters collapse to generate repeated images. We found the techniques described by Salimans et al.10 stabilized training and helped generate more plausible images. The GAN had trouble representing unbalanced datasets,

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 8

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

(a)

(b)

Figure 8: Synthesized GAN images [first column in both subfigure (a) and subfigure (b)] and closest patches from training dataset via L2 distance calculated pixel-wise (a) or feature-wise (b). which was a challenge for this dataset as it had several-fold more of the most frequently occurring class (flat sand) than the least frequently occurring class (sand ripple). To address this, we trained on sandy ridges and ripple. The GAN successfully captured important physical features such as bifurcations in sand ripple and shadow. To ensure that the GAN was not memorizing examples from the training data, we calculated the L2 distance pixel-wise and feature-wise (after feeding forward training examples through the fine-tuned CNN classifier) between generated samples and the training samples. We plotted the ten closest images according to these two metrics and found that the network does not copy training samples (Fig. 8). This is a promising step towards semisupervised training of generative models, as understanding the learned features may greatly help in classification and underwater target recognition. E.

STYLE TRANSFER NETWORKS

The VGG19 based network was able to perform style transfer to synthesize realistic SAS textures such as ripple and rocky terrain. When images were heterogeneous and contained transition boundaries, however, the conventional style-transfer network did not capture this global information and failed to reproduce boundaries. Based on recent improvements achieved by adding a semantic map for conditioning the network to maintain semantically relevant image boundaries,16 we were able to transfer SAS textures onto images of simulated targets and maintain boundaries between textures (Fig. 9). To investigate the capacity of the network to represent important physical features such as shadow orientation, we generated image analogies given semantic annotations only of target highlights and background environment. We found that for distributions close to the exemplar image

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 9

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

Figure 9: Using semantic maps, a SAS image was used to synthesize a simulated target in a rocky environment.

Figure 10: We investigated the network’s ability to infer shadow given only annotations of the highlight (yellow) in different orientations and background (blue).

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 10

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

(Fig. 10 top row), the network synthesized correct directionality of shadow (left of highlight). However, the shadow shape/length was more variable as it depends on grazing angle and target height, knowledge of which was not incorporated into the network. We believe that incorporating this side information will improve the quality and physical accuracy of the synthesis. For highlights of different orientation, the network failed to synthesize the shadow in some cases (Fig. 10, bottom two rows).

4.

CONCLUSION

We have demonstrated the first application of deep convolutional neural networks for semisupervised learning and generative modeling of SAS images. Significant improvement in classification accuracy was achieved by pretraining a convolutional autoencoder unsupervised followed by fine-tuning with labeled data. This is particularly valuable in the context of big data with very few labels, such as SAS imagery gathered by UUVs. Pretraining allows us to utilize the whole dataset for learning the important structure of the data. This will be important in developing systems for detection of underwater targets such as mines. Next, we pursued training generative models to synthesize SAS imagery. We found that GANs were able to generate realistic SAS images. Generation was unconditional but it is possible to condition the generation to produce class specific images.22 Style networks trained on ImageNet were sufficient feature extractors and were able to perform high quality transfers of SSAMII textures and bottom-types such as ripple and rocky terrain. Our goal is to generate underwater targets in variable contexts (environments and operating conditions), as target data is often sparse in this image domain. Unsupervised learning and generative models will elucidate what important image features are necessary for improved target recognition. These preliminary results warrant further investigation into deep-learning approaches for developing computer-aided systems for automated target recognition. Creation of a reliable benchmark is required for development and testing of new algorithms and thus one necessary effort is to create a ground-truth labeled dataset. This work is a first step in utilizing deep learning to help us understand why current algorithms fail and help identify deficiencies in current physics-based simulations.

ACKNOWLEDGMENTS This work was supported by the Office of Naval Research. The authors acknowledge Shawn Johnson (ARL/PSU) for selection and preparation (beamforming) of the SAS data and for technical discussions on seafloor classification.

REFERENCES 1

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

2

Scott Reed, Yvan Petillot, and Judith Bell. An automatic approach to the detection and extraction of mine features in sidescan sonar. IEEE Journal of Oceanic Engineering, 28(1):90–105, 2003.

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 11

J. L. Chen and J. E. Summers

Deep neural networks for learning classification features and generative models

3

Jason E. Summers, Timothy C. Havens, and Thomas K. Meyer. Learning environmentally dependent feature representations for classification of objects on or buried in the seafloor (A). Journal of the Acoustical Society of America, 135(4):2296–2297, 2014.

4

Jonathan Masci, Ueli Meier, Dan Cires¸an, and J¨urgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, pages 52–59. Springer, 2011.

5

Jie Geng, Jianchao Fan, Hongyu Wang, Xiaorui Ma, Baoming Li, and Fuliang Chen. High-resolution SAR image classification via deep convolutional autoencoders. IEEE Geoscience and Remote Sensing Letters, 12(11):2351– 2355, 2015.

6

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

7

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929–1958, 2014.

8

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153, 2007.

9

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

10

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANS. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.

11

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.

12

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 2014.

13

Leon A. Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.

14

Leon A. Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.

15

Chuan Li and Michael Wand. Combining Markov random fields and convolutional neural networks for image synthesis. arXiv preprint arXiv:1601.04589, 2016.

16

Alex J. Champandard. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768, 2016.

17

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

18

Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.

19

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.

20

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFS. arXiv preprint arXiv:1412.7062, 2014.

21

Alireza Makhzani and Brendan J. Frey. Winner-take-all autoencoders. In Advances in Neural Information Processing Systems, pages 2791–2799, 2015.

22

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

Proceedings of Meetings on Acoustics, Vol. 29, 032001 (2017)

Page 12