Semantic Image Segmentation Combining Visible

0 downloads 0 Views 10MB Size Report
Jun 15, 2015 - 20 Multi-scale convolutional neural network . ..... ditionally, two alternative depth estimation methods are tested: Firstly a cross-spectral ..... proach also allows input from coarser scales of an image pyramid into higher layers and ..... predict the pose but also the action performed by a detected person.
Master’s Thesis

Semantic Image Segmentation Combining Visible and Near-Infrared Channels with Depth Information submitted to

Department of Computer Science Bonn-Rhein-Sieg University of Applied Sciences in partial fulfillment of the requirement for the Master of Science degree in Computer Science

by

Maurice Velte, B.Sc. First Examiner: Prof. Dr. Norbert Jung Second Examiner: Prof. Dr. Wolfgang Heiden Supervisor: Holger Steiner, M.Sc.

15th June, 2015

Statement of Originality / Eidesstattliche Erkl¨ arung I hereby declare that the thesis at hand is my own work. Information directly or indirectly derived from work of others has been acknowledged in the text. The work contained in this thesis has not been previously submitted for a degree or diploma at any other higher education institution. Hiermit erkl¨ are ich an Eides statt, dass ich die vorliegende Arbeit selbstst¨andig angefertigt habe. Die aus fremden Quellen w¨ortlich oder sinngem¨aß u ¨bernommenen Textstellen wurden als solche kenntlich gemacht. Die Arbeit wurde bisher in gleicher oder ¨ahnlicher Form noch keiner anderen Pr¨ ufungsbeh¨orde vorgelegt.

Bonn, 15th June, 2015

Maurice Velte

Acknowledgements Foremost, I would like to express my gratitude to my examiners Prof. Dr. Norbert Jung and Prof. Dr. Wolfgang Heiden as well as my advisor Holger Steiner for their support and advice and for giving me the opportunity to pursue this exciting research project in my master’s thesis. I would also like to thank all my colleagues at the ISF for their help, especially Sebastian Sporrer, who designed the camera rig prototype and helped me to capture my dataset, and Tobias Scheer, who set up the labeling tool used to create my ground truth data. My special thanks go to David Scherfgen for his invaluable proof reading skills and to Alexander Dieckmann for the many insightful discussions on computer vision, machine learning and thesis writing in general. Finally, and most importantly, I want to thank Tintu Mathew, for everything.

Abstract Image understanding is a vital task in computer vision that has many applications in areas such as robotics, surveillance and the automobile industry. An important precondition for image understanding is semantic image segmentation, i.e. the correct labeling of every image pixel with its corresponding object name or class. This thesis proposes a machine learning approach for semantic image segmentation that uses images from a multi-modal camera rig. It demonstrates that semantic segmentation can be improved by combining different image types as inputs to a convolutional neural network (CNN), when compared to a single-image approach. In this work a multi-channel near-infrared (NIR) image, an RGB image and a depth map are used. The detection of people is further improved by using a skin image that indicates the presence of human skin in the scene and is computed based on NIR information. It is also shown that segmentation accuracy can be enhanced by using a class voting method based on a superpixel pre-segmentation. Models are trained for 10-class, 3-class and binary classification tasks using an original dataset. Compared to the NIR-only approach, average class accuracy is increased by 7% for 10-class, and by 22% for 3-class classification, reaching a total of 48% and 70% accuracy, respectively. The binary classification task, which focuses on the detection of people, achieves a classification accuracy of 95% and true positive rate of 66%. The report at hand describes the proposed approach and the encountered challenges and shows that a CNN can successfully learn and combine features from multi-modal image sets and use them to predict scene labeling.

List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Scene parsing system of Farabet et al. . . . . . . . . . . . . . . . . . . . CNN scene parsing results from RGBD . . . . . . . . . . . . . . . . . . . Prism-based multi-spectral camera . . . . . . . . . . . . . . . . . . . . . Non-linear activation functions . . . . . . . . . . . . . . . . . . . . . . . LeNet-5 network architecture . . . . . . . . . . . . . . . . . . . . . . . . Pixel context patches of all scales . . . . . . . . . . . . . . . . . . . . . . System overview of presented semantic segmentation approach . . . . . Blooming on NIR images . . . . . . . . . . . . . . . . . . . . . . . . . . RGB-to-NIR image registration . . . . . . . . . . . . . . . . . . . . . . . Depth map reprojection . . . . . . . . . . . . . . . . . . . . . . . . . . . Sharpness vs. distance for 8 mm and 50 mm lenses . . . . . . . . . . . . Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of depth from Kinect and from cross-spectral stereo . . . . Pre-processing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . Learned features and filter responses . . . . . . . . . . . . . . . . . . . . Input layer of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First convolution layer of CNN . . . . . . . . . . . . . . . . . . . . . . . Second convolution layer of CNN . . . . . . . . . . . . . . . . . . . . . . Third convolution layer of CNN . . . . . . . . . . . . . . . . . . . . . . . Multi-scale convolutional neural network . . . . . . . . . . . . . . . . . . Superpixel class voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loss and accuracy during training for 3- and 10-class models . . . . . . Pixel- and average class-accuracy (10 classes) . . . . . . . . . . . . . . . Per-class accuracy (10 classes) . . . . . . . . . . . . . . . . . . . . . . . . Pixel- and average class-accuracy (3 classes) . . . . . . . . . . . . . . . . Per-class accuracy (3 classes) . . . . . . . . . . . . . . . . . . . . . . . . Accuracy and true positive rate etc. (human detection) . . . . . . . . . Examples for predicted semantic image segmentation . . . . . . . . . . . Accuracy for 3-class and 10-class classification models - alternative plot Pinhole camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real optical system model . . . . . . . . . . . . . . . . . . . . . . . . . . NIR image before and after undistortion . . . . . . . . . . . . . . . . . . Experimental camera rig . . . . . . . . . . . . . . . . . . . . . . . . . . . Images before and after pre-processing . . . . . . . . . . . . . . . . . . . More examples of filter responses . . . . . . . . . . . . . . . . . . . . . . Matches of HOG descritptors . . . . . . . . . . . . . . . . . . . . . . . . Problematic image - no human detection . . . . . . . . . . . . . . . . . . Convolution layers of the CNN . . . . . . . . . . . . . . . . . . . . . . . Loss and accuracy during training for human detection models . . . . .

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 25 26 34 41 43 45 46 50 53 55 56 57 58 64 65 66 66 67 69 71 74 78 78 80 80 82 83 84 97 97 103 106 107 108 109 109 110 111

List of Tables 1 2 3 4 5 6 7 8

All input modalities for CNN models . . . . . . . . . . . . . . . Class label ground truths . . . . . . . . . . . . . . . . . . . . . Accuracies (10-class) . . . . . . . . . . . . . . . . . . . . . . . . Accuracies (3-class) . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy, TP, FN, TN and FP rates for human detection . . . Accuracies with class voting (10-class) . . . . . . . . . . . . . . Per-class accuracies (10-class) . . . . . . . . . . . . . . . . . . . All accuracies (pixel, class, per-class) with class-voting (3-class)

7

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

17 60 76 79 81 111 112 112

List of Equations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

Singe-layer-perceptron . . . . . . . . . . . . . . . . . . . . . . Logistic sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . Hyperbolic tangent . . . . . . . . . . . . . . . . . . . . . . . . Rectifier linear unit (ReLU) . . . . . . . . . . . . . . . . . . . Neuron activation (output value) . . . . . . . . . . . . . . . . Weight matrix . . . . . . . . . . . . . . . . . . . . . . . . . . Neuron activation (output value) in vectorized form . . . . . Softmax function . . . . . . . . . . . . . . . . . . . . . . . . . Softmax derivatives . . . . . . . . . . . . . . . . . . . . . . . . Logistic loss (cross-entropy loss) . . . . . . . . . . . . . . . . . Derivative of multinomial logistic loss . . . . . . . . . . . . . Hinge loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loss function with L2 regularization . . . . . . . . . . . . . . Loss in output layer of MLP . . . . . . . . . . . . . . . . . . . Loss in output layer of MLP in vectorized form . . . . . . . . Loss in a layer in terms of loss of next layer . . . . . . . . . . Gradient of loss with respect to biases . . . . . . . . . . . . . Gradient of loss with respect to weights . . . . . . . . . . . . Gradient descent rule (generalized) . . . . . . . . . . . . . . . Gradient descent weight rule . . . . . . . . . . . . . . . . . . Gradient descent bias rule . . . . . . . . . . . . . . . . . . . . Stochastic gradient descent weight rule . . . . . . . . . . . . . Stochastic gradient descent bias rule . . . . . . . . . . . . . . Stochastic gradient descent weight rule with L2 regularization Convolution filter kernel . . . . . . . . . . . . . . . . . . . . . Convolution operation . . . . . . . . . . . . . . . . . . . . . . Downscaling factor for RGB images . . . . . . . . . . . . . . Crop rectangle for RGB image . . . . . . . . . . . . . . . . . Projection error for homography matrix estimation . . . . . . Projection of depth map to 3D-space . . . . . . . . . . . . . . Projection from 3D space into 2D space . . . . . . . . . . . . Vector L2 unit normalization . . . . . . . . . . . . . . . . . . Normalized differences for skin classification . . . . . . . . . . Image normalization to zero mean and unit variance. . . . . . Running average . . . . . . . . . . . . . . . . . . . . . . . . . Standard deviation for running average . . . . . . . . . . . . Per-pixel accuracy for multi-class classification . . . . . . . . Per-class accuracy for multi-class classification . . . . . . . . Relationship between 3D world and 2D image coordinates . . Camera matrix . . . . . . . . . . . . . . . . . . . . . . . . . . Screen projection . . . . . . . . . . . . . . . . . . . . . . . . . Approximate correction of radial distortion . . . . . . . . . .

8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33 33 33 34 34 34 35 35 36 36 36 36 37 38 38 38 38 38 39 39 39 39 39 41 41 48 48 50 52 52 57 59 62 73 73 75 75 96 97 98 98

43 44 45 46 47 48 49 50 51 52 53 54

Approximate correction of tangential distortion . . . . . . . . . . . . . . Image undistortion (correction of radial and tangential distortion) . . . 2D rotation matrices around x-, y- and z-axis . . . . . . . . . . . . . . . Translation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Planar homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Homography matrix in column-wise notation . . . . . . . . . . . . . . . First constraint for solving camera calibration . . . . . . . . . . . . . . Second constraint for solving camera calibration . . . . . . . . . . . . . Ideal world-to-image point transformation with zero distortion . . . . . Solving for image distortion coefficients . . . . . . . . . . . . . . . . . . Rotation and translation between cameras . . . . . . . . . . . . . . . . . Proposed separation of luma and “chroma” components for NIR images

. . . . . . . . . . . .

98 99 100 100 101 102 102 103 103 103 104 113

List of Algorithms 1 2

Back-propagation of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Pre-processing of RGBDNIR images . . . . . . . . . . . . . . . . . . . . . 59

9

Abbreviations ANN

artificial neural network

CA

chromatic aberration

CCD

charge coupled device

CNN

convolutional neural network

CRF

conditional random field

DFD

depth from defocus

DFF

depth from focus

DNIRS

Depth + NIR + skin

DNIR

Depth + NIR

FeGeb

fake detection for face biometrics (F¨alschungserkennung f¨ ur die Gesichtsbiometrie)

FPA

focal plane array

FPGA

field-programmable gate array

GigE

gigabit ethernet

GPU

graphics processing unit

HOG

histogram of oriented gradients

InGaAs

indium gallium arsenide

IR

infrared

ISF

Safety and Security Research Institute

LED

light-emitting diode

MLP

multi-layer perceptron

MRF

markov random field

NIR

near-infrared

PoE

power-over-ethernet

RANSAC

random sample consensus

ReLU

rectifier linear unit

10

RGBDNIRS RGB + Depth + NIR + skin RGBDNIR

RGB + Depth + NIR

RGBD

RGB + Depth

RGBNIR

RGB + NIR

RGB

RGB color model - red, green and blue

RNN

recurrent neural network

SGD

stochastic gradient descent

SGM

semi-global matching

SIFT

scale-invariant feature transform

SLP

single-layer perceptron

SPAI

safe person recognition in the working area of industrial robots

SVM

support vector machine

TOF

time-of-flight

11

Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . 1.3 Overview of Basic Principles . . . . . . . . . . . . . . . 1.3.1 Convolutional Neural Networks (CNNs) . . . . 1.3.2 Superpixels . . . . . . . . . . . . . . . . . . . . 1.3.3 Depth Estimation . . . . . . . . . . . . . . . . 1.4 Related Work & State of the Art . . . . . . . . . . . . 1.4.1 Semantic Segmentation with CNNs . . . . . . . 1.4.2 Semantic Segmentation using RGB and Depth 1.4.3 Semantic Segmentation using RGB and NIR . 1.4.4 Human Detection . . . . . . . . . . . . . . . . . 1.5 Scope of Research . . . . . . . . . . . . . . . . . . . . 2 Basic Principles 2.1 Multi-Spectral Image Acquisition with NIR Camera 2.2 Camera Calibration . . . . . . . . . . . . . . . . . . 2.3 Artifical Neural Networks . . . . . . . . . . . . . . . 2.3.1 Multi-Layer Perceptron . . . . . . . . . . . . 2.3.2 Loss Function . . . . . . . . . . . . . . . . . . 2.3.3 Back-Propagation . . . . . . . . . . . . . . . 2.3.4 Gradient Descent . . . . . . . . . . . . . . . . 2.3.5 Convolutional Neural Networks . . . . . . . . 2.3.6 CNNs for Semantic Segmentation . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

14 14 16 17 17 19 19 21 21 23 25 27 28

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

30 30 31 32 33 35 37 38 40 42

3 Approach for Semantic Multi-Modal Image Segmentation 3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problems Occurring in Captured Images . . . . . . . . . . . . 3.3 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . 3.4 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Image Registration for RGB Images . . . . . . . . . . 3.4.2 Image Registration for Depth Maps . . . . . . . . . . 3.5 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Depth from Chromatic Aberration . . . . . . . . . . . 3.5.2 Cross-Spectral Stereo Matching . . . . . . . . . . . . . 3.6 Pre-Processing Pipeline . . . . . . . . . . . . . . . . . . . . . 3.6.1 Labeling for Ground Truth . . . . . . . . . . . . . . . 3.7 Multi-Scale CNN for Semantic Segmentation . . . . . . . . . 3.7.1 Preparation of Input Data . . . . . . . . . . . . . . . . 3.7.2 Network Architecture . . . . . . . . . . . . . . . . . . 3.8 Post-Processing: Class Voting in Superpixels . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

44 44 45 47 49 49 51 54 54 56 58 60 61 61 65 70

12

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

4 Evaluation 4.1 Training Analysis . . . . . . . . . . . . . 4.2 Comparison of Trained Models . . . . . 4.2.1 Models for 10-Class Classification 4.2.2 Models for 3-Class Classification 4.2.3 Human Detection . . . . . . . . . 4.3 Discussion . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

72 72 75 76 79 81 84

5 Conclusion and Future Work 86 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A Camera Calibration A.1 Geometry of Lens and Optical Systems . A.2 Planar Homography . . . . . . . . . . . A.3 Estimation of Calibration Parameters . A.4 Stereo Calibration and Rectification . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

96 96 100 101 104

B Experimental Camera Setup

106

C Additional Figures, Tables and Equations

107

D CD Contents

114

13

1. Introduction 1.1. Motivation In computer vision, the recognition of objects in a digital image or video sequence is of crucial importance and has a wide range of applications such as robot localization and visual positioning, human detection in surveillance systems, industrial manufacturing supervision and quality control, automotive parking and driving aid systems etc. The correct recognition of objects in an observed scene is also known as “image understanding”. While humans easily localize and identify hundreds of different objects in their visual field and on images, it is still a difficult task within computer vision, because image information can be ambiguous. The projection of an object can come from an infinity of image configurations, with varying orientations and/or viewpoints causing different two-dimensional images on the sensor (which can be the camera sensor or the retina of the human eye). The object can also be partially occluded, incomplete (i.e. missing a part) and observed under different illumination conditions, causing further variations in the resulting image. (Biederman, 1987) One important step towards image understanding for computer systems is semantic (image) segmentation, also known as scene parsing, scene labeling or object-class segmentation. In this work the terms are used interchangeably. It is the task of sectioning a digital image into disjunct labeled regions that delineate objects, so that every pixel in the image is labeled with the class of the object to which it belongs (transparency is generally ignored). This also implies that neighboring pixels within segment boundaries should belong to the same class. Scene parsing is a very challenging task, since it includes solving the problems of detection, segmentation and multi-class recognition in one process. Short- and long-range information in the image has to be included in the method, because the class of a pixel can depend on its more immediate surroundings, (e.g. a car tire often indicates the presence of a car in the image), as well as more global information (e.g. a green pixel can only be correctly classified as belonging to a meadow, a tree or a green car using a large contextual window). (Farabet et al., 2013; Pinheiro and Collobert, 2014) Semantic segmentation can be improved by using images of different modalities. While humans can only perceive the visible part of the electromagnetic spectrum, machines are not limited to that – they can obtain data from infrared (IR), near-infrared (NIR) cameras or other sensor devices. The reflective properties of materials can vary greatly along the spectrum, so that features present in one sensor image may be absent in another.

14

1.1

Motivation

15

Thus, images from different sensors are not globally, and often not even statistically, correlated (Irani and Anandan, 1998). This also means that other image types may contain additional information, i.e. objects whose boundaries and features may be easier to distinguish in one spectral range than in another. Depth perception on the other hand is natural to most binocular creatures and is crucial to the way humans and animals perceive and interact with their surroundings. Depth information can be obtained in a number of ways (mentioned below) and yields valuable cues for segmentation. For instance, objects boundaries may be easier to detect in a depth map than in a color image, due to color ambiguity or adverse lighting conditions. In summary, images of varied modalities yield different information and can thus be expected to improve semantic segmentation results when combined. The main goal of this thesis is to quantify this improvement for different combinations of image types. Traditional approaches for semantic segmentation normally consist in creating a segmentation hypothesis, i.e. segmenting the image into superpixels, in order to guarantee visual consistency and account for similarities of neighboring segments, and using engineered feature detectors (e.g. SIFT (Lowe, 1999)) to encode these segments. A graphical model such as a conditional random field (CRF) or a markov random field (MRF) is then trained with extracted features in order to ensure global consistency while maximizing the likelihood of correct classification. These approaches have their principal drawback in their high computational cost. In this thesis a deep learning approach is employed instead. More specifically, a convolutional neural network (CNN) is trained to perform scene parsing. A CNN allows the supervised learning of feature extractors in the form of convolution filters, which tend to generalize better than engineered ones1 , and may represent previously unknown relations in the input data. In order to improve the capability of the network to model global relationships in the scene, a multi-scale CNN using spatial pyramids is employed, as has been proposed by Farabet et al. (2013) and Schulz and Behnke (2012). These works also show that CNNs can be significantly faster than competing methods while achieving state-of-the-art accuracy. Furthermore, CNNs can be efficiently implemented in hardware using field-programmable gate arrays (FPGAs)2 (Omondi and Rajapakse, 2006; Zhu and Sutton, 2003). Such custom hardware implementations can speed up the performance even more and make real-time semantic segmentation possible (Farabet et al., 2010). 1

Engineered features rely on prior knowledge of the data and can therefore only model relations in the data that are understood by the human engineer. 2 FPGAs are also energy-efficient, which is advantageous for mobile and autonomous applications.

1.2

Problem Statement

16

1.2. Problem Statement This thesis contributes to the research on multi-modal semantic image segmentation of indoor scenes using a convolutional neural network, with a multi-channel NIR camera as primary image source as well as an RGB and a depth camera as secondary image sources. Scene parsing involves the classification of all pixels in an image into multiple classes, and applying post-processing steps to improve the prediction. This problem can be divided into the following subtasks: Capture of a multi-modal image dataset: First a suitable training and testing dataset has to be obtained. This is done by capturing sets of multi-modal images with an experimental rig featuring multiple cameras that are calibrated in a separate step. The image sets consist of NIR and RGB images plus a depth map from a structured light sensor. For the sake of brevity such a set will be called an RGBDNIR image below. Acronyms for other combinations of image types follow the same convention.3 Image pre-processing: In a pre-processing step the captured RGBDNIR images have to be undistorted and rectified using intrinsic and extrinsic camera parameters obtained during calibration. Depth and RGB images have to be mapped to the coordinate system of the NIR camera in order to achieve maximal overlapping of the scene objects. Additionally, two alternative depth estimation methods are tested: Firstly a cross-spectral stereo matching method is implemented, which uses images from two different parts of the electromagnetic spectrum, e.g. NIR and RGB images. Secondly, the applicability of a depth from focus method from a previous research project of the author is investigated. The quality of the resulting depth maps is analyzed and it is determined whether one of these methods can replace the need for a depth camera. Furthermore, a binary skin image is computed, which indicates image areas in which human skin has been detected, as described in Schwaneberg et al. (2013) and Steiner et al. (2015). The complete set with all four image types will be called an RGBDNIRS image. Training of convolutional neural network models: In the next step variations of a multi-scale CNN will be trained using varying subsets of the preprocessed image sets, generating a number of models which are to be compared during evaluation. The goal is to train six models with different input modalities, see table 1 below. All models are trained for three classification tasks, namely 10-class and 3-class classification as well as human detection (where the only two classes are person and other ). 3

Following the current naming trend in the field of computer vision and robotics, the acronym begins with “RGB” such as in RGBD (RGB + Depth), although the primary image source in this work is the NIR camera.

1.3

Overview of Basic Principles

17

NIR

near-infrared

DNIR

Depth + NIR

DNIRS

Depth + NIR + skin

RGBNIR

RGB + NIR

RGBDNIR

RGB + Depth + NIR

RGBDNIRS

RGB + Depth + NIR + skin

Table 1: All input modalities for CNN models

Post-processing: The post-processing step will attempt to improve the scene parsing result by applying a number of fast intensity-based segmentation methods and performing superpixel class voting (see section 3.8) on the dense pixel classification. Evaluation: Finally, the goal of the evaluation is to determine whether and how much the use of the additional information (RGB, depth and binary skin image) can help to improve the semantic segmentation result compared to NIR-only segmentation. This is achieved by investigating the performance of the different models in terms of pixel and class accuracy, for the three classification tasks. Improvement through post-processing is also evaluated. Furthermore, training loss and accuracy curves of the models are compared.

1.3. Overview of Basic Principles This section gives an overview of basic principles which are of relevance to the related work mentioned below, but does not go into much detail for all of them. A more thorough explanation of CNNs and artificial neural networks (ANNs) in general follows in section 2 alongside other basic principles. More details on depth estimation approaches that have been implemented and tested in the course of this work follow in section 3. Please note that whenever the term image is used, a digital image is meant, which can be thought of in terms of pixel grid, rows and columns. 1.3.1. Convolutional Neural Networks (CNNs) A convolutional neural network is a type of ANN. CNNs have been introduced first by Fukushima (1980) and since then been continuously improved, arguably most notably in LeCun et al. (1989)’s work on handwritten character recognition and his LeNet-5 convolutional network (LeCun et al., 1998). CNNs have since then shown state-of-theart performance in various other image-related areas such as face recognition and image

1.3

Overview of Basic Principles

18

classification. In recent times CNNs have outperformed other methods on important benchmark datasets such as PASCAL VOC and ImageNet (Deng et al., 2009; Everingham et al., 2010). This improvement is due to the availability of large training datasets and the practicality of large models with millions of trainable parameters through powerful GPU implementations. (Zeiler and Fergus, 2014) The ImageNet Large Scale Visual Recognition Challenge, which has been held annually since 2010, has seen almost solely submissions using CNNs in 2014, the winner of the image classification challenge being GoogLeNet (Szegedy et al., 2014), a 22 layers deep network. GoogLeNet has been tested against human test subjects in an extensive experiment, and it was shown that trained human annotators performed better than the network only by a small (≈ 2%) margin (Russakovsky et al., 2014). In the words of LeCun et al. (2010), a CNN is a “biologically-inspired trainable architecture that can learn invariant features”. It consists of multiple stages, thus enabling it to learn multi-level feature hierarchies. Each stage but the last is composed of a convolution filter bank layer (thus the term convolutional ), a non-linearity layer and a feature pooling layer, whose task is to guarantee robustness against small feature translations by averaging and downsampling. The last stage generally consists of a pure filter bank layer which is fed into a fully connected two-layer classifier, i.e. a multi-layer perceptron (MLP). The biological inspiration comes from the work of Hubel and Wiesel (1962), which investigates a cat’s primary visual cortex. In this work the authors identify two important types of cells: simple cells with local receptive fields, whose function inspired the concept of filter banks in a CNN, and complex cells, whose function is similar to the pooling layer. (LeCun et al., 2010) The training of CNN models is supervised, i.e. the training and testing datasets consist of labeled images, so that the input value is the image array and the desired output is the label or class of the image. By perpetuating this concept, one can also perform semantic segmentation on densely labeled images. Instead of the whole image, pixel context patches, i.e. small local windows around the currently observed pixel, are used as input to the network together with the pixel’s class. The CNN is trained to classify these patches, thereby ultimately learning pixel-wise classification. This means that the predicted label for each context patch corresponds to the label of the central pixel of the patch. Works which apply this concept are mentioned in section 1.4.1, and a more detailed explanation of its functionality follows in section 2.3.5.

1.3

Overview of Basic Principles

19

1.3.2. Superpixels Image segmentation aims at creating a compact representation of the image, which emphasizes “interesting” properties by grouping similar pixels. A wide variety of segmentation techniques exist, all of which use some model of similarity. Naturally, i.e. for the human eye, this pixel similarity is defined by color homogeneity and affiliation with the same object, and the goal of segmentation is precisely to isolate objects or parts of objects from the rest of the image, thereby creating meaningful patches of similar pixels, also called superpixels.4 Superpixels can help to simplify methods, improve results and reduce processing time in a number of ways, e.g. an object classifier might only consider boxes around superpixels. (Bradski and Kaehler, 2008, Chap. 9) (Forsyth and Ponce, 2002, Chap. 15) In our case semantic segmentation of the image is improved by class voting in superpixels, i.e. identifying the class prediction with the highest occurrence in each superpixel, and setting all its pixels to this class. Two methods are tested: The one proposed by Felzenszwalb and Huttenlocher (2004), as well as the SEEDS algorithm (Van den Bergh et al., 2012). 1.3.3. Depth Estimation The depth of image points, i.e. their distance from the camera, is of great importance for a number of computer vision problems. Most importantly a depth map is necessary to reconstruct a three-dimensional model of the observed scene, but it can also help to better identify object boundaries and therefore improve segmentation. There are a number of methods for depth estimation and this section gives an overview on the ones relevant to the thesis at hand. Stereo Vision: The most well known method is binocular stereo vision, since it relates directly to the way humans and most animals perceive depth using their two eyes. Here the distance of an object can be approximated exploiting the parallax effect, i.e. the apparent difference in position of one point viewed from two different viewpoints. Simply put, the larger the displacement of a point in the left image with regard to the right image, the closer is the point to the viewer. Stereo vision in general and especially for cross-spectral vision systems (e.g. composed of NIR and RGB cameras like in the work at hand) will be explained in more detail in section 3.5.2.

4

Note that image segmentation per se does not produce labeled superpixels. While pixels in a superpixel definitely share similarities, object classification is yet a further step.

1.3

Overview of Basic Principles

20

Depth From Focus: Another class of approaches commonly known as depth from focus (DFF) or depth from defocus (DFD) estimates depth using a local measure of defocussing or blurring effects caused by camera lenses. The main assumption is that parts of the scene which are blurred, i.e. unfocused, lie outside of the lenses focal plane. With known focal length and a measure of sharpness, e.g. local contrast or Sobel operator5 , their distance from the focal plane can be inferred. A DFF approach for multi-channel NIR cameras is illustrated in section 3.5.1. Structured Light Sensor: The methods above are passive, since they do not require any interaction with the captured scene. In contrast, a depth map can be obtained actively using a structured light sensor, which projects some sort of pattern into the scene. The projection appears distorted when observed from a point of view other than that of the projector, and this effect can be used to reconstruct the geometry of the scene. This can be seen as a variation of classic stereo vision: One camera is substituted by a projector and the parallax effect causes the displacement or deformation of the projection. The projected pattern can be a dot from a laser beam or a light plane (a common 3D scanning technique). In these cases scanning along one or two axes is necessary. Less time consuming methods involve the projection of two-dimensional patterns such as parallel lines, concentric circles or a dot matrix. Projecting patterns in the invisible range of the spectrum has a major advantage over visible patterns for being noninvasive, i.e. the pattern is not visible in an RGB image. (Fofi et al., 2004) Recent years have seen affordable structured light sensors, most notably the Kinect (Zhang, 2012). This device constructs a depth map by analyzing a speckle pattern of IR laser light through a special astigmatic lens, which has different focal lengths in horizontal and vertical direction. In addition to the perspective deformation of the pattern due to parallax, this special lens causes a depth-dependent elliptical deformation of the circular dots in the speckle pattern. Thus, the device employs concepts of both DFF and stereo vision for depth inference (MacCormick, 2011). The depth map is computed internally on the device’s hardware, so that little additional processing is necessary. Studies such as Smisek et al. (2013) show that the accuracy of Kinect is comparable to a standard time-of-flight (TOF) camera or stereo camera rig, which makes it suitable for a number of tasks. The range of the sensor is from 80 cm to 4 m. In the thesis at hand a Kinect is used to capture the primary depth map for the RGBDNIR image set and as ground truth for the evaluation of secondary depth estimation methods (cross-spectral stereo matching and depth by chromatic aberration (CA), 5

Sharpness can only be effectively computed for edges, therefore DFF produces sparse depth maps.

1.4

Related Work & State of the Art

21

see section 3.5). The choice for this device is motivated by its low cost and ease of use, as well as the fact that the projected IR pattern is invisible in both visible and NIR spectral range.

1.4. Related Work & State of the Art This section will give an overview of works from other authors which attempt to solve related problems and/or use similar approaches. Please note that in contrast to the works mentioned below, in the present work the NIR camera is the main image source, while RGB and depth sensors serve as additional inputs. To the best of the author’s knowledge, this particular configuration of multi-modal semantic image segmentation using multi-scale CNNs has not been the subject of a scientific investigation to this day. 1.4.1. Semantic Segmentation with CNNs The last years have seen a number of approaches that use CNNs to perform scene parsing. Common to all approaches is the fact that as in semantic segmentation approaches via CRFs, features are extracted on patches around each pixel. This context patch is fed to the network so that a model can be trained to optimize a loss function which penalizes deviations from correct class predictions. Another similarity is the use of stochastic gradient descent (SGD) as optimization method. One of the first approaches is the work of Grangier et al. (2009) in which an iteratively growing, partly pretrained model is employed. The first model consists of 3 convolution layers, each containing one filter bank and a non-linearity unit, followed by a linear classifier. This model is trained till convergence. In each of the following iterations the classifier is substituted by a new convolution layer followed by a new linear classifier, while the original layers are kept with their learned parameters. Again, this model is trained until it converges. A total of four models are computed, and the authors report an increase in class prediction accuracy with every added convolution layer with the described training method. On the other hand, when the differently sized models are trained individually without incremental pre-training, their performances are roughly similar. According to the article, the approach outperforms CRF-based models by > 1% on the MSRC-9

6

(Microsoft, 2015) dataset in terms of pixel-wise accuracy.

A similar architecture is used by Schulz and Behnke (2012), with some notable additions, which leads to an improvement of almost 2% pixel accuracy compared to Grangier et al. (2009). While scarce on implementation details, the article mentions reuse of pre6

Available at http://research.microsoft.com/en-us/projects/ObjectClassRecognition/

1.4

Related Work & State of the Art

22

training results, which are max-pooled and convolved and serve as additional input to the next output layer, so that a layer can focus on learning the difference to the previous layer’s output. In order to capture long-range relationships between classes, their approach also allows input from coarser scales of an image pyramid into higher layers and tries to repair wrong class predictions with a pairwise class location prior. Moreover, besides the RGB channels, two additional feature maps are extracted via a convolutional zero component analysis transform and histogram of oriented gradients (HOG), and used as inputs to the network. Capturing long-range information in the image is a very important aspect on which the correct classification of a pixel may depend, e.g. a wooden surface may only be correctly classified as belonging to a chair or a table when the surrounding objects are known. Based on this idea Farabet et al. (2013) propose a multi-scale CNN, in which multiple instances of the convolution layers share training weights while each instance is trained with pixel context patches from a different scale of an image pyramid. The three-scale image pyramid is computed in a preprocessing step that also includes conversion to the YUV color space7 and channel-wise local normalization (zero mean and unit variance). The instances are trained in parallel and their output feature vectors are concatenated and fed to a linear classifier. In a second training step the classifier is discarded and a multi-layer perceptron (MLP) is trained with the concatenated feature maps from the three scales of convolution layers. Farabet et al. report close to state-of-the-art pixel accuracy and improved average class accuracy on the Stanford Background dataset8 (Gould et al., 2009) and record performance on the SIFT Flow dataset9 (Liu et al., 2011), while being superior by several seconds in terms of processing time. Observe figure 1 for a depiction of their scene parsing system. This approach resembles the CNN used in the work at hand, with the difference that in this approach the multi-scale convolution layers are trained directly with the MLP classifier, and that this architecture is designed to extend to various input channels from different image modalities. Furthermore, sample context patches are not sampled according to balanced class frequencies (which may increase class accuracy), but using natural frequencies. Another way of capturing long-range relationships between pixels is to use a recurrent convolutional neural network as proposed by Pinheiro and Collobert (2014). Like recurrent neural networks (RNNs), recurrent CNNs are fed with the current input plus the network output (i.e. prediction) of the previous time step, which can help to discover 7

Actually, YCbCr color space, where Y is the light intensity component, while Cb and Cr describe the blue-difference and red-difference in the image, respectively. 8 Available at http://dags.stanford.edu/projects/scenedataset.html 9 Available at http://people.csail.mit.edu/celiu/SIFTflow/

1.4

Related Work & State of the Art

23

Figure 1: Scene parsing system of Farabet et al. (2013). This approach bears many similarities to the presented approach, including multi-scale convolution layers and a post-processing step in which predicted labels are improved by superpixel averaging. However, while Farabet et al. train the multi-scale convolution layers with a linear classifier first and train an MLP with the feature maps produced by the trained filters, in the presented approach training is performed in a single step. (Farabet et al., 2013)

time-dependent information in the data, correct mistakes from previous instances and learn label dependencies. In each time step the network is fed with a different version of the image from an image pyramid, plus the predictions from the last time step (in the first time steps predictions are zero). This method allows to effectively learn long-range relationships with a smaller-sized network. Recurrent CNNs with 2 and 3 instances have been tested, the latter proving too slow to train on a single computer. While not outperforming Farabet et al. (2013), their approach achieves state-of-the-art accuracy on the Stanford Background and SIFT Flow datasets. 1.4.2. Semantic Segmentation using RGB and Depth Additional depth information can significantly improve scene understanding and object detection, as has been shown in a number of recent works. RGBD sensors have become more affordable in the last years due to developments in the entertainment industry which mass-produces consumer depth sensors such as Kinect. Holz et al. (2012) show how combining RGB and depth information can improve real-time obstacle detection for robot navigation, especially for objects which do not protrude much from the ground plane. Here color information is used essentially to improve scene understanding from depth. Also in the field of robotics, Hern´andez-L´opez et al. (2012) introduce a real-time method for object detection that combines color and depth segmentation techniques. They show that while transforming the image from RGB to CIE-L*a*b* color space makes segmentation more robust to lighting changes, objects of similar colors continue

1.4

Related Work & State of the Art

24

to pose a problem. This challenge is met using depth as additional data, since it allows discriminating objects that lie on different planes. The problem of full-scene parsing, i.e. dense pixel-wise labeling, is subject of the work of Silberman and Fergus (2011). The article describes the use of a CRF-based segmentation model which is trained with intensity and depth data from a Kinect sensor, while exploring a range of different representations for depth information. Most notably a 3D location prior is introduced, which is computed from depth values that have been normalized to the maximum extend of the bounding hull of the current room (similarly to the work at hand, only indoor scenes are considered). The prior acts on the class probability for a pixel and penalizes probabilities which do not fit the prior, e.g. class chair on the ceiling or class wall in the middle of the room. Their model also features gradient-based spatial transition cost, which inhibits or encourages prediction-independent class transition for pixels that lie close to an intensity, depth or superpixel edge. Class probabilities are averaged for each superpixel of an initial segmentation. The work of Silberman and Fergus is furthermore noteworthy for the creation of an original RGBD dataset with hundreds of densely labeled images, known as the NYU depth dataset, which on the current date is in its second version. The use of depth, 3D location priors and spatial transition potentials helps to improve the per-pixel classification accuracy by 13% compared to the pure RGB model. This results in a accuracy of 56.6% on the first version of their dataset, i.e. NYU depth v1 with 13 classes. The same research group used a modified approach on the second, more varied version of this dataset (NYU depth v2). Silberman et al. (2012) explore 3-dimensional support relations, by computing surface normals on the depth data and dividing the scene in different planes. The main idea is that certain objects are supported by others, e.g. a table stands on the floor and a cup generally resides on top of a table. An initial superpixel segmentation is computed and a logistic regression classifier is trained to predict the correct class for each image region, using a feature vector paired with a support relation term. This approach yields a pixel accuracy of 58.6% for 4 classes in the NYU depth v2 dataset. This dataset is also used in the work of Couprie et al. (2013), who extend the CNN from Farabet et al. (2013) (see section 1.4.1 above) by adding an extra channel for depth data. Their approach outperforms Silberman et al. (2012) by 5.9%, resulting in 64.5% accuracy. Observe in figure 2 how depth improves segmentation results for some classes while impairing others. The authors observe that objects with low depth variance in the class are recognized better with depth information. Overall, accuracy is improved by 1.5% when using depth. It is also interesting to note that the CNN is fed with the raw

1.4

Related Work & State of the Art

25

Figure 2: Exemplary CNN scene parsing result from an RGBD image from the work of Couprie et al. (2013). From left to right: depth image (near = blue, far = red), ground truth labeling, result from RGB only, result for RGBD. (Couprie et al., 2013)

color and depth images, apparently learning useful 3D relationships, though due to the nature of CNNs it is difficult to reconstruct how these relationships are represented in the network. 1.4.3. Semantic Segmentation using RGB and NIR As mentioned above, intensity values in NIR images are uncorrelated to RGB values for the same point in space. Values in the NIR range tend to be more uniform across same materials while in contrast RGB values tend to be more distinguished for different natural and man-made objects since the appearance of our environment is molded to human and animal perception (and vice versa). This section presents works which combine RGB and NIR in order to improve semantic segmentation. Since silicon-based image sensors are sensitive to near-infrared light, NIR images can be obtained with relative ease by removing a blocking filter between the lens and the image sensor of a common RGB camera.10 This concept is used by Brown and Susstrunk (2011), who modify a conventional camera accordingly by removing the said filter from the lens and taking double-exposure images with external band-pass filters for visible and near-infrared light. The authors propose a multi-spectral SIFT for image classification that finds features on 4-channel images, i.e. RGB plus one NIR channel. Note that in contrast to this work, Brown and Susstrunk do not capture multi-channel NIR data, i.e. the NIR image is gray-scale. On their original RGBNIR dataset, featuring 477 images divided in 9 categories, the authors report a categorization accuracy of 73.1% for multispectral SIFT compared to 59.8% for normal SIFT. Salamati et al. (2012) use a labeled sub-set of the same dataset as test and training data for semantic segmentation. Their segmentation framework is CRF-based and com10

Silicon sensors are responsive only until ≈ 1000nm in contrast to the indium gallium arsenide (InGaAs) sensor used in the work at hand, which has a range of 900 nm to 1700 nm. NIR and RGB images are obtained with separate cameras, which leads to the necessity of multi-spectral image registration, which is addressed in section 3.4.

1.4

Related Work & State of the Art

26

Figure 3: Prism-based multi-spectral camera used by Kang et al. (2011). This setup allows for a good overlay of the RGB and NIR images, whereas the presented approach relies on cross-spectral image registration due to a setup with separate cameras. (Kang et al., 2011)

putes probabilities for superpixel patches of pre-segmented images. SIFT descriptors are computed separately for each channel and then concatenated. They are combined with a color feature descriptor which encodes channel-wise intensities using mean and standard deviation for each descriptor bin. The best results are achieved using SIFT for the NIR channel only and color descriptors for all images, resulting in 85.7% overall pixel-wise accuracy. As the authors point out, NIR is particularly useful to discriminate water, sky and cloud from other classes. Since the present work deals with indoor scenes only, these classes are of minor significance. Still, an information gain is expected from combining RGB and NIR data. Kang et al. (2011) apply a multi-spectral approach for road scene understanding, using a hierarchical bag-of-textons model which has been trained with features from multiple resolutions of the input image. A prism-based camera is used to obtain the multi-spectral image, i.e. the spectral bands (visible and NIR) are separated by coating on the prism, see figure 3 for more detail. This model represents object classes by frequencies of “visual words”, i.e. basic texture elements called textons (for more detail on bag-of-textons please refer to Csurka et al. (2004) and Willamowski et al. (2004), among others). The image is convolved with a bank of linear spatial filters in order to provide good local descriptors for image patches. Filter responses are clustered and textons are defined as cluster centers, so that image pixels can be textonized in the nearest cluster. Bag-of-textons are computed over local neighborhoods at three scales and concatenated to form the final feature vectors. A CRF is used for optimization. For road scene segmentation into 8 disjunct classes an overall pixel-wise accuracy of 87.3% has been achieved using the multi-scale approach for RGB and NIR images, which is an improvement of ≈ 1% over the color-only method.

1.4

Related Work & State of the Art

27

1.4.4. Human Detection Since the third classification task focuses on human detection, i.e. the detection of people in images, relevant work from this research area is also studied. Approaches that rely on hand-crafted features as well as CNN-based methods in which features for human detection are learned, similar to the work at hand, are discussed below. The aforementioned HOG descriptor has been proposed by Dalal and Triggs (2005) for human detection in combination with a support vector machine (SVM) classifier. In this approach a dense grid of descriptors indicates gradient orientations (or edge directions) in the image. A person is recognized based on silhouette contours defined by these oriented gradients. This concept can be expanded by combining edge with color and texture information. The approach of Schwartz et al. (2009) uses co-occurrence matrices to extract texture features and counts color frequencies for each HOG descriptor, resulting in a high-dimensional feature vector. Partial least squares analysis is performed to reduce the dimensionality of this vector before feeding it to a classifier. Rather then using feature descriptors to directly infer the presence of a person in an image, Yao et al. (2014) apply text-recognition techniques to human detection. Here a visual alphabet of the human body is learned, which is composed of small patches showing body parts, e.g. legs, arms and torsos. During detection, patches are matched using HOG descriptors, and a pose dictionary is used to verify hypotheses at runtime. In recent years CNNs have been successfully applied to human detection. Sermanet et al. (2013), Ouyang and Wang (2013) and Luo et al. (2014) report state-of-the-art performance for deep convolutional networks on pedestrian data sets such as Caltech11 (Doll´ar et al., 2009) and ETH12 (Ess et al., 2008). CNNs can also be used to estimate human poses. The methods of Toshev and Szegedy (2014) and Jain et al. (2014) achieve state-of-the-art accuracy for the FLIC

13

human pose dataset (Sapp and Taskar, 2013).

Poses are predicted for detected people in terms of limb positions and orientations. Gkioxari et al. (2014) take this concept a step further by training the CNN to not only predict the pose but also the action performed by a detected person.

11

Available at http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ Available at https://data.vision.ee.ethz.ch/cvl/aess/dataset/ 13 Available at http://vision.grasp.upenn.edu/cgi-bin/index.php?n=VideoLearning.FLIC 12

1.5

Scope of Research

28

1.5. Scope of Research Multi-modal semantic image segmentation is a vast topic with many ramifications, not all of which can be addressed in this thesis. This section ascertains its limits and mentions aspects of the topic that will not be examined in the scope of this work. Image Modes and Hardware Setup: In this thesis, the multiple modes of images are NIR, RGB, depth and binary skin images in various combinations. The NIR image is the main data source and other images are used as additional information, since this research contributes to projects of the Safety and Security Research Institute (ISF) and shares their hardware setup. Use of other image types such as IR or certain feature maps (e.g. HOG as in Schulz and Behnke (2012)) is feasible but is not pursued in this work. The images are not well aligned, since the distance between the cameras is approximately 20 cm, dictated by the size of the custom multi-channel light-emitting diode (LED) ring flash, which can be seen in figure 33. This makes it necessary to register the images, i.e. find transformations that approximate an optimal alignment of scene object. An improved hardware setup is possible in theory, e.g. the radius and central opening of the ring flash could be enlarged to accommodate all three cameras in order to minimize their distance to each other, but this is beyond the scope of this thesis. Types of Scenes: The observed scenes are exclusively indoor scenes and the number of images of rooms with direct sunlight (e.g. through windows) has been kept low, since sunlight interferes with the active illumination of both multi-channel NIR camera and Kinect depth sensor. The study of multi-modal scene parsing on outdoor scenes is not part of this work. Real-Time: This thesis focuses on the acquisition of a multi-model image dataset for training and validation (i.e. camera rig setup and calibration, image pre-processing and registration and labeling of ground truth images), and on the architecture and training/testing of various model configurations of multi-modal CNNs. The framework used, CAFFE (Jia et al., 2014), allows to easily design, train and test sophisticated CNN architectures but does not allow exporting performance-optimized implementations of the trained network models. Thus, the real-time capability of the current system is very limited. Recent works have shown that semantic segmentation with CNNs is possible in real-time though, especially when employing FPGAs (refer to Farabet et al. (2010), Omondi and Rajapakse (2006) and Zhu and Sutton (2003) for more information). It is therefore safe to say that the developed method can be implemented as a real-time solution in future research.

1.5

Scope of Research

29

3D-Reconstruction: The main goal of this work is semantic segmentation, i.e. correct classification of all image segments, or rather correct labeling of all image pixels. 3Dreconstruction is a separate step which can be combined with a full scene-labeling in order to classify reconstructed objects. This can be done by reprojecting a 3D-scene onto the image plane and inferring object classes from corresponding segment labels. However, 3D-reconstruction is outside the scope of this thesis. Research on Semantic Segmentation Approaches: In this work a CNN is used as a state-of-the-art approach, which has been shown to outperform other approaches, see section 1.4.1. Implementation and evaluation of other scene parsing methods is not pursued. The area of CNN research itself is very broad and active, and innovative approaches have appeared regularly in the last years. Deep networks with 20 and more layers are state of the art and could certainly be adapted for scene parsing. It is important to note that extensive research on convolutional neural networks is not part of this thesis. The main goal here is research on the influence of multi-modal image configurations on scene parsing accuracy, when used as input into a CNN. Comparison is limited to multi-scale CNNs with three convolution layers and an MLP classifier with one hidden layer. Models with different multi-modal image input configurations, specifically NIRonly, DNIR, DNIRS, RGBNIR, RGBDNIR and RGBDNIRS, have been trained in order to determine the performance gain achieved by using additional image modes. All model configurations are trained and tested for three generalization levels of object classes, i.e. with 2, 3 and 10 disjunct classes, see table 2 in section 3.

2. Basic Principles This chapter defines a number of basic principles that are relevant to this thesis, in addition to the ones mentioned in section 1.3. It starts with a description of the multichannel NIR camera used in this work. Next, it briefly addresses camera calibration (a more detailed explanation of this subject can be found in appendix A). The section concludes with an overview of the principles involved in training artificial neural networks and on how these are extended to convolutional neural networks.

2.1. Multi-Spectral Image Acquisition with NIR Camera This thesis has been carried out at the ISF of the Bonn-Rhein-Sieg University of Applied Sciences, where a custom-built multi-channel NIR camera is used as primary image source in two of the institutes’s projects: safe person recognition in the working area of industrial robots (SPAI) and fake detection for face biometrics (F¨ alschungserkennung f¨ ur die Gesichtsbiometrie) (FeGeb), from the research areas concerning human skin detection and material classification for safety and security applications. For more details please refer to Steiner et al. (2015). In an RGB camera the channel separation is achieved by using a mosaic color filter array such as the Bayer filter (Bayer, 1976). This is sufficient for human color vision since every channel represents one primary color (short, medium and long wavelengths in the range of visible light).14 In the case of a NIR camera however generally no such filter array is present15 , i.e. the resulting image is an array of intensity values in the infrared sensitivity range of the camera, commonly displayed as a gray scale image. To overcome this limitation, an experimental multi-channel NIR camera has been developed at the ISF. The same camera is also part of the camera rig used in this master thesis project. In this camera images are acquired in the near-infrared range using an InGaAs CMOS sensor sensitive to radiation between 900 nm and 1700 nm, with a resolution of 636×508 pixels. In order to obtain a multi-channel image, the experimental camera features a custom-built monochromatic LED ring flash with three types of LED arrays, one for each of three wavelengths, see figure 33 in appendix B. For each frame three successive pictures are taken, for each illuminated with one type of LED, thereby showing intensity values for a very small band of the near-infrared spectrum, ideally monochromatic. To 14

According to Grassman’s law, every color perceptible by humans can be achieved through combination of three linearly independent primary colors. 15 There are exceptions, e.g. the company Pixelteq produces a NIR camera with custom filter arrays, but at a very high price range (http://www.pixelteq.com/product/pixelcam/).

30

2.2

Camera Calibration

31

obtain clean monochromatic images, the dark image, i.e. an image recorded without flash, is subsequently subtracted. Thus, the resulting channel images show only the reflected radiation from the scene caused by a monochromatic infrared light pulse. It must be noted however that the LEDs do not emit ideal monochromatic light, but typically have a bandwidth of approximately 40 - 100 nm, which is much broader than that of a laser (Paschotta, 2008). The wavebands used in SPAI, FeGeb and this thesis project lie at 970 nm, 1300 nm and 1600 nm. This choice is due to the fact that human skin shows distinctive reflective properties in these wavebands (Schwaneberg, 2013) and that such LEDs are commercially available. Note that all NIR images shown in this work are false color images, i.e. the wavebands are displayed as the red, green and blue channels of an RGB image.

2.2. Camera Calibration In general computer vision methods perform best on “undistorted” images, i.e. images with corrected lens distortion. Furthermore, when working with multiple cameras, their relative transformations must be known so that the images can be related to each other, i.e. by projection on a common image plane and alignment of corresponding pixel rows. Thus, when acquiring camera images for any computer vision task, it is imperative that the intrinsic and extrinsic parameters and distortion coefficients of the camera are known, so that projection and undistortion transformations can be performed. The intrinsic parameters consist of focal length and displacement from the optical axis, the extrinsic parameters are the coordinate system transformations (rotation and translation) from world to camera coordinates. Lens distortion is described by parameters for tangential and spherical distortion.16 Please refer to Appendix A for details on how the calibration parameters can be obtained and how camera images are undistorted and rectified.

16

Other lens distortions, like chromatic aberration (CA), are not considered in the used calibration model.

2.3

Artifical Neural Networks

32

2.3. Artifical Neural Networks The CNN used in this work is a special case of an artificial neural network (ANN). This section provides a short history and mentions fundamental principles which are common to both network types and important to their understanding, before the next section delves into CNNs. ANNs constitute a much broader area of research than can be contemplated in the scope of this thesis. Interested readers are therefore referred to relevant literature such as Schalkoff (1997), Bishop et al. (1995), Zurada (1992) and Pao (1989), among many other publications.17 ANNs are a branch in the field of artificial intelligence used for learning statistical models for pattern recognition tasks, one of the earliest and most well known example being handwritten digit recognition. With the ability to learn and adapt and to generalize on input data, ANNs can solve natural, non-linear tasks which are difficult to model using algorithms based on static rules. The inspiration for their development comes from biological neural networks found in the animal and human brain and other parts of the central nervous system. Neural networks are composed of millions of neurons 18 , i.e. cells which exchange information using electrical and chemical signals. ANNs mimic the functionality of neurons: The network nodes are linked through connections with adaptive weights and have an activation function that models the electric excitability of neurons and (depending on the function) guarantees non-linearity, and their output is the weighted sum of their inputs. The weights and the activation threshold constitute the trainable parameters of the network. In the late 1950s Frank Rosenblatt and colleagues from MIT developed the Mark I Perceptron, one of the first ANNs (or neuro-computers, since at that time the implementation was done solely in hardware) which could perform binary classification of digits on a 20 × 20 image (Rosenblatt, 1961). A perceptron, or rather a single-layer perceptron (SLP), is a layer of parallel neurons so that for an input vector x = [x1 , . . . , xn ] an output scalar f (x) is computed using weights wi and bias b as  1 if Pn w x + b > 0 i=1 i i f (x) = 0 otherwise .

(1)

Here the bias expresses the neuron’s threshold, b = −threshold, and belongs to the left-hand side of the inequation, which is especially useful for more complex activation functions (see below). Rosenblatt advanced the concept of SLPs in his later work, 17

The author would also like to recommend the online books Kriesel (2007) and Nielsen (2015), which give a comprehensive introduction on the subject aimed at students and researchers. 18 The human brain consists of about 86 billion neurons (Azevedo et al., 2009)

2.3

Artifical Neural Networks

33

but its real potential only became apparent in the 1980s with the development of the back-propagation of error learning procedure by Rumelhart et al. (1988) and the use of multilayer perceptrons. 2.3.1. Multi-Layer Perceptron A multi-layer perceptron (MLP) is a perceptron with two or more layers of trainable parameters capable of modeling far more complex relations in the input data than a SLP. While a SLP can divide the data space with a hyperplane and thus perform binary classification, a MLP with one hidden layer, i.e. a total of three neuron layers, is capable of performing multi-class classification. A crucial aspect of MLPs is the activation function σ which simulates the activation of a biological neuron that causes it to “fire” an electrical impulse. The activation function is defined globally, i.e. is identical for all neurons. It must be differentiable to be used in back-propagation, which is the preferred learning technique for MLPs and will be explained in section 2.3.3. The neurons generally feature a non-linear activation function such as the logistic sigmoid and the hyperbolic tangent: fsigmoid (x) =

tanh(x) =

1 1 + e−x

sinh(x) e2x − 1 = 2x cosh(x) e +1

(2)

(3)

The logistic sigmoid yields numbers in the interval [0, 1], which is suitable for classification problems where training data and predicted values represent probabilities. The hyperbolic tangent’s range is [−1, 1] and is mapped to [0, 1] for such tasks. This work uses a third alternative called a rectifier linear unit (ReLU), which outputs numbers above zero, defined as frelu (x) = max(0, x) ,

(4)

which in recent works has been shown to both improve and speed up the performance of recognition systems (Russakovsky et al., 2014; Jarrett et al., 2009). It has a hard threshold at zero and is differentiable until that point, which suffices for back-propagation. Glorot et al. (2011) show that besides being computationally efficient, ReLUs are also more biologically plausible than the logistic sigmoid or the hyperbolic tangent. See figure 4 for a comparison of the three activation functions.

2.3

Artifical Neural Networks

34

Figure 4: Non-linear activation functions. The hyperbolic tangent is normalized to [0, 1].

The activation, i.e. the output value, of each neuron Ω in a MLP is the value of σ, and the input to σ is the sum of the weighted outputs of connected neurons and the l bias b. Let Ωli denote the ith neuron in the lth layer and let wj,k be the weight for the

connection from Ωl−1 to Ωlj . Let blj be the bias and alj the activation for neuron Ωlj . The k activation alj of Ωlj can be computed by the equation alj



m X

! l al−1 wj,k k

+

bkj

  = σ zjl .

(5)

k=1

Here the sum is taken over all m neurons in the (l − 1)th layer and zjl is a short notation for the input into the activation function. For a clearer notation, one can collect all weights, biases and activations in matrices and vectors. Let n and m be the number of neurons in the lth and (l − 1)th layer, respectively. Then let al = [al1 , . . . , aln ] be the activation vector and bl = [bl1 , . . . , bln ] the bias vector for that layer. The weight matrix is



 l l w1,1 . . . w1,m  . ..  .. . wl =  . .   .  l l wn,1 . . . wn,m

(6)

Thus, equation (5) can be written in vectorized form as     al = σ wl al−1 + bl = σ z l .

(7)

When used for classification problems, the output layer of the network has as many neurons as there are classes. The output values of these neurons can be merged into a

2.3

Artifical Neural Networks

35

class prediction vector. This vector contains in its elements the predicted probabilities for the input data belonging to each of the classes, and the highest probability designates the predicted class. MLPs are trained using the back-propagation of error procedure. The main goal of back-propagation is to find partial derivatives ∂L/∂w and ∂L/∂b of the loss function L, see next section, for any weight w and bias b in order to adapt the weights using stochastic gradient descent and minimize L. 2.3.2. Loss Function The loss function L computes the classification error, or loss, which is to be minimized by an optimization method such as gradient descent, see section 2.3.4 below. In order to be used with back-propagation, the loss function must fulfill two requirements: It must be computable for individual training data points and it must be defined as a function of the outputs of the network. For a multi-class problem the network has multiple outputs (as many as there are classes) and therefore a multinomial logistic loss 19 function is used, which fulfills both requirements. It encompasses the use of the softmax function, which normalizes a vector so that its elements lie in [0, 1] and sum up to 1. It computes c¯ = [¯ c1 , . . . , c¯k ], the normalized vector of a predicted probability distribution over all K classes. Let pj be the output (prediction) of the MLP for a given input data point and class j. The normalized predicted class probability20 c¯j for class j can be written as epj c¯j = PK

k=1 e

pk

,

(8)

where e is the Euler’s constant. Evidently the sum of all normalized probabilities is 1. The derivatives with respect to pi (again, important for back-propagation) are:  c¯j(1 − c¯j) ∂¯ cj = ∂pi −¯ ci c¯j

if i = j

(9)

if i 6= j

Given the target class distribution vector c = [c1 , . . . , ck ] (defined by the ground truth), the multinomial logistic loss can be computed on the softmax value c¯j for each input data point as L=−

K X

ck ln c¯k .

k=1 19 20

Logistic loss is also called cross-entropy loss. That is, the predicted probability of the data point belonging to a certain class

(10)

2.3

Artifical Neural Networks

36

The target vector c could be a class probability distribution for the input, i.e. for fuzzy classifications / fuzzy sets. However, in the work at hand the hand-crafted ground-truth labeling defines exactly one class per pixel. Thus, the classes are disjoint and c is a hard target vector with ck = 1 if the class of a data point equals k and ck = 0 otherwise. If for example the target class is k and the predicted class probability for k is P (k) = 0.5, the loss would be −1 · ln(0.5) ≈ 0.693. The derivative of L with respect to a prediction pi is simply the difference between target class and predicted class and thus very cheap to compute, and is given by K X

K

X ∂ ln c¯k ∂L =− = · · · = c¯i ck ∂pi ∂pi

! ck

− ci = c¯i − ci .

(11)

k=1

k=1

The work at hand solves the binary classification task (human detection) using a CNN with two output neurons, one for each class. Another way of doing binary classification is with a single target value c ∈ {−1, 1}. In this case there is only one network output and the simpler “one-vs.-all” (quadratic) hinge loss can be used. It decreases the closer the prediction p is to the target value c and settles on zero if c = p and is given by Lh (p) = max(0, 1 − cp)2 .

(12)

One problem that can arise is overfitting, i.e. the model doesn’t generalize well and instead specializes on the training data. As an effect, the classification results might be good for the training set, but bad when applied for other data.21 Overfitting can be avoided by using a regularization technique that penalizes large weights in the network since large weights can be considered a sign of specialization on a dataset. In the present work, L2 regularization is used, which extends a loss function L to L0 = L +

µX 2 wi , 2

(13)

i

where µ is the regularization parameter and the

network.22

2 i wi

P

the sum of squares of all weights in

Clearly, using a small µ favors the original loss function while a higher

µ value increases the penalty for large weights. Note that regularization is applied only 21 22

To generate a meaningful evaluation of the model, cross-validation should be used, see section 4. Division by two is a matter of convention, it makes the derivate more intuitive, see equation (24).

2.3

Artifical Neural Networks

37

to weights since large biases do not lead to a specialization of the network and can be even desirable in some cases. 2.3.3. Back-Propagation Backward propagation of errors is a procedure that constantly adjusts the network’s weights in order to minimize the difference between actual and desired output vector and thus minimizes the loss function. Through this adjustment, hidden network units are trained to represent features in the input data. Back-propagation is computed once in every iteration, i.e. separately for every input data point. The algorithm is as follows: Algorithm 1 Back-propagation of errors 1. Input: Set all activations for the input layer (l = 0) from one training data point, e.g. pixel intensities in the case of image classification. 2. Feedforward step: For each neuron in all following layers, compute the input into its activation function and its output. I.e. for l = l1 , l2 , . . . , L compute z l and al (see equation (7)). 3. Loss: Compute the error (or loss) δ L in the output layer. 4. Back-propagation step: Go backwards from the output layer and compute the loss for each preceding layer, i.e. for l = L − 1, L − 2, . . . , 1 compute δ l . 5. Output: The gradient vector 5L of the loss function L for all weights and biases.

Evidently four more equations are needed. The output layer loss δ L is given for each component, i.e. each neuron output, by δjL =

where

∂L ∂pj

∂L 0 L σ (zj ) , ∂pj

(14)

is the loss derivative with respect to the class prediction pj from equation (11),

0 i.e. pj is the activation at the j th output layer neuron and pj = aL j . Furthermore σ is

the derivative of the activation function and zjL is its input. As shown for equation (7) this can be written in vectorized form as δ L = 5La ◦ σ 0 (z L ) ,

(15)

2.3

Artifical Neural Networks

38

where 5La is a gradient vector containing the partial derivatives of the output layer activations from equation (14) and ◦ denotes element-wise multiplication of vectors. In order to iterate backwards through the network, the loss of a layer and its transposed weight matrix (wl )T can be used to compute the loss of the preceding layer: δ l−1 = ((wl )T δ l ) ◦ σ 0 (z l−1 ) .

(16)

The gradient of the loss is its rate of change with respect to the network weights and biases. The loss with respect to biases is exactly the loss at each neuron output in a given layer: ∂L = δjl . l ∂bj

(17)

Finally, the loss with respect to weights is given by ∂L l = al−1 k δj . l ∂wjk

(18)

2.3.4. Gradient Descent The gradient descent optimization method attempts to minimize the loss function using the loss gradient vector 5L computed through back-propagation. Gradient descent guarantees adjustment of the network parameters in the opposite direction of the gradient, so that they “move” towards the next local (and hopefully global) minimum. The learning rate α is a small positive parameter that, as the name suggests, controls the training velocity. The choice of α is an important factor since a high learning rate can cause the method to overshoot and miss the (global or at least the best reachable) minimum, while a too small one can make it converge to a local minima. In general terms, if one defines 4W as the vector containing all changes (also called “deltas” of the weights) to be applied to all trainable parameters, updates are computed by 4 W = −α 5 L .

(19)

2.3

Artifical Neural Networks

39

The update for the lth layer from actual weights and biases vectors wl , bl to adjusted ∂L ∂L w˜l , b˜l is done using the corresponding entries 4wl = ∂w and 4bl = ∂b l l from the jk

j

gradient vector 5L. The speed of the adjustment is regulated by the learning rate α, which defines how fast the solution moves towards the minimum. In a more detailed form, gradient descent rules are given by the following equations: ∂L w˜l = wl − 4wl = wl − = wl − αδ l (al−1 )T l ∂wjk

(20)

∂L b˜l = bl − 4bl = bl − l = bl − αδ l . ∂bj

(21)

The immediate problem is that in order to compute the gradient vector 5L it is necessary to compute the gradients of all n training inputs x1 , . . . , xn in training set P X and compute the average 5L = n1 ni=1 5Lxi before any adjustment can be done. Because of this the stochastic gradient descent uses a randomly chosen subset of the training data in each iteration, called “mini-batch”, which depending on its size gives a ¯ = x¯1 , . . . , x¯m denote such a mini-batch good approximation of the average 5L. 23 Let X of size m. The rules from equations (20) and (21) become: m

α X x¯i ,l x¯i ,l−1 T ∂L ¯ l w˜l = wl − 4wX,l = wl − = w − δ (a ) x¯i ,l m ∂wjk i=1

(22)

m

∂L α X x¯i ,l ¯ b˜l = bl − 4bX,l = x¯ ,l = bl − δ . m ∂bj i i=1

(23)

With respect to the aforementioned L2 regularization, see equation (13), the learning rule for weights becomes: w˜l = wl −

∂L x¯i ,l ∂wjk

! + µwl

(24)

There exist many techniques to improve gradient descent such as momentum, which accumulates fractions of previous weight updates in a velocity vector and adds them to the update in the current iteration (Sutskever et al., 2013). Similar to a physical 23

The mini-batch size necessary for a good approximation depends heavily on the dataset and must be determined empirically.

2.3

Artifical Neural Networks

40

momentum, this technique can prevent the model from converging in a local minimum and increases the speed of convergence, but also introduces the risk of overshooting the target minimum. Another popular technique, called dropout (Hinton et al., 2012), decreases the chance of overfitting by randomly omitting hidden units in the network. This prevents co-adaptations on the training data, in which one unit is only helpful in the context of other units. By “dropping” random units, multiple thinned versions of the same network are created during training, and predictions are effectively averaged over all thinned networks. Thus, dropout approximates the effect of training multiple networks and averaging their predictions, which is an effective but resource-intensive way of improving prediction. 2.3.5. Convolutional Neural Networks As mentioned earlier, a CNN is a type of artificial neural network. It is a variation of the MLP with typically at least two convolution layers followed by three normal perceptron layers. CNNs are currently the state of the art in the area of image classification. They have first been introduced in 1980 as the neocognitron proposed by Fukushima (1980), who himself was inspired by the groundbreaking work of Hubel and Wiesel (1959) on a cat’s primary visual cortex. According to this work, the biological neural network in the visual cortex is hierarchical, ranging from simple to complex, to hyper-complex cells. Certain kinds of simple cells respond more to certain kinds of stimuli (or features), e.g. vertical or horizontal lines. Cells in higher stages bundle the response of several simpler cells and tend to have a more selective response to a more complex feature of the visual stimulus. Complex cells also have a larger receptive field, making them less sensitive to spatial shifts. In CNNs this structure is modeled by cascaded layers of convolution filter banks and sub-sampling units. Rather than using the entire image as network input, multiple filters with relatively small kernels (e.g. 7 × 7) iterate over the image, each responding to a different type of feature (≈visual stimulus) and convolving the image into a feature map. Since weights for each filter kernel do not depend on position, the same kind of feature is extracted for every point in the image. Sub-sampling operations downsize these maps so that filters in the next layer “see” a larger context window of the previous feature map. Filters in higher layer have all or a subset of the previous feature maps as inputs, thereby extracting more complex features. Please observe the often-cited diagram in figure 5, from the work of LeCun et al. (1989), in which he introduced gradient descent learning for CNNs.

2.3

Artifical Neural Networks

41

Figure 5: Architecture of LeCun et al. (1998)’s LeNet-5 convolutional neural network for digits recognition. Features are extracted in two convolution and sub-sampling layers that feed into a MLP classifier. (LeCun et al., 1998)

In image processing, a convolution filter F is defined by its kernel, which can be written in matrix form:



 f1,1 . . . f1,k  . ..  .. . F = . .   . , fk,1 . . . fk,k

(25)

where k is the size (height = width) of the kernel and each fi,j is a filter weight. A convolution operation “folds” the weighted neighbors of a pixel onto the central pixel. The weights come from the transposed kernel matrix24 . For a pixel intensity value Px,y 0 is computed by at position (x, y) in the image, the new value Px,y

0 Px,y

=

r r X X

fi,j Px−i,y−j ,

(26)

i=−r j=−r

where k = 2r + 1. Clearly, k must always be an odd number. Like the weighted inputs into neurons, filter responses are thresholded using a bias and are fed into an activation function. Filter weights and biases constitute the trainable parameters of a filter unit. The sub-sampling, or spatial pooling, operation provides a larger receptive field and invariance to small transformations in the image. The two most common pooling types are sum- and max-pooling, which compute the averaged sum or maximum over a neighborhood of activation values and write the result to the down-sampled image. The size of the pooling neighborhood equals the step size in the original image, while the step size in the down-sampled image is 1. In this work, max-pooling is used which implements selective response by taking the highest activation in a neighborhood. 24

In convolution, the filter is mirrored around the center pixel. If the kernel matrix is used without being transposed, the operation is called correlation.

2.3

Artifical Neural Networks

42

By cascading multiple filter layers, CNNs are capable of learning complex features during training. This approach is superior to using handcrafted visual features because it does not require human interaction with previous knowledge of the data, and might even produce useful feature extractors which could not be conceived by a human. The last layers of the network are composed of fully connected perceptrons, i.e. a MLP, which performs classification based on the extracted feature maps. As can be seen in figure 5, the last 16 feature maps of the LeNet-5 network have a size of 5 × 5 pixels, which makes for 400 input values into the MLP. Depending on the architecture, the last filter layer might also produce single pixels, so that the feature maps have a size of 1 × 1, and the input into the classification layers is actually a vector of filter responses. 2.3.6. CNNs for Semantic Segmentation While CNNs are most commonly used for classification of whole images, they can also be adapted to perform pixel-wise classification for semantic segmentation. In this case, rather than using the entire image as input into the network, small context patches around each pixel are used as input, with the label of the object to which the pixel belongs as target class. This way, each patch provides contextual information about the pixel besides its intensity value – surrounding pixels, edges and features that may help to identify the class of a pixel. For instance a blue pixel surrounded by white specks might belong to the class sky, while a blue pixel above a sandy colored region might belong to the class “water”. The size of the context patch is obviously an important factor, the bigger the window the more information is available. At the same time, big patches slow down learning and ultimately semantic segmentation because in order to parse one image all patches for all image pixels have to be classified. This constitutes a non-trivial amount of processing time, e.g. for a 320 × 240 sized image, over seventy thousand context patches have to be fed to the network. To circumvent this problem a multi-scale approach can guarantee long-range context information for each pixel while maintaining a relatively small context patch size. An image pyramid composed of several versions of the image at multiple smaller scales is computed, and same-sized patches are taken from all scales for each pixel. Consequently, patches from smaller scales contain more long-range information at the cost of image resolution, see figure 6. Section 3.7 gives a detailed description of the multi-scale architecture CNN developed for semantic segmentation in this thesis.

2.3

Artifical Neural Networks

43

Figure 6: Pixel context patches of all scales. For the sake of demonstration, patches have been upsampled to fit the original size and arranged on top of each other. Note how the patches from smaller scales (1/2 and 1/4) provide larger context fields at lower resolution.

3. Approach for Semantic Multi-Modal Image Segmentation This chapter describes the approach for semantic image segmentation using multi-model image sets that has been developed in this thesis, motivating and giving details as well as mentioning encountered challenges. It commences with an overview of the scene parsing system used to train and test multiple CNN models, and continues with a description of how the RGBDNIRS dataset has been generated, including pre-processing and image labeling. Next, the CNN architecture for all model types is illustrated and training details and examples of extracted features are given. The chapter concludes with an explanation of the class voting algorithm using superpixels, which is used in a postprocessing step to improve the results.

3.1. System Overview This work addresses semantic image segmentation using multi-modal images. More precisely, each scene which is to be segmented is captured with RGB, NIR and depth cameras. Since camera resolutions and visual fields vary due to different sensors, camera positions and lenses, which also introduce distortions into the image (see appendix A.1), a pre-processing step is necessary in order to undistort and align the images. Furthermore, a binary skin image is computed, in which detected (human) skin is denoted by “1”. Pre-processed images are then used to build an image pyramid with smaller versions of the original images in order to train the network to model long-range pixel relations. Small context patches around each pixel are extracted from each pyramid scale and fed to three instances of the CNN’s convolution layers (more detail on these layers is found in figure 38 in appendix C). Since corresponding filter weights for each scale are shared with weights in the other scales, the network is able to learn scale-invariant features. The outputs of all convolution layer instances are concatenated into one feature vector which serves as input to a MLP classifier. The MLP predicts each pixel’s class and a dense label prediction for an entire image can thus be computed. Lastly, the predicted label image is improved by superpixel class voting in terms of object borders and label consistency along segments. The CNN to be trained consists of convolution layers and an MLP. Pre-processing and patching is identical for training and test iterations. Scene parsing and class-voting are separate steps performed only in the final semantic segmentation in order to evaluate the system. Please see figure 7 for an exemplified depiction of the system’s processing pipeline.

44

3.2

Problems Occurring in Captured Images

45

Figure 7: Exemplified System Overview: An RGBDNIR image is captured, then undistorted and rectified using intrinsic and extrinsic camera parameters. RGB image and depth map are registered with the NIR image and the skin image is computed. A threescale image pyramid is created and the CNN receives a patch for each pixel (pixel window/neighborhood). The patch is filtered by three convolution layers with learned filter kernels, and the resulting feature vector is fed to a MLP which performs class prediction. Finally, the result is improved by superpixel class voting.

3.2. Problems Occurring in Captured Images In order to train the CNN models a set of RGBDNIR images has to be captured to obtain the ground truth images used as training and testing dataset. For a detailed description of the experimental camera setup and on how the images have been obtained see appendix B. The captured images can suffer from a number of visual artifacts which can deteriorate the image quality in a way that makes them inept for CNN model training. This is the case when large parts of the image are affected by blooming or when images are too dark to contain sufficiently distinct intensity values. Such images have been discarded. At the same time the presence of some artifacts can help the network to adapt to realistic image conditions so that some less problematic images have been kept in the training/test set. Captured images have been screened manually and discarded where required. Insufficient illumination for NIR images: As with any light source, the radiation of the NIR flash has a limited intensity and decreases proportionally to the square of the distance from the source. Thus, the overall intensity that can be achieved on the multi-channel NIR images decreases quadratically to the size of the scene that is being captured, i.e. the size of the room. This is especially problematic when another stronger light source, such as the sun, illuminates the scene. In this case the focal aperture has to be narrowed (in comparison to scenes with artificial lighting) so as not to overexpose

3.2

Problems Occurring in Captured Images

46

(a) Blooming on Dark Image (b) Blooming from Sunlight (c) Blooming from NIR flash Figure 8: Blooming on NIR images caused by sunlight, visible on the dark image (a) and the multi-channel image (b), and blooming caused by the NIR flash (b). Note that blooming from sunlight appears as a black “blob” in (b) because the dark image is subtracted from the channel images.

the image. Compared to the sun’s radiance, the flash is not strong enough to illuminate the scene sufficiently. As aforementioned, a dark image is recorded without illumination and subsequently subtracted from the other channels so that the images show only the reflected radiation caused by the NIR flash. As an effect, after subtracting a dark image showing much sunlight, the channel images contain very little usable information. Blooming: The NIR camera uses an InGaAs focal plane array (FPA) sensor. InGaAs is a semiconductor that provides very good photo-sensitivity in the NIR region of the spectrum. The FPA consists of a photo-diode array connected to a readout integrated circuit. (Princeton Instruments, 2012) Similar to charge coupled device (CCD) sensors, InGaAs sensors suffer from blooming. This is a phenomenon which occurs when the signal exceeds a specific level due to high-intensity light reaching the photosensitive area. The excess charge overflows to adjacent sensor elements and causes the blooming effect, visible in the image as bright “blobs” of pixels with the highest possible intensity. These blobs cover much more pixel area than the one pixel receiving the high charge. Several anti-blooming techniques exists which provide drains to carry the excess charge away. (Hamamatsu Photonics, 2014) Anti-blooming is implemented by default in all modern CCD sensors, but seems to be missing from InGaAs sensors to the best of our knowledge. The blooming artifacts make the image information in the affected areas unusable. If only minor areas of the image have been affected, the blobs have been marked as belonging to the class “sensor artifacts” and are thus treated as a kind of object whose recognition the CNN can learn, in theory. If larger areas have been affected, the images are discarded.

3.3

Camera Calibration

47

Blooming caused by sunlight: The sun emits strong radiation not only in the visible but also in the NIR spectrum. Because of this, even indirect sunlight such as from brightly illuminated windows or light specks can cause blooming when capturing images in indoor environments. The blooming effect is visible on all channel images including the dark image, and since the dark image is subtracted from the other channels, this causes a dark blob in the multi-channel NIR image, as can be observed in figure 8b. Blooming caused by the NIR flash: Another type of blooming occurs due to the light from the LED flash which is reflected strongly by some surfaces (mostly glass and metal). Depending on the angle of an object the reflected light can fall directly into the camera lens. In this case, the blooming appears as a bright blob in the channel images and does not appear in the dark image, since it is caused by the flash rather than sunlight or another light source, see figure 8c.

3.3. Camera Calibration Camera calibration is carried out in two steps: First intrinsic parameters and distortion coefficients are computed for individual cameras, then the extrinsic parameters are determined for the camera rig, using the previously obtained parameters. Extrinsic, intrinsic and distortion parameters are encoded in the aforementioned undistortion and rectification maps 25 , i.e. lookup tables for corrected pixel positions for each camera. See appendix A for the related formulas. Single camera calibration: Multiple views of a chessboard calibration patterns (see figure 32 in appendix A) are recorded for each camera and the intrinsic parameters and distortion coefficients are determined. In order to calibrate the Kinect depth sensor, raw IR images are recorded, since the flat calibration pattern is not visible on the depth image. Kinect uses the IR camera to capture the projected pattern and estimate depth, see section 1.3.3, therefore camera parameters are identical for depth and IR image. The IR pattern projector is covered during calibration to avoid interference. The results are stored as camera matrices and distortion coefficient vectors for each camera, see appendix A.1. Camera rig calibration: The chessboard pattern is recorded simultaneously by relevant camera pairs for multiple views. Since RGB and depth images are mapped to NIR images because of the central position of the NIR camera, the resulting image pairs are (Inir , Irgb ) and (Inir , Iir ). RGB images have a higher resolution than NIR images and 25

Undistortion and rectification maps will be referred to in short form as “rectification maps” below.

3.3

Camera Calibration

48

must therefore be resized.26 The correct downscaling factor must be found which guarantees that RGB and NIR pixel metrics are similar, i.e. pixel distances in both image types translate to the same real-world distance. In order to do so, all (Inir , Irgb )-pairs are temporarily undistorted using the corresponding camera matrices and distortion vectors. Chessboard corner detection is performed on undistorted images yielding n corner points Pnir,i and Prgb,i , i ∈ n. The downscaling factor f is computed from the average distances between adjacent corner points: f=

n−1 X

, n−1 X (Pnir,i − Pnir,i+1 ) (Prgb,i − Prgb,i+1 ) .

i=1

(27)

i=1

After downscaling an RGB image by factor f , the image must also be cropped to the same size as NIR images to make alignment possible. Since both cameras are at the same height one can assume the middle part of Irgb to coincide most with Inir . Furthermore, the wide-angle lens used with the RGB camera causes strong vignetting (see figure 34b and 34i in appendix C), a fact that also recommends a centralized crop. Thus, a crop rectangle Rectrgb is defined in terms of the pixel matrix by  Rectrgb

x

T



1 (Irgb .width  12  2 (Irgb .height 

   y    = =   width    height

− Inir .width)

T

 − Inir .height)  .  Inir .width  Inir .height

(28)

Now the undistorted RGB calibration images are resized and cropped and intrinsic and distortion parameters are estimated once more for the smaller versions of these images. Finally, chessboard corners are detected on all views of all image pairs (with resized RGB images) and used alongside camera matrices and distortion vectors to estimate the translation and rotation vectors between each two camera coordinate systems. The results are encoded in rectification maps Mnir and Mrgb for NIR and RGB cameras. For depth / IR only rotation and translation vectors Rir and Tir from IR to NIR camera space are needed, because in order to register depth to NIR, 3D-point cloud reprojection is performed as explained in section 3.4.2. It is also for that reason that depth images do not need to be resized.

26

The RGB images are downscaled rather than upscaling the other images, because after pre-processing all images are downscaled to 320 × 240 pixels before being fed to the CNN, for performance reasons.

3.4

Image Registration

49

3.4. Image Registration Image registration describes a number of methods to transform different source images into one coordinate system so that corresponding image regions overlap. For segmentation it is particularly important that object edges are aligned. Ideally after successful registration, all pixels should have the same coordinates as corresponding pixels in a destination image. This correspondence can be based on intensity values in the simplest case, e.g. block matching (Konolige, 1998), or on feature descriptors such as produced by SIFT (Lowe, 1999) or SURF (Bay et al., 2006). The transformations used to register the source to the destination image can be divided in linear and nonrigid transformation. Linear transformations range from translation and rotation to scaling and other affine transformations. Nonrigid transformations allow to also model local geometric differences between images, i.e. apply local deformations to the image. One example of this is the thin-plate splines algorithm by Bookstein (1989). 3.4.1. Image Registration for RGB Images The immediate issue when attempting to register RGB to NIR images is the same encountered in every multi-modal vision system: Corresponding data points from different sensors are very challenging to register since the signals do not have matching physical properties (in this case, intensity values come from different parts of the electromagnetic spectrum). Similarities in terms of gradient, structures and color, which provide constraints for common feature matching approaches, are weak or not present at all. There is no computable correlation between intensity values from RGB and NIR images and therefore gradients are of different strengths and often even reversed. (Krotosky and Trivedi, 2007; Shen et al., 2014) A common approach in medical imaging is to use a statistical method such as mutual information to register images from varying domains. In this work a different approach using HOG feature descriptors has been implemented in order to find corresponding image points. It is also used in the cross-spectral stereo matching method described below, refer to section 3.5.2 for more detail. NIR and RGB images are first stereo-rectified (see appendix A.4), so that pixel rows are aligned, then registration is performed. Three RGB-to-NIR registration techniques have been implemented and tested, see figure 9. All techniques use points that have been matched through the HOG descriptor approach. The closest matches are pre-selected using an empirically derived distance threshold dthresh = 4.0, so that only matched feature point pairs (p1 , p2 ) with a vector distance of kp1 − p2 k ≤ dthresh are kept. Furthermore, since the images are stereo-

3.4

Image Registration

(a) NIR image

(b) Regist. via Shift

50

(c) via Homography

(e) Difference for Shift (f) Diff. for Homogr.

(d) Thin-Plate Splines

(g) Difference for TPS

Figure 9: RGB-to-NIR image registration. Simple horizontal shift yields the best result (b). Due to the poor quality of cross-spectral HOG feature matches, the homography approximation is inadequate (c) and the thin-plate splines algorithm creates many erroneous local deformations (d). (e), (f) and (g) show difference images for edges detected on the NIR image and the registered images.

rectified, only matches that lie in the same pixel row are kept. In all techniques only these best matches are used in order to minimize the influence of false matches. Registration via Horizontal Shift: The median Euclidean distance d˜f is computed in pixel units for all best matches. The rectified RGB image is shifted horizontally by offset = 1 d˜f pixels in order to minimize the average offset between correspondences. 2

This approach shows the best results and is used in the final algorithm. Registration via Homography: Assuming that the points pAi = [xAi , yAi ]T and pBi = [xBi , yBi ]T , i ∈ [1, m] of all m best matches lie on the planes A and B, registration can be performed by transforming A into the coordinate system of B. A homography matrix H is estimated (see appendix A.2) so that pAi ≈ HpBi for all i ∈ [1, m]. This is done by minimizing the projection error defined as     ! m X h1,1 xBi + h1,2 yBi + h1,3 2 h2,1 xBi + h2,2 yBi + h2,3 2 xAi − + yAi − (29) h3,1 xBi + h3,2 yBi + h3,3 h3,1 xBi + h3,2 yBi + h3,3 i=1

where h1,1 , . . . , h3,3 are elements in H. To further minimize the influence of outliers, the random sample consensus (RANSAC) method is used (Fischler and Bolles, 1981).

3.4

Image Registration

51

RANSAC iterates through randomly chosen four-element subsets of the matched point pairs, chooses the best subset according to the projection error and computes an initial homography H0 . Outliers are identified and rejected based on H0 and a threshold, and a final H is approximated as described above. Of course this method works best for the registration of chiefly planar objects such as x-rays, satellite images or the calibration patterns mentioned in appendix A.3, since a homography can only encode translation and rotation of planes. Nevertheless, it also allows for good registration of camera images from different perspectives, given that enough good feature matches can be found. Thin-Plate Spline Algorithm Bookstein (1989) proposes an algebraic approach for describing local deformations which are described by scattered point matches. This makes the thin-plate spline algorithm suitable for the presented problem, since HOG descriptors are sparsely and irregularly spaced when computed on RGB and NIR images. Furthermore, the different camera perspectives create discrepancies in the image which can be approximated by local deformations for want of a real 3D-reconstruction of the scene. In the final algorithm, RGB-to-NIR image registration is implemented using horizontal shift as described above. Albeit very simple and relying solely on translation, it showed the best results. This might be partly due to the relatively large offset between RGB and NIR camera (≈ 20 cm), but also to the poor quality of the cross-spectral feature matches. The approximated homography matrix suffers from the fact that an insufficient number of good matches is found, see figure 36 in appendix C. Therefore some parts of the image are well registered, while others areas, where no matches are found, are ignored. The thin-plate spline approach in particular performs very poorly creating many unnecessary deformations based on poor matches. 3.4.2. Image Registration for Depth Maps If depth is obtained through stereo matching or a DFF approach that involves the NIR camera, the depth map is already aligned with the NIR images. When depth is acquired using another sensor, e.g. Kinect, then the spatial offset and different perspective make image registration necessary. This is a special case though, because a 3-dimensional point cloud is available since the depth value of each pixel can be interpreted as a z-axiscoordinate. Thus, instead of having to register 2D images, a point pd in the depth map with coordinates (xd , yd ) and depth value d can be projected to 3D-space. Using the

3.4

Image Registration

52

intrinsic parameters fdx , fdy , cdx and cdy of the depth camera, see section 3.3 for depth camera calibration, the resulting point p3D [X, Y, Z]T in 3D-space is:

p3D

    d(xd − cdx )/fdx X     =  Y  =  d(yd − cdy )/fdy  . Z d

(30)

This 3D point can be projected into the NIR camera space onto 2D point p2D using p2D =

" # x0 0y

" =

Xfnx /Z + cnx Y fny /Z + cny

# ,

(31)

where fnx , fny , cnx and cny are the intrinsic parameters of the NIR camera. The reprojected point cloud creates a 2D image with missing information due to the different perspective of the cameras, visible as black “shadows” on the lower side of objects. These regions are visible from the NIR camera, which is located below the Kinect in the camera rig, but not from the depth sensor.27 In addition, specular surfaces and spots of sunlight can distort or obscure the projected IR pattern. As a result, depth cannot be inferred at these points, which causes missing information in the depth map even before reprojection. In order to remove these “holes” in the depth map, adjacent depth values must be used to infer the missing ones. Silberman and Fergus (2011) use the cross-bilateral filter of Paris (Paris and Durand, 2006), which guides diffused depth values in the missing regions. However, while this seems to work well when using the relatively close-set RGB and depth sensors from the Kinect, it gives poor results for cameras with a larger offset. In this work a more direct approach is used that iterates over the image array in horizontal and vertical direction, filling missing pixels with the depth of the last visited non-empty pixel in the same pixel row. See the result of reprojection and depth filling in figure 10. A relatively large number of missing depth values are not filled correctly and because of that some objects are “erased”. It can be argued that such an approximated dense depth map is more advantageous for scene labeling than an unaltered one since missing depths in the latter cause edges that might be wrongly interpreted as object boundaries.

27

On the other hand, some original points might overlap in the reprojected image, i.e. points visible from the depth sensor but not from the NIR camera, so that effectively some depth information is lost.

3.4

Image Registration

53

(a) Original Depth Map

(b) Filled Depth Map

(c) Re-Projection

(d) Filled Re-Projection

Figure 10: Depth map reprojection. The original depth map has missing information (a) which is corrected by a filling operation (b). Re-projection causes “shadows” due to the change in perspective (c) which are again corrected through filling (d).

3.5

Depth Estimation

54

3.5. Depth Estimation The depth of image points can be inferred in a number of ways, such as by a structured light sensor, through stereo vision and depth from focus. Two approaches have been tested and rejected in this work, which led to the use of the structured light sensor as will be discussed in section 3.6. Although rejected, these approaches give valuable insight into the challenge posed by single-camera depth estimation and cross-spectral stereo vision and shall therefore be described in this section. 3.5.1. Depth from Chromatic Aberration The aforementioned single-camera depth estimation techniques known as depth from focus (DFF) rely on the analysis of the level of sharpness, or blurriness, of different image regions. These methods can be roughly divided in two groups: Early approaches have been known since 1968, such as the works of Horn (1968), Krotkov (1988), Tenenbaum (1970) and more recently Subbarao and Surya (1994) among others. Here the optimal focus for an object of interest is approximated by informed search, i.e. a sequence of images is captured while shifting the camera focus with the intent of minimizing the blurriness of the image, similar to the auto-focus function of some modern low-end cameras. Since focal length and lens position are known, the approximate depth for objects in that area can be computed using the thin lens formula

1 DO

+ D1S =

1 f

which defines the

relationship of lens-to-object and lens-to-sensor distances DO and DS and focal length. More recent methods approximate depth from single images without interaction with the camera. The main assumption is that objects which lie outside the focal plane of the camera are more blurred than the ones inside, allowing inference of their depth.28 Since sharpness can only be measured where some kind of structure or texture is present in the image, depth can only be inferred at edges and thus the resulting depth map is always sparse. To obtain a dense depth map, depth values have to be guided into image regions bordering those edges in a post-processing step by segmentation- and interpolation-based techniques. The concept of DFF is expanded through the consideration of the chromatic aberration (CA) effect. CA is a distortion resulting from the failure of the lens to perfectly focus light of all wavelengths, resulting from the fact that lenses have differing refraction indices for different wavelengths in the electromagnetic spectrum. A depth-from-CA method exploits this effect by analyzing the relationship of sharpness values in the different 28

However, it is not trivial to determine on which side of the focal plane an object lies. This ambiguity can be resolved by using information from multiple channels as in depth-from-CA, see below.

3.5

Depth Estimation

(a) Sharpness 50 mm lens

55

(b) Sharpness 8 mm lens

Figure 11: Sharpness values vs. distance plot for different camera lenses. The values are averages of multiple measurements in 5 mm steps. For details on the experimental setup please refer to Velte (2014). 50 mm lens (a): Focal lengths vary greatly between different channels of the NIR camera, causing characteristic sharpness curves. Depth can be inferred using sharpness relations. 8 mm wide-angle lens (b): Focal lenghts for different channels are very close due to CA correction, which causes similar sharpness curves. Depth estimation using sharpness relations between channels is therefore impractical.

image channels, e.g. the red, green and blue channels in an RGB image. Since every channel has a slightly different focal length due to the CA of the lens, some proportion measure can be used to directly infer the distance from the camera. Evidently, a lens with pronounced CA is desirable in this approach, which stands in contrast to the fact that modern lenses almost always feature some kind of CA correction in order to improve photographic results. This concept is utilized in the visual spectrum by Atif (2013) and Trouv´e et al. (2013), who deliberately employ lenses with pronounced or even enhanced CA. Within the context of a pre-study to this work 29 a depth-by-CA method for the nearinfrared spectrum was developed for use with ISF’s aforementioned multi-channel NIR camera. In that study, a 50 mm camera lens with little CA correction is used, which provides sufficiently distinct focal lengths for different channels. Depth can be inferred with an absolute error of less than 18 cm in a range from 1.5 m to 3.0 m from the camera. Please refer to the (unpublished) tech report (Velte, 2014) for more information on the approach. In this work, this approach was tested with the present hardware setup which uses an 8 mm wide-angle lens for the NIR camera. This lens has an efficient CA correction 29

The so-called “Masterprojekt” is part of the computer-science master studies a the Bonn-Rhein-Sieg University of Applied Sciences and directly precedes the master’s thesis.

3.5

Depth Estimation

56

though, which reduces the measurable effect significantly and makes sharpness differences between channels very small, see figure 11. Depth estimation by CA is thereby impractical in this setup and this approach has been rejected. 3.5.2. Cross-Spectral Stereo Matching As previously mentioned binocular stereo vision exploits the fact that a real-world point projects onto different coordinates in the images from a stereo-vision system. The images have to be be rectified so that epipoles and epipolar lines in both cameras are aligned. The epipole of one camera coincides with the center of projection of the other camera projected on the first camera’s image plane, see figure 12. The left epipolar line is the imaginary line on the image plane of the left camera that spans from the origin of the right camera to a real-world point, and vice-versa. Stereo rectification ensures that point projections lie on the same pixel row in both images. Corresponding points PL and PR in the (undistorted and aligned) left and right camera images are searched for, these points being projections of the same real-world point. The depth of this point is inversely proportional to the disparity d = PL − PR between the views. Finding these correspondences is not trivial, but a number of efficient methods exist, such as the block matching algorithm by Konolige (1998) and the semi-global matching (SGM) method by Hirschmuller (2008).

Figure 12: Epipolar Geometry: The epipoles EL and ER are projections of the other camera’s center of projections OR and OL respectively, i.e. projection of the line P − OL onto the right cameras image plane. For stereo matching algorithms, epipoles and epipolar lines of each camera should be aligned to ensure that corresponding points projections lie in the same row in both image planes. (ZooFari, 2009)

When attempting to perform stereo matching using images from different spectral ranges, such as RGB and NIR as in our case, one faces the problem that there is no direct relation between the pixel values, making the methods mentioned above perform poorly. For this reason the cross-spectral stereo matching method proposed by Ping-

3.5

Depth Estimation

57

(a) Depth from Kinect

(b) Cross-Spectral Stereo

Figure 13: Comparison of depth from Kinect sensor (a) and from cross-spectral stereo matching using HOG descriptors (b). The latter approach has been rejected due to the poor quality of estimated depth values.

gera12 et al. (2012) has been implemented and evaluated, which uses HOG descriptors for pixel matching. This method has been chosen since in contrast to other approaches like Barrera Campo et al. (2012) and Krotosky and Trivedi (2007), it produces dense stereo maps for the whole image. HOG descriptors have been proposed by Dalal and Triggs (2005) for human detection and unlike other feature descriptors have the advantage of not using intensity values but gradients, while ignoring gradient directions. Intensity values and gradient directions can differ greatly between images from different parts of the spectrum, but a gradient’s unsigned orientation generally does not. For instance, the same edge might be caused by a dark-to-bright transition in the NIR image and a bright-to-dark transition in the RGB image, but the direction of the edge is the same in both. In order to compute a HOG descriptor, a window around the pixel of interest is divided into a number of cells, for each of which a histogram of unsigned oriented gradients is computed using simple gradient filter kernels [−1, 0, 1] and [−1, 0, 1]T . The descriptor for each block is a concatenation of each cell’s histogram. Following Pinggera12 et al. (2012)’s work, descriptors are computed densely, for every pixel. Each HOG descriptor vector v = [x1 , . . . , vn ]T is normalized to euclidean norm in order to maximize invariance by v −1 u n X u v = v t x2i  . v0 = |v| i=1

(32)

3.6

Pre-Processing Pipeline

58

The disparity image is computed using a modified SGM algorithm, where the cost Pn function for pixel similarities is given by the L1 distance i=1 |vL,i − vR,i | between descriptor vectors vL and vR . Although results in Pinggera12 et al.’s work are promising, the algorithm performs poorly on the stereo pairs obtained for this project, as can be seen in image 13. Apparently an insufficient number of good matches is found for corresponding points, an assumption that is confirmed by close study of visualizations such as figure 36 in appendix C. This method has therefore also been rejected.

3.6. Pre-Processing Pipeline Once all necessary parameters have been estimated, any image captured with the camera rig can be undistorted, rectified and registered (as long as relative camera positions and lenses are unaltered). Note that stereo rectification is used to align RGB and NIR images since this facilitates registration, although no stereo matching algorithm is used. Image registration for RGB and depth images has been explained in section 3.4. The Kinect depth map is used – the cross-spectral stereo vision (section 3.5.2) and depth by CA (section 3.5.1) approaches have been discarded due to the problems mentioned earlier. A comparison of original and processed images can be found in figure 34 in appendix C.

Figure 14: Pre-processing pipeline: Intrinsic, extrinsic and distortion parameters that have been estimated during calibration are used to find the best possible alignment of the images.

Binary skin image: Pre-processing also includes computation of a binary skin image according to the approach of Steiner et al. (2015). In order to determine if a pixel in a multi-channel NIR image shows human skin, normalized differences di,j are computed for all combinations of the intensity values v1 , v2 , v3 from the three wavebands (970 nm, 1060 nm and 1300 nm) as  di,j =

vi − vj vi + vj

 , 1 ≤ i ≤ 3, i 6= j

(33)

3.6

Pre-Processing Pipeline

59

and thus lie in the range [−1, 1]. Normalized differences make the algorithm robust against illumination variances, since they are independent of the absolute brightness of analyzed pixels. Skin-like materials are classified by defining lower and upper thresholds t1 and t2 for each difference di,j and computing binary images in which pixels are set to “1” if di,j ∈ [t1 , t2 ]. These filter masks are multiplied in order to keep only pixels which have been classified as skin by all filters. Since this method alone creates false positives for some materials, Steiner et al. (2015) apply an additional SVM classifier to improve classification. This second classifier has not been implemented in this work due to limited time, and the images have been corrected manually to emulate an optimal classification.30 Minimal common image area: After all images have been processed, all are of the same size. Nonetheless a minimal common image area has yet to be defined, since the right-shift in RGB image registration and reprojection of the depth map result in undefined regions at image borders, see figure 10d for an example of the latter. A rectangle defining the usable area in the registered RGB image is given for the computed offset as [offset, 0, Irgb .width − offset, Irgb .heigth]

31 .

For the depth map, this rectangle

is defined by the projections of the original image’s corner points into NIR camera space. Eventually remaining undefined pixels are removed using the aforementioned filling operation, see section 3.4.2. Subsequently, the intersection of the two resulting rectangles is determined and used to crop all images. Algorithm 2 Pre-processing of RGBDNIR images 1. Undistort NIR images and undistort, downscale and crop RGB images. 2. Rectify RGB and NIR images using rectify maps Mnir and Mrgb . 3. Register RGB to NIR image using simple shift, see section 3.4.1. 4. Register depth map by reprojection to NIR camera space using Rir , Tir and depth camera intrinsics according to equations (30) and (31). Fill holes in depth map. 5. Use rectified NIR image to compute binary skin image. 6. Define a minimal common image area and crop all images accordingly.

30

In contrast to the other images, the skin image is not an independent image modality, but rather an engineered feature since it is computed using the multi-channel NIR image as input. 31 Compare rectangle notation in equation (28)

3.6

Pre-Processing Pipeline

60

3.6.1. Labeling for Ground Truth Ground truth data for semantic image segmentation has to be obtained in the form of manually labeled images (see figure 21c) in order to train the CNN. Labeling has to be done based on the pre-processed NIR images since the NIR camera is in the rig’s central position, to which the other coordinate systems are mapped. Images have been labeled by a group of helpers (friends and colleagues) using the online labeling tool LabelMe (Russell et al., 2008), which allows designating classes to image areas by drawing polygons around objects. The tool outputs an XML file which is parsed using an adaptive dictionary to account for synonyms and spelling mistakes, creating 8-bit gray-scale label images. Pixel classes are encoded as intensity values, i.e. the brightness of a pixel designates its class, yielding a total of 255 possible classes. The choice for encoding the pixel classes in an image rather than a different format, e.g. a text file, has been made out of convenience because it allows to easily resize the label images alongside the RGBDNIR images. Label ground truth data is created at three different generalization levels, so that the models can be trained to perform binary and multi-class classification. The following table details the three ground truth levels: 10 classes

3 classes

Binary

0

person

6%

person

6%

other

94%

1

wall

30%

background/structure

48%

person

6%

2

floor

9%

object (miscellaneous)

43%

3

ceiling

3%

undefined

3%

4

door

4%

5

window

6%

6

chair

3%

7

table

7%

8

object (miscellaneous)

27%

9

sensor artifacts

2%

10

undefined

3%

Table 2: Class label ground truths at three generalization levels. The class undefined designates pixel that have not been labeled and is ignored during validation.

Next to each label is its percentage with respect to the whole dataset (with a total of ≈ 6.6 million labeled pixels). Note that in “3 classes” the background/structure label comprises walls, floor, ceiling and windows, since these are structural elements which

3.7

Multi-Scale CNN for Semantic Segmentation

61

can be considered static, e.g. for robot navigation. The binary set of label images is used to train the CNN for the human detection task. Observe figures 21a and 21c below for an example of manual labeling.

3.7. Multi-Scale CNN for Semantic Segmentation This section describes implementation details and architecture of the multi-scale convolutional neural network used in the presented semantic segmentation approach. For a thorough explanation concerning loss function, back-propagation, gradient descent learning, etc. please refer to section 2.3. The CNN has been implemented by adapting the GPU-accelerated deep learning framework Caffe (Jia et al., 2014) for multi-modal pixel classification, i.e. predicting a class based on multiple types of images showing the same scene. Features are learned on images that have been resized to 320 × 240 pixel, using linear interpolation for NIR and color images, and nearest-neighbor interpolation for depth and label images. 3.7.1. Preparation of Input Data Randomness: Stochastic Gradient descent estimates the loss gradient on the basis of randomly selected single data points or mini-batches, so the randomness of the training data is very important. In image classification, the number of training samples generally does not exceed some thousands which are stored on disk and can easily be referenced in a list. Randomness can be guaranteed by shuffling the list with any established algorithm. For semantic image segmentation however, where one context patch is created for every image pixel, the number of examples is much higher. For instance, the dataset used in this work contains 89 images with a resolution of 320 × 240, resulting in over 6 million patches. Storing so many images on disk is inefficient, therefore patches are created at runtime. To guarantee randomness without repetitions, a list of shuffled pixel positions is computed beforehand for every image. These lists are kept in an outer list, also shuffled, together with an index into each list that is increased after each retrieval. Every new patch is created using the next pixel position in the next shuffled list, and after each traversal the outer list is shuffled again. Shuffling is performed with uniform distribution. Note that traversing shuffled lists does not provide true randomness, since pixel positions are not repeated. In this setting every pixel in the training set is retrieved exactly once before any repetition occurs. After all pixels have been viewed, shuffling is repeated for all lists.

3.7

Multi-Scale CNN for Semantic Segmentation

62

Image pyramid and patching: An image pyramid is constructed from each image, composed of the original and two smaller versions scaled by factors 0.5 and 0.25, using linear and nearest-neighbor interpolation for NIR/RGB and depth/label images respectively. Small patches of size 46×46 around each pixel of interest are “cut out” to provide context information for classification.32 These context patches are created around the same relative position in all scales of the pyramid in order to consider short- and longrange information, see figure 6. For a pixel position (x, y) at original scale, positions (x/2, y/2) and (x/4, y/4) are used in the downsized images. To account for pixels at image borders each scale in the pyramid is zero-padded with a 23 pixel wide border. Conversion to Y’CB CR : In order to separate brightness from color information, the patches from multi-channel NIR and RGB images are converted to Y’CB CR color space, where Y’ is the luma component and CB , CR are the chroma components for blue-difference and red-difference, respectively. The Y’CB CR patches are then split into a single channel Y’-patch and a two-channel CB CR -patch, each of which is convolved with a dedicated set of filters. Here it must be noted that the conversion of multi-channel NIR to Y’CB CR images is essentially an arbitrary operation, since Y’CB CR is a way of encoding RGB images and as such is based on the human perception of visible radiation (= light). This means among other things that the green channel has a much higher influence on brightness than the blue and red channels. Although NIR channels are ordered similarly to RGB, i.e. ranging from the lowest to the highest wavelength, the resulting false color images have no relation to the human perception of light. The choice to convert NIR to Y’CB CR is ultimately motivated by positive results of early test runs, in which this conversion caused an increase in test accuracy. Nevertheless, a more suited conversion method is conceivable, see section 5.2. Normalization: In order to account for differences in brightness and contrast, each patch Ip is normalized to zero mean and unit variance, using image mean and standard deviation. Note that despite the term “unit variance” the denominator in equation (34) is stdDev, since unlike the variance (= stdDev2 ) it has the same dimension as the values. norm(Ip ) =

32

(Ip − mean) stdDev

(34)

The patch size is chosen so that a single pixel is output after all convolution and max-pooling operations, see section 3.7.2. Since it is even-numbered, the pixel of interest is not exactly centralized.

3.7

Multi-Scale CNN for Semantic Segmentation

63

Transformation invariance: The quality of a learned model depends directly on the quality of the training data, which is determined by its size and diversity. One way to artificially increase the size of image training data is to randomly apply elastic distortions (also called jitter ) to the images, i.e. transformations such as rotation, translation and skewing. In particular, jitter helps the network to learn invariance to transformations. (Simard et al., 2003) With respect to semantic segmentation as a special case of image classification, translation and skewing are of less importance, while at the same time the need for invariance to object orientation and size arise. This suggests horizontal mirroring and scaling as important transformations. Thus, following the approach of Farabet et al. (2013), input patches are randomly mirrored, rotated and resized. Mirroring is applied with probability of 0.5 while the rotation degrees and resizing factors are sampled from [−8, 8] and [0.9, 1.1] respectively, using uniform distribution. Multi-scale neighborhoods of the same pixel should coincide though, so the same transformation is applied to patches from all scales.

3.7

Multi-Scale CNN for Semantic Segmentation

64

(a) Filter Kernels

(b) Original Image

(c) Filter Responses

(d) ReLU activations

Figure 15: Learned feature filters from first NIR layer of DNIR model trained for 10-class classification, and their responses: The first 10 filter kernels in (a) are used to convolve the Y’ channel, the remaining 6 are combined kernels for CB , CR channels. Filter responses (c) and rectified filter responses (after ReLU activation) (d) are shown in equal order.

3.7

Multi-Scale CNN for Semantic Segmentation

65

3.7.2. Network Architecture The CNN used in this work consists of the input layer with one to four inputs33 , three convolution layers and an MLP classifier with one hidden layer. The last perceptron layer of the MLP has as many neurons as there are classes. Since there are ground truth label image sets for binary, 3-class and 10-class classification, see table 2, there are versions of the network with two, three and ten output neurons.34 The first two convolution layers consist of filter banks followed by ReLUs for non-linearity and max-pooling sub-sampling units with a step size of two, i.e. the feature map is scaled down to half the size. The third convolution layer contains only filters. All filter weights are randomly initialized from a normal distribution with zero mean and standard deviation one.

Figure 16: Input layer for NIR, RGB, depth and skin images.

The network contains three instances of input and convolution layers, one for each scale of the image pyramid, each of which produces one feature vector. The filter weights and biases are shared across scales so that effectively all three instances of the convolution layers consist of identical filter banks. This allows the network to learn scale-invariant features. Training occurs in parallel for all scales and weight updates are performed during gradient descent with regard to the loss produced by all three images from the pyramid. The feature vectors from all scales are concatenated and fed to the MLP classifier, which outputs a vector of normalized probabilities for the classes, see figure 20. Input layers receive random context patches of all scales that have been normalized and processed with elastic transformations (the latter is true only for training iterations). As mentioned before, the input patches have size 46 × 46 and the final outputs of the convolution layers have single-pixel size. All filters have a kernel size of 7 × 7 and convolution is performed without border padding, therefore the outputs of the first filter bank have size 40 × 40. This is reduced by max-pooling to feature maps of size 20 × 20. Accordingly, the second convolution layer outputs maps of size 7 × 7 and the third convolves these into single pixel values, so that the last set of 1 × 1 feature maps can be considered a feature vector. 33 34

For a maximum of four image types – NIR, RGB, depth and skin. As mentioned in section 2.3.2, binary classification is carried out using a two-class MLP classifier and the logistic loss function. Alternatively one could use a single output neuron and the hinge loss.

3.7

Multi-Scale CNN for Semantic Segmentation

66

Figure 17: First convolution layer with separate filter bank columns for each image type.

The first two convolution layers are divided into filter bank “columns” for each image type present in the current model (see figures 17 and 18). These columns are fully connected internally, but have no connections between them, thus producing separate sets of feature maps. The maps are concatenated and then fed to the third convolution layer which is fully connected, see figure 19. The reasoning behind the separation by image type of the first two layers is that different kind of features can be extracted from different image modalities, and those features should only be combined at a higher abstraction level. Furthermore, suboptimal image registration, e.g. for RGB images, is less prejudicial in higher convolution layers which are more invariant to translations. This reasoning has been verified by early results – DNIR models with fully connected convolution layers have an average accuracy roughly 2% lower than models with separate filter columns.

Figure 18: Second convolution layer with separate filter bank columns for each image type.

There exist six different versions of the convolution layer architecture, each for a different subset of the RGBDNIRS training images, see table 1. For each output feature map of a convolution layer multiple filters are applied, one for every input map, whose outputs are summed and offset by the bias before being rectified by the ReLU activation

3.7

Multi-Scale CNN for Semantic Segmentation

67

function. Different image types and channels are convolved with different numbers of filters. The first NIR filter bank for instance contains 10 filters for the Y’ channel and 6 filters for each of the CB , CR channels, producing 16 feature maps. The next filter bank quadruplicates this amount, producing 64 feature maps for which 16 × 64 = 1024 filters are needed, since every map is computed from summed and rectified filter responses of all preceding maps. These feature maps are concatenated with maps from the other filter bank columns (from RGB, depth and/or skin images) and convolved in the third convolution layer, producing a 576-dimensional feature vector (in the case of the complete RGBDNIRS network). Figure 15 shows NIR feature filters extracted from the first convolution layer of a network trained for 10-class classification, and filter responses for an example image. A visual examination reveals predominantly edge detection and smoothing filters in this first filter bank. Please also see figure 35 in appendix C for more examples of filter responses.

Figure 19: Concatenation of all feature maps and third convolution layer.

Additional image types, i.e. depth, RGB and skin, are convolved with smaller filter banks, for varying reasons. For RGB images this is due to the fact that registration to NIR images is not optimal and therefore the influence of wrongly registered image areas should be kept smaller than the influence of the corresponding NIR information. Depth maps on the other hand are gray-scale images which contain planes of continuous depth values and fewer edges than color images, and therefore are expected to contain less features. The skin images are binary images containing only edges which also suggests the need for fewer filters. Note that these architectural details are design decisions based on informed estimates, and have not been confirmed through tests due to time limitation. Another reason for the limited number of filters are performance concerns, since every filter introduces a multitude of trainable parameters. The RGBDNIRS model, which is the largest CNN trained in this work, has over 80, 000 filters with over 4 million weights

3.7

Multi-Scale CNN for Semantic Segmentation

68

and biases. See figure 38 in appendix C for a graphical representation of all RGBDNIRS convolution layers for one out of three scales. The loss function used for multi-class classification is the logistic loss from equation 2.3.2. Regarding training parameters, this work closely follows Farabet et al. (2013). The learning rate is fixed to 10−3 and L2 regularization with parameter µ = 10−5 is used. The network is trained with neither learning rate adaption method nor momentum and the input is a mini-batch of size one, i.e. one image patch is viewed per iteration. Different parameter settings have been tested early during training but have proven to perform worse, in particular higher mini-batch sizes cause the loss to stagnate early. In order to test the network architecture, a model was trained on a subset of the Stanford background dataset (Gould et al., 2009) consisting of 80 training and 20 test images, which is close in terms of size to the RGBDNIR dataset. After 2 million iterations, an accuracy of ≈ 65% was achieved.

3.7

Multi-Scale CNN for Semantic Segmentation

69

Figure 20: The complete multi-scale convolutional neural network. An image pyramid is computed for the input images and the three scales are processed in parallel with shared weights. An MLP performs classification on the resulting feature vector, producing class probabilities. See figure 38 in appendix C for a detailed depiction of the convolution layers.

3.8

Post-Processing: Class Voting in Superpixels

70

3.8. Post-Processing: Class Voting in Superpixels The semantic segmentation result can be improved by labeling each region of an oversegmentation, i.e. superpixels, with the class that has the most occurrences within that region. This approach has two major advantages: The voting removes outliers and prediction “noise”, guaranteeing unambiguous classification for minimal regions. Secondly, it helps to define object borders in the predicted label image, which poses an especially difficult challenge for the scene parsing method. This is due to the fact that while the context window around each pixel helps to model short- and long-range relationships, it can also make classification at object borders ambiguous. An (ideal) superpixel segmentation on the other hand clearly separates objects based on intensity values. Two algorithms for superpixel segmentation have been tested. The first is the wellknown method proposed by Felzenszwalb and Huttenlocher (2004), which is a graphbased approach. Virtual edges are laid between neighboring pixels and weighted according to color similarity. The edges are arranged in a graph based on their weight and iteratively collapsed according to a threshold, which is updated continuously for each joined region. Thus, the color similarity inside superpixels and the color difference between them is maximized. The second approach, called SEED, is from the work of Van den Bergh et al. (2012) and aims at providing real-time performance. Superpixels are initialized in a grid and boundaries are modified continuously using a fast hill-climbing optimization. The energy function enforces color similarity based on a color histogram for each segment.35 Once superpixels have been computed, class voting can be performed taking predicted labels and the superpixels as input. First, a superpixel segmentation of the NIR image36 is created using either one of the algorithms described above (their contribution to accuracy improvement is compared in section 4). The result is a 16-bit single-channel image in which each superpixel is identified by a distinct intensity value, which can be interpreted as an ID for that superpixel. Next the algorithm iterates simultaneously over the superpixel image and the predicted label image and counts the occurrences of each predicted class for each superpixel. The votes are stored in a two-dimensional array and counted, and an output image is created in which all pixels belonging to one region are set to its prevalent class. If no majority of votes could be determined, the algorithm randomly draws a winner from the equally represented classes. 35

See http://cs.brown.edu/~pff/segment/ and http://www.mvdblive.org/seeds/ for examples and source code of this superpixel segmentation methods. 36 Superpixels are computed on the NIR image since it is the primary image and has been used for ground truth data labeling.

3.8

Post-Processing: Class Voting in Superpixels

(a) Original (NIR)

(c) Ground truth

71

(b) Superpixel

(d) Predicted labels

(e) Improved labels

Figure 21: Superpixel class voting for 10 classes, for labels predicted with RGBDNIRS model. The superpixel segmentation helps to fill out connected areas belonging to the same object and to correct object boundaries when comparing (d) to (e). Note however that it can also generate errors due to suboptimal segmentation in some areas, e.g. the torso of the person.

4. Evaluation The objective of this master’s thesis is to examine if and to what extend the combination of images from different modalities can improve semantic image segmentation. To this end multiple CNN models with different input image type configurations have been trained using a custom dataset consisting of 89 multi-modal images of indoor scenes. For the ground truth the images have been manually labeled, producing over 6.6 million classified pixels. This chapter commences with a brief analysis of the training and continues with a detailed evaluation of all models for 3- and 10-class classification, followed by a separate evaluation of the presented method applied to human detection. The results for different models are compared, i.e. networks trained with varied combinations of input image data, namely NIR, DNIR, DNIRS, RGBNIR, RGBDNIR and RGBDNIRS, see table 1. Note that below these acronyms will be used to name the models rather than the images. The chapter concludes with a discussion on the evaluation results.

4.1. Training Analysis Training is carried out for five-fold cross-validation with a 20% holdout for validation. The 89 images from the dataset are divided in five groups, and for every fold one group is used for validation while the remaining four serve as training data. Hence, five models have to be trained for every one of the six model types and for every classification modality, i.e. 10-class, 3-class and binary (human detection).37 The models are trained for two million iterations which takes an average of ≈ 8 hours training time. The total training time for all models is over three continuous weeks on an NVIDIA GeForce GTX 780 GPU (2304 CUDA cores). The number of iterations was defined based on the analysis of logs from early training runs, which show convergence occurring between one and two million iterations, i.e. after 15% to 30% of all pixels patches have been viewed. Figure 22 shows plots of the training loss and accuracy of all models for the 10-class and 3-class classification tasks. Training plots for human detection can be found in figure 39 in appendix C. Values have been averaged for all folds of the cross validation. For the sake of clarity the loss curve is shown as a running average with a window size of N = 10, 000. The running average and the standard deviation σavg with respect to the running average are given by the following equations. avg(li ) =

N −1 1 X li−j N j=0

37

For human detection only the DNIRS and RGBDNIRS models have been trained.

72

(35)

4.1

Training Analysis

73

σavg,i

N −1 1 X =( (avg(li ) − li−j )2 )0.5 N

(36)

j=0

Accuracy increases fast during the first 100, 000 iterations, which corresponds to ≈ 1.5% of the dataset, reaching top accuracies of approximately 60%, 70% and 95% for 10-class, 3-class and binary classification, respectively. These values may differ slightly from the final evaluation results because during training only 15% of the testing dataset is evaluated in every test phase due to the very high number of data points (almost 700, 000 context patches). Tests are performed every 100, 000 iterations, the remaining values have been interpolated for the plots. The loss for models trained for 3-class classification starts lower and drops faster than the loss for 10-class models. However, taking into account the different chance levels for different amounts of classes,38 the curve progression is roughly similar. For both models the accuracy converges fastest during the first 100, 000 iterations and stagnates in the vicinity of 1.5 million iterations, which confirms the previously estimated convergence interval. In all loss curves a steep downwards slope followed by a short plateau in the first few thousand iterations can be observed. One possible interpretation for this phenomenon is that this point might coincide with the transgression of the na¨ıve baseline (random guess) for accuracy. This goal might be rapidly achieved by relatively simple weight adaption in the classifier. In contrast, feature learning through adjustments of filter weights might take a longer time to show results, while being responsible for the more significant increase in accuracy. In general, the loss drops faster when training with multi-modal input rather than NIR images only and faster learning models achieve lower loss values and vice-versa, as expected. However, the loss rate does not allow one to draw conclusions about accuracy. When comparing the loss curves in figure 22c for instance, one might come to the conclusion that the DNIR model performs poorly. In reality though, this model achieves the best results for 10-class classification, as will be shown in section 4.2. Apparently a lower loss is more closely related to the size of the convolution layers, which increases with the number of image types, than to scene labeling accuracy. There are two explanations: To begin with, training loss and accuracy reflect the model’s performance on the training data, while accuracy must be measured on the testing data, both sets being disjunct. Since models with more input image types have a larger parameters space, i.e. 38

Chance level ≈ class probability, e.g. the probability for 1 out of 10 compared to 1 out of 3 classes.

4.1

Training Analysis

74

(a)

(b)

(c)

(d)

Figure 22: Plots (a) and (c) show training loss and accuracy for all model types for the 10class and 3-class classification problems, respectively. The loss curve shows a running average of the loss. The standard deviation (SD) with respect to the running average is exemplified in (b) and (d) for two models. As it becomes apparent in the plot, the different curve progressions for the averaged losses are meaningful even if considering variance. While the curve progressions show the training advance, the loss level is not directly related to the accuracy. A low loss only indicates that the model fits the training data well.

4.2

Comparison of Trained Models

75

more filter banks and consequently more trainable parameters, they can more easily fit a set of observations. Therefore, a low loss might merely indicate higher specialization on the training dataset, but this does not necessarily mean better performance on test or live data. Furthermore, an inspection of the source code of deep learning framework Caffe reveals that rather than outputting the total loss at the output layer, the training loss is computed as an average of the partial derivatives of the loss at each weight and bias. Thus, the relationship of loss and accuracy can vary between models with different numbers of parameters.

4.2. Comparison of Trained Models This section compares the results of all model types. First the performance of these models on a 10-class and a generalized 3-class classification problem will be analyzed. Based on the results of these models, a reduced set of models for person detection, i.e. binary classification, has been trained and is analyzed separately. See table 2 for an overview of the class label sets. All predicted label images are post-processed through superpixel class voting in an attempt to improve semantic segmentation accuracy, see section 3.8. Results for class voting with Felzenszwalb and SEED superpixels are compared. The parameters of the methods have been determined empirically to yield the best possible segmentation and a roughly similar number of superpixels.39 As mentioned above, due to the small dataset models are evaluated using 5-fold crossvalidation. The stated values are averages of all folds. For multi-class classification problems, per-pixel and per-class accuracies are computed using the following formulas, respectively: correctly classified pixels total number of labeled pixels

(37)

true positives for this class total number of pixels labeled with this class

(38)

Pixel Accuracy =

Class Accuracy =

It is debatable which of these values better represent the quality of a model. Pixel accuracy is the more intuitive quality measure, but the influence of minority classes40 on its value is small. If the correct detection of small objects is also important, e.g. 39

Ideally, the best possible parameters can be found using an automated grid search. While the SEED algorithm allows defining a fixed number of superpixels; parameters for the Felzenszwalb algorithm have been set to produce a similar number. 40 Minority classes are classes that have relatively few members, i.e. pixels labeled as belonging to them, compared to the whole dataset.

4.2

Comparison of Trained Models

76

light switches or small electronic devices in a robotics task, then class accuracy may be a more adequate indicator for quality. For a binary classification problem, per-pixel and per-class accuracies are identical. Additionally, other measures such as the true positive rate are given, see section 4.2.3. The quality measures are computed with regard to all labeled pixels in the ground truth, unlabeled pixels (class undefined, see table 2) are ignored. Examples of predicted label images for the three classification tasks can be found in figure 28. 4.2.1. Models for 10-Class Classification The best results are achieved by the models DNIR and DNIRS, which achieve the highest pixel accuracy of 58.1% and the highest average class accuracy of 48.4%, respectively. Depth and the skin images clearly play the most important role in improving semantic segmentation. For instance, depth is responsible for a 5% increase in pixel accuracy and a 5.5% increase in average class accuracy when comparing models NIR and DNIR. The addition of skin information increases class accuracy by another 1.5% when comparing the DNIR and DNIRS models. The combination of NIR with RGB yields a 2% improvement in class accuracy mostly due to a 17% increase for the class person and a 21% gain for class lens artifacts, but it does not alter pixel accuracy. In general, combinations of NIR with RGB and skin images improve scene parsing, but the fact that the DNIRS, RGBDNIR and RGBDNIRS models perform worse than the DNIR model suggests that the combination of these image types with depth creates some kind of interference in the neural network. NIR DNIR DNIR (Felzenszwalb) DNIRS DNIRS (Felzenszwalb) RGBNIR RGBDNIR RGBDNIRS

pixel accuracy: 53.5% 58.1% 61.2% 56.9% 60.4% 53.5% 55.5% 56.0%

class accuracy: 39.2% 44.4% 46.8% 45.9% 48.4% 41.4% 42.8% 43.9%

Table 3: 10-class classification: Pixel- and class accuracies for all models.

Semantic segmentation results are improved further through superpixel class voting in a post-processing step. The Felzenszwalb algorithm gives better results than the SEED algorithm for all models, causing an average 3.2% increase in pixel accuracy. Prediction

4.2

Comparison of Trained Models

77

accuracy of the DNIR model is improved to 61.2% by class voting with Felzenszwalb superpixels. Table 3 shows pixel- and average class accuracies for all models and also shows the impact of class voting for winning models DNIR and DNIRS. A visual comparison of original predictions and class voting improvements for all models can be seen in figure 23 and a complete overview of the results is given in table 6 in appendix C. Another important measure for the quality of semantic segmentation is the accuracy per individual class, which is compared for all models in figure 24. It becomes evident that depth aids the correct classification of pixels belonging to bordering scene elements such as walls, floor and windows, while prediction for the class person is chiefly improved by skin and RGB images. However, some correlations are difficult to interpret. For instance, the accuracy of the classes ceiling, chair and artifacts clearly benefits from skin information. Furthermore, lens artifacts (i.e. blooming) are best recognized by the models RGBNIR and RGBDNIRS, which implies that the absence of blooming from the RGB images is an important cue, but in contradiction to that model RGBDNIR yields a much lower accuracy for that class than the other two models with RGB input. Table 7 in appendix C gives a complete listing of per-class accuracies for all models.

4.2

Comparison of Trained Models

78

Figure 23: 10 classes: Pixel- and average class-accuracy for all CNN models. The bar plots show the original segmentation results and accuracies improved through superpixel class voting with SEED and Felzenszwalb algorithms, see section 3.8. The horizontal lines indicate baseline accuracies for the majority classifier (also known as ZeroR), which predicts the class with the largest population. The use of depth clearly improves scene parsing results. Using RGB and skin images increases performance of the models slightly, but seems to be disadvantageous in certain combinations, most notable in the difference between DNIR and DNIRS.

Figure 24: 10 classes: Per-class-accuracy for all CNN models (without class voting). Next to the labels the class percentage with respect to the total number of pixels is given (the missing 3% belong to class undefined from the ground truth, also see table 2). Observe how skin and RGB images significantly improve accuracy for the classes person and artifacts, while depth seems to play an important role in the classification of structural elements (wall, floor, ceiling). Some correlations are difficult to interpret though, e.g. the improvement from DNIR to DNIRS for class ceiling.

4.2

Comparison of Trained Models

79

4.2.2. Models for 3-Class Classification The highest pixel accuracy for this reduced classification problem with classes person, background and object is achieved by the DNIRS model with 70.6%, followed closely by RGBDNIRS with 69.0%. These models also yield the highest average class accuracies with 69.2% and 67.3%, respectively. The great performance leap for the class person is caused primarily by combinations of NIR with skin images. The results of the RGBNIR and RGBDNIR models suggest that RGB images also contain important cues for human detection – an observation which is consistent with results from 10-class classification models. Figure 26 shows per-class accuracies for all models as a bar plot. See table 4 for a detailed list of pixel-, average class and per-class accuracies for all models. For the winning model DNIRS, the improvement through class voting is also shown.

NIR DNIR DNIRS DNIRS (SEED) DNIRS (Felzenswalb) RGBNIR RGBDNIR RGBDNIRS

pixel acc. 61.8% 66.6% 70.6% 71.3% 72.4% 63.4% 67.2% 69.0%

class acc. 47.7% 53.9% 69.2% 68.6% 70.1% 56.6% 60.2% 67.3%

person 12.3% 22.5% 65.6% 61.8% 64.4% 39.1% 42.4% 62.5%

backgr. 63.7% 69.8% 70.9% 71.5% 73.5% 59.6% 65.6% 66.7%

obj. 66.9% 69.5% 71.1% 72.5% 72.4% 71.2% 72.5% 72.6%

Table 4: 3-class classification: Pixel-, average class and per-class accuracies for all models.

Superpixel class voting is applied to the predicted label images and this improves the results, see figure 25. As before, class voting with Felzenszwalb superpixels outperforms the approach using SEED superpixels, raising pixel accuracy in average by 2.2%. Note however that while accuracy increases for the classes background and object, it decreases for the class person when class voting is used. This is likely caused by the suboptimal segmentation of human forms by the two approaches which has been confirmed visually, e.g. in figure 21b in the previous chapter. For a complete overview of all class voting results, see table 8 in appendix C.

4.2

Comparison of Trained Models

80

Figure 25: 3 classes: Pixel- and average class-accuracy for all CNN models. For this reduced classification problem, the advantage of using skin information is immediately visible and depth continues to play an important role in improving accuracy. Interestingly, the use of RGB images clearly improves segmentation when combined with NIR and depth in contrast to the results from 10-class classification, see figure 23. However, it seems disadvantageous in combination with skin images when comparing results for DNIRS and RGBDNIRS.

Figure 26: 3 classes: Per-class-accuracy for all CNN models. It is immediately apparent that skin images greatly improve the detection of people, while RGB aids the recognition of both the person and the object class. Class percentages are given with respect to all pixels, the missing 3% belonging to the class undefined.

4.2

Comparison of Trained Models

81

4.2.3. Human Detection Human detection is a binary classification task since the CNN is trained to distinguish between person and other only. A subset of size 64, containing only images in which at least one person is visible, has been selected from the original dataset. The number of labeled pixels for this subset is ≈ 5 million. Human detection is applicable for example in ISF’s SPAI-project, where the main concern is safe person recognition in a robotized manufacturing line. If a person entering a previously defined danger zone is recognized, the system triggers an alarm and industrial robots are stopped so as not to endanger the person. In such a safety application, the number of true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP) is more important than segmentation accuracy, because they give a measure of the safety of the system. The true positive rate TPR = TP / P is the proportion of pixels belonging to a person (or rather the image of a person), that have been correctly classified as such. Complementary to that is the false negative rate FNR = FN/P, which is a very important measure since a person who is not detected by a safety system might come to harm. The false positive rate FPR = FP/N is the proportion of background pixels that have been wrongly identified as belonging to a person. This rate is also important because a “false alarm” of a safety system can cause disruptions and imply additional expenses, e.g. when industrial robots and other machinery are unnecessarily stopped. The FPR is complementary to the true negative rate TNR.

DNIRS

RGBDNIRS

orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb

accuracy 94.6% 94.8% 94.9% 95.0% 95.1% 95.3%

TPR 59.5% 55.7% 56.6% 65.7% 61.9% 63.8%

FNR 40.5% 44.3% 43.4% 34.3% 38.1% 36.2%

TNR 98.0% 98.6% 98.6% 97.8% 98.3% 98.4%

FPR 2.0% 1.4% 1.4% 2.2% 1.7% 1.6%

Table 5: Accuracy, TP, FN, TN and FP rates for human detection

Table 5 shows accuracy and the aforementioned positive and negative rates for human detection. A plot of these values can be seen in figure 27. The RGBDNIRS model outperforms the DNIRS model by 0.4% in terms of accuracy, but more importantly by 6.2% in terms of TPR and FNR, achieving 95.0% accuracy and a true positive rate of

4.2

Comparison of Trained Models

82

Figure 27: Human detection (2 classes): Accuracy, true positive, false negative and false positive rate. Note that although accuracy is roughly similar for DNIRS and RGBDNIRS models, the true positive rate is higher for the latter.

65, 7%. The number of images in which a human has been detected is 63 out of 64 images for both models (as mentioned above, a person is visible in all images in this set).41 Superpixel class voting slightly improves the accuracy for both models and again the Felzenszwalb algorithm yields better results than SEED. This improvement is due to the fact that class voting increases the proportion of true negatives, and negatives (i.e. non-person pixels) constitute 94% of the image pixels, thus influencing accuracy more strongly than positives. However, class voting deteriorates the true positive rate, presumably because of the suboptimal superpixel segmentation for class person that has already been observed in the 3-class approach, see section 4.2.2. The proportion of images in which a person has been detected also decreases with class voting, e.g. from 63 to 61 images for the Felzenszwalb algorithm and RGBDNIRS model. Evidently post-processing by class voting is disadvantageous for the presented human detection method. To summarize, when considering the true positive rate, results for human detection are not as good as the high accuracy of 95% seems to suggest at first glance. This is also substantiated by the fact that this value lies only a small margin above the 91.3% accuracy of the naive majority predictor, which simply predicts the majority class other/not-person. 41

In the same image in which human detection fails, the class person could also not been identified by the multi-class models, possibly due to bad illumination. This “problematic” image is shown in figure 37 in appendix C.

4.2

Comparison of Trained Models

83

(a) 10-class classification

(b) 3-class classification

(c) Human detection / person segmentation Figure 28: Examples for semantic image segmentation for 10- and 2-class classification and human detection tasks – comparison of ground truths (left) and predictions (right).

4.3

Discussion

84

4.3. Discussion The basic assumption of the presented approach is that since different types of images, namely NIR, RGB, depth and skin 42 , contain different kinds of informations, the combination of these images constitutes an information gain. When used as input to a semantic segmentation method, such combinations of image types are therefore expected to enhance accuracy. Furthermore, if pairwise combinations improve the semantic segmentation results, e.g. DNIR and RGBNIR, the combination of more than two image types, e.g. RGBDNIR should result in greater (or at least equal) improvements. This has proven to be only partially true. All pairwise combinations that have been tested enhance pixel and/or average class accuracy for all classification tasks when compared to the NIR-only input into the CNN. However, evaluation results also show that there is no increasing relationship between the number of input modalities and accuracy.

Figure 29: Accuracies for 3-class and 10-class models. Evidently, accuracy does not always increase with the number of image types. The negative effect of RGB information on pixel accuracy is clearly visible.

For instance, based on the assumption one would assume the RGBDNIRS models to perform better or at least equally well as the DNIRS models, since both RGB and depth information have been shown to increase accuracy in combination with NIR for 3class and 10-class classification. However, the contrary is the case for both classification tasks. Skin information was expected to only improve segmentation of class person and 42

As mentioned before, the skin image is not a real image type but rather a hand-crafted feature map, since it is computed based on the NIR image using knowledge from prior research. Strictly speaking, the same can be said of a depth map. Nevertheless, in the scope of this evaluation these images are considered to be independent types.

4.3

Discussion

85

to not interfere with other classes, however a similar phenomenon is observed. In 10-class classification one would expect the DNIRS model to achieve higher accuracy than the DNIR model, since RGBDNIRS outperforms RGBDNIR, but in this case the addition of skin information decreases pixel accuracy by > 1%. Of the secondary image types, only depth always causes an increase in accuracy. Evidently RGB, depth and skin images are not always complementary, and their combinations can even cause some sort of interference in the CNN. In particular, RGB information seems to conflict with depth. The reason for this cannot be fully determined without a detailed examination of all filters and feature maps, which is beyond the scope of this thesis. It is arguable that one important issue is the suboptimal registration of RGB to NIR images. Since objects in an RGB image are not well aligned with respect to the other images, small objects may not be correctly recognized (see class object in figure 24). An RGB image may also cause more area of the image to be recognized as belonging to a certain class, than is actually occupied by the corresponding object. Hence class accuracy increases for objects whereas it decreases for the background. Thus, even though depth and RGB information enhance accuracy when paired with NIR, features extracted from these image types can interfere with each other. Interestingly, for the human detection task the best result is achieved with the model that uses RGB information, although the results for 3-class classification suggest otherwise. This might be due to the fact that a reduced dataset is used for this task. An examination of the convolution layers is necessary to understand the relationship between extracted features of different classification tasks, and how the features affect the results.

5. Conclusion and Future Work 5.1. Conclusion A successful approach for semantic multi-modal image segmentation has been developed to show that semantic segmentation accuracy can be enhanced by combining different types of images as inputs to a multi-scale CNN, and that the results can be improved further by superpixel class voting. The image types are NIR, RGB and depth, which have been captured with a custom-build camera rig composed of a multi-channel NIR camera, an industrial RGB camera and a Kinect. A fourth image type that indicates the presence of human skin in the image is computed based on the NIR image, and is used for the purpose of further improving the recognition of people. Evaluation for three different classification tasks, with ten, three and two classes each, has been performed. For 10-class classification the DNIR model increases the pixel accuracy by 4.6% and the DNIRS model raises the average class accuracy by 6.7%, with respect to the NIR-only model. If one assumes class accuracy to be the more important measure, the winning model is DNIRS, which achieves, after superpixel class voting, a total of 60.4% pixel accuracy and 48.4% class accuracy. For 3-class classification, the findings are similar. The winning model DNIRS shows enhancements of 8.8% and 21.5% for pixel and average class accuracies, achieving, with class voting, a total of 72.4% and 70.1% respectively. For human detection, which is a binary classification task, only the two models that performed best with respect to the class person have been trained. A true positive rate of 65, 7% and a classification accuracy of 95.3% have been achieved by the RGBDNIRS model. In contrast to the multi-class tasks, superpixel class voting proved to be disadvantageous in this case. On a class level, depth information is responsible for enhancing accuracy for most of the classes, and RGB and skin images notably support the detection of people. While combining images of different modalities generally improves the results, accuracies do not necessarily increase with a higher number of image types. The combination of RGB images and depth maps seems to be particularly problematic. This is presumably due to the suboptimal registration of RGB images, and the consequential ambiguity with respect to object borders. RGB-to-NIR image registration has proven to be a difficult task. Of the three different approaches that have been proposed and tested, the simple horizontal shift has been identified as the best method. In light of the results, this part of the system needs to be improved. Furthermore, a depth estimation approach using a cross-spectral stereo image

86

5.2

Future Work

87

from two cameras of different spectral regions (NIR and RGB) has been implemented and tested in order to dispense with the third (depth) camera. However, the computed depth map is of very poor quality, which led to the rejection of this method. To summarize, it has been shown that a CNN is capable of extracting and combining features from multi-modal image sets and use them to predict scene labeling, and that the combination of different image types constitutes a significant improvement for solving this task. The quality of image registration has proven to be crucial to segmentation accuracy when using images from different cameras. The presented findings offer many starting points for future research in the area of semantic image segmentation, which is far from being solved.

5.2. Future Work In the course of this thesis issues have arisen that could not be addressed in its scope. The evaluation also identified parts of proposed system in need of improvement. This section gives a brief outlook on possible future research with respect to these issues. In order to improve RGB-to-NIR image registration, reprojection could be used instead of geometric transformations. The depth map could be mapped onto the RGB image using the RGB camera parameters, and the color image pixels could be reprojected into the NIR camera space using depth as the third dimension, similar to the method described in section 3.4.2. Instead of using an additional RGB camera, using the Kinect RGB camera would be advantageous because of the small distance between RGB depth sensor. Alternatively, the camera rig could be modified to accommodate both NIR and RGB camera in the central orifice of the ring flash (see figure 33) in order to minimize the distance between the cameras. While superpixel class voting considerably increases segmentation accuracy in most cases, there is room for improvement. An automated grid search for optimal parameters of the Felzenszwalb algorithm could be implemented, with pixel and/or class accuracies as target values. A genetic algorithm could also be used for parameter optimization. The conversion to Y’CB CR color space is not well suited for multi-channel NIR images, as discussed in section 3.7.1 In order to separate brightness from “color” (or rather relationships between wavebands, since the term color is restricted to the visible spectrum), a more adequate conversion could be found that weighs all channels equally. Such a conversion is proposed in equation (54) in appendix C. As mentioned before, a detailed analysis of the filters and feature maps of the CNN can help to understand and improve the network architecture. Similar to the filter responses

5.2

Future Work

88

that have been shown in figures 15 and 35, all convolution layers could be visualized and used in a diagnostic role following the approach of Zeiler and Fergus (2014). In this work the CNN is trained with natural class frequencies, i.e. each pixel in the dataset is used as training input with the same probability. Farabet et al. (2013) propose the use of balanced class frequencies as an alternative approach that can help to enhance per-class accuracy for minority classes, i.e. classes that comprise a relatively small number of pixels. In its simplest form balanced frequencies guarantee that all classes are equally represented in the training data. Depending on the task, the probability of occurrence can also be set individually for each class. Finally, there are a number of techniques that can enhance the performance of the proposed CNN approach, which could not be implemented in the given time frame. Among those are random dropout (mentioned in section 2.3.4), local normalization of feature maps and unsupervised pre-training (Hinton et al., 2012; Jarrett et al., 2009; Erhan et al., 2010). Using these techniques in combination with a deeper CNN architecture (20 layers and more) can drastically improve image classification performance (Szegedy et al., 2014), and can thus also be expected to increase accuracy when using CNNs for scene labeling. Now that the most promising models for semantic RGBDNIRS image segmentation have been identified, such a deeper network could be built and tested based on the proposed architecture.

References

Atif, M. Optimal Depth Estimation and Extended Depth of Field from Single Images by Computational Imaging using Chromatic Aberrations. PhD thesis, 2013. Azevedo, F. A., Carvalho, L. R., Grinberg, L. T., Farfel, J. M., Ferretti, R. E., Leite, R. E., Lent, R., Herculano-Houzel, S., et al. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain. Journal of Comparative Neurology, 513(5):532–541, 2009. Barrera Campo, F., Lumbreras Ruiz, F., and Sappa, A. D. Multimodal stereo vision system: 3d data extraction and algorithm evaluation. Selected Topics in Signal Processing, IEEE Journal of, 6(5):437–446, 2012. Bay, H., Tuytelaars, T., and Van Gool, L. Surf: Speeded up robust features. In Computer vision–ECCV 2006, pages 404–417. Springer, 2006. Bayer, B. Color imaging array, July 20 1976. URL http://www.google.com/patents/ US3971065. http://www.google.com/patents/US3971065. Biederman, I. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987. Bishop, C. M. et al. Neural networks for pattern recognition. Clarendon press Oxford, 1995. Bookstein, F. L. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence, 11(6):567–585, 1989. Bouguet, J. Y. Camera calibration toolbox for matlab - fifth calibration example - calibrating a stereo system, stereo image rectification and 3d stereo triangulation. URL http://www. vision.caltech.edu/bouguetj/calib_doc/htmls/example5.html. Bradski, G. and Kaehler, A. Learning OpenCV: Computer vision with the OpenCV library. ” O’Reilly Media, Inc.”, 2008. Brown, D. C. Close-range camera calibration. Photogramm. Eng, 37:855–866, 1971. Brown, M. and Susstrunk, S. Multi-spectral sift for scene category recognition. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 177–184. IEEE, 2011. Carr, H., Snoeyink, J., and van de Panne, M. Progressive topological simplification using contour trees and local spatial measures. In 15th Western Computer Graphics Symposium, British Columbia, volume 86, 2004. Couprie, C., Farabet, C., Najman, L., and LeCun, Y. Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572, 2013. Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray, C. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004.

89

Dalal, N. and Triggs, B. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. Doll´ ar, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detection: A benchmark. In CVPR, June 2009. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11:625–660, 2010. Ess, A., Leibe, B., Schindler, K., , and van Gool, L. A mobile vision system for robust multi-person tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). IEEE Press, June 2008. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. Farabet, C., Martini, B., Akselrod, P., Talay, S., LeCun, Y., and Culurciello, E. Hardware accelerated convolutional neural networks for synthetic vision systems. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 257–260. IEEE, 2010. Farabet, C., Couprie, C., Najman, L., and LeCun, Y. Learning hierarchical features for scene labeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1915–1929, 2013. Felzenszwalb, P. F. and Huttenlocher, D. P. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004. Fischler, M. A. and Bolles, R. C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24 (6):381–395, 1981. Fofi, D., Sliwa, T., and Voisin, Y. A comparative survey on invisible structured light. In Electronic Imaging 2004, pages 90–98. International Society for Optics and Photonics, 2004. Forsyth, D. A. and Ponce, J. Computer vision: a modern approach. Prentice Hall Professional Technical Reference, 2002. Fryer, J. G. and Brown, D. C. Lens distortion for close-range photogrammetry. Photogrammetric engineering and remote sensing, (52):51–58, 1986. Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980. Gkioxari, G., Hariharan, B., Girshick, R., and Malik, J. R-cnns for pose estimation and action detection. arXiv preprint arXiv:1406.5212, 2014.

90

Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP Volume, volume 15, pages 315–323, 2011. Gould, S., Fulton, R., and Koller, D. Decomposing a scene into geometric and semantically consistent regions. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1–8. IEEE, 2009. Grangier, D., Bottou, L., and Collobert, R. Deep convolutional networks for scene parsing. In ICML 2009 Deep Learning Workshop, volume 3. Citeseer, 2009. Hamamatsu Photonics, K. K. Opto-Semiconductor Handbook, Chapter 5 - Image SSensor (Hamamatsu). Hamamatsu Photonics, K. K., 2014. Hern´ andez-L´ opez, J.-J., Quintanilla-Olvera, A.-L., L´opez-Ram´ırez, J.-L., Rangel-Butanda, F.-J., Ibarra-Manzano, M.-A., and Almanza-Ojeda, D.-L. Detecting objects using color and depth segmentation with kinect sensor. Procedia Technology, 3:196–204, 2012. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):328–341, 2008. Holz, D., Holzer, S., Rusu, R. B., and Behnke, S. Real-time plane segmentation using rgb-d cameras. In RoboCup 2011: Robot Soccer World Cup XV, pages 306–317. Springer, 2012. Horn, B. K. P. Focusing. 1968. Hubel, D. H. and Wiesel, T. N. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574, 1959. Hubel, D. H. and Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962. Irani, M. and Anandan, P. Robust multi-sensor image alignment. In Computer Vision, 1998. Sixth International Conference on, pages 959–966. IEEE, 1998. Jahne, B. Digital Image Processing: Concept, Algorithms, and Scientific Applications. Springer, 1997. Jain, A., Tompson, J., LeCun, Y., and Bregler, C. Modeep: A deep learning framework using motion features for human pose estimation. In Computer Vision–ACCV 2014, pages 302–315. Springer, 2014. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. What is the best multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pages 2146–2153. IEEE, 2009. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

91

Kang, Y., Yamaguchi, K., Naito, T., and Ninomiya, Y. Multiband image segmentation and object recognition for understanding road scenes. Intelligent Transportation Systems, IEEE Transactions on, 12(4):1423–1433, 2011. Konolige, K. Small vision systems: Hardware and implementation. In Robotics Research, pages 203–212. Springer, 1998. Kriesel, D. A Brief Introduction to Neural Networks. 2007. URL http://www.dkriesel.com. Krotkov, E. Focusing. International Journal of Computer Vision, 1(3):223–237, 1988. Krotosky, S. J. and Trivedi, M. M. Mutual information based registration of multimodal stereo videos for person tracking. Computer Vision and Image Understanding, 106(2):270–287, 2007. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1 (4):541–551, 1989. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. LeCun, Y., Kavukcuoglu, K., and Farabet, C. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 253–256. IEEE, 2010. Liu, C., Yuen, J., and Torralba, A. Sift flow: Dense correspondence across scenes and its applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):978– 994, 2011. Lowe, D. G. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. Luo, P., Tian, Y., Wang, X., and Tang, X. Switchable deep network for pedestrian detection. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 899–906. IEEE, 2014. MacCormick, J. How does the kinect work? Presentert ved Dickinson College, 6, 2011. Microsoft. Object class recognition - microsoft research, 2015. microsoft.com/en-us/projects/ObjectClassRecognition/.

URL http://research.

Nielsen, M. A. Neural Networks and Deep Learning. neuralnetworksanddeeplearning.com/.

2015.

URL http://

Omondi, A. R. and Rajapakse, J. C. FPGA implementations of neural networks, volume 365. Springer, 2006. Ouyang, W. and Wang, X. Joint deep learning for pedestrian detection. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2056–2063. IEEE, 2013. Pao, Y. Adaptive pattern recognition and neural networks. Reading, MA (US); Addison-Wesley Publishing Co., Inc., 1989.

92

Paris, S. and Durand, F. A fast approximation of the bilateral filter using a signal processing approach. In Computer Vision–ECCV 2006, pages 568–580. Springer, 2006. Paschotta, R. Encyclopedia of laser physics and technology, volume 1. Wiley-vch Berlin, 2008. URL http://www.rp-photonics.com/fiber_lasers_vs_bulk_lasers.html. Pascucci, V. and Cole-McLaughlin, K. Efficient computation of the topology of level sets. In Proceedings of the conference on Visualization’02, pages 187–194. IEEE Computer Society, 2002. Pinggera12, P., Breckon, T., and Bischof, H. On cross-spectral stereo matching using dense gradient features. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012. Pinheiro, P. and Collobert, R. Recurrent convolutional neural networks for scene labeling. In Proceedings of The 31st International Conference on Machine Learning, pages 82–90, 2014. Princeton Instruments, I. G. Introduction to scientific ingaas fpa cameras. Technical report, Princeton Instruments, Inc., 2012. Reeb, G. Sur les points singuliers d’une forme de pfaff completement int´egrable ou d’une fonction num´erique. CR Acad. Sci. Paris, 222:847–849, 1946. Rosenblatt, F. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical report, DTIC Document, 1961. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. propagating errors. Cognitive modeling, 1988.

Learning representations by back-

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575, 2014. Russell, B. C., Torralba, A., Murphy, K. P., and Freeman, W. T. Labelme: a database and webbased tool for image annotation. International journal of computer vision, 77(1-3):157–173, 2008. Salamati, N., Larlus, D., Csurka, G., and S¨ usstrunk, S. Semantic image segmentation using visible and near-infrared channels. In Computer Vision–ECCV 2012. Workshops and Demonstrations, pages 461–471. Springer, 2012. Sapp, B. and Taskar, B. Modec: Multimodal decomposable models for human pose estimation. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3674– 3681. IEEE, 2013. Schalkoff, R. J. Artificial neural networks. McGraw-Hill New York, 1997. Schulz, H. and Behnke, S. Learning object-class segmentation with convolutional neural networks. In 11th European Symposium on Artificial Neural Networks (ESANN), volume 3, 2012. Schwaneberg, O. Concept, system design, evaluation and safety requirements for a multispectral sensor. PhD thesis, 2013.

93

Schwaneberg, O., K¨ ockemann, U., Steiner, H., Sporrer, S., Kolb, A., and Jung, N. Material classification through distance aware multispectral data fusion. Measurement Science and Technology, 24(4):045001, 2013. Schwartz, W. R., Kembhavi, A., Harwood, D., and Davis, L. S. Human detection using partial least squares analysis. In Computer vision, 2009 IEEE 12th international conference on, pages 24–31. IEEE, 2009. Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. Pedestrian detection with unsupervised multi-stage feature learning. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3626–3633. IEEE, 2013. Shen, X., Xu, L., Zhang, Q., and Jia, J. Multi-modal and multi-spectral registration for natural images. In Computer Vision–ECCV 2014, pages 309–324. Springer, 2014. Silberman, N. and Fergus, R. Indoor scene segmentation using a structured light sensor. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 601–608. IEEE, 2011. Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012, pages 746–760. Springer, 2012. Simard, P. Y., Steinkraus, D., and Platt, J. C. Best practices for convolutional neural networks applied to visual document analysis. In 2013 12th International Conference on Document Analysis and Recognition, volume 2, pages 958–958. IEEE Computer Society, 2003. Smisek, J., Jancosek, M., and Pajdla, T. 3d with kinect. In Consumer Depth Cameras for Computer Vision, pages 3–25. Springer, 2013. Steiner, H., Sporrer, S., Kolb, A., and Jung, N. Design of an active multispectral swir camera system for skin detection and face verification. Journal of Sensors, 501:456368, 2015. Subbarao, M. and Surya, G. Depth from defocus: a spatial domain approach. International Journal of Computer Vision, 13(3):271–294, 1994. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1139–1147, 2013. Suzuki, S. et al. Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing, 30(1):32–46, 1985. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014. Tenenbaum, J. M. Accommodation in computer vision. Technical report, DTIC Document, 1970. Toshev, A. and Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1653– 1660. IEEE, 2014.

94

Trouv´e, P., Champagnat, F., Le Besnerais, G., Sabater, J., Avignon, T., and Idier, J. Passive depth estimation using chromatic aberration and a depth from defocus approach. Applied optics, 52(29):7152–7164, 2013. Tsai, R. Y. A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. Robotics and Automation, IEEE Journal of, 3 (4):323–344, 1987. Van den Bergh, M., Boix, X., Roig, G., de Capitani, B., and Van Gool, L. Seeds: Superpixels extracted via energy-driven sampling. In Computer Vision–ECCV 2012, pages 13–26. Springer, 2012. Velte, M. Depth estimation from single images in the near-infrared spectrum using chromatic aberrations. Technical Report on Research Project in Master Studies (Computer Science), 2014. Willamowski, J., Arregui, D., Csurka, G., Dance, C. R., and Fan, L. Categorizing nine visual classes using local appearance descriptors. illumination, 17:21, 2004. Yao, C., Bai, X., Liu, W., and Latecki, L. J. Human detection using learned part alphabet and pose dictionary. In Computer Vision–ECCV 2014, pages 251–266. Springer, 2014. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pages 818–833. Springer, 2014. Zhang, Z. A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(11):1330–1334, 2000. Zhang, Z. Microsoft kinect sensor and its effect. MultiMedia, IEEE, 19(2):4–10, 2012. Zhu, J. and Sutton, P. Fpga implementations of neural networks–a survey of a decade of progress. In Field Programmable Logic and Application, pages 1062–1066. Springer, 2003. ZooFari. Example of epipolar geometry. Public Domain, 2009. URL http://upload.wikimedia. org/wikipedia/commons/f/f2/Epipolar_Geometry1.svg. accessed: 05.25.2015. Zurada, J. M. Introduction to artificial neural systems. West St. Paul, 1992.

95

Appendix A. Camera Calibration The multi-modal images used in the present scene parsing approach come from separate cameras at different positions in world space. They must therefore be registered, i.e. corresponding segments should overlap and object borders should be aligned, before they can be used for training of the segmentation models. In order to do so, certain camera and lens parameters must be known. This section will explain the principles of camera geometry and lens distortion, and how appropriate parameters can be estimated. It closes with a description of stereo calibration and rectification and the employed rectification method.43

A.1. Geometry of Lens and Optical Systems The basic geometry of ray projection in an optical system is best explained using the simple pinhole camera model. It is composed of an infinitely small pinhole, through which passes a light ray from an object point [X, Y, Z]T to meet the image plane at [x, y, −d0 ]T . Thus, the 2-dimensional image coordinates are [x, y]T . Z is the distance of the object point from the focal plane and d0 the distance between focal plane and image plane, see figure 30. For convenience one can interchange pinhole and image plane, reinterpreting the pinhole as the center of projection and effectively modeling a perspective projection. Since in the pinhole camera model the focal length f is exactly the distance from pinhole to image plane so that f = d0 , the simplified real-world to image coordinates transformation can be written as  x=f

X Z



 ,

y=f

Y Z

 .

(39)

The pinhole camera model is a simplification of a real imaging system. It assumes perfect rectilinear projection and ignores lens distortion and aspects of camera geometry that are present in a real-world optical system.

43

Camera calibration is in large parts implemented using convenience methods from the OpenCV library. Rather than just mentioning the methods, a survey on their functionality is given here from a scientific standpoint.

96

A.1

Geometry of Lens and Optical Systems

97

Figure 30: Pinhole camera model: An object point in 3D world coordinates is projected onto the image plane by a single ray falling through the pinhole, resulting in a 2D image of the point. (Jahne, 1997, fig. 7.3)

Figure 31: Model of a real optical system (approximation): Principal points P1 and P2 describe the lens thickness. An object at distance d from outer focal point F1 causes an image to be formed at distance d0 from inner focal point F2 . (Jahne, 1997, fig. 7.7)

Intrinsic camera parameters: The camera geometry involves the relative positions of lens and image sensor. It defines the focal length F and the coordinates [cx , cy ]T , which describe a point on the sensor onto which all points along the optical axis are mapped, modeling a possible deviation of the camera sensor’s center from the optical axis. These parameters are known as the intrinsic camera parameters. In a real optical system (see figure 31 for an approximation of such) a sharp image is formed only within a certain distance range defined by f . The actual focal length is commonly represented by two separate parameters fx and fy in pixel units, which are products of the actual focal length f and the length and height sx and sy of the image sensor elements (which are in general not perfectly square), i.e. fx = f sx and fy = f sy . Actual focal length and sensor elements cannot be measured directly unless the camera is disassembled and have to be inferred via camera calibration. Intrinsic parameters are commonly written as a camera matrix : (Jahne, 1997, chap. 7)(Bradski and Kaehler, 2008, chap. 11)   fx 0 cx   A =  0 fy cy  . 0 0 1

(40)

A.1

Geometry of Lens and Optical Systems

98

With these parameters a 3D point [X, Y, Z]T in the real world is projected onto the 2D image point [x, y]T using the following formula:  x = fx

X Z



 + cx ,

y = fy

Y Z

 + cy .

(41)

Lens distortion occurs because most imaging systems use spherical lenses which are much easier to manufacture and thus more affordable than the mathematically more appropriate aspherical or parabolic lenses. Distortion causes straight lines to appear curved on the image plane, i.e. squares yield barrel- or cushion-shaped images. The two most important types of lens distortion are radial distortion, which is caused by the shape of the spherical lens, and tangential distortion, which arises from the imperfect manufacturing process, i.e. lens and image sensor are not perfectly parallel. Radial distortion is perceived as the so-called barrel effect that can be seen in figure 32a, causing image points outside the image center to be shifted outwards. It equals zero in the optical center and increases towards the sides of a lens. The corrected coordinates x0 and y 0 for each pixel [x, y]T can be approximated for every image pixel by the first two (for cameras with low distortion) or three (for high distortion) terms k1 , k2 , k3 of a Taylor series expansion around r = 0:

x0 = x(1 + k1 r2 + k2 r4 + k3 r6 ) y 0 = y(1 + k1 r2 + k2 r4 + k3 r6 ) ,

(42)

r being the radius of the pixel, i.e. its distance from the image center. Tangential distortion can be corrected with two parameters p1 and p2 by which the deviation from perfect alignment of lens and image sensor is characterized. The corrected coordinates x00 and y 00 are computed via

x00 = x + (p1 (r2 + 2x2 ) + 2p2 xy) y 00 = y + (p2 (r2 + 2y 2 ) + 2p1 xy) .

(43)

A.1

Geometry of Lens and Optical Systems

99

It follows that a complete undistortion (i.e. distortion correction) can be achieved by computing

xundist = x(1 + k1 r2 + k2 r4 + k3 r6 ) + (p1 (r2 + 2x2 ) + 2p2 xy) yundist = y(1 + k1 r2 + k2 r4 + k3 r6 ) + (p2 (r2 + 2y 2 ) + 2p1 xy)

(44)

according to this lens distortion model. Thus, the distortion caused by a camera lens can be expressed as one vector containing those 5 distortion coefficients: [k1 , k2 , p1 , p2 , k3 ]. Equation (44) is used to compute corrected coordinates for all pixels using the coefficients vector. Missing information, i.e. “empty” or overlapping corrected pixels, is dealt with via interpolation of intensity values. Observe in figure 32b how all lines are parallel after undistortion. (Bradski and Kaehler, 2008, chap. 11) A thorough elaboration of the lens distortion model and the derivations of above equations is beyond the scope of this thesis, please refer to (Brown, 1971) and (Fryer and Brown, 1986) for more information. Other, less impacting types of lens distortion do exist, e.g. chromatic aberration and astigmatism, but are ignored in this context.44 Extrinsic camera parameters: Poses of objects and cameras relative to each other are important information for camera calibration and rectification of images from multiple cameras, which will be explained in the following sections. If one thinks of every object (including cameras) having its own coordinate system with itself at the origin, the relative position of one object in the coordinate system of another can be expressed by rotation and translation.45 The translation vector T~ is the offset from one point in space to another. In terms of two coordinate systems A and B this means simply that T~ = originA − originB . The 3-dimensional rotation matrix R is the product of rotations in two dimension around each axis. If ψ, φ and θ are the rotation angles around each of the x-, y- and z-axes, then R = Rx (ψ)Ry (φ)Rz (θ). A point P can thus be rotated in space to a new position P 0 by matrix multiplication: P 0 = RP . The inverse rotation can be achieved by multiplying a point with the inverse rotation matrix, so that

44 45

CA can be used for depth estimation however, see section 3.5.2 In the case of cameras, their reproduction ratios have to be matched first by scaling, so that real-world distances create the same distance in pixels on all images.

A.2

Planar Homography

100

P = RT P 0 = RT (RP ) (since R is orthogonal, R−1 = RT ). The 2-dimensional rotation matrices are  1  Rx (ψ) = 0

0

0

cos ψ



 sin ψ 

0 − sin ψ cos ψ   cos φ 0 − sin φ   Ry (φ) =  0 1 0  sin φ 0 cos φ   cos θ sin θ 0   Rz (θ) = − sin θ cos θ 0 . 0

0

(45)

1

Evidently as a product of these three matrices, R is also a 3 × 3 matrix. R and T are known as the extrinsic camera parameters. In order to compute the coordinates of point PA in coordinate system A in terms of coordinate system B one would compute PA = R(PA − T~ ), where PA are PA ’s coordinates in B. B

B

When using homogeneous coordinates (see below), translation can also be expressed as matrix multiplication so that for translation vector T~ = [tx, ty, tz] a translation matrix T can be defined as

 1  0 T = 0  0

0 0 tx



 1 0 ty  . 0 1 tz   0 0 1

(46)

A.2. Planar Homography In computer vision planar homography46 is a relation between to planar surfaces, or rather the projective mapping between them. Homogeneous coordinates have the advantage that by using them, projective transformations can be easily computed via matrix multiplication. Every Cartesian point P = [x, y]T can be represented by a set of homogeneous coordinates, so that [xs, ys, s]T represents P for every non-zero real number s (the arbitrary scaling factor). If one chooses s = 1, then [x, y, 1]T is a homogeneous coordinate for P . The homography H is composed of the physical transformations W = [R T ] from one plane to another, i.e. extrinsic parameters for rotation and trans46

Meanings of the term “homography” vary slightly among different sciences.

A.3

Estimation of Calibration Parameters

101

lation, and the projection represented by the camera matrix M containing the intrinsic parameters. As will be explained in the following appendix A.3, a chessboard pattern printed on a flat (planar) object is used for calibration, so that the projection is between planes. The object plane can be defined to have Z = 0 without loss of generality if one assumes that the calibration object is at a fixed place and the camera moves through space. Thus, the third column of rotation matrix R can be ignored. This means that the transformation W = [R T ] is a 3 × 3 matrix where the first two columns correspond to the first two columns of R, and the last column is the last column of translation matrix T shortened by one row, see equation (46). The homography can therefore be defined as the 3 × 3 matrix H = sM W (Here s is the arbitrary scaling factor from above, factored out of H). A point PA on source plane A is related to PB on destination plane B by

PB = HPA



PA = H −1 PB



    xB xA      yB  = H  yA  1 1     xA xB    −1   yA  = H  yB  1

(47)

1

A.3. Estimation of Calibration Parameters In order to estimate lens distortion coefficients and intrinsic and extrinsic camera parameters, at least two views of a calibration object with known geometry must be captured. As mentioned above, a chessboard pattern is used as calibration object, see figure 32 for an example. Each chessboard yields a number of planar objects with the same orientation, i.e. black-and-white squares, which are detected as rectangles defined by four corners. For this, the image is thresholded and contours are retrieved from the resulting binary image using an approach based on Suzuki et al. (1985). After that a topological cleanup is performed using the concept of contour trees first proposed by Reeb (1946) and improved in later works such as Pascucci and Cole-McLaughlin (2002) and Carr et al. (2004), among others. For more information on contour retrieval and chessboard corner detection in particular please refer to those authors and pertinent literature. A total of 10 parameters must be solved for in each view of the calibration pattern, i.e. four intrinsic parameters (fx , fy , cx , cy ) and six extrinsic parameters (rotation angles

A.3

Estimation of Calibration Parameters

102

ψ, φ, θ47 and entries tx , ty , tz of the translation vector). Since the chessboard calibration object is planar, a homography matrix can be used to solve for intrinsic and extrinsic camera parameters. While intrinsic and extrinsic parameters depend on 3D geometry, the distortion coefficients describe a transformation in 2 dimensions and are therefore estimated separately (distortion coefficients can be inferred using equation (44), see below). The 10 parameter imply that a minimum of 10 constraints are necessary. Although a chessboard pattern yields multiple squares, effectively only four corners can be used per view because all lie in the same plane. Thus, considering that each corner point has two coordinates (x, y), a minimum of two views at different angles are necessary to solve for the 10 parameters. In practice though, generally more than 10 views are recorded to account for noise and guarantee numeric stability. Although in reality the chessboard pattern is captured at different angles with a fixed camera, one can assume a moving camera and a fixed chessboard without loss of generality. For a fixed object it can be assumed that Z = 0, so that the homography H is the 3 × 3 matrix from equation (47). It was said above that H = sM W , where M is the camera matrix and W is a matrix composed of the first two columns r1 and r2 of rotation matrix R and one column t corresponding to translation vector T~ , so that t = T T = [tx , ty , tz ]T . Furthermore, s is an arbitrary scaling factor extracted from H. If one considers H to be composed of column vectors so that H = [h1 , h2 , h3 ], this yields r1 = 1s M −1 h1 [h1 , h2 , h3 ] = sM [r1 , r2 , t]



r2 = 1s M −1 h2 t=

(48)

1 −1 h . 3 sM

Since the scaling factor s is extracted, the orthogonal rotation vectors are orthonormal, which results in a first constraint: r1T r2 = 0



(h1 M −1 )T M −1 h2 = 0

(49)

The magnitude of the vectors being equal, the following also holds: r3 = r1 × r2 and particularly r1T r1 = r2T = r2 47



(h1 M −1 )T M −1 h1 = (h2 M −1 )T M −1 h2 ,

(50)

In practice, rather than angles, OpenCV uses the Rodrigues’ rotation vector representation, which is more efficient for numerical optimization procedures.

A.3

Estimation of Calibration Parameters

103

(a) Distorted image

(b) Undistorted image

Figure 32: NIR-Image before and after undistortion: In the distorted image, the chessboard calibration pattern is barrel-shaped and lines are curved (a). In the undistorted image all lines are straight and the pattern’s structure is rectangular (considering perspective) (b).

which is a second constraint. Following this reasoning one can solve for all intrinsic and extrinsic camera parameters. For a detailed description please refer to Zhang (2000) and Bradski and Kaehler (2008, chap. 11). In order to estimate the lens distortion coefficients, results of the previous intrinsic/extrinsic calibration are used as starting points while assuming zero distortion. Let Pud = [xud , yud ]T be an undistorted point and Pd = [xd , yd ]T the corresponding distorted point. Assuming zero distortion, Pud can be computed using intrinsic and extrinsic parameters with the real-world-to-image transformation from equation (39): "

xud yud

#

    w + c fx X x Z  =   Yw  fy Zww + cy

(51)

Here Xw , Yw and Zw are the calibration object’s coordinates mapped to the camera coordinate system using extrinsic parameters W = [R T ]. When substituting with the undistortion formula in equation (44) one obtains: " # xud yud

=

" # xd yd

2

4

6

(1 + k1 r + k2 r + k3 r ) +

" # (p1 (r2 + 2x2d ) + 2p2 xd yd ) (p2 (r2 + 2yd2 ) + 2p1 xd yd )

(52)

Note again that [xud , yud ]T are the ideal coordinates computed assuming no distortion, but what is really perceived in the image are coordinates [xd , yd ]T . From this discrepancy

A.4

Stereo Calibration and Rectification

104

the distortion coefficients (k1 , k2 , k3 , p1 , p2 ) can be inferred by solving these equations for numerous points. This approach is based on Brown (1971), see this work for more details. Once k1 , k2 , k3 , p1 and p2 have been found, the intrinsic and extrinsic parameters must be estimated again, this time considering lens distortion. The error for the estimated parameters is defined by the distances between measured and reprojected points, for a transformation from the chessboard plane in one view to another. This procedure is repeated until the reprojection error is less than a predefined threshold or a maximum number of iterations is reached. (Bradski and Kaehler, 2008, chap. 11)

A.4. Stereo Calibration and Rectification Image rectification is a transformation which projects multiple images onto a common plane. Stereo image rectification in particular also guarantees that corresponding pixel rows in the image are aligned, which is fundamental for finding correspondences in a pair of stereo images in order to compute a disparity map, which can be used as a map of relative depths. This is of importance for the cross-spectral stereo matching method that is investigated, see section 3.5.2. It is also very useful for image registration, see section 3.4. It has been shown above how to estimate parameters for rotation and translation from a calibration object (the chessboard pattern) to the camera coordinate system. The same concept applies to transformations between two camera coordinate system, e.g. from a stereo pair. If Rl , Tl and Rr , Tr are the rotation matrices and translation vectors for the left and right cameras and R, T are the transformations between the camera views, then R = Rr RlT T = Tr − RTl .

(53)

In order to keep the change for each image to a minimum, rather than rotating one camera view completely, the rotation matrix is divided, so that each camera is rotated by only half the value. Next the epipolar lines of the camera images must be aligned (which will cause pixel rows to be aligned), see section 3.5.2 for more detail. In the present work an unpublished algorithm by J.Y. Bouguet from the Matlab camera calibration toolbox is used, which is an improvement of the approach presented in Tsai (1987) and Zhang (2000). Please refer to these articles and the online source Bouguet for more detail. The results of calibration and rectification are encoded in the form of an undistortion and rectification map for each camera, which relates it to another camera in a stereo

A.4

Stereo Calibration and Rectification

105

pair. The map is essentially a 2-dimensional lookup table in matrix form, containing corrected and rectified positions for all image pixels.

B. Experimental Camera Setup

Figure 33: Experimental camera rig consisting of one RGB and one NIR camera, a multi-spectral NIR ring flash and a structured light depth sensor.

The experimental hardware setup used to acquire the testing and training images consists of cameras for different image types (RGB, NIR and depth). We use gigabit ethernet (GigE) RGB and NIR cameras which are connected via a power-over-ethernet (PoE) switch48 to a laptop running the image capturing software. In addition, the camera rig includes a multi-spectral NIR ring-flash composed of three different LED types (of 970 nm, 1300 nm and 1550 nm wavelength) plus dispersion lenses. Please refer to section 2.1 for details on multi-channel NIR image acquisition. The depth map is obtained using a Kinect structured light depth sensor (see section 1.3.3). The capturing software was designed to allow recording subsets of the RGBDNIR image array, e.g. only NIR and RGB, for calibration purposes. It successively triggers all cameras49 to capture a multi-modal image set of a scene, and also features a self-timer for convenience. Note that the GigE cameras used in this project cannot be triggered in parallel using a single PoE switch, due to packet collision problems. In order to obtain the training and testing datasets, images have been captured in multiple places on the university campus, including offices, hallways, workshops and common rooms. To make transportation easier, the camera rig, laptop, PoE switch and power supplies are fixed on a mobile equipment table. This experimental setup can be seen in figure 33. A total of >130 images have been captured, and after sorting out near-duplicates and low-quality images (due to the problems mentioned in 3.2) 89 images remain for the training and testing set. 48 49

A PoE switch dispenses with the need to use camera power supplies. The NIR camera is triggered four times – once for the dark image and once for each waveband

106

C. Additional Figures, Tables and Equations

(a) Captured NIR

(d) Pre-proc. NIR

(e) Pre-proc. RGB

(h) Captured NIR

(k) Pre-proc. NIR

(b) Captured RGB

(f) Pre-proc. Depth

(i) Captured RGB

(l) Pre-proc. RGB

(c) Captured Depth

(j) Captured Depth

(m) Pre-proc. Depth

Figure 34: Images before and after pre-processing

107

(g) Skin map

(n) Skin map

(a) Original Image

(b) Filter Responses

(c) Rectified Filter Responses

Figure 35: More examples of images and their filter responses from first NIR layer from DNIR model trained for 10-class scene labeling.

108

Figure 36: Matches of HOG descritptors between NIR and RGB image, based on L1 vector distance. Close inspection of the image reveals that although only a subset of best descriptors is shown, many descriptors have erroneous matches.

Figure 37: Problematic image in dataset: The class person could not be detected by any of the models, in particular also not by the human detection models. Possible reasons are poor lighting conditions in the background due to the strongly illuminated table in the foreground.

109

Figure 38: Depiction of the convolutional layers of the CNN for the RGBDNIRS model for one scale of the image pyramid. NIR and RGB images are split into Y’ and CR CB images. The first two layers convolve the input maps without padding, followed by ReLU and max-pooling units. The last layer is a pure convolution layer. The filter banks are fully connected, producing many-dimensional feature maps, i.e. in each column of each bank there are |input| × |output| filters. Thus the convolution layers have over 4 million trainable parameters.

110

Figure 39: The plot shows training loss and accuracy for DNIRS and RGBDNIRS models for human detection classification task. The loss curve shows a running average of the loss. Furthermore, the standard deviation (SD) with respect to the running average is shown.

NIR

DNIR

DNIRS

RGBNIR

RGBDNIR

RGBDNIRS

orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb

pixel accuracy 53.5% 54.9% 57.3% 58.1% 59.9% 61.2% 56.9% 58.5% 60.4% 53.5% 55.0% 56.6% 55.5% 56.5% 57.6% 56.0% 57.5% 59.6%

class accuracy 39.2% 39.5% 42.3% 44.4% 45.0% 46.8% 45.9% 46.6% 48.4% 41.4% 41.6% 42.4% 42.8% 42.9% 44.7% 43.9% 44.4% 46.7%

Table 6: 10-class classification: Pixel- and class accuracies and improvements through class voting for all models. The best performing model is DNIR in terms of pixel accuracy and DNIRS in terms of class accuracy.

111

person (6%) wall (30%) floor (9%) ceiling (3%) door (4%) window (6%) chair (3%) table (7%) object (27%) artifacts (2%)

NIR 27% 65% 78% 9% 17% 36% 20% 49% 60% 33%

DNIR 31% 78% 79% 21% 14% 48% 22% 54% 55% 43%

DNIRS 49% 67% 76% 31% 18% 42% 25% 46% 60% 45%

RGBNIR 42% 66% 56% 11% 21% 39% 15% 49% 59% 54%

RGBDNIR 48% 73% 61% 15% 22% 40% 22% 47% 55% 45%

RGBDNIRS 50% 70% 66% 8% 23% 34% 20% 52% 59% 59%

Table 7: 10-class classification: Per-class accuracies for all models (without class voting). Next to the class names their percentage with respect to the whole dataset is given. The missing 3% belong to the class undefined.

NIR

DNIR

DNIRS

RGBNIR

RGBDNIR

RGBDNIRS

orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb orig. prediction SEED Felzenszwalb

pixel acc.

class acc.

61.8% 62.8% 63.5% 66.6% 67.6% 68.6% 70.6% 71.3% 72.4% 63.4% 64.0% 66.2% 67.2% 68.5% 69.5% 69.0% 70.1% 71.8%

47.7% 47.9% 48.5% 53.9% 53.9% 54.7% 69.2% 68.6% 70.1% 56.6% 56.6% 58.4% 60.2% 61.0% 61.4% 67.3% 67.8% 69.3%

person (6%) 12.3% 10.7% 11.0% 22.5% 19.9% 20.3% 65.6% 61.8% 64.4% 39.1% 37.1% 38.2% 42.4% 41.9% 40.8% 62.5% 61.4% 62.8%

backgr. (48%) 63.7% 65.0% 66.2% 69.8% 70.9% 72.4% 70.9% 71.5% 73.5% 59.6% 59.3% 62.4% 65.6% 66.9% 68.8% 66.7% 67.8% 70.4%

object (43%) 66.9% 68.0% 68.2% 69.5% 71.0% 71.4% 71.1% 72.5% 72.4% 71.2% 73.3% 74.6% 72.5% 74.2% 74.6% 72.6% 74.0% 74.8%

Table 8: 3-class classification: Pixel, average class and per-class accuracies for all models with class voting. Next to the class names their percentage in terms of the whole dataset is given.

112

Y’CC conversion for NIR images: A proposal for an adequate separation of luma and “chroma” components for NIR images is given by the following formula: Y 0 = (Ch1 + Ch2 + Ch3 )/3 Chroma1 = 127.5 + 0.75(Ch1 − Y 0 ) Chroma2 = 127.5 + 0.75(Ch3 − Y 0 ) ,

(54)

where Ch1 , Ch2 , Ch3 are the channels of the multi-channel NIR image. All channels are equally weighted. The formula guarantees values in the range [0, 255], suitable for 8-bit encoding. The term chroma is used in the style of Y’CB CR to describe the relation between channels, although NIR lies in the non-visible part of the spectrum and therefore does not contain “real” color information.

113

D. CD Contents masterthesis mvelte.pdf

This master’s thesis as PDF document

expose mvelte.pdf

The master’s thesis expos´e as PDF document.

/source

Folder containing the source code of the capturing software, the pre-processing pipeline and the modified Caffe deep learning framework.

/calib

Folder containing calibration parameter files for the RGBDNIR camera rig.

/dataset

Folder containing all RGBDNIRS images of the training and testing dataset (original and processed).

114