ROBUST CONVOLUTIONAL NEURAL NETWORKS UNDER ...

1 downloads 0 Views 797KB Size Report
Nov 19, 2015 - Bouman, Charles A, Sauer, Ken, and Saquib, Suhail. Markov random fields and ... Nguyen, Anh, Yosinski, Jason, and Clune, Jeff. Deep neural ...
Under review as a conference paper at ICLR 2016

ROBUST C ONVOLUTIONAL N EURAL N ETWORKS UNDER A DVERSARIAL N OISE

arXiv:1511.06306v1 [cs.LG] 19 Nov 2015

Jonghoon Jin, Aysegul Dundar and Eugenio Culurciello Electrical and Computer Engineering, Purdue University Biomedical Engineering, Purdue University {jhjin,adundar,euge}@purdue.edu

A BSTRACT Recent studies have shown that Convolutional Neural Networks (CNNs) are vulnerable to a small perturbation of input called “adversarial examples”. In this work, we propose a new feedforward CNN that improves robustness in the presence of adversarial noise. Our model uses stochastic additive noise added to the input image and to the CNN models. The proposed model operates in conjunction with a CNN trained with standard backpropagation algorithm. In particular, convolution, max-pooling, and ReLU layers are modified to benefit from the noise model. Our model is parameterized by only a mean and variance per pixel which simplifies computations and makes our method scalable to a deep architecture. The proposed model outperforms the standard CNN by 13.12% on ImageNet and 7.37% on CIFAR-10 under adversarial noise at the expense of 0.28% of accuracy drop when used in the original dataset – with no added noise.

1

I NTRODUCTION

Convolutional neural networks (CNNs) (LeCun et al., 1998) have shown great success in visual and semantic understanding. They have been applied to solve visual recognition problems, where hardto-describe objects or multiple semantic concepts are present in images. CNNs provide a unified framework for vision tasks without hand-engineering and have achieved ground-breaking results on ImageNet (Krizhevsky et al., 2012; Sermanet et al., 2013; Simonyan & Zisserman, 2014) and face identification (Schroff et al., 2015) as well as on temporal data when combined with sequencegenerating model (Hochreiter & Schmidhuber, 1997). Given the global widespread use of cameras on mobile phones, CNNs are already a candidate to perform categorization of user photos. Device manufacturers use various types of cameras, each with very different sensor noise statistics as shown in figure 1 (Tian, 2000). Also, recent phone cameras can record a video at hundreds of frames per second, where more frames per second translates into higher image noise (Tian, 2000). Unfortunately, CNNs are vulnerable to artificial noise and could be easily fooled by the noise of just few pixels (Szegedy et al., 2013; Goodfellow et al., 2014; Nguyen et al., 2014). Moreover, the noise distribution of the original dataset can be different from the noise of newer image sensors. This problem arises because standard CNNs are discriminative models and do not explicitly take into account the probabilistic nature of input. This work provides a solution to improve instability so that CNNs can be applied to practical problems, for example security applications. The main contribution of this work is to propose a robust feedforward CNN model under adversarial noise; a noise that affects the performance the most. The goal is to provide high classification accuracy in an environment where images are obtained under various noise conditions. In order to achieve this, we add stochasticity to the CNN models with the assumption that the perturbation can be seen as a sample drawn from a white Gaussian noise. We tested the proposed model on CIFAR-10 and ImageNet dataset in the presence of adversarial noise. Without loss of generality, our model takes advantage of a parametric model, which makes our method possible to scale up and apply to a large-scale dataset such as ImageNet. The proposed model can be combined with other methods such as model averaging or multi-view voting (Krizhevsky et al., 2012; Sermanet et al., 2013) to make the output more reliable. 1

Under review as a conference paper at ICLR 2016

(a) A noise from different camera sensors

(b) A noise from different frames per second

Figure 1: Examples to show different noise profiles from cameras or camera settings. The left photo was taken from six different iPhone generations1 . The right photo was taken from Nexus 6P with 60, 120 and 240 frames per second (from left to right).

2

R ELATED W ORK

There are two main approaches related to this work: one is generative approach and the other is ensemble approach. The generative approach considers an entire distribution derived from prior distribution instead of a single example. Though this approach integrates knowledge over a posterior and does not suffer from overfitting, the algorithm is computationally expensive, thus it is not scalable to a very large system. In neural networks, Bayesian learning is applied as a way of model averaging (Neal, 1995). And stochastic feedforward neural networks proposed by Tang & Salakhutdinov (2013) uses a hybrid structure whose hidden layers consist of both deterministic and stochastic variables. The combination of two types of neurons is trained using the standard backpropagation algorithm. The mixture of such neurons can approximate multimodal distributions that gives more flexibility to infer model distribution. However, training stochastic feedforward networks is difficult as confirmed also in the following study by Raiko et al. (2014). In the CNN literature, generative aspects of CNNs were investigated in Dai & Wu (2014). Generative pre-training followed by discriminative gradient refining on standard CNNs results in accuracy improvement on ImageNet test. Stochastic pooling method (Zeiler & Fergus, 2013), selectively applied to layers in CNNs, replaces deterministic pooling with stochastic procedure in the training phase. Backpropagation happens throughout the network according to multimodal distribution over the pooling region. Both methods are compatible with the generic CNN framework and they are successfully applied to a large-scale model. Other studies have applied regularization methods to improve classification accuracy or stabilize prediction. Ensemble methods are commonly used in the literature, which excludes biases and high variability from a single prediction by synthesizing results from different models. For example, the dropout regularizer can be considered as an equally-weighted averaging of exponentially many models (Hinton et al., 2012). Averaging probabilities from multiple models with different initializations (Krizhevsky et al., 2012; Sermanet et al., 2013; Simonyan & Zisserman, 2014) is now also a standard technique to increase the accuracy. In addition to the model ensemble, data ensemble by multi-view voting (Krizhevsky et al., 2012) is used where the image patches are sampled from either different scales or location with strides (Sermanet et al., 2013). Goodfellow et al. (2014) used adversarial perturbation for training to regularize models although the improvement was marginal. In contrast to training with adversarial examples, our proposed method only alters feedforward model.

1

http://petapixel.com/2013/05/14/comparing-the-quality-of-iphone-cameras-over-the-years/

2

Under review as a conference paper at ICLR 2016

Φ

³

P( X 1 ) P( X 2 ) P(Y) Exact P(Yˆ ) Approx

Probability density

Probability mass/density

P( X ) P( Y)

θ−µX ´ σX

θ

µX 2

µX X

(a) Rectified linear unit (ReLU) layer

X

µX 1

(b) Max-pooling layer

Figure 2: Behavior of ReLU and max-pooling layer with an example of two stochastic input random variables. All input distributions are assumed to be normally distributed. (a) The area Φ (·) under the red curve between (−∞, θ] is left-censored where its area is reported as a probability mass of Y at the threshold θ. (b) The curves in red and blue illustrate the exact distribution P (Y ) of the max of two input random variables denoted as X1 , X2 centered at µX1 , µX2 , and its normal approximation P (Yˆ ) calculated from equation 6.

3

C ONVOLUTIONAL N EURAL N ETWORKS WITH N OISE M ODEL

Our method of stochastic input modeling is designed to make CNNs robust to adversarial examples. The proposed feedforward model uses a noise distribution applied to each pixel. The following subsections explain the stochastic model of each layer, including convolution, pooling, and nonlinear activation, which are used in standard CNNs.

3.1

I NPUT N OISE M ODEL

Let the X be a single value of the input image at an arbitrary location in R3 (channel, height, width). CNNs are deterministic models and have no underlying probability assumptions, each of Xijk is fixed to a raw pixel value where we denote its value as a constant µXijk . Decision regions in discriminative CNN models are locally linear in high-dimensional space and they occupy a larger area than the population of training samples. We add an uncertainty to input images so the output vector in hyperspace becomes a cloud with marginal information. We hypothesize that referring to the marginal data during classification helps CNNs to be robust toward adversarial examples. We limit the scope of adversarial examples to indirect encoding (Nguyen et al., 2014), more specifically the natural images with artificial noise (Goodfellow et al., 2014) - samples near a decision boundary. 2 Then, we model an additive noise N with zero mean and noise power of σN at each pixel location. As a result of our modeling, the input is modeled as a normal distribution that has conditional independence among neighboring pixels. To clarify, we adopted the artificial noise distribution in order to improve the robustness of CNNs and it is unrelated to natural image statistics. Although Gaussian models may appear to be unsuitable to model edge information, they have been proven to show comparable results in graphical models while keeping the model tractable (Bouman et al., 1995). Each pixel becomes a random variable X and follows a normal distribution with the mean of 2 the original pixel value µXijk and a constant variance of σN 2 Xijk , µXijk + N =⇒ Xijk ∼ N (µXijk , σN )

(1)

2 where all input pixels have the same noise power of σN . The conditional independence among pixels in neighborhood helps us to simplify the model and make it scalable to deep networks.

3

Under review as a conference paper at ICLR 2016

3.2

C ONVOLUTION L AYER

While CNN inputs are modeled as random variables, all remaining CNN parameters such as weights and biases are deterministic. Convolution is a weighted sum of random variables and its first and second order moments of output of convolution layer are shown in equation 2. X E [Y ] = ωE [X] + b (2) X V ar [Y ] = ω 2 V ar [X] X and Y corresponds to a single element of the input and output in the convolution layer respectively. ω and b are weights and biases, and pixel index i, j and k are omitted for conciseness. We are interested in the first and second order statistics since we want to stay with a parametric model throughout layers, which simplifies overall computations. In the very first convolution layer, input distribution for each pixel follows an exact Gaussian. Therefore resulting distribution is also a Gaussian whose mean and variance are determined by linear combinations of independent random variables as follows. Y ∼ N (µY , σY2 )

µY = E [Y ] ,

σY2 = V ar [Y ]

(3)

where µY and σY denotes a mean and a standard deviation for output random variable Y . For the following convolution layer, we no longer have normality assumption since the pooling and non-linear activation layers alter its distribution to a non-parametric distribution. However, considering that the number of connections for a single neuron in convolution layer is often greater than 512, the summation of a sufficiently large number of independent random variables is distributed according to a Gaussian by the central limit theorem. In the preliminary experiment with 20 randomly generated variables, such approximation yields a negligible error (less than 1% of mean or variance) compared to those from the exact calculation. When projected to practical CNNs with a large number of connections, the equation 3 is still valid for the rest of the convolution layers. 3.3

R ECTIFIED L INEAR U NIT L AYER

Rectified linear units (ReLU) (Krizhevsky et al., 2012) applies non-linearity to the decision function, which makes CNNs more discriminative. With the notation of input X and output Y the same as in convolution layer, the module computes the operation Y = max(X, θ) in an element-wise manner. It replaces the negative activations with a threshold θ (or zero). In other words, the distribution of the ReLU output Y is left-censored where Y = X for X > θ otherwise reported as a single value θ. The mean and variance (Greene, 2008) of output Y for the given normal distribution of input X are: E [Y ] = E[Y |X = θ]P r(Y = θ|X) + E[Y |X > θ]P r(Y > θ|X) = θΦ(α) + (µX + σX λ(α)) (1 − Φ(α)) (4) V ar [Y ] = EX [V ar[Y |X]] + V arX [E[Y |X]]   2 = σX (1−Φ(α)) (1−δ(α))+(α−λ(α))2 Φ(α) where δ(α) = λ(α)(λ(α) − α), λ(α) = φ(α)/(1 − Φ(α)), a standard score α = (θ − µX )/σX , a standard normal density of φ and a cumulative normal distribution of Φ are used. Note that the denominator of λ needs to be regularized with  to avoid zero division. We still hold the normality condition for input since the preceding layer of ReLU is either convolution layer or max-pooling layer (section 3.4) whose output can be approximated to a Gaussian. An example in the figure 2a illustrates the input and output distributions from a single neuron in the ReLU layer, where the stem at θ indicates point mass probability for the deactivated area with respect to the threshold θ. The output from the ReLU layer is a censored distribution whose shape differs from a Gaussian. Therefore, approximation of the distribution causes error during feedforward computation. The ReLU does not propagate negative activations to the next layer, thus it controls the flow of critical information to be considered in prediction. The critical information is not from a single neuron but from multiple neurons and dominating features among them predict class category. However, the 4

Under review as a conference paper at ICLR 2016

non-linearity could filter out necessary information by mistake due to the noise or variation in object’s representation. The hard thresholding completely removes the information from the pipeline and leaves no room for reconsideration in standard feedforward CNNs. In such scenario, ReLU with stochastic input model delivers information that is present in the right tail of the distribution regardless of the neuron’s activation. Later this information is taken into the account for decision making in the higher layer of CNNs therefore it reduces the risk of misclassification. Compared to the output value in standard CNNs, the mean of the output distribution tends to increase by incorporating the tail information. The proposed model is equivalent to CNNs with deterministic input as the variance of input approaches to zero as follows. lim E[Y ] = θ,

2 →0 σX

3.4

lim V ar[Y ] = 0 for

2 →0 σX

X