An Efficient and Effective Convolutional Neural

3 downloads 0 Views 2MB Size Report
Jun 15, 2016 - Results of Proposed Learning Algorithm and Its Distributed ... A convolutional neural network (CNN) is a variant of ANNs that is optimized.
Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

An Efficient and Effective Convolutional Neural Network for Visual Pattern Recognition Liew Shan Sung Supervisor: Prof. Dr. Mohamed Khalil Mohd. Hani Co-supervisor (Former): Dr. Rabia Bakhteri

June 15, 2016

1 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Outline 1

Introduction Introduction Problem Statement Research Objectives Scope of Work Research Methodology

2

Proposed Models and Learning Algorithms Proposed Models Proposed Learning Algorithms

3

Results and Analysis Results of Proposed Convolutional Neural Network Models Results of Proposed Learning Algorithm and Its Distributed Computing Implementation

4

Conclusion Conclusion and Contributions Future Work Publications

2 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Introduction: Artificial Neural Networks An artificial neural network (ANN) is a bio-inspired mathematical model that can approximate functions. Input

Fan-in

Figure:

Hidden

Output

Fan-out

A multilayer perceptron (MLP) with a single hidden layer.

Typically trained supervisedly [3] using gradient descent (GD) method [7]. Shortcomings: Poor generalization ability [8]. Unsuitable for interpreting multi-dimensional data [10]. 3 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Introduction: Convolutional Neural Networks A convolutional neural network (CNN) is a variant of ANNs that is optimized for visual recognition tasks [9]. Incorporates feature extraction, dimensionality reduction, and classification into a single trainable model. Input 1@32×32

C1 6@28×28

S1 6@14×14

C2 16@10×10

S2 16@5×5

C3 128@1×1

F4 84@1×1

F5 10@1×1

Feature map 5×5 Convolutions

2×2 Pooling

5×5 Convolutions

Figure:

2×2 Pooling

5×5 Convolutions

Full Connection Full connection

The LeNet-5 CNN architecture [9].

Advantages over conventional ANNs [9]: Weight sharing minimizes parameters. Require minimal preprocessing of raw inputs. 4 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Introduction: Deep Learning and Distributed Machine Learning Deep learning (DL) A subset of machine learning (ML) techniques that learn hierarchical representations of data in deep-layered model architectures [14]. Most common DL model: deep neural network (DNN) [120]. CNNs - well-established and show great potentials.

Issue Training deeper models using large datasets is extremely computationally expensive [22]. Leads to distributed ML concept.

Distributed ML Distribute training process to multiple machines in a parallel computing platform to achieve parallelism speedup [19]. Distributed versions of learning algorithms have been developed. Mostly first order methods [19, 25, 28]. 5 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement

Outstanding issues with current models and learning algorithms: CNN model (architecture) 1. Slow computations in convolutional layer due to kernel weight flipping. 2. Gradient diffusion problem and training instability due to inappropriate activation functions. Learning algorithm 3. Slow convergence of conventional algorithms (i.e. first order). 4. Difficulty of tuning many hyperparameters. Distributed machine learning 5. Training inefficiency of conventional distributed learning algorithms. 6. Little discussion on mapping a learning algorithm for distributed computing models.

6 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 1 : Convolutional Layer Research issue Propagations through a convolutional layer require flipping of kernel weights [32] that slows down the computational time of a CNN model.

9 8 7 1 2 3 4 5 6

6 5 4 3 2 1

7 8 9

(a) Figure:

(b)

(a) The original kernel, and (b) convolution with the flipped kernel.

Previous related works: Some works reported to perform convolutions - use cross-correlations instead [34 -38]. The effect of weight flipping on the CNN learning performance remains an open question. 7 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 2 : Activation Function Research issue Common activation functions (bounded): gradient diffusion problem [45] Other activation functions (unbounded): training instability [46].

Logistic (logsig)

Rectified linear unit (relu) [47]

logsig

relu

1

bifire

2 f(x) Df(x)

1.8

4 f(x) Df(x)

3.5

1.6

3

0.7

1.4

2.5

0.6

1.2

2

1

1.5

0.5

y

0.8

y

y

0.9

Bi-firing (bifire) [46]

0.4

0.8

1

0.3

0.6

0.5

0.2

0.4

0

0.1

0.2

−0.5

0 −8

−6

−4

−2

0

2

4

6

8

0 −2

−1.5

x

−1

−0.5

0

0.5

1

1.5

2

x

(a) (b) Figure: Activation and gradient curves of the activation functions:

−1 −4

f(0.5) Df(0.5) −3

−2

−1

0

1

2

3

4

x

(c) (a) logistic, (b) rectified linear

unit (ReLU) [47], and (c) bi-firing [46]. 8 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 2 : Activation Function (Cont’d)

Previous related works to reduce the training (numerical) instability: Weight regularization [46]. Input and output normalization [60, 88]. Modification of loss function [91, 92]. Gradient clipping [96, 97]. Proper selection of hyperparameter values [99].

Issues with these approaches: More computations are required in the training process. Little in-depth studies on the impact of an activation function on the training stability.

9 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 3 : Learning Algorithm

Research issue Common first order learning algorithms suffer from slow convergence. First order methods (e.g. stochastic gradient descent (SGD)) Simple and effective. Have higher chances of reaching poor local minima especially for ill-conditioned problems [8, 49].

Research issue Most second order methods are computationally expensive. Second order methods (e.g. Levenberg-Marquardt (LMA)) Converge faster than first order methods. Require inversion of Hessian matrix - compute-intensive [49].

10 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 4 : Learning Algorithm (Cont’d) Research issue Some learning algorithms have many hyperparameters to be tuned manually [52]. More difficult to find a good solution.

Previous related works: Table: List of known supervised learning algorithms for training the NN models. Learning algorithm Total hyperparameters AdaGrad [107] >1 AdaDelta [55] 2 BGD with learning rate schedule [48] >2 SGD with learning rate schedule [48] >2 Bold driver [106] 3 Rprop [108] 5 iRprop [7] 5 GD with adaptive step size controller [109] >5 AdaDec [48] 6 BFGS [51] >1 L-BFGS [111] >1 Layer-specific SDLM [112] 2 SDLM [9] 3 B-SDLM (this work) 1 11 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 5 : Distributed Learning Algorithm Most distributed learning algorithms are based on first order methods [45, 49]. Issue: Converge slowly.

Some other algorithms are effective in training deep models. Issue: Computationally expensive [19, 55].

Previous related works: Table: List of known recent previous works on distributed supervised learning algorithms. Work Dettmers et al. [169] Shokri et al. [172] Heigold et al. [28] Kumar et al. [30] Li et al. [173] Zhang et al. [174] Dean et al. [19] Zeiler [55] Zinkervich et al. [171] Agarwal et al. [53] Dean et al. [19] Suri et al. [54] This work

Year Learning algorithm Characteristic 2015 Mini-batch SGD 2015 Distributed selective SGD 2014 A-SGD 2014 Distributed SGD 2014 EMSO-GD First order method 2014 Elastic averaging SGD 2012 Downpour SGD 2012 AdaDelta 2010 Parallel SGD 2014 SGD + L-BFGS 2012 SandBlaster L-BFGS Second order method 2002 Parallel LMA 2016 Distributed B-SDLM Efficient and effective 12 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Problem Statement 6 : Mapping of Learning Algorithm

Research issue There are inadequate discussions in the literature on the process of mapping a learning algorithm for parallel computation [30, 56, 57]. Therefore, this leaves open questions on the mapping considerations: Types of parallelism. Thread models. Communications between physical machines. Task scheduling and synchronization mechanism.

13 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Research Objectives

1. To propose an efficient convolutional neural network (CNN) model Consists of convolutional layers with correlation filtering and bounded activation functions. Faster computation, improved generalization performance and better training stability.

2. To develop an effective stochastic second order learning algorithm, i.e. bounded stochastic diagonal Levenberg-Marquardt (B-SDLM) Converges faster and is computationally efficient. Alleviates hyperparameter overfitting problem.

3. To propose a distributed second order learning algorithm Converges faster and better than common distributed first order learning algorithms, With a systematic methodology of mapping the proposed learning algorithm for parallel computation.

14 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Scope of Work C/C++ based software implementation Pthreads library MPICH library

Platforms Quad-core CPU computing platform. CPU cluster with four aforementioned platforms.

Case studies (visual pattern recognition) MNIST - basic handwritten digit classification. mnist-rot-bg-img - complex handwritten digit classification. AR Purdue - face recognition.

Figure:

Examples of the handwritten digit images in the mnist-rot-bg-img database.

15 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Research Methodology

Develop baseline DNN and CNN models as well as training procedure with SGD as the baseline learning algorithm

Propose an improved CNN model that has convolutional layers with correlation filtering and bounded activation functions

Derive an efficient B-SDLM learning algorithm to train the NN models effectively

Implement the proposed distributed B-SDLM learning algorithm to achieve fast parallelism speedup

Figure:

General approach taken in the research work. 16 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Model: Convolutional Layer with Correlation Filtering Problem 1 : Weight flipping slows down the CNN computation.

Proposed solution 1 : Convolutional layer with correlation filtering Replace convolutions with cross-correlations to eliminate the weight flipping operation. Original: 

  (l) Yj (x , y ) = f  

X

X

X 

i∈N (l−1) u∈Kx(l) v ∈Ky(l)

(l−1)

Yi



  (l) (l) f (l) (u, v ) + B (l)  Sx x + u, Sy y + v W j  ji 

(1)

Proposed:   (l) Yj (x , y ) = f  

 X

X

X 

i∈N (l−1) u∈Kx(l) v ∈Ky(l)

   (l ) (l−1) (l)  (l) (l) Yi Sx x + u, Sy y + v Wji (u, v ) + Bj   (2)

Contribution 1 : Proposed convolutional layer achieves faster execution speed and better learning performance. 17 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Model: Bounded Activation Functions Problem 2 : Training instability due to unbounded outputs. Proposed solution 2 : Derive new activation functions with bounded output range based on the universal approximation theorem (UAT) [71]. ReLU (relu) [47]

Bounded ReLU (brelu)

f (x ) = max (0, x )

(3)

Leaky ReLU (lrelu) [83] ( x f (x ) = 0.01x

x >0 otherwise

(5)

Bi-firing (bifire) [46]

f (x ) =

  −x − x2  2A

 x−

A 2

A 2

x < −A −A 6 x 6 A x >A

(7)

  0 f (x ) = min (max (0, x ) , A) = x  A

x 60 0A (4)

Bounded leaky ReLU (blrelu)   x 60 0.01x f (x ) = x 0 A Bounded bi-firing (bbifire)  B x < −B − A2    A A   −x − 2 −B − 2 6 x < −A x2 f (x ) = 2A −A 6 x 6 A    x − A2 A < x 6 B + A2    B x > B + A2

(6)

(8)

18 / 44

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Model: Bounded Activation Functions (Cont’d) brelu

blrelu

1.5

bbifire

1.5 f(1) Df(1)

6 f(1) Df(1)

1

1

0.5

0.5

5

f(0.5, 4) Df(0.5, 4)

4

y

3

y

y

Introduction

2

0

1

0

0 −0.5 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−0.5 −2

−1.5

−1

−0.5

0

x

(a) Figure:

0.5

1

1.5

2

−1 −8

−6

−4

−2

x

(b)

0

2

4

6

8

x

(c)

Activation and gradient curves of proposed activation functions: (a) bounded ReLU, (b)

bounded leaky ReLU, and (c) bounded bi-firing.

Can alleviate the training instability Without additional techniques that impose more computations.

Contribution 2 : Bounded activation functions improve the generalization performance and training stability of an NN model. 19 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Bounded SDLM (B-SDLM) Problem 3 : Second order methods are generally computationally expensive due to the Hessian calculation. Stochastic diagonal Levenberg-Marquardt (SDLM) [9] More efficient than most second order methods. Diagonal Hessian estimation with running average method [9]. *

∂2E (l)2 ∂Wji

+

* =γ

(t+1)

+

∂2E (l)2 ∂Wji

+ (1 − γ) (t)

∂2E

(9)

(l)2

∂Wji

(t)

Issue: Still imposes certain computational overhead. Proposed solution 3 : Perform simple averaging of estimated Hessians for a small subset of the training set instead of the running average. P  

∂2E (l)2

∂Wji

 =

∂2 E MH (l)2 ∂Wji (m)

MH

(10)

Simpler computation and less hyperparameters. 20 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: B-SDLM (Cont’d) Problem 4 : Hyperparameter overfitting problem due to many hyperparameters to be tuned manually. η

(l) Wji

η + = * ∂2 E +µ 2 ∂W (l)

(11)

ji

Layer-specific SDLM (L-SDLM) [112] Removes the hyperparameter µ - Adds more computation.

Proposed solution 4 : Replace µ with a boundary condition that serves the same purpose of ensuring the learning stability.

η

(l) Wji

=

         

η

 2   ∂ E 2  (l)  ∂Wji 

     η



∂2 E (l)2 ∂Wji

! >1

(12)

otherwise

Consists of only a single hyperparameter. 21 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: B-SDLM (Cont’d) // Initialization stage initialize all weights W (l) and biases B (l) t =0 repeat Shuffle M training samples // Hessian estimation stage if t mod tupdt = 0 then for m = 1 to MH do forward propagation second order backward propagation accumulate Hessians end for calculate average Hessians using Eq. (10) calculate learning rates using Eq. (12) end if // Training stage for m = 1 to M do forward propagation first order backward propagation if m mod MB = 0 then weight update using Eq. (13) end if end for calculate average loss value for training samples // Testing stage for m = 1 to MT do forward propagation end for calculate average loss value for testing samples t =t+1 until E <  or t > tmax

=⇒Proposed solution 3 =⇒Proposed solution 4

=⇒Proposed solution

22 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: B-SDLM (Cont’d) Supports mini-batch learning mode:

! P

MB

(l)

(l)

Wji = Wji − η

(l) Wji

∂E (l) ∂Wji

m

MB

(13)

Contribution 3 : B-SDLM achieves fast and better convergence while having minimal computational overhead than SGD.   Computational complexity overhead over SGD: O hM logtupdt tmax

Contribution 4 : B-SDLM alleviates the hyperparameter overfitting problem by consisting of only a single hyperparameter. 23 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM

Problem 5 : Most distributed learning algorithms are based on first order methods that converge slowly.

Proposed solution 5 : Propose a distributed learning algorithm based on B-SDLM (a stochastic second order method).

Problem 6 : Inadequate discussions in the literature on the mapping of a learning algorithm for distributed computing models.

Proposed solution 6 : Formulate a systematic methodology based on the existing general approach of mapping an algorithm for parallel computation.

24 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) General methodology of mapping algorithms for parallel computation Layer 5

Application Processing Tasks

Layer 4

I/O Data

Algorithm Design Task Dependence Graph

Layer 3

Task-Directed Graph (DG)

Parallelization and Scheduling

Processor Assignment and Scheduling

Layer 2

VLSI Tools

Thread Assignment and Scheduling

Concurrency Platforms

HDL Code

Layer 1

Figure:

C/Fortran Code

Hardware Design

Multithreading

Custom Hardware Implementation

Software Implementation

Phases or layers of implementing an algorithm in software or hardware for parallel

computations [179].

Layer 5: Application phase Aim: To accelerate the NN training through parallelism. 25 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 4: Algorithmic development phase Table:

Model structure parameters

List of tasks in the training procedure

using B-SDLM. Node Task 0 Calculate fan-in and fan-out Initialization 1 Initialize weights 2 Shuffle dataset 3 Fetch data for Hessian estimation 4 Forward propagation Hessian 5 Second order backward propagation estimation 6 Accumulate Hessian 7 Calculate average Hessian 8 Calculate learning rates 9 Fetch data for gradient computation 10 Forward propagation 11 Calculate error Training 12 Accumulate misclassification error 13 First order backward propagation 14 Update weights 15 Calculate accuracy 16 Fetch data for testing 17 Forward propagation Testing 18 Calculate error 19 Accumulate misclassification error 20 Calculate accuracy

Initialization 0 stage

Stage

Training data 2

1

3 4

Hessian estimation stage

(1)

5 6

7 Training data

Hyperparameters

8 9 10

Training stage

11

Testing data

14

Hyperparameters (4)

13

12

(2)

15

Training accuracy

16 17

Testing stage

(3) 18 19

20 Testing accuracy

Figure:

DCG of the B-SDLM algorithm.

26 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 3: Parallelization and scheduling phase Parallelization 0

2

Initialization stage

Training data Learned parameters

1

22

21

3

3

3

3

Training 10 stage

(2)

4

4

4

4

11

13

Hessian (1) (1)······ 5 estimation 5 stage 6

6

7

(1)

(1)

5

5

6

6

(4)

8

······

15

10

(2)

14

11

13

Learned parameters

16

16

17

17

······

······

19

(4)

Training accuracy

17

(3) (3)······ Testing 18 18 stage

Gradients Hyperparameters

21

12

16

19

Gradients

9 Learned parameters 22

12 Testing data

Hyperparameters

Training data

Gradients 9

······

Training data

······

Model structure parameters

16 17

(3)

(3)

18

18

19

19

20 Testing accuracy

Figure:

DCG of the proposed distributed B-SDLM algorithm.

Hessian estimation and testing stages: parallel processing of data batches. Training stage: Asynchronous weight updates with a parameter server [19].

27 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Scheduling and synchronization Parameter Shared Server Memory calcFan() modelStructureParam

Model Replica

[for each training sample] dataSample

loop

fetchModelParam() modelParam

8 9

fProp()

initModelParam() calcErr() loop shuffleDataset()

datasetSeq shuffledDatasetSeq

[while not reaching convergence]

accMCE() 1

fetchModelParam() modelParam

2

[for each Hessian sample] dataSample fProp()

loop

3

bProp() 11 13

updateParam()

pushGradient()

10

accMCE()

12

calcAccuracy()

bbProp()

[for each testing sample] dataSample

loop accHess() 5

14

fetchModelParam() modelParam

accHess() 4

15

fProp()

16

calcAvgHess()

calcErr() calcLearningRate() accMCE() fetchLearningRate() learningRate

6 7

accMCE() 18

17

calcAccuracy()

Function Data sent Data received as function return

Figure:

Sequence diagram of the proposed distributed B-SDLM algorithm.

UML sequence diagram [180] Depict the interactions between a parameter server and a worker.

28 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 2: Coding phase Concurrency platforms: 1. Shared-memory multiprocessors system - Pthreads implementation 2. Distributed-memory multiprocessors system - MPICH implementation Parameter server thread model: Parameter Server

Parameter Server W

W W

W

W W

W

Model Replicas

Model Replicas

Data

Data

(a) Figure:

W W

W

W W

(b)

The parameter server thread model for the (a) Pthreads and (b) MPICH

implementations. 29 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Proposed Learning Algorithm: Distributed B-SDLM (Cont’d) Layer 1: Implementation phase Implementation platforms: 1. Pthreads implementation: A quad-core CPU computing platform. 2. MPICH implementation: A CPU cluster with four aforementioned platforms.

Contribution 5 : Training using the distributed B-SDLM learning algorithm achieves faster and scalable parallelism speedup.

Contribution 6 : A systematic methodology of mapping a learning algorithm into the deployment on parallel computing platforms is presented.

30 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Convolutional Layer with Correlation Filtering Faster execution speed (up to 1.4× faster). Better learning performance (up to 14.5% improvement in terms of testing misclassification error rates (MCRs)).

Table:

MCRs and average execution time for CNN models composed of convolutional layers with

different weight flipping modes. The colored results denote the proposed convolutional layer. Flipping MCR (%) Average time Computation time mode Training Testing per epoch (s) overhead (%) 2 0.01 1.17 35.82 26.44 1 0.01 1.19 29.36 3.64 MNIST [9] 0 0.01 1.02 28.33 2 11.69 26.20 26.99 35.56 1 13.03 26.87 21.98 10.40 mnist-rot-bg-img [194] 0 5.72 22.97 19.91 2 12.85 21.00 9.43 11.33 1 7.30 21.83 9.17 8.26 AR Purdue [195] 0 11.85 19.67 8.47 Dataset

Weight flipping modes: Proposed - No flipping (0). Flipping of kernel’s indices (1). Flipping of kernel’s values (2). 31 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Bounded Activation Functions - Benchmarking

Perform comparably well or even better than other activation functions (with bbifire achieving a slightly higher testing MCR than bifire).

Table:

Testing MCRs of the MLPs with different activation functions on the mnist-rot-bg-img

dataset. The colored functions denote the proposed functions. Activation function Testing MCR (%) logsig [46] 62.64 relu [46] 51.98 brelu 50.09 lrelu 50.52 blrelu 49.88 bifire [46] 48.75 bbifire 48.97

32 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Bounded Activation Functions - Comparative Analysis

Outperform their original forms in terms of the testing MCRs.

Table:

Results of the CNN models using different activation functions. The colored functions

denote the proposed functions. Activation function relu brelu lrelu blrelu bifire bbifire

Testing MCR (%) MNIST mnist-rot-bg-img AR Purdue MSE Softmax + CE MSE Softmax + CE MSE Softmax + CE 0.97 1.02 25.22 22.97 20.50 19.67 0.94 0.92 23.60 22.59 17.33 16.50 0.95 1.06 25.70 25.79 16.67 7.67 0.96 0.96 24.44 23.42 4.17 7.00 0.89 1.04 25.02 25.61 N/A 7.33 0.90 0.86 23.09 24.05 7.17 6.67

33 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Bounded Activation Functions - Training Stability Improves the training stability significantly as opposed to the ones with unbounded functions.

Table:

Probability of numerical instability for the CNN models using different activation functions

on three different datasets. The colored functions denote the proposed functions. Activation function relu brelu lrelu blrelu bifire bbifire

Probability of numerical instability (%) MNIST mnist-rot-bg-img AR Purdue MSE Softmax + CE MSE Softmax + CE MSE Softmax + CE 0.00 40.00 0.00 0.00 0.00 0.00 0.00 8.57 0.00 0.00 0.00 0.00 60.00 40.00 60.00 60.00 80.00 0.00 51.43 8.57 25.71 0.00 31.43 0.00 95.00 93.33 92.50 0.00 100.00 45.00 26.67 53.33 0.00 0.00 0.00 0.00

Probability of numerical instability Proportion of experimental settings for an activation function that result in the numerical instability.

34 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: B-SDLM - Benchmarking MNIST: Lower testing MCR. AR Purdue: Can classify all testing samples correctly.

Table:

Benchmarking of learning algorithms with previous existing works on the MNIST dataset. Work Year Learning algorithm Testing MCR (%) vSGD-g 3.65 Schaul et al. vSGD-l 2.16 2013 [113] SGD 2.15 vSGD-b 2.05 Zeiler [55] 2012 AdaDelta 2.00 This work 2016 B-SDLM 1.98

Table:

Benchmarking of face recognition using the AR Purdue face dataset. Accuracy Work Year Approach (%) Roli et al. [198] 2006 Semi-supervised PCA 85.33 Rose [197] 2006 Gabor and log-Gabor filters 89.00 Song et al. [199] 2007 Parameterized direct LDA 90.00 Patel et al. [200] 2012 Dictionary-based recognition 93.70 Jiang et al. [201] 2011 K-SVD 97.80 Syafeeza et al. (SDLM) [65] 2014 CNN 99.50 This work (B-SDLM) 2016 CNN 100.00 35 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: B-SDLM - Comparisons among Learning Algorithms Consistently outperforms other learning algorithms on all three datasets. Faster than other SDLM variants.

Table:

CE errors and MCRs for various learning algorithms on the MNIST dataset. Dataset

MNIST

mnist-rot-bg-img

AR Purdue

MCR (%) Average time Training Testing per epoch (s) SGD 0.01 1.12 25.69 SDLM 0.01 1.08 34.13 L-SDLM 0.01 1.04 33.11 B-SDLM (This work) 0.01 0.90 28.40 SGD 13.49 25.61 20.19 SDLM 1.21 23.52 21.42 L-SDLM 5.72 22.97 19.91 B-SDLM (This work) 0.38 21.14 19.12 SGD 16.95 24.67 8.34 SDLM 15.80 26.00 8.75 L-SDLM 11.85 19.67 8.47 B-SDLM (This work) 4.65 15.83 8.40 Learning algorithm

36 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Distributed B-SDLM - Learning Convergence Outperforms other distributed learning algorithms on both datasets.

Best misclassification errors on testing set

Best misclassification errors on testing set 25

SGD SDLM L−SDLM B−SDLM

1.5 1.4

Misclassification error (%)

Misclassification error (%)

1.6

1.3 1.2 1.1 1 0.9 0.8 1

2

3

4

Total workers

(a) Figure:

5

6

7

SGD SDLM L−SDLM B−SDLM

24.5 24 23.5 23 22.5 22 21.5 21 1

2

3

4

5

6

7

Total workers

(b)

Testing MCRs for various distributed learning algorithms based on the parameter server

thread model on (a) MNIST and (b) mnist-rot-bg-img datasets in the Pthreads implementation.

37 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Distributed B-SDLM - Parallelism Speedup Execution time decreases significantly as more workers are assigned for the gradient computations.

Average execution time for a single training epoch

Average execution time for a single training epoch 24

45 SGD SDLM L−SDLM B−SDLM

40

20 18

30

Time (s)

Time (s)

35

SGD SDLM L−SDLM B−SDLM

22

25 20

16 14 12 10

15 8

10 5 1

6

2

3

4

5

6

Total workers

(a) Figure:

7

4 1

2

3

4

5

6

7

Total workers

(b)

Average execution time of a single training epoch for various distributed learning

algorithms based on the parameter server thread model on (a) MNIST and (b) mnist-rot-bg-img datasets in the Pthreads implementation.

38 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Distributed B-SDLM - Convergence Rate Achieve the fixed accuracies faster than the distributed SGD algorithm. Compared to sequential NN training: MNIST: 5.5× faster to reach 98% testing accuracy. mnist-rot-bg-img: 4.3× faster to reach 75% testing accuracy. Time reach 98% testing accuracy

Time to reach 75% testing accuracy

300

250 SGD SDLM L−SDLM B−SDLM

250

SGD SDLM L−SDLM B−SDLM

200

Time (s)

Time (s)

200 150

150

100

100 50

50 0 1

2

3

4

Total workers

(a) Figure:

5

6

7

0 1

2

3

4

5

6

7

Total workers

(b)

Time taken to reach a fixed classification accuracy on the testing set for various

distributed learning algorithms based on parameter server thread model: : (a) 98% on the MNIST dataset, and (b) 75% on the mnist-rot-bg-img dataset in the Pthreads implementation.

39 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Results: Distributed B-SDLM - Towards Larger Computing Platform Compared to sequential NN training: 6× and 12.3× faster to reach 0.01 training and 0.08 testing loss values. 5.7× faster to reach 99% training accuracy. 8.7× faster to reach 98% testing accuracy. Time to reach a certain loss value

Time to reach a certain classification accuracy

1400

600 0.01 (Training) 0.08 (Testing)

1200

400

Time (s)

Time (s)

1000 800 600

100

200 5

10

Total workers

(a) Figure:

300 200

400

0 0

99% (Training) 98% (Testing)

500

15

0 0

5

10

15

Total workers

(b)

Time taken to reach a certain (a) loss value and (b) classification accuracy on the

MNIST dataset when training with batch size = 16 in the MPICH implementation. 40 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Conclusion and Contributions In response to Objective 1 : 1. Convolutional layer with correlation filtering achieves faster execution speed (up to 1.4× faster) and better learning performance (up to 14.5% improvement). 2. Bounded activation functions improve the generalization performance and training stability of an NN model significantly (with training instability being eliminated in some cases).

In response to Objective 2 : 3. B-SDLM achieves fast and better convergence (up to 19.6% improvement) while having minimal computational overhead than SGD. 4. B-SDLM alleviates the hyperparameter overfitting problem by consisting of only a single hyperparameter.

In response to Objective 3 : 5. Training using the distributed B-SDLM learning algorithm achieves faster and scalable parallelism speedup (up to 12.3× faster to reach a certain loss value). 6. A systematic methodology of mapping a learning algorithm into the deployment on parallel computing platforms is presented. 41 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Future Work

More features can be extracted by replacing convolutions with a trainable function to learn deeper. The hyperparameters of the bounded activation functions can be made trainable. Hyperparameter optimization is a viable approach to deal with the hyperparameter overfitting issue. The combination of both model and data parallelism is a promising approach to achieve further training speedup. The learning algorithm can be mapped onto larger scale computing platforms to expand its DL capability for big data applications.

42 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Publications Journals 1. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. An Optimized Second Order Stochastic Learning Algorithm for Neural Network Training. Neurocomputing. 2016. vol. 186, 74-89. (ISI, IF 2.083 (Q2)). 2. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. Bounded Activation Functions for Enhanced Training Stability of Deep Neural Networks on Visual Pattern Recognition Problems. Neurocomputing. 2016. (ISI, IF 2.083 (Q2)). Under revision. 3. Liew, S. S., Khalil-Hani, M., Syafeeza, A. and Bakhteri, R. Gender Classification: A Convolutional Neural Network Approach. Turk. J. Elec. Engin. 2016. vol. 24. 1248-1264. (ISI, IF 0.407 (Q4)).

Conferences 4. Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Distributed Learning on Multi-Core Platform for Neural Network in Visual Pattern Recognition. 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2016). 2016. (Scopus). In review process. 5. Liew, S. S., Khalil-Hani, M., and Bakhteri, R. Distributed B-SDLM: Accelerating the Training Convergence of Deep Neural Networks through Parallelism. 14th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2016). 2016. (Scopus). 6. Khalil-Hani, M., Liew, S. S. and Bakhteri, R. An Optimized Second Order Stochastic Learning Algorithm for Neural Network Training. Arik, S., Huang, T., Lai, W. K. and Liu, Q., eds. Neural Information Processing. Springer International Publishing. 2015, Lecture Notes in Computer Science, vol. 9489. ISBN 978-3-319-26531-5. 38-45. (Scopus). 7. Khalil-Hani, M. and Liew, S. S. A-SDLM: An Asynchronous Stochastic Learning Algorithm for Fast Distributed Learning. Javadi, B. and Garg, S., eds. 13th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2015). Sydney, Australia: ACS. 2015, CRPIT, vol. 163. 75-84. (Scopus). 8. Khalil-Hani, M. and Liew, S. S. A Convolutional Neural Network Approach for Face Verification. High Performance Computing Simulation (HPCS), 2014 International Conference on. 2014. 707-714. (Scopus).

43 / 44

Introduction

Proposed Models and Learning Algorithms

Results and Analysis

Conclusion

Publications (Cont’d)

Others 9. Syafeeza, A. R., Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Convolutional Neural Networks with Fused Layers Applied to Face Recognition. International Journal of Computational Intelligence and Applications, 2015. 14(03): 1550014. (Scopus). 10. Syafeeza, A., Khalil-Hani, M., Liew, S. S. and Bakhteri, R. Convolutional Neural Network for Face Recognition with Pose and Illumination Variation. International Journal of Engineering and Technology, 2014. 6(1): 44-57. ISSN 0975-4024. (Scopus).

44 / 44