Distributed Deep Reinforcement Learning using

4 downloads 0 Views 190KB Size Report
environments from OpenAI Gym, the agent outperforms a decent human reference player with few days of training. Keywords— Deep Reinforcement Learning, ...
International Conference on Current Trends in Computer, Electrical, Electronics and Communication (ICCTCEEC-2017)

Distributed Deep Reinforcement Learning using TensorFlow P Ajay Rao

Navaneesh Kumar B

Siddharth Cadabam

Praveena T

Department of Computer Science & Engineering, RVCE, Bengaluru - 59, [email protected]

Department of Computer Science & Engineering, RVCE, Bengaluru - 59, navaneeshkumarb@ gmail.com

Department of Computer Science & Engineering, RVCE, Bengaluru - 59, [email protected]

Assistant Professor Department of Computer Science & Engineering, RVCE, Bengaluru – 59 [email protected]

The reinforcement learning agent gets reward corresponding to each action taken and performing only those actions that maximize cumulative reward. The process of learning can then be converted into a supervised learning problem where each observation is given as input, action reward for each action is given as output and action with maximum action value is given to environment as next step action. The network, when trained with a loss function, trains the network to recommend an action with a high degree of accuracy. The loss function is calculated using the Q-learning algorithm which is used to train a specific type of network called Deep Q-Network (DQN) [2]. The training process uses RMSprop optimizer to minimize the loss.

Abstract— Deep Reinforcement Learning is the combination of Reinforcement Learning algorithms with Deep neural network, which had recent success in learning complicated unknown environments. The trained model is a Convolutional Neural Network trained using Q-Learning Loss value. The agent takes in observation, i.e. raw pixel image and reward from the environment for each step as input. The deep Q-learning algorithm gives out the optimal action for every observation and reward pair. The hyperparameters of Deep Q-Network remain unchanged for any environment. TensorFlow, an open source machine learning and numerical computation library is used to implement the deep Q-Learning algorithm on GPU. The distributed TensorFlow architecture is used to maximize the hardware resource utilization and reduce the training time. The usage of Graphics Processing Unit (GPU) in the distributed environment accelerated the training of deep Q-network. On implementing the deep Q-learning algorithm for many environments from OpenAI Gym, the agent outperforms a decent human reference player with few days of training.

The network is initialized using Xavier initialization [3]. During the training process, the network is made to take up random actions with a certain probability so that it explores all possible dynamic programming state space. This is continued for a large number of actions and then the agent is configured to perform a large number of greedy actions. After learning a large number of episodes, the agent learns to perform on par with human level control. The same agent is used to learn a large number of environments with no change in the network configuration leading to a generalized agent for reinforcement learning.

Keywords— Deep Reinforcement Learning, TensorFlow, Deep Q-Networks, Deep Q-Learning, Artificial Generalized Intelligence

I. INTRODUCTION Reinforcement Learning (RL) is a branch of machine learning which takes its aspiration from behaviour psychology relating how software agents take action in an environment in order to maximize the cumulative reward. The formulation of the environment is done as Markov Decision Process (MDP) [1] where many reinforcement learning algorithms are based on dynamic programming techniques. Reinforcement learning algorithms can be trained to follow a dynamic programming path without actually having a perfect dynamic programming tree. RL is a special form of supervised learning where no correct input/output pairs can be generated, nor explicitly suboptimal actions are presented. Further, all cases of RL are online while finding a perfect balance between exploration (of unknown territory) and exploitation (of existing knowledge). With recent exciting achievements in the areas such as deep learning, big data, increase in computational power (GPU) and new algorithmic techniques, the combination of reinforcement learning and Deep Neural Networks (DNN) has been successful.

II. RELATED WORK A lot of research works led to many ground-breaking solutions for deep neural network and machine learning that help in building efficient generalized reinforcement learning agent. The first attempt in proposing a model for achieving human level control using deep reinforcement learning [4] was done by Volodymyr Mnih, Koray Kavukcuoglu et al. It presented the first deep learning model learning the control policies directly from raw pixel inputs successfully. Atari 2600 games from Arcade Learning Environment [5] have been used for the same. Hado van Hasselt [6] et al. showed the specific adaption of DQN with Double Q-Learning achieves better performance. Bram Bakker [7] presented model-free reinforcement learning using long short-term memory networks. Tanmay Shankar et al. [8] proposed the implementation of reinforcement learning algorithms using recurrent convolutional neural networks. Jeffery Dean et al. [9] proposed an architecture on large scale distributed deep networks which is scalable to production level.

978-1-5386-3243-7/17/$31.00 ©2017 IEEE

171

International Conference on Current Trends in Computer, Electrical, Electronics and Communication (ICCTCEEC-2017) The most recent breakthrough in artificial general intelligence was proposed using a model called Pathnet [10] which achieved notable performance using transfer learning.

This is a classical CNN with 3 convolutional layers, followed by 2 fully connected (FC) layers. The network input is four 84×84 grayscale game states. The network output is Q-values for each possible action. Q-values are real values, updated such that it can be optimized with squared error loss as given in (4).

III. THEORY AND SYSTEM ARCHITECTURE A. Q-Learning Q-Learning is a value-based RL algorithm. The state-action value function of Q(s, a) represents the maximum future discounted reward for choosing the action a in state s, and continue to behave optimally for all actions taken in the future. Q(st, at) = max Rt+1

L = ½ [ r + maxa' Q(s', a') – Q(s, a) ]2

For all transition .

C. Experience Replay The approximation of Q-values using Q-learning which are non-linear functions is very unstable. The single most important method is experience replay. During the forward pass, all the experience set are stored in a replay memory. While training the network, random mini-batches are taken from replay memory instead of most recent transition. This breaks the similarity of subsequent training samples, which usually tends to take the network to a local minimum. Experience replay makes the training task more similar to traditional supervised learning thus simplifying debugging and testing.

The Q-value of state s and action a in terms of Q-value of the next state s' is given in (3). Q(s, a) = r + γ maxa' Q(s', a')

(3)

B. Deep Q-Learning algorithm The neural networks are extraordinary at approximating good features for complicated data patterns. To represent the Qfunction with a deep neural network, which takes four game states and action as input and outputs the corresponding Qvalue. This approach has an advantage that if Q-values update are performed or the action with highest Q-value is chosen, then it requires only one complete forward pass through the network and all Q-values for all the actions are available. Input to the neural network is the current observation (i.e. raw pixels) from the environment. In output layer, it gives out the Q values for different actions. Thus, number of output units is equal to the number of actions in the environment. TABLE I. 

The final deep Q-learning algorithm is as follows 1: initialize replay memory D 2: initialize action-value function Q with random values 3: observe initial state s 4: repeat 5: select an action a 6: with probability ε select a random action 7: otherwise select a = argmaxa' Q(s', a') 8: carry out action a 9: observe reward r and new state s' 10: store experience from D 12: calculate target for each minibatch transition 13: if ss' is terminal state then tt = rr 14: otherwise tt = rr + γ maxa' Q(ss', aa') 15: train the Q-network using (tt – Q(ss, aa))2 as loss 16: s = s' 17: until terminated

DEEP Q-NETWORK ARCHITECTURE

Layer

conv1

conv2

conv3

fc1

fc2

Input

84×84×4

20×20×32

9×9×64

7×7×64

512

Filter Size

8×8

4×4

3×3

Stride

4

2

1

No. of filters

32

64

64

512



Activation

ReLU

ReLU

ReLU

ReLU

Linear

Output

20×20×32

9×9×64

7×7×64

512



(4)

172

International Conference on Current Trends in Computer, Electrical, Electronics and Communication (ICCTCEEC-2017) B. OpenAI-gym OpenAI gym [12] is an extensive toolkit for developing and comparing reinforcement learning algorithms. All these environments expose a common interface making it easier to try out multiple environments against algorithms.

D. System Architecture The environment gives observation and reward at every step to the agent. The observation is processed using DQN which returns appropriate action value. The most optimal action is chosen and passed back to the environment. This process continues until the environment terminates. The weights of DQN are stored in a parameter server so that a common copy is retained across replicated training systems. The loss is computed using deep Q-learning algorithm from which the weights of DQN are updated.

C. Hardware and Software resources The implementation is done on a cluster with each node configuration (Intel Xeon 2 processors - 24 cores, 32 GB RAM, 500 GB HDD Storage, NVIDIA Tesla K20 GPU, 1000Mbps LAN). The software used is Ubuntu includes Python - for the algorithms, HTML, CSS and JavaScript - for result visualization, Low-level Libraries which includes CUDA Toolkit, NVIDIA cuDNN. Scientific Python packages includes TensorFlow, OpenAI-gym, Flask and Numpy. D. Evaluation Metrics • Average Loss per Episode: Loss is calculated using the Deep Q-Learning Algorithm. The average loss per episode is used by RMSProp trainer to train the DQN agent and update the weights. It gradually decreases over time as agent performance improves. • Average Max Q-Value per Episode: Q-value measures how well the action-value for each action are updated over time. On effective training, Q-value gradually increases.

Fig. 1. System Architecture for the implementation

IV. IMPLEMENTATION AND EXPERIMENTAL RESULTS

• Duration per Episode: It specifies the number of steps the agent successfully plays in an episode. A good agent can play as many episodes the environment provides. With effective training the agent makes more reward per episode, the duration per episode gradually increases.

A. TensorFlow For numerical computation purposes, TensorFlow [11], a high-level library written by Google Inc. is used. TensorFlow uses graphical model to perform the computations which are easier and efficient in practice. It is backed by Google Inc. and a great community of open source contributors. Multi-GPU support is available for training purposes and collection of machine learning and deep learning high-level APIs for faster development. Ability to restore the program from checkpoints and supports high performance and out of the box distribution of computations in GPUs. The ability to visualize the computation graph itself using TensorBoard a web interface which helps visualize scalars, histograms, images, audio and distributions of tensors.

• Total Reward per Episode: The agent takes optimal action for current state and makes sure that reward is maximized. The reward is the environment’s measure on agent’s performance to maximize number of optimal steps. With effective training, total reward per episode gradually increases. E. Experimental Results The implementation of DQN algorithm is tested on a total of 8 environments given by OpenAI-gym. The summary of the results as per the evaluation metrics is given in Table II. TABLE II. 

Average MaxQ Value / Episode

Duration / Episode

Total reward / episode

Assault-v0

5.734

10100

81

Atlantis-v0

2.403

3365

70

2.5

6500

101

Breakout-v0

2.134

1487

26

MsPacman-v0

Environment

BeamRider-v0

Fig. 2. TensorFlow generated graph for DQN

DQN SUMMARY FOR VARIOUS ENVIRONMENTS

20.29

1442

114

Pong-v0

0.6

6275

10

Qbert-v0

12.9

731

54

SpaceInvadersv0

6.167

1737

51

a.

173

Maximum Values of Evaluation metrics

International Conference on Current Trends in Computer, Electrical, Electronics and Communication (ICCTCEEC-2017)

TensorBoard outputs continuous graphs for each of the evaluation metrics, which are interactive in nature. Example graphs for the environment Pong-v0 for the evaluation metrics Average Loss per episode (Fig. 3), Average Max-Q Values (Fig. 4), Duration per episode (Fig. 5), Reward per episode (Fig. 6) are,

V. CONCLUSION The implementation of deep Q-learning algorithm worked successfully with small environments with limited number of dynamic programming states. The result and analysis from the project have shown that deep Q-learning algorithm can be used to generalize the working of autonomous systems in the real world by learning the environment. Out of eight environments, the agent has successfully learned the state space for seven environments. The performance of the agent completely depended on the training process and the network hyperparameters. The use of GPU accelerated the implementation and working by tenfold. The project implementation has multiple distributed agents and thus can be trained simultaneously. This is very useful as the agent converges to the best action value in very short time. REFERENCES

Fig. 3. Breakout-v0 Averaege loss per episode

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning : An Introduction, 2nd Edition ed., M. Press, Ed., Massachusetts Ave, 2016, pp. 47-74. [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. A. Riedmiller, "Playing Atari with Deep Reinforcement Learning," CoRR, vol. abs/1312.5602, 2013. [3] D. Silver and J. Heinrich, "Deep Reinforcement Learning from Self-Play in Imperfect-Information Games," CoRR, Vols. abs/1603.01121, 2016, 2016. [4] V. Mnih, K. Kavukcuoglu and D. Silver, "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529-533, February 2015. [5] M. G. Bellemare, Y. Naddaf, J. Veness and M. Bowling, "The Arcade Learning Environment: An Evaluation Platform for General Agents," CoRR, vol. abs/1207.4708, 2012.

Fig. 4. Breakout-v0 Average Max-Q value per episode

[6] H. van Hasselt, A. Guez and D. Silver, "Deep Reinforcement Learning with Double Q-learning," CoRR, vol. abs/1509.06461, 2015. [7] B. Bakker, "Reinforcement Learning with Long Short-Term Memory," in In NIPS, MIT Press, 2002, pp. 1475-1482. [8] T. Shankar, S. K. Dwivedy and P. Guha, "Reinforcement Learning via Recurrent Convolutional Neural Networks," CoRR, vol. abs/1701.02392, 2017. [9] J. Dean and G. S. Corrado, "Large Scale Distributed Deep Networks," Proceedings of the 25th International Conference on Neural Information Processing Systems, pp. 1223-1231, 2012. [10] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel and D. Wierstra, "PathNet: Evolution Channels Gradient Descent in Super Neural Networks," CoRR, vol. abs/1701.08734, 2017. [11] D. Jeffrey, "TensorFlow: A System for Large-Scale Machine Learning," Google Brain, 2015. [Online]. Available: www.tensorflow.org. [Accessed 9 April 2017].

Fig. 5. Breakout-v0 Duration per episode

[12] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba, "OpenAI Gym," OpenAI, 5 June 2016. [Online]. Available: gym.openai.com. [Accessed 9 April 2017].

Fig. 6. Breakout-v0 Total reward per episode

174