REINFORCEMENT LEARNING FOR COORDINATED ... - CiteSeerX

2 downloads 0 Views 179KB Size Report
Larry D. Pyeatt. Adele E. Howe. Computer ... higher levels are responsible for long term planning and goal ... are not very e ective at pursuing long term goals.
REINFORCEMENT LEARNING FOR COORDINATED REACTIVE CONTROL Larry D. Pyeatt Adele E. Howe Computer Science Department Colorado State University Fort Collins, CO 80523 email: fpyeatt,[email protected] URL: http://www.cs.colostate.edu/~fpyeatt,howeg tele: 1-970-491-7589 fax: 1-970-491-2466 ABSTRACT

The demands of rapid response and the complexity of many environments make it dicult to decompose, tune and coordinate reactive behaviors while ensuring consistency. Reinforcement learning networks can address the tuning problem, but do not address the problem of decomposition and coordination. We hypothesize that interacting reactions can often be decomposed into separate control tasks resident in separate networks and that the interaction can be coordinated through the tuning mechanism and a higher level controller. To explore these issues, we have implemented a reinforcement learning architecture as the reactive component of a two layer control system for a simulated race car. By varying the architecture, we test whether decomposing reactivity into separate controllers leads to superior overall performance and learning convergence in our domain.

I. INTRODUCTION

Reactive systems have been favored for applications requiring quick responses, such as robotic or process control applications. The lack of deliberation means that while the action may not be perfect, at least something is done. One diculty with such systems is ensuring consistency or coordination between reactions. Consistency and coordination are necessary for discouraging self-defeating behaviors and encouraging progress towards achieving the agent's goals.  THIS RESEARCH WAS SUPPORTED IN PAYMENT BY ARPA-AFOSR CONTRACT F30602-93-C-0100 AND BY NSF RESEARCH INITIATION AWARD #RIA IRI930857. THE U.S. GOVERNMENT IS AUTHORIZED TO REPRODUCE AND DISTRIBUTE REPRINTS FOR GOVERNMENTAL PURPOSES NOTWITHSTANDING ANY COPYRIGHT NOTATION HEREIN.

The reinforcement learning approach is desirable for control because of its automated training regimen and quick responses. The problem is that training these networks to produce particular behaviors is a dicult and time consuming process. Moreover, due to the indirect nature of the training, it may be dicult to achieve exactly the desired behavior. Consequently, considerable care must be devoted to identifying appropriate behaviors for each component and coordinating their interactions. Some autonomous robot control systems use a hierarchical approach wherein the lower levels contain short term reactive mechanisms, while the higher levels are responsible for long term planning and goal attainment [15, 8]. This division allows the robot to react quickly to changes in its environment while it works to achieve long term objectives. The strength of pure reactive systems lies in dealing with unexpected changes in the environment, but they are not very e ective at pursuing long term goals. Pure planning systems are good at nding ecient ways to achieve their goals, but have diculty dealing with even small changes in their environment. The motivation behind the two level approach is to exploit the bene ts of both reaction and planning while avoiding their individual weaknesses. Taking the advantages of a hybrid system as given, this research addresses the speci c question of how reactivity might be incorporated. In particular, should reactivity be divided into separate controllers or incorporated in a monolithic structure, and how reactivity can be coordinated in a two layer control system. To explore these questions, we have implemented and tested three alternative two layer control systems for a simulated robot race car (the RARS simulator [20]). The reactive component, which controls steering and acceleration, is implemented as a set of reinforcement learning networks. The networks are coordinated through the

Reinforcement Learning for Coordinated Reactive Control credit assignment mechanism and, at a higher level, through a heuristic coordination mechanism. The heuristic coordination mechanism switches control between the neural networks and strategies for behaviors not addressed by the networks. At present, this mechanism is crude but it serves as a placeholder in the architecture for the later addition of a planning component.

symbolic processing in a neural network structure. In Knick's system [11], a mobile robot is equipped with both symbolic and sub-symbolic processing. The sub-symbolic system learns and translates collections of information into new concepts that are integrated into the world model of the symbolic system, thereby allowing the robot to learn about its world and improve its performance over time.

II. INTEGRATING REACTIVE AND DELIBERATIVE CONTROL

III. SIMULATED ENVIRONMENT FOR EXPERIMENTATION

A variety of approaches have been adopted to coordinate reactivity: purely symbolic, pure neural network, or mixed. Cypress [21], plastyc [5] and Phoenix [10] are examples of purely symbolic approaches to two level control of reactivity. Cypress implements a model in which the planner is responsible for only high level operation. The planner creates abstract plans that are given to an execution module to ll in details and execute. plastyc uses a blackboard-based planner and a collection of individual reactive modules. plastyc does not support direct interaction between the planning and reactive mechanisms, but combines the desired behaviors generated by each just prior to their being executed. Phoenix agents incorporate reactions, called re exes, which change e ector settings in response to sensory events; re exes are activated and deactivated by the planner in the context of plan actions. Pure neural network approaches have been used for control of autonomous robots for several years. Krishnaswamy [7] demonstrated a structured neural network approach to controlling a robotic manipulator. Liu [14] used neural networks to control grasping of a robotic hand. Comoglio [4] presented results of simulations for the control of an autonomous underwater vehicle using neural networks. The adaptability and noise rejection of a neural based robot controller was studied by Poo [16], who found that the neural based controller performed better than a standard model based control algorithm. Bullock [3] demonstrated a neural network system for positioning a robotic manipulator with various tools attached and under various hardware failure conditions. Mixed systems typically use neural networks for low level control and symbolic planning for high level coordination. For example, Handelman [9] used a rule based system to train a neural network and to control a robotic system during training. Shavlik [17] proposed a method for encoding

Simulation provides an alternative to the mechanical problems, high costs and long turnaround times associated with currently available robot technology. In response to concerns about realism, simulators are becoming increasingly detailed and sophisticated, such as the one described by Feiten [8] that attempts to accurately model robotic sensors and e ectors. Dorigo [6] showed that reactive behaviors could successfully be transferred from a simulator to a physical robot. As an initial testbed, our research uses the Robot Automobile Racing Simulator (RARS) [20]. This simulator is designed to allow di erent agents to compete in automobile races. This simulator is complex enough to be interesting, while simple enough that all the variables can be examined and/or controlled. The knowledge we gain in this simple simulator can be applied to more complex simulators. For more advanced studies, we will use a modi ed version of the SRI Erratic simulator [12]. The Flakey simulator provides a reasonably detailed environment for a single autonomous agent. The primary user interface to RARS consists of a rich set of command line options. The user controls the number of agents participating in the race and selects which agents to use, which track to use, how many laps in a race, how many races to run, and whether or not the physical model is deterministic. RARS provides the user with real time display of race statistics, such as the maximum and average speeds for each car in the race, and the relative positions of each car. Figure 1 shows the simulator during a race. RARS can simulate between 1 and 16 di erent agents simultaneously and can display the race in real time, or can be called to run the race at faster than real time speed with or without the display. Each agent is implemented as a subroutine that is compiled into RARS. To achieve realtime simulation, each agent subroutine is required to execute within a limited time. The exact time

Reinforcement Learning for Coordinated Reactive Control heuristic and not capable of learning. However, most of the agents exhibit extremely good performance. Our goal was to develop an agent that could learn to to perform about equally to the average distribution agent.

IV. AGENT ARCHITECTURE

FIGURE 1. RARS: The robot auto racing simulator. is dependent upon the speed of the processor and the size of the time slice during simulation. RARS runs through the simulation at a xed lockstep. At each time step, RARS updates the position and velocity of each car. It then calls the subroutine for each agent in the race. As shown in Figure 2 for two agents, each agent routine receives a structure containing information about its position, current velocity, distance to the end of the current track segment, curvature of the track, and relative position and velocity of nearby cars. The agent routine calculates command signals for acceleration and steering, which are returned to RARS for updating and continuing the simulation. steering

steering

acceleration

acceleration

RARS

Agent . . .

velocity

position track curvature

Agent velocity . . . position track curvature

FIGURE 2. Interaction between RARS and agents. Several agents are included in the standard RARS distribution. These agents were contributed to the RARS distribution by programmers from all over the world. These agents provide the competition to our test designs and were used to judge the relative performance of our system. All of the agents distributed with the version used for this study are

The environment, as re ected in the required control signals, naturally subdivides the agent's activity into two reactive control tasks: accelerating and steering. This arrangement requires the agent to make decisions about two complex interacting tasks for each time step. Thus, this environment leaves little question of what controllers to include in the architecture, but how to combine and coordinate them is an open question. High Level Coordination (heuristic)

Enable

Enable

Enable

Enable

Racing Behavior

Failure Recovery Behavior

Passing Behavior

(reinforcement learning)

(heuristic)

(heuristic)

Control Signals

RAR Simulator

Sensory Informaion

FIGURE 3. Overview of the agent architecture. As shown in gure 3, our basic agent architecture consists of one reinforcement learning low level behavior, some heuristic low level behaviors, and a higher level coordinating mechanism. The reinforcement learning networks are similar to those used by Barto, Sutton and Watkins [2], Anderson [1] and Lin [13]. The coordinating level is a simple set of heuristics that are responsible for ensuring that the car moves around the track. The heuristic low level behaviors perform non-routine recovery behaviors. For example if the car leaves the track, the low level reinforcement learning behaviors are disabled temporarily while a low level heuristic behavior pilots the car back onto the track. The heuristic behavior returns the car to a safe state and ensures that progress will be made in the correct direction on the next time step. The heuristic behavior is also given control whenever the car is moving in the wrong direction around the track so that the car can be

Reinforcement Learning for Coordinated Reactive Control properly oriented when the neural network regains control.

A. Reinforcement learning of low level behaviors

A major bene t of reinforcement learning is that it requires only a scalar reinforcement signal as feedback from the environment in order to learn to control the system. Our system uses the Adaptive Heuristic Critic (AHC) architecture [1]. This reinforcement learning method has been shown to converge reliably for other control tasks. AHC divides learning into two subtasks: constructing an evaluation network that estimates the utility of each world state, and learning which actions result in world states with the highest utility. To accomplish the two subtasks, an AHC reinforcement learning network uses two feed-forward neural networks with the back-propagation learning algorithm and a stochastic action selector (see Figure 4). The upper network (called the utility network) learns to predict the utility of each world state, performing temporal credit assignment [18] by using temporal di erence methods [19]. The output of the utility network trains the lower network (called the action network), which learns to select the actions that lead to world states with higher utility. The stochastic action selector is a mechanism that forces the action network to explore the space of possible actions [13] by occasionally choosing an action that is not the one selected by the action network. Stochastic action selection occasionally results in actions which lead to world states with increased utility value, and the action network will learn to perform that action in similar situations. Without stochastic action selection, the network is more likely to learn a less general control strategy. The utility network uses temporal di erence methods, which are the standard backpropagation learning algorithm with one modi cation. In standard backpropagation, the error term for the output node is calculated by subtracting the actual output of the network from the desired output. The learning algorithm then attempts to minimize the error. For temporal di erencing, the error term is calculated as: et?1 = ot ? ot?1 + r (1) where ot is the output of the network at time t, ot?1 is the output of the network at the previous time step, and r is the reinforcement signal.  is a scaling parameter between 0 and 1. The inputs from time t ? 1 are used to train the network. The action network also uses temporal di er-

ence methods. The idea is to increase the probability of choosing actions that lead to world states with higher utilities, while decreasing the probability of choosing actions that lead to lower utility world states. For the action that was chosen in the previous time step, the error term is et?1 as calculated in equation 1. For all other actions, the error term is ?et?1. As for the utility network, the inputs from time t ? 1 are used for training the action network.

B. Heuristic low level behaviors

To augment the behavior of the reinforcement learning networks, our agent architecture also includes two heuristic behaviors and a higher level control function to select which behavior is given control of the vehicle. One of the heuristic behaviors returns the car to the race track after a recoverable crash. The second heuristic behavior manages the car as it passes another vehicle. The heuristic failure recovery behavior is given control whenever the car is found to be going in the wrong direction around the track, or when the car is found to be o of the track. It guides the car into a state where it is moving in the correct direction on the track. Since the car is not allowed to move backwards and the failure recovery algorithm makes sure that it moves forward at least one time step, some forward progress is inevitable, even when the network is not trained and failures are very frequent.

C. Coordination mechanism

Performance is directly in uenced by interaction of acceleration and steering. Thus, for the agent to perform well, it must coordinate both control tasks. The networks are coordinated by providing each network with the set of actions chosen at the previous time step. This feedback mechanism allows each network to adjust for the action chosen by the other. The low level reinforcement learning networks are trained to drive the car around the track. While racing around the track comprises the bulk of the agent's activity, other behaviors are required to fully operate in the environment. The coordinating mechanism switches vehicle control between the two heuristic behaviors and the reinforcement learning behavior as appropriate.

D. Alternative designs

Two distinct control signals are required for driving the car. To simplify the design, we divided each control signal into three possible actions. The

Reinforcement Learning for Coordinated Reactive Control Reinforcement

A Stochastic Action Selector

Reinforcement Acceleration Steering

Environment Simulator

C

Stochastic Action Selector Stochastic Action Selector

Sensory Information

Acceleration Reinforcement

B

Stochastic Action Selector

Environment Simulator

Steering

Action Stochastic Action Selector

Acceleration Environment Simulator Steering

Sensory Information

Stochastic Action Selector

Sensory Information

FIGURE 4. Three reinforcement learning networks to implement the acceleration and steering behaviors. A) Single Utility Single Action (SS), B) Single Utility Multiple Action (SM), C) Multiple Utility Multiple Action (MM). three actions for setting the acceleration are to accelerate by 12 ft/sec, decelerate by 12 ft/sec, or remain at the same speed. RARS limits the amount of acceleration to the physical constraints of the simulated system. The possible steering commands are to adjust steering to the left by 0.01 radians, do not adjust steering, or adjust steering to the right by 0.01 radians. We tested three alternative network architectures that were derived from the reinforcement learning network. The three architectures are characterized by the number of utility and action networks that they have. The SS approach uses a single utility network and a single action network. The SM approach uses a single utility network but multiple action networks. The MM approach uses multiple utility and multiple action networks for acceleration and steering. Figure 4 shows the alternative designs.

The reinforcement signal is constructed from the output of the environment simulator. After some experimentation to check that the signal would provide the correct response, we settled on one signal. The reinforcement signal is calculated as: if (car is off the track) OR (speed of car < 15 MPH) OR ((car just completed a lap) AND (time for this lap > time for last lap)) reinforcement = -1; else reinforcement = 0;

This function negatively reinforces leaving the track and going slow. The network has a slight bias, due to equation 1, so no positive reinforcement is necessary.

Reinforcement Learning for Coordinated Reactive Control V. EVALUATION

A. Experiment design

Each neural network network implementation was tuned, trained, and tested on a single track in RARS. Once network parameters were tuned, the network was reset with random weights. Each network was then trained for 250 laps around the track with no other cars on the track. Performance was measured in length of time needed to learn and the failure rate after a certain number of races. A failure is considered to have occurred whenever the car leaves the track for any reason. The networks have several tuning parameters that were expected to in uence the results: number of hidden units in each network, whether or not the output node is directly connected to the input nodes, and the learning rate and momentum term for the neural network. Each network architecture was run with various parameter settings to nd those that resulted in the best performance. In all of the networks, a learning rate of 0:0001 and a momentum term of 0:25 resulted in good performance. We also found that adding direct connections between the input and output nodes increased performance in all cases. The required size of the network was found to depend on the network architecture. For the SS network, 6 hidden nodes were used in each network. For the SM network, 5 hidden nodes were used for each network. For the MM network, 4 hidden nodes were used for each network.

B. Results

Given our principle of control task modularity, we expected the MM approach to outperform

MM Network SS Network SM Network

45 40

Failures during lap

35 30 25 20 15 10 5 0 0

50

100 150 Laps around the track

200

250

FIGURE 5. Learning times for the three reinforcement learning algorithms. the other two architectures. Figure 5 compares the number of failures for each reinforcement learning network during training. As expected, the MM approach nds a good solution with less training than the other two approaches. We believe that the MM approach exhibits superior performance in part because each network in the MM approach has fewer hidden units. Smaller networks are, in general, easier to train than large networks. Additionally, although the total number of weights in the MM approach is actually higher than in the other two approaches, the modular structure expedites training. 7000 MM Network SS Network SM Network 6000

5000 Time to complete lap

The agent architecture was based on three design principles:  behaviors should be tailored to suit the domain,  implementing multiple control tasks in separate control mechanisms simpli es generalization and expedites scale up, and  a coordination mechanism provides long term strategic reasoning without sacri cing time capabilities of reactive control. Our long term goal is to test all of these design principles. For this study, we tested the second principle by implementing three reinforcement learning networks for the low level behaviors. One network is monolithic, one is totally modular, and one is in between. The question is does the modular approach results in the best overall performance?

50

4000

3000

2000

1000 0

50

100 150 Laps around the track

200

250

FIGURE 6. Time required for making one lap around the track as the networks learn. Figure 6 shows the time for each lap during training. Again, it is the MM network that shows the best learning performance. This graph is similar to the graph in Figure 5, because fewer failures means less time spent recovering from failure, and because the two behaviors strongly interact. The

Reinforcement Learning for Coordinated Reactive Control interaction is re ected in a slowly improving cycle of 3 steps forward and 2 backward. The MM network nds a solution where it does not fail at all, but then the time to complete a lap is reduced until failures begin again. The network then tries to reduce the number of failures. Over time, this behavior minimizes both the time to complete a lap and the rate of failure. These results strongly support the idea that implementing multiple control tasks in separate control mechanisms expedites coordination. Since the networks are continually learning, they automatically compensate for the e ects of interaction between the control tasks. This approach results in superior learning rates due to the smaller size of each neural network. Although not addressed in this paper, the MM approach has an added bene t that each reinforcement learning network can have its own reinforcement signal from the environment. Separate reinforcement signals should give more accurate reinforcement to each reinforcement network, resulting in better performance and faster learning.

VI. FUTURE WORK AND CONCLUSIONS

We have shown that low level behaviors can be implemented using a modular reinforcement learning approach. It remains to be seen how well this approach will work with the addition of new behaviors and how well it can be generalized for use in other problem domains. Our future work will include replacing the heuristic passing strategy with reinforcement learning, adding a pit-stop behavior, and improving the high level control function to select the appropriate reactive behavior. We will also be experimenting with other simulation environments, in particular a modi ed version of the SRI Erratic simulator [12]. The Erratic simulator is a real-time simulator for the SRI Flakey robot. It includes a detailed simulation of the servos and sensors available on that robot. We have already performed some preliminary experiments using genetic reinforcement learning to implement reactive behaviors and have found that genetic reinforcement learning can also produce acceptable results. We intend to apply the approach used with the RARS simulator to the Erratic simulator and to compare the performance of genetic reinforcement learning with the method presented in this paper. Our long term goal is to develop a reliable and ecient method for generating low-level behaviors that

can be used by a higher level coordinating mechanism. We have found that reinforcement learning can be used to implement ecient and modular reactive behaviors. The modularity of this approach will allow new reactive behaviors to be added at any time while learning compensates for interaction effects.

REFERENCES

[1] Charles W. Anderson. Strategy learning with multilayer connectionist representations. In Proceedings of the Fourth International Workshop on Machine Learning, pages 103{114, 1989. [2] A.G. Barto, R.S. Sutton, and C.J. Watkins. Learning and sequential decision making. Technical report, COINS Technical Report 89-95, Dept. of Computer and Information Science, University of Massachusetts, 1989. [3] Daniel Bullock, Stephen Grossberg, and Frank Guenther. A self-organizing neural network model for redundant sensory-motor control, motor equivalence, and tool use. In International Joint Conference on Neural Networks, pages 91{96, Baltimore, 1992. IEEE. [4] R. F. Comoglio and A. S. Pandya. Using a cerebellar model arithmetic computer (cmac) neural network to control an autonomous underwater vehicle. In International Joint Conference on Neural Networks, pages 781{786, Baltimore, 1992. IEEE. [5] David S. Day. Integrating reaction and re ection in an autonomous agent: An empirical study. Unpublished, 1990. [6] M. Dorigo and M. Colombetti. Robot shaping: developing autonomous agents through learning. Arti cial Intelligence, 71(2):321{370, December 1994. [7] Gita Drishnaswamy, Marcelo H. Ang Jr., and Gerry B. Andeen. Structured neural-network approach to robot motion control. In International Joint Conference on Neural Networks, pages 1059{1066. IEEE, 1991. [8] W. Feiten, U. Wienkop, A. Huster, and G. Lawitsky. Simulation in the design of an autonomous mobile robot. In R. Trappl, editor, Cybernetics and Systems '94, pages 1499{1506. World Scienti c Publishing, Singapore, 1994.

Reinforcement Learning for Coordinated Reactive Control [9] D.A. Handelman, S.H. Lane, and J.J. Gelfand. Integrating knowledge-based system and neural network techniques for robotic skill acquisition. In Proceedings of the Eleventh International Joint Conference on Arti cial Intelligence (IJCAI-89), pages 193{198, Los Altos, CA, 1989. Morgan Kaufmann. [10] Adele E. Howe and Paul R. Cohen. Responding to environmental change. In Katia P. Sycara, editor, Proceedings of the Workshop on Innovative Approaches to Planning, Scheduling and Control, pages 85{92. Morgan Kaufmann Publishers, Inc, November 1990. [11] M. Knick and F.J. Radermacher. Integration of sub-symbolic and symbolic information processing in robot control. In Proceedings of Third Annual Conference on AI, Simulation and Planning in High Autonomy Systems, pages 238{243, Perth, Western Australia, July 1992. [12] Kurt Konolige. Erratic robot simulator. Anonymous ftp from ftp://ftp.ai.sri.com/pub/konolige/erratic-ver2b.tar.Z, 1994. [13] Long-H Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3/4):69{97, 1992. [14] Huan Liu, Thea Iberall, and George A. Beckey. Neural network architecture for robot hand control. In Proceedings of IEEE International Conference on Neural Networks. IEEE, July 1989. [15] Min Meng and A.C. Kak. Mobile robot navigation using neural networks and nonmetrical environment models. IEEE Control Systems, pages 30{39, October 1993. [16] A.N. Poo, M.H. Ang Jr., C.L. Teo, and Qing Li. Performance of a neuro-model-based robot controller: adaptability and noise rejection. Intelligent Systems Engineering, 1(1):50{62, 1992. [17] Jude W. Shavlik. Combining symbolic and neural learning. Machine Learning, 14(3):321{331, 1994. [18] Richard S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Dept. of Computer and Information Science, University of Massachusetts, 1984.

[19] Richard S. Sutton. Learning to predict by the methods of temporal di erences. Machine Learning, 3:9{44, 1988. [20] Mitchell E. Timin. Robot Auto Racing Simulator. available from http://www.ebc.ee/~mremm/rars/rars.htm, 1995. [21] David E. Wilkins, Karen L. Myers, John D. Lowrance, and Leonard P. Wesley. Planning and reacting in uncertain and dynamic environments. Journal of Experimental and Theoretical AI, 7, 1994.