A Self-learning Fuzzy Logic Controller Using Genetic Algorithms With ...

12 downloads 0 Views 167KB Size Report
Aug 31, 2009 - learning architecture, called a genetic reinforcement fuzzy logic controller ..... evaluations for its future action, the machine represents a.
460

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 5, NO. 3, AUGUST 1997

Letters A Self-Learning Fuzzy Logic Controller Using Genetic Algorithms with Reinforcements Chih-Kuan Chiang, Hung-Yuan Chung, and Jin-Jye Lin

Abstract— This paper presents a new method for learning a fuzzy logic controller automatically. A reinforcement learning technique is applied to a multilayer neural network model of a fuzzy logic controller. The proposed self-learning fuzzy logic control that uses the genetic algorithm through reinforcement learning architecture, called a genetic reinforcement fuzzy logic controller (GRFLC), can also learn fuzzy logic control rules even when only weak information such as a binary target of “success” or “failure” signal is available. In this paper, the adaptive heuristic critic (AHC) algorithm of Barto et al. is extended to include a priori control knowledge of human operators. It is shown that the system can solve more concretely a fairly difficult control learning problem. Also demonstrated is the feasibility of the method when applied to a cart-pole balancing problem via digital simulations. Index Terms— Fuzzy logic control, genetic algorithm, neural network, reinforcement learning.

I. INTRODUCTION

I

N recent years, there has been much interest and progress in the development and applications to control problems. As is well known, fuzzy logic controllers that do not require analytical models have demonstrated a number of successful applications, for example, in water quality control [1], nuclear reactor control [2], and automobile transmission control [3]. These applications have largely been based on emulation the performance of a skilled human operator in the form of linguistic rules. Although there is extensive literature concerning various applications of fuzzy logic controllers, until now there have been few systematic procedures for the design of fuzzy logic systems. The most straightforward approach is to define rules and membership functions subjectively by studying a humanoperated or controlled system of an existing controller and then testing the design for the proper output. The rules and membership functions should be adjusted if the design fails the test. The current research trend is to design a fuzzy logic system that has the capability to learn itself, starting with the selforganizing control (SOC) techniques of Mamdani and his students [4]. It is expected that the controller performs two tasks: 1) it observes the process environment while issuing the Manuscript received May 20, 1996; revised December 12, 1996. This work was supported by the National Science Council of the Republic of China under NSC 83-0404-E008-024. The authors are with the Department of Electrical Engineering, National Central University, Chung-Li, Taiwan, 32054 R.O.C. Publisher Item Identifier S 1063-6706(97)05911-0.

appropriate control decision and 2) it uses the results of the decision for further improvement. Lin and Lee [5] proposed a fuzzy logic control/decision network, which is constructed automatically and tuned by learning the training examples. However, in some life environments, obtaining exact training data may be expensive. A new method is introduced to regulate fuzzy control rules [6] based on control rules represented by an analytic expression with a regulating factor . This method, however, only specializes in the control problem (which is to change the set-point of the plants). In [7], a reinforcement learning is used to adjust the consequence of fuzzy logic rules for the one-dimensional (1-D) pole balancing problem in which the position of the cart is ignored. Berenji [8], [9] developed an architecture that learns to adjust the fuzzy membership functions of the linguistic labels used in different control rules through reinforcements. The learning task may include the tuning of the fuzzy membership functions used in the control rules and the identification of the main control rules. In the following, we are concerned with the latter learning task, which will be followed and developed by an architecture that can learn the fuzzy control rules automatically. Unsupervised learning, as connectionist learning methods, do not rely on a network teacher that guides the learning process; like the supervised learning class, the teacher associates the learning system with desired outputs for each given input. Learning involves memorizing the desired outputs by minimizing the discrepancy between the actual outputs of the system and the desired output. In reinforcement learning class, the teacher provides the learning system with a scalar evaluation of the system’s performance of the task according to a given index. The objective of the learning system is then to improve its performance (as evaluated by the critic) by generating appropriate outputs. It has been shown that if the supervised learning can be used in control (e.g., when the input/output training data sets are available), it is more efficient than the reinforcement learning [10]. However, many control examples are required to select control actions. This has the consequences emerge over uncertain periods for which input/output training data sets are not readily available. In such cases, the reinforcement learning systems can be used to provide the unknown desired outputs through system with a suitable evaluation of system performances. Here, the reinforcement learning is more appropriate than the supervised learning.

1063–6706/97$10.00  1997 IEEE

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

CHIANG et al.: FUZZY LOGIC CONTROLLER USING GENETIC ALGORITHMS WITH REINFORCEMENTS

A genetic algorithm (GA) is a parallel global-search technique that emulates natural genetic laws. Because it simultaneously evaluates many points in the search space, it is more likely to finally converge toward finding the global solution of a given problem. A GA uses the operators inspired by the mechanism of natural selection to a population. It uses the binary strings for encoding the parameter space. For each generation of population, it explores different areas of the parameter space and then directs the search to regions where there is a high probability of finding improved performance. In this way, a genetic algorithm can, in effect, often seek many local minima and increase the likelihood of finding the global minima representing the problem goals. In this paper, a GA is used to design fuzzy logic control rules and seek the optimal linguistic values of the consequences of the fuzzy control rules in reinforcement learning. This paper is organized as follows. In Section II, the traditional fuzzy logic control and reinforcement learning are introduced. In Section III, a genetic algorithm is described. The proposed approach—a combination of techniques drawn from the fuzzy logic and neural network theory—is presented in Section IV. In Section V, computer simulation results of the cart-pole balancing are described. Finally, in Section VI, the general conclusion is formulated. II. A SURVEY OF FUZZY SET AND FUZZY LOGIC CONTROL AND REINFORCEMENT LEARNING A fuzzy logic controller comprises four principal components: a fuzzification, a knowledge base, decision-making logic or inference engine, and a defuzzificator. The fuzzificator acquires and measures the values of controller input variables and converts them into corresponding linguistic value, which may be viewed as labels of fuzzy sets. The knowledge base comprises a knowledge of the application domain and attendant control goals. The decision-making logic (or inference engine) has the capability of simulating human reasoning based on fuzzy concepts and of inferring with fuzzy control actions by employing fuzzy implication and the rules of inference in a fuzzy logic. The defuzzificator converts the range of linguistic values of output variables into the corresponding universe of discourse. Because of the partial matching attribute of fuzzy control rules and the fact that the preconditions of rules do overlap, more than one fuzzy rule can fire at a time. The methodology used in deciding which control action should be taken as the result of the conflict resolution is explained in the following example. Suppose that we have the two rules Rule 1: IF X is

and Y is

THEN Z is

Rule 2: IF X is

and Y is

THEN Z is

If we have and as the sensor readings for fuzzy variables and , their truth values are represented by and , respectively, for Rule 1 where and represent the membership function for and , respectively. Similarly, For rule 2, we have and

461

as the truth values of the preconditions

where denotes a conjunction of intersection operator, the minimum values of which are used in fuzzy logic controllers. The control output of Rule 1 is calculated by applying the matching strength of its preconditions on its conclusion. We assume that

and for Rule 2 that

The above equations show that Rule 1 is recommending the control action and Rule 2 the control action . The generalization of the above rules produce a nonfuzzy control action , which is calculated using a weighted averaging approach

where is the number of rules and is the amount of control action recommended by Rule 1. In reinforcement learning, one assumes that there is no supervisor to critically judge the chosen control action at each control step. The learning schemes themselves modify the behavior of a system based on evaluation of a single scalar of the system’s output. In the supervised learning schemes, however, a knowledgable network teacher explicitly specifies the desired output in the form of a reference signal. Since the evaluative signal contains much less information than the reference signal, the reinforcement learning is appropriate for system operating in knowledge-poor environments. The study of reinforcement learning relates to the credit assignment where, given the performance of a process, one has to assign the reward or blame attribute to the individual elements contributing to that performance. This may be complicated if there is a sequence of action that is collectively awarded a delayed reinforcement. In rule-based systems, for example, this equivalent of assigning a credit or blame to individual rules is involved in the problem solving process. Samuel’s checkers-playing program is probably the earliest artificial intelligence (AI) program which uses this idea [11]. Michie and Chambers [12] used a reward–punishment strategy in their BOXES system, which taught cart-pole balancing by discretizing the state space into nonoverlapping regions and applied two opposite constant forces. Barto et al. [13] used two neuron-like elements to solve the learning problem in cart-pole balancing. In these approaches, partition the state space into nonoverlapping smaller regions and assign the credit performed on a local basis. The main advantage of reinforcement learning is that it is easy to implement because unlike backpropagation, which computes the effect of changing a local variable, the “credit

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

462

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 5, NO. 3, AUGUST 1997

Fig. 1. The architecture of GRFLC.

assignment” does not require any special apparatus for computing derivatives. Therefore, reinforcement learning can be used in a complex system in which it would be very hard to analytically compute reinforcement derivatives. The objective of reinforcement learning is to try to maximize some function of this reinforcement signal such as the expectation of its value on the upcoming time step or the expectation of some integral of its value over all future time, as appropriate for the particular task. Fig. 2. The action-evaluation network.

III. THE ARCHITECTURE

OF

GRFLC

In this section, we present the theory of the proposed fuzzy logic controller, which has the capability to learn its own control rules without expert knowledge. Fig. 1 shows the architecture of the general reinforcement fuzzy logic controller (GRFLC), which uses GA’s to design fuzzy logic controller in reinforcement learning where the main elements are the actionevaluation network (AEN), which acts as a critic and provides advice to the main controller—the action generation network (AGN)—which includes a fuzzy controller, the stochastic action modifier (SAM), and using both F and V, produces the action F applied to the plant. The accumulator of AGN is used as a relative measure of fitness required for GA’s. The AGN is a multilayer network logic controller that consists of four components elements: GA rule base, a fuzzificator, the decision making logic, and the defuzzificator. Here, the GA rule base is a collection of fuzzy IF THEN statements, self-generated by a GA. The membership functions of fuzzy variables in antecedent parts, however, have to be built in advance, save the ones in consequent parts also generated by a GA. Using a GA, the optimal fuzzy singletons of the consequent parts can be obtained. In the following, the two networks will be presented in more detail. A. The Action-Evaluation Network

information received by the AEN is the state of the physical system in terms of its state variables and the status that a failure has occurred. It is a two-layer feedforward network (overall with sigmoid activation function) except in the output layer. The structure of the AEN, shown in Fig. 2, contains hidden units and input units that include a bias unit. Each hidden unit receives inputs and has weights, while each output unit receives inputs and has weights at its input. The learning algorithm is composed of an adaptive heuristic critic (AHC) of Sutton’s algorithm [16] for the output unit and an error backpropagation algorithm [17] for the hidden units. The output of the th unit of the hidden layer is given by

where

is and and are the successive sampling times. The output unit of the evaluation network receives inputs from the both units in the hidden layer and directly from the units in the input layer so that the prediction of the reinforcement is

The AEN proposed by Berenji [8] plays the role of an adaptive critic element (ACE) [13] that repeatedly predicts reinforcements associated with different input states. The only

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

CHIANG et al.: FUZZY LOGIC CONTROLLER USING GENETIC ALGORITHMS WITH REINFORCEMENTS

463

In the above equations, double time dependencies are used to avoid instabilities in the updating of weights [19]. This network evaluates the action recommended by the action network as a function of the failure signal and the change in state evaluation based on the state of the system at time start state failure state

otherwise Fig. 3. The structure of the AEN.

is the discount rate. In other words, the where change in the value of plus the value of the external reinforcement constitutes the heuristic or internal reinforcement where the future value of is discounted more the further it is from the current state of the system. Learning Paradigm of AEN: In the network, a learning paradigm similar to a reward/punishment scheme for the weights updating is used. The weights connecting the units in the input layer directly to the units in the output layer are updated according to

where is a constant and is the internal reinforcement at time . Similarly, the weights connecting between the hidden layer and the output layer are updated according to

where is a constant. The weight update function for the hidden layer is based on a modified version of the error backpropagation algorithm [17]. Since it is impossible to measure error directly (as in Anderson [15], [18]) plays the role of an error measure in the update of the output weights given by

where . Note that, the sign of the weight of the hidden output is used rather than its value. This variation is based on Anderson’s empirical result that the algorithm is more robust if the sign of the weight is used rather than its value.

values to the next layer directly and no computation is done. 2) In the layer, the single nodes are used for presentation of individual membership function. The generate outputs the membership function. For example, if large is one of the values that can take, a node computing arg ( belongs to layer 2). It has exactly one input and will feed its output to all the rules using the clause if is large in their if part. For the bell-shaped function

where is the center and the spread of the membership function. For triangular shapes, this function is given by

Triangular shapes are preferable here because they are simple and adequate for a large number of application domains. 3) It implements the conjunction of antecedent conditions belonging to the problem solution in one rule. A node in layer 3) corresponds to a rule in the rule base. The total number of nodes in this layer depends on the number of nodes in layer 2). For example, if there are two input variables in layer 1), and both of them have five membership functions in layer 2), then layer 3) will have up to 5 5 rules. The node itself performs the fuzzy AND operation, which is a minimum operation

of the product operation B. Action-Generation Network The AGN shown in Fig. 3 contains a fuzzy controller modeled by a five-layer neural network. It generates an action by implementing a fuzzy inference scheme. The control rules are self-learning in AGN. Fig. 3 shows the structure of the presented as a network has five layers, each layer performing one step of the fuzzy inference process. The rules of individual layer are as follows. 1) The nodes in this layer are input linguistic variables in the antecedent parts of control rules; they transmit input

where ( is the degree of match between a fuzzy label occurring as one of the antecedents of rule and the corresponding input variable is the number of the rule nodes. As a result, , the degree of applicability of rule , is obtained. 4) The nodes in this layer correspond to the consequent labels. The input of individual nodes come from all rules which use this particular consequent label. For each of

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

464

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 5, NO. 3, AUGUST 1997

the supplied to the node, this node computes the corresponding output action, as suggested by rule . The resulting mapping may be written as where indicates a specific consequent label and the inverse is taken to mean a suitable defuzzification procedure applicable to an individual rule. In general, the mathematical inverse of may not exist if the function is not directly monotonic. We usually use a simple procedure is the to determine this inverse [9]: that coordinate of the centroid of the set . It will be the center where of the consequent fuzzy set if the membership function is symmetrical. It is similar to the procedure above to use a fuzzy singleton of the consequent label. In general, it is easy to produce the antecedent part of a fuzzy control rule, but it is very difficult to produce the consequent part without expert knowledge. We do not a priori know how to select which the labels in the THEN part of that rule. The nodes in layer 4) are associated with the correspondent rule node in layer 3). The consequent labels value of each node in this layer are self-generated by GA. Here, a simple GA described in Section IV is used. When applying the three-operator GA to a search problem (which is not difficult), two decisions must be made: how to code the possible solutions to the problem as finite bit strings and how to evaluate the merit of each string (also called decoding). a) Coding: to use the GA, we must first code the decision variables as some finite-length strings. The variable, the consequent label of each node in this layer, will be coded as a binary unsigned integer of length . Since rules are possible and each rule is represented as an -b string, a string of length represented every possible rule set for the fuzzy logic controller. b) Decoding: the decoded value of each node is used as the fuzzy singleton of the consequent label. This values according to

where is the value of parameter being coded, is the integer value represented by an -b string, is the user-determined maximum, and is the minimum. 5) This layer has as many nodes as there are output-action variables. Each output node combines the recommendations from all the fuzzy control rules in the rule base using the following weighted sum—the weights being the rule strengths

where is the total rule number. The layer acts as a defuzzifier.

C. Stochastic Action Modifier SAM is an abstract machine that randomly generates actions according to a stored probability distribution and receives feedback from the environment evaluation of effects of actions. If the machine is capable use the feedback for updating its distribution so as to increase the expectation of favorable evaluations for its future action, the machine represents a stochastic learning system. Here we use the value of and the action to generate at random an action

to be actually applied to a plant to be controlled, is a normally distributed random variable with the mean value and standard deviation . Here, the predictive reinforcement is used to compute the standard deviation as

where is a monotonically decreasing nonnegative function . Moreover, so that when the maximum of reinforcement is expected, the standard deviation is zero. The stochastic perturbation in the suggested action leads to a better exploration of state space and better generalization ability. If the prediction of reinforcement is high, the magnitude of the deviation is small and so should be small. Conversely, if the prediction of reinforcement is low, should be large so that the unit explores a wider interval in its output range. The result is that a large random step away from the recommendation results when the last action performed is bad, but the controller remains consistent with the fuzzy control rules when the previous action selected is a good one. The actual form of the function , especially its scale and rate of decrease, should take the units and range of variation of the output variable into account. D. Steps Accumulator An accumulator plays a role which is a relative performance measure. It accumulates the steps until a failure occurs. In this paper, the feedback takes the form of an accumulator that determines how long the pole stays up and how long the cart avoids the end of the track; this is used as a relative measure of fitness for a genetic algorithm. IV. SIMULATION RESULTS The GRFLC approach proposed here was applied to the cart-pole balancing problem. A. The Cart-Pole Balancing Problem The cart-pole task involves a pole hinged to the top of a wheeled cart that travels along a track. The cart and pole are constrained to move within the vertical plane. The system state is specified by four real-valued variables: —the horizontal position of the cart; —the velocity of the cart; —the angle of the pole with respect to the vertical line; —the angular velocity of pole; and —the force [ 10, 10] newtons, applied to the cart. The dynamics of the cart-pole system are modeled

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

CHIANG et al.: FUZZY LOGIC CONTROLLER USING GENETIC ALGORITHMS WITH REINFORCEMENTS

465

by the following nonlinear differential equations [13]:

where is the acceleration due to gravity, is the mass of the cart, is the mass of the pole, is the half-pole length, is the coefficient of friction of the cart on track, and is the coefficient of friction of the pole on cart. These equations were simulated by the Euler method, which uses an approximation to the above equations and a time step of 20 ms. The goal of the cart-pole task is to apply the forces of unfixed magnitude to the cart such that the pole is balanced and the cart does not hit the edge of the track. Bounds on the angle and on the cart’s horizontal position specify the states for which a failure signal occurs. There is no unique problem solution—any trajectory through the state space that does not result in a failure signal is acceptable. The only information regarding the goal of the task is provided by a failure signal, which signals either the pole falling past 12 or the cart hitting the bounds of the track at 2.4 m. These two kinds of failures are not distinguishable in the case considered herein. The goal, as just stated, makes this task very difficult; the failure signal is a delayed and rare performance measure. B. Applying GRFLC to the Cart-Pole Balancing Fig. 4 presents the GRFLC architecture as applied to the simulated problem. The AEN is supposed to have four input units, a bias input unit, five hidden units, and an output unit. The input state vector is normalized so that the pole and cart positions lie in the range [0, 1]. The velocities are also normalized, but they are not constrained to lie in any range. The weights of this net are initialized randomly to value in [ 0.01, 0.01]. The external reinforcement is received by the AEN and used to calculate the internal reinforcement. The Action-Generation Network Selected: 1) Layer 1)—there are four inputs which are system states: , , , and in this layer. 2) This layer has 16 units that are linguistic variables of the system states. States and contain five linguistic variables, respectively. States and also contain three linguistic variables, respectively. 3) We use 97 rules not 225 rules (5 5 3 3) to simplify the rule nodes for GA. Since we don’t care, and whenever is PB(NB) or is PB(NB). We start with random rules and they will be modified after learning. 4) Each decision variable is coded as a binary unsigned integer of length , thus, every rule set for the fuzzy logic controller is represented a string of length 485. In decoding, is 10 N and is 10 N. The probability of crossover is 0.7 and that of mutation is 0.001 in our simulations. The population sizes 5, 50, and 100 are used, respectively, later. 5) Only one output unit is used to compute the force.

Fig. 4. GRFLC applied to the cart-pole balancing.

C. The Simulation We have implemented the system to be simulated (shown in Fig. 4) in a Sun workstation. The parameter values used in the simulation are half-pole length 0.5 m, pole mass 0.1 , , , kg, cart mass 1.0 kg, . The learning system is tested for six runs. A run is called “success” whenever the number steps before failure is greater than 500 000, as used in Barto et al. [13]. (This time it corresponds to about 2.8 h of real time.) The is 1 when the failure signal external reinforcement occurs; otherwise it is zero. Fig. 5(a) and (b) shows the results of our simulations with population size 50 and 100, respectively. In each figure it is tested for six runs and one run is terminated after 500 000. Our simulations show that increasing population can obtain better performance. However, smaller populations tend toward faster learning but with poorer behavior. Fig. 6(a) and (b) shows the value for the pole angle and cart position where the half-pole length is 0.5 m and the pole mass is 0.1 kg. In each figure, the first 1000 time steps show the performance of the controller during the initial portion. The second 1000 time steps show the performance of the controller after 100 000 time steps. The third and fourth 1000 time steps show the performance of the controller after 200 000 and 300 000 time steps, respectively. The last 1000 time steps show the end of the trial in which the controller learned to balance the system for at least 500 000 time steps. To show the adaptation of the system, the length and mass of the pole are changed. Four experiments are done. The first two are increased the original mass of the pole by a factor of five and ten, respectively. The original length of pole is increased by a factor ten and the mass of the pole is reduced to onefifth of the original value in the third one. In the last one,

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

466

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 5, NO. 3, AUGUST 1997

(a) (a)

(b) (b)

Fig. 6. (a) State performance of the pole position. (b) State performance of the cart position.

Fig. 5. (a) Learning curves with population size 50. (b) Learning curves with population size 100.

the original length of pole is increased by a factor 20 and the mass of the pole is reduced to one-tenth of the original value. Without any further trials, the system successfully completed these tasks. Finally, we damage the membership functions in the antecedent of control rules by shifting the center of the membership function of antecedent label. For example, the membership function positive small (PS) and negative small (NS) of input variable are shifted by a factor of two to the right and left, respectively. Each membership function of input variable and is also shifted by a factor of two. Fig. 7 shows the result of our simulation after damaging the membership function with population size 100. The value of the pole angle is shown in Fig. 8.

Fig. 7. Learning curves with population 100 after damaging membership functions.

V. CONCLUSION The proposed GRFLC represents a new way of a selflearning fuzzy controller. The computer simulation results show that it can generate the control rules and the membership functions in a consequent way automatically without expert knowledge. Because a GA can seek the optimal membership functions in the consequence of control rules, GRFLC can compensate for inappropriate definitions of membership functions in the antecedent of control rules. It is applicable to control problems for which the analytical models of the process are unknown since GRFLC learns to predict the

Fig. 8. State performance of the pole position after damaging membership functions.

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.

CHIANG et al.: FUZZY LOGIC CONTROLLER USING GENETIC ALGORITHMS WITH REINFORCEMENTS

behavior of a physical system through its action evaluation network. The approach described have may be viewed as a step in the development of a better understanding of how to combine a fuzzy logic controller with a neural network to achieve a significant learning capability.

REFERENCES [1] O. Yagishita, O. Itoh, and M. Sugeno, “Application of fuzzy reasoning to the water purification process,” in Industrial Application of Fuzzy Control, M. Sugeno, Ed. Amsterdam, The Netherlands: North-Holland, 1985, pp. 19–40. [2] J. A. Bernard, “Use of rule-based system for process control,” IEEE Contr. Syst. Mag., vol. 8, no. 5, pp. 3–13, 1988. [3] Y. Kasai and Y. Morimoto, “Electronically controlled continuously variable transmission,” in Proc. Int. Congr. Transport. Electron., Dearborn, MI, Oct. 1988, pp. 69–85. [4] T. J. Procyk and E. H. Mamdani, “A linguistic self-organizing process controller,” Automatica, vol. 15, no. 1, pp. 15–30, 1979. [5] C. T. Lin and C. S. George Lee, “Neural network-based fuzzy logic control and decision system,” IEEE Trans. Comput., vol. 40, pp. 1320–1336, Dec. 1991. [6] W. Z. Qiao, P. Z. Wang, T. H. Heng, and S. S. Song, “A rule selfregulating fuzzy controller,” Fuzzy Sets Syst., vol. 47, pp. 13–21, 1992. [7] C. C. Lee, “A self-learning rule-based controller employing approximate reasoning and neural net concepts,” Int. J. Intell. Syst., vol. 6, pp. 71–93, 1991.

467

[8] H. R. Berenji, “A reinforcement learning-based architecture for fuzzy logic control,” Int. J. Approx. Reasoning, vol. 6, pp. 267–292, 1992. [9] H. R. Berenje and P. Khedkar, “Learning and tuning fuzzy logic controllers through reinforcements,” IEEE Trans. Neural Networks, vol. 3, pp. 724–740, May 1992. [10] A. G. Barto and M. I. Jordan, “Gradient following without backpropagation in layered network,” in Proc. IEEE 1st Annu. Conf. Neural Network, San Diego, CA, 1987, pp. 629–636. [11] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM J. Res. Develop., vol. 3, pp. 210–229, 1959. [12] D. Mechie and R. A. Chambers, “Boxes: An experiment in adaptive control,” in Mach. Intell., vol. 2, pp. 137–152, 1968. [13] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuron-like adaptive elements that can solve difficult learning control problem,” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 834–846, Sept./Oct. 1983. [14] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley, 1989. [15] C. W. Anderson, “Learning and problem solving with multilayer connectionist system,” Ph.D. dissertation, Univ. Massachusetts, Amherst, MA, 1986. [16] R. S. Sutton, “Temporal credit assignment in reinforcement learning,” Ph.D. dissertation, Univ. Massachusetts, Amherst, MA, 1984. [17] D. Rumelhart, G. Hinto, and R. J. Williams, “Learning internal representation by error propagation,” in Parallel Distributes Processing, D. Rumelhart and J. McCelland, Eds. Cambridge, MA: MIT Press, 1986, pp. 318–362. [18] C. W. Anderson, “Strategy learning with multilayer connectionist representation,” Tech. Rep. TR87-509.3, GTE Labs., May 1988. [19] D. Rumelhart, G. Hinto, and R. J. Williams, “Learning internal representation by error propagation,” in Parallel Distributes Processing, D. Rumelhart and J. McCelland, Eds. Cambridge, MA: MIT Press, 1986, pp. 318–362.

Authorized licensed use limited to: National Central University. Downloaded on August 31, 2009 at 19:19 from IEEE Xplore. Restrictions apply.