Learning to Detect and Adapt to Unpredictable Changes - USC

14 downloads 7055 Views 3MB Size Report
me with the necessary motivation, knowledge and most importantly permission to ... determination, hard work and an endless strive for perfection are qualities of a good ...... Figure 40: Actions executed when the model is repaired vs. rebuilt . .... right-front wheel, forcing it to drive backwards dragging the dead wheel [WB09].
LEARNING TO DETECT AND ADAPT TO UNPREDICTED CHANGES

by Nadeesha Oliver Ranasinghe

A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE)

December 2012

Copyright 2012

Nadeesha Oliver Ranasinghe

i

Acknowledgements First, I would like to express my deepest gratitude to my advisor Prof. Wei-Min Shen without whom this work would not even be conceived. In particular, Prof. Shen provided me with the necessary motivation, knowledge and most importantly permission to build on his own dissertation research. I have learned from him that consideration, determination, hard work and an endless strive for perfection are qualities of a good researcher and a great human being.

I would also like to thank Prof Ramakant Nevatia, Prof Michael Safonov, Prof Laurent Itti and Prof Yu-Han Chang for giving me their precious time by being on my committees, providing me with invaluable feedback and an abundance of encouragement.

I am eternally grateful to my wife, parents, sister and in-laws for their unwavering support throughout this process. My father, mother, wife and daughter (1+ year) have helped me focus on this research by addressing all other challenges around me. They are my light against the darkness of uncertainty.

Finally, thank you to all the members of the Polymorphic Robotics Laboratory for their help over the years. Special thanks to Harris Chiu, Dinesh Hiripitiyage, Kary Lau, Lizsl DeLeon, Feili Hou, Jacob Everist, Mike Rubenstein, Behnam Salemi, Akiya Kamimura, Prof Peter Will, Luenin Barrios, Teawon Han, Joseph Chen, TJ Collins and all my friends.

ii

Table of Contents Acknowledgements ........................................................................................................... ii List of Tables ................................................................................................................... vii List of Figures ................................................................................................................. viii Abstract ............................................................................................................................. xi Chapter 1: Introduction ................................................................................................... 1 1.1 Problem Statement .................................................................................................. 1 1.2 Motivation of Our Approach ................................................................................... 2 1.3 Scope and Assumptions ........................................................................................... 3 1.4 Scientific Contributions .......................................................................................... 5 1.5 Dissertation Organization ...................................................................................... 6 Chapter 2: Overview of the State-of-the-Art.................................................................. 7 2.1 Inspirations from Developmental Psychology ........................................................ 7 2.2 Artificial Intelligence .............................................................................................. 8 2.3 Adapting to Unpredicted Changes (including Fault & Failure Tolerance) ......... 13 2.4 Reasoning with Unpredicted Interference ............................................................ 16 2.5 Summary ............................................................................................................... 17 Chapter 3: Surprise Based Learning ............................................................................ 21 3.1 Terminology .......................................................................................................... 21 3.2 Illustrative Example - Constrained Box Environment .......................................... 26 3.3 Learning a Prediction Model ................................................................................ 29 3.4 Life Cycle of Prediction Rules .............................................................................. 32 3.4.1 Rule Creation ................................................................................................ 34 3.4.2 Surprise Detection and Analysis ................................................................... 36 3.4.3 Rule Splitting ................................................................................................ 40 3.4.4 Rule Refinement ........................................................................................... 43 3.5 Using a Prediction Model ..................................................................................... 46 3.6 Goal Management (Knowledge Transfer) ............................................................ 48

iii

3.7 Probabilistic Rules and Forgetting ....................................................................... 50 3.8 Entity & Attribute Relevance ................................................................................ 53 3.9 SBL Pseudocode.................................................................................................... 56 3.10

Summary ........................................................................................................... 58

Chapter 4: Evaluation Strategy ..................................................................................... 59 4.1 Experimental Environments .................................................................................. 59 4.1.1 Game Environments...................................................................................... 60 4.1.2 Real-world Office Environment ................................................................... 62 4.2 Evaluation Methods .............................................................................................. 63 4.2.1 Comparison via Simulation........................................................................... 63 4.2.2 Feasibility in Real-World .............................................................................. 66 4.3 Summary ............................................................................................................... 67 Chapter 5: Structure Learning ...................................................................................... 68 5.1 Scalability by Exploiting Structure ....................................................................... 68 5.1.1 Approach ....................................................................................................... 69 5.1.2 Results & Discussion .................................................................................... 70 5.2 Constructing Useful Models with Comparison Operators ................................... 76 5.2.1 Approach ....................................................................................................... 77 5.2.2 Results & Discussion .................................................................................... 78 5.3 Impact of Training ................................................................................................ 78 5.3.1 Approach ....................................................................................................... 79 5.3.2 Results & Discussion .................................................................................... 80 5.4 Summary ............................................................................................................... 81 Chapter 6: Learning from Uninterpreted Sensors and Actions ................................. 82 6.1 Discretizing a Continuous Sensor ......................................................................... 82 6.1.1 Approach ....................................................................................................... 82 6.1.2 Results & Discussion .................................................................................... 83 6.2 Combining Multiple Uninterpreted Sensors ......................................................... 84 6.2.1 Approach ....................................................................................................... 85 6.2.2 Results & Discussion .................................................................................... 86 6.3 Scalability in the Number of Sensors and Actions ................................................ 89 6.3.1 Approach ....................................................................................................... 89 6.3.2 Results & Discussion .................................................................................... 90 6.4 Summary ............................................................................................................... 94

iv

Chapter 7: Detecting and Adapting to Unpredicted Changes .................................... 95 7.1 Unpredicted Directly Observable Goal Changes ................................................. 95 7.1.1 Approach ....................................................................................................... 95 7.1.2 Results & Discussion .................................................................................... 97 7.2 Unpredicted Indirectly Observable Goal Changes............................................... 99 7.2.1 Approach ..................................................................................................... 100 7.2.2 Results & Discussion .................................................................................. 101 7.3 Unpredicted Configuration Changes in the Environment .................................. 105 7.3.1 Approach ..................................................................................................... 105 7.3.2 Results & Discussion .................................................................................. 106 7.4 Unpredicted Sensor Changes .............................................................................. 108 7.4.1 Approach ..................................................................................................... 108 7.4.2 Results & Discussion .................................................................................. 109 7.5 Unpredicted Action Changes .............................................................................. 110 7.5.1 Approach ..................................................................................................... 110 7.5.2 Results & Discussion .................................................................................. 111 7.6 Repairing vs. Rebuilding the Learned Model from Scratch ............................... 113 7.6.1 Approach ..................................................................................................... 113 7.6.2 Results & Discussion .................................................................................. 114 7.7 Relevance and Unpredicted Sensor & Action Changes ...................................... 115 7.7.1 Approach ..................................................................................................... 115 7.7.2 Results & Discussion .................................................................................. 117 7.8 Simultaneous Unpredicted Changes in Sensors, Action & Goals ...................... 119 7.8.1 Approach ..................................................................................................... 120 7.8.2 Results & Discussion .................................................................................. 121 7.9 Simultaneous Unpredicted Changes in Sensors, Action, Goals & the Environment’s Configuration ..................................................................................... 123 7.9.1 Approach ..................................................................................................... 123 7.9.2 Results & Discussion .................................................................................. 124 7.10

Summary ......................................................................................................... 126

Chapter 8: Detecting and Reasoning with Unpredicted Interference ...................... 128 8.1 Noise & Missing Data ......................................................................................... 128 8.2 Experimental Setup ............................................................................................. 129 8.3 Approach ............................................................................................................. 132 8.3.1 Model Learning Phase ................................................................................ 133 8.3.2 Similarity Metric & Similarity Bounds Learning Phase ............................. 136 8.3.3 Testing Phase .............................................................................................. 137

v

8.3.4

Noisy and Gapped Recognition .................................................................. 138

8.4 Results & Discussion .......................................................................................... 140 8.4.1 Recognition ................................................................................................. 140 8.4.2 Gap Filling .................................................................................................. 144 8.5 Summary ............................................................................................................. 146 Chapter 9: Conclusion .................................................................................................. 147 9.1 Summary and Contributions ............................................................................... 147 9.2 Future Research Directions ................................................................................ 150 References ...................................................................................................................... 152

vi

List of Tables Table 1: Comparison of some competitive learning algorithms ....................................... 17 Table 2: Example of rule creation ..................................................................................... 35 Table 3: Example of surprise analysis .............................................................................. 38 Table 4: Example of rule splitting using causes from Table 3 ......................................... 41 Table 5: Example of rule refinement ................................................................................ 44 Table 6: Prediction model discretizing a continuous sensor ............................................. 83 Table 7: Adapting to action & sensor changes in the constrained box environment ...... 117 Table 8: Summary of contributions ................................................................................ 147

vii

List of Figures Figure 1: a) Overhead view b) SuperBot & sensors c) Vision & range sensor ................ 26 Figure 2: SBL Process

.................................................................................................. 29

Figure 3: SBL architecture ................................................................................................ 29 Figure 4: Life cycle of a prediction rule ........................................................................... 32 Figure 5: a) Robot’s location b) Base-condition c) Base-result ........................................ 36 Figure 6: a) base-condition b) base-result/surprised-condition c) surprised result ........... 38 Figure 7: a) Base-consequence b) Surprised-condition c) Surprised-consequence .......... 45 Figure 8: a) After 1st/before 2nd b) After 2nd/before 3rd action c) after 3rd action ......... 52 Figure 9: a) Hunter-goal layout b) Hunter-prey layout ..................................................... 60 Figure 10: Layout of my office room ............................................................................... 62 Figure 11: Actions to learn each map w/o obstacles from every starting location ........... 70 Figure 12: Data from Figure 11 excluding the “Random exploration” series .................. 70 Figure 13: Learning time for each map w/o obstacles from every starting location ........ 71 Figure 14: Actions to learn each map w obstacles from every starting location .............. 72 Figure 15: Learning time for each map w obstacles from every starting location ........... 73 Figure 16: Actions executed to learn each chaotic map ................................................... 74 Figure 17: Data from Figure 16 excluding the “Random exploration” series .................. 74 Figure 18: Learning time for each chaotic map ................................................................ 75 Figure 19: Actions to reach the goal after the specified amount of training runs ............. 80 Figure 20: Proximity sensor response ............................................................................... 83

viii

Figure 21: Actions executed for hunter to catch prey in each map .................................. 86 Figure 22: Execution time for the hunter-prey solution for each map .............................. 87 Figure 23: Impact of increasing the available actions ...................................................... 90 Figure 24: Impact of increasing the number of dummy constant valued sensors ............. 91 Figure 25: Impact of increasing the number of dummy random valued sensors .............. 92 Figure 26: Scalability in the number of irrelevant sensors ............................................... 93 Figure 27: Actions to reach dynamic goal locations in each map w/o obstacles .............. 97 Figure 28: Actions to reach dynamic goal locations in each map with obstacles............. 98 Figure 29: Response of fixed platform with new starting locations at run 1, 4 & 7 ....... 101 Figure 30: a) Initial b) Run 1 c) Run 2 d) Run 7............................................................. 102 Figure 31: Response of fixed start location with platform moving at run 1, 4 & 7 ........ 103 Figure 32: a) Initial b) Run 1 c) Run 2 d) Run 7............................................................. 103 Figure 33: The hunter cannot traverse the gray obstacles............................................... 105 Figure 34: Actions executed to reach the goal in each run w environmental changes ... 106 Figure 35: No of surprises encountered during each run w environmental changes ...... 107 Figure 36: Actions executed to reach the goal in each run with sensor changes ............ 109 Figure 37: Number of surprises encountered during each run with sensor changes ...... 109 Figure 38: Actions executed to reach the goal in each run with action changes ............ 111 Figure 39: Number of surprises encountered during each run with action changes ....... 112 Figure 40: Actions executed when the model is repaired vs. rebuilt .............................. 114 Figure 41: Surprises encountered with repair vs. rebuild ............................................... 114 Figure 42: Goal locations in hunter layout ..................................................................... 120

ix

Figure 43: Actions executed to reach the goal in each run with several changes ........... 121 Figure 44: Number of surprises encountered during each run with several changes ..... 121 Figure 45: Actions executed to reach the goal in each run with several changes ........... 124 Figure 46: Number of surprises encountered during each run with several changes ..... 125 Figure 47: Action recognition system dataflow .............................................................. 130 Figure 48: a) Frame 1 b) Frame 5 ................................................................................... 131 Figure 49: a) Frame 10 b) Frame 30 ............................................................................... 131 Figure 50: SBL action recognition process ..................................................................... 132 Figure 51: Relationship between examples, models and an action ................................ 134 Figure 52: Graphical depiction, sensor data and prediction model for “replace” ........... 135 Figure 53: Segmented data with fired rules for “replace” .............................................. 135 Figure 54: Markov chain for “replace” ........................................................................... 135 Figure 55: a) Missing start data b) Expected results for postdiction .............................. 138 Figure 56: a) Stream with missing data b) Expected results for interpolation................ 139 Figure 57: a) Missing end data b) Expected results for prediction ................................. 139 Figure 58: Action recognition results with positive examples and positive models....... 142 Figure 59: Action recognition results with different combinations of models ............... 143 Figure 60: Action recognition results with gap filling .................................................... 145

x

Abstract To survive in the real world, a robot must be able to intelligently react to unpredicted and possibly simultaneous changes to its self (such as its sensors, actions, and goals) and dynamic situations/configurations in the environment. Typically there is a great deal of human knowledge required to transfer essential control details to the robot, which precisely describe how to operate its actuators based on environmental conditions detected by sensors. Despite the best preventative efforts, unpredicted changes such as hardware failure are unavoidable. Hence, an autonomous robot must detect and adapt to unpredicted changes in an unsupervised manner.

This dissertation presents an integrated technique called Surprise-Based Learning (SBL) to address this challenge. The main idea is to have a robot perform both learning and representation in parallel by constructing and maintaining a predictive model which explains the interactions between the robot and the environment. A robot using SBL engages in a life-long cyclic learning process consisting of “prediction, action, observation, analysis (of surprise) and adaptation”. In particular, the robot always predicts the consequences of its actions, detects surprises whenever there is a significant discrepancy between the prediction and observed reality, analyzes the surprises for its causes (correlations) and uses critical knowledge extracted from the analysis to adapt itself to unpredicted situations.

xi

SBL provides four new contributions to robotic learning. The first contribution is a novel method for structure learning capable of learning accurate enough models of interactions in an environment in an unsupervised manner. The second contribution is learning directly from uninterpreted sensors and actions with the aid of a few comparison operators. The third contribution is detecting and adapting to simultaneous unpredicted changes in sensors, actions, goals and the environment. The fourth contribution is detecting and reasoning with unpredicted interference over a short period of time.

Experiments on both simulation and real robots have shown that SBL can learn accurate models of interactions and successfully adapt to unpredicted changes in the robot’s actions, sensors, goals and the environment’s configuration while navigating in different environments. Experiments on surveillance videos have shown that SBL can detect interference, and recover some information that was hidden from sensors, in the presence of noise and gaps in the data stream.

xii

Chapter 1 Introduction 1.1

Problem Statement

No matter how carefully a robot is engineered, the initial knowledge of the robot is bound to be incomplete or incorrect with respect to the richness of the real world. Thus, an autonomous robot must deal with dynamic situations caused by changes in its sensors, actions and goals, changes in the environment and interference.

Specifically, we define “unpredicted changes” as addition and deletion of sensors, actions, goals, alterations in the environment’s configuration, and change of definition of sensors and actions. A definition-change in a sensor means that its typical response to stimuli has changed, e.g. a camera being accidentally “twisted”. A definition-change in an action means that its actuator response has changed, e.g. a cross-wire has made leftturn to right-turn and vice versa. These unpredicted changes occur instantaneously and could last over an indefinite period of time. Therefore, the robot must adapt to such changes as soon as possible. In contrast, we define “interference”, such as noise and missing data or gaps in sensor data streams, as unpredicted changes that occur over a finite duration. Typically, the robot must be able to detect such situations and recover any missing information. So our objective in this dissertation is to develop a solution to address all of these unpredicted changes. 1

There are several major challenges for this problem. Unpredicted changes may occur simultaneously in sensors, actions, goals or other related aspects. Thus, a robot cannot assume any particular component to be fault-free, or guaranteed redundancy. Due to the lack of permanently correct models for sensors, actions and environments, the robot must constantly check and refine its models. The detection of change may not be under the supervision of external guidance. A fast response time is required despite resource limitations on a robot, and in most cases the robot must cope with continuous (nondiscrete), uncertain and vast information/action space.

1.2

Motivation of Our Approach

Events such as the Spirit Mars rover getting stuck in soft soil [Coul09] validated the fact that no matter how carefully a robot is engineered, its initial knowledge is bound to be incomplete or incorrect with respect to the richness of the real world. Similarly, the same robot received some damage to a control circuit resulting in it being unable to turn its right-front wheel, forcing it to drive backwards dragging the dead wheel [WB09]. This and many other examples justify the fact that an autonomous robot must deal with dynamic situations caused by unpredicted changes in its sensors, actions, goals and environmental configurations.

Motivated by children’s developmental psychology, we are inspired to develop a lifelong cyclic learning process for an autonomous robot. We believe that a child is born with very limited knowledge or expectations of his or her actions and interactions with 2

the environment, but as they interact with the world they extract valuable knowledge from these experiences to develop the ability to predict future outcomes of their actions even in an ever changing world. As a human progresses through life, their physical characteristics or sensing and actuating capabilities change, such as changes in eyesight, hearing, physical strength etc. Clearly, we are able to adjust and continue with our goals, so a truly autonomous robot must also be able to handle changes in its sensors, actuators, environments and goals as upgrades, failures and re-tasking are inevitable during its lifetime. Thus, our primary motivation was to create a learning algorithm for robots inspired by a human’s process of “prediction, action, observation, analysis (of surprise), and adaptation”. Eventually we hope that this will not only advance the fields of robotics and artificial intelligence, but also be a vehicle that will contribute towards the field of development learning.

1.3

Scope and Assumptions

Surprise-Based Learning was developed to facilitate life-long learning for an unsupervised robot or autonomous agent by systematically addressing the challenges outlined earlier. Throughout this dissertation we assume that the robot has sufficient redundancy in hardware to compensate for any “surprise” or unpredicted change. In other words we assume that a cause for a surprise (or a reason for an unpredicted change) is observable, such that the learning could converge to a solution that exists. This assumption is practical in sensor rich applications like robotic navigation. Yet, there are situations where the cause may not be directly observable due to hidden states. Although 3

there are strategies such as “local distinguishing experiments” [Shen93a] that could be performed to disambiguate hidden states, they will not be addressed in the scope of this dissertation.

We refer to a structured response to a stimulus from the environment as an “entity”. This research assumes that a sensor maps to one or more entities, each entity has one or more attributes, and an attribute has a value obtained during an observation. For example, a camera sensor could return a set of entities such as uniquely identified objects or colored blobs, which could have the attributes size and location, while a proximity sensor would return an entity corresponding to distance with the attribute size. The mapping from a sensor to entities and their attributes must be provided by a user as preprocessing is required for complex sensors (i.e. multi-dimensional data), while raw or uninterpreted data from simple sensors (i.e. one-dimensional data) can be fed directly to the learner.

It is important to define the sensors, entities and attributes adequately as the learner can only converge on goals defined in terms of them. Any inadequate definitions may result in random action execution dominating over goal directed behavior as the learned model is unable to plan towards the goals. In order to determine adequate mappings a user should identify a set of goals that the robot is expected to achieve, then work backwards by identifying the desired attributes, corresponding entities and which sensors would most likely satisfy these goals. Note that it is not necessary to identify the most appropriate mapping. Providing one or more adequate mappings would ensure learning.

4

This research does not assume that the robot can be reset to its initial configuration prior to each experiment in the real world. However, it does assume the continuity of entities in consecutive observations. For example, in an environment consisting of uniquely colored walls, if the robot perceives a blob of a certain color at a particular size in the first observation, and a blob of the same color but a different size in the next observation, it assumes that these blobs represent the same entity. To ensure the validity of this assumption the duration of an action between consecutive observations is defined to be sufficiently small.

Finally, when learning to detect and adapt to interference this research assumes that there is either no interference, or a very tolerable amount of interference is present in the data during the learning phase, and no further adaptation of the learned model is allowed during the testing phase.

1.4

Scientific Contributions

This dissertation investigated the problem of autonomous detection and adaptation to unpredicted changes of a robot. There are four contributions to robotic learning.

The first contribution is a new machine learning technique called Surprise-Based Learning for structure learning in robotics. SBL captures the structure, which are patterns in the sensed data associated each action, in a set of prediction rules by detecting and eliminating surprises. 5

The second contribution is learning directly from uninterpreted sensors and actions. More precisely SBL can learn from both interpreted as well as uninterpreted sensors by discretizing continuous sensor data through the application of comparison operators.

The third contribution is detecting and adapting to simultaneous unpredicted changes in sensors, actions, goals and the environment. SBL achieves this by combining its abilities of detecting and identifying the independent unpredicted changes.

The fourth contribution is detecting and reasoning with unpredicted interference over a short period of time. The interference considered here includes temporary noise and gaps in data.

1.5

Dissertation Organization

This dissertation is organized into 9 chapters. Following the introduction in this chapter, Chapter 2 outlines the background and related work in developmental learning, artificial intelligence, and autonomous robot adaptation. Chapter 3 details the Surprise-Based Learning approach. Chapter 4 describes experimental environments and a strategy for evaluating the contributions. Chapters 5 to 7 present experiments and results on the contributions of structure learning, learning from uninterpreted sensors & actions, and detecting and adapting to unpredicted changes. Chapter 8 describes the contribution of detecting and reasoning with unpredicted interference, with its results. Finally, chapter 9 provides a summary and some future research directions. 6

Chapter 2 Overview of the State-of-the-Art For autonomous detection and adaptation to unpredicted changes a number of challenges must be addressed. This chapter presents inspirations drawn from human developmental psychology and reviews some current artificial intelligence algorithms capable of addressing these challenges with specific attention to related research on adaptation.

2.1

Inspirations from Developmental Psychology

The basic concept of surprise-based learning was first proposed by Shen and Simon in 1989 [SS89] and later formalized as Complementary Discrimination Learning [Shen90, Shen93a, SS93]. This learning paradigm stems from Piaget’s theory of Developmental Psychology [Piag52], Herbert Simon’s theory on dual-space search for knowledge and problem solving [SL74], and C.S. Peirce’s method for science that “our idea of anything is our idea of its sensible effects” [Peir1878].

Over the years, researchers have attempted to formalize this intuitively simple but powerful idea into an effective and general learning technique. A number of experiments in discrete or symbolic environments have been carried out with successes, including the developmental psychology experiments for children to learn how to use novel tools [Shen94], scientific discovery of hidden features (genes) [Shen89, SS93, Shen95], game

7

playing [Shen93b], and learning from large knowledge bases [Shen92]. Here, we generalize these previous results for real robots to learn autonomously from continuous and uncertain environments.

Another important inspiration drawn from developmental psychology is a behavioral procedure called the water maze [Morr84], which was designed by Richard G. Morris to test spatial memory. Although it was originally developed to test learning in animals, it has since been modified for evaluating the same in autonomous robots.

2.2

Artificial Intelligence

In computer science, SBL is related to several solutions for the inverse problem, such as Gold’s algorithm for system identification in a limit, Angluin’s L* algorithm [Angl87] for learning finite state machines with hidden states using queries and resets, the L*extended algorithm by Rivest and Schapire [RS93] using homing sequences, and the D* algorithm based on local distinguishing experiments [Shen93a].

For learning from stochastic environments, SBL is related to learning hidden Markov models (HMM) [Murp04], partially observable Markov decision processes [Cass99], and most recently, predictive state representations [WJS05] and temporal difference (TD) algorithms [ST05].

8

Some systems also incorporate novelty [HW02], with a flavor of surprise, into the value function of states, although they are not used to modify the learned models. The notion of prediction is common in both developmental psychology and AI. Piaget’s constructivism [Piag52] and Gibson’s affordance [Gibs79] are two famous examples. In AI systems, the concepts of schemas [Dres91] and fluent [CAOB97] all resemble the prediction rules we use here, although their primary use is not for detecting and analyzing surprises as we do here. Different from these results, we present a new technique that capitalizes on predictions and surprises to facilitate simultaneous learning and representation.

At present most learning algorithms can be classified as supervised, unsupervised or reinforcement learning [KB72]. Supervised learning (SL) requires the use of an external supervisor that may not present here. In contrast, unsupervised learning (UL) opts to learn without any feedback from the environment by attempting to remap its inputs to outputs, using techniques such as clustering. Hence, it may overlook the fact that feedback from the environment may provide critical information for learning.

Reinforcement learning (RL) receives feedback from the environment. Some of the more successful RL algorithms used in related robotic problems include Evolutionary Robotics [NF00] and Intrinsically Motivated Reinforcement Learning [SKB05]. However, most RL algorithms focus on learning a policy from a given discrete state model and the reward is typically associated with a single goal. This makes transferring the learned knowledge to other problems more difficult. Doya et al. addressed this shortcoming in

9

Multiple Model-based Reinforcement Learning (MMRL) [DSKM02] by applying the concepts of using multiple paired forward inverse models for motor control [WK98] into the RL paradigm. MMRL maintains several parallel models and switches between them as the goal changes, yet it needs a substantial amount of a priori knowledge about the environment and sensors. It is interesting to note that such techniques may have been inspired by unfalsified control theory [ST97], whereas SBL has a subtle difference in that it maintains a single model that adapts with experience.

The following are a few noteworthy extensions and applications of RL. Krichmar et al. [KNGE05] tested a brain-based device called “Darwin X” on a dry version of the water maze. This robot utilized visual cues and odometry as input and its behavior was guided by a simulated nervous system modeled on the anatomy and physiology of the vertebrate nervous system. Busch et al. [BSKS07] built on this idea by simulating a water maze environment to compare an attributed probabilistic graph search approach and a temporal difference reinforcement learning approach based solely on visual cues encoded via a self-organizing map which discretized the perceptual space. Stone et al. [SSK08] tested the simulated algorithm in a physical robot and extended it to facilitate adaptation of the reinforcement learning approach to the relocation of the hidden platform.

RL discretizes the perceptual space with a predefined approximation function. Instead, SBL learns a model of the environment as a set of prediction rules. RL requires a large amount of training episodes to learn a path to a single goal. SBL learns via surprises with

10

each action it takes, so it does not need separate training episodes. Most RL algorithms focus on learning a policy from a given discrete state model and the reward is typically associated with a single goal, making it difficult to transfer learned knowledge to other goals or to other problems which are similar. This fact was noted by Stone et al. prompting an extension of the original algorithm, but even with the short term and long term rewards the trajectories of the robot were still slightly biased when the goal was relocated. SBL accommodates multiple goals and can be transferred to other problems with minimal changes as the algorithm does not require any a priori information regarding the environment, the sensors or the actuators. In addition, SBL’s ability to accommodate sensor and actuator failure during runtime without any external intervention makes it an ideal candidate for a physical robot operating in a real environment. Compared to the paradigm of reinforcement learning, SBL is model-based and offers quicker adaptation for large scaled and continuous problems.

Complementary Discrimination Learning (CDL) [Shen94] attempts to learn a model from a continuous state space and is capable of predicting future states based on the current states and actions. This facilitates knowledge transfer between goals and discovering new terms [Shen89]. However, CDL is symbolic learning that has not been applied to physical robots to learn directly from the physical world. In particular, CDL prescribes how the conditions in rules are to be altered to perform specialization and generalization, yet it does not specify how the prediction side should be utilized to capture relations between sensed entities. There are also some other activities that are necessary for robotic learning

11

that have not been addressed in CDL, such as discretization of continuous sensors, rule forgetting and adapting to unpredicted changes.

Evolutionary Robotics (ER) is a powerful learning framework which facilitates the generation of robot controllers automatically using neural networks (NN) and genetic programming/algorithms (GA). The emphasis in ER is to learn a controller given some special parameters such as the number of neurons and layers in a NN or the fitness, crossover and mutation functions in a GA. These parameters are not easily discoverable for most robots in different environments. However, advances in ER such as the ExplorationEstimation Algorithm [LB04] proved that an internal model such as an action or sensor model of a robot can be learned without a priori knowledge, given that the environment and the robot can be reset to its initial configuration prior to each experiment, which may not be realistic in the real world.

Another promising approach is Intrinsically Motivated Reinforcement Learning (IMRL) where the robot uses a self-generated reward to learn a useful set of skills. IMRL has successfully demonstrated learning useful behaviors [OKH07], and a merger with ER as in [SMB07] was able to demonstrate navigation in a simulated environment. The authors have mentioned that IMRL can cope with situated learning, but the high dimensional continuous space problem that embodiment produces, is beyond its current capability.

12

There has been a large amount of research in model learning as in [PK97] and [SS06], yet the majority focus on action, sensor or internal models of the robot and not the external world. For the autonomous robotic learning problem we are interested in, the learner must accommodate learning models of interaction in an environment with limited processing, noisy actuation and noisy sensing available on a physical robot. Furthermore, the ability to predict and be “surprised” by any ill effects is also critical. The powerful paradigm of learning from surprises has been analyzed, theorized, explored and used in a few other applications such as traffic control [HJSL05] and computer vision [IB04].

Alternatively, for problems such as robot navigation, Simultaneous Localization and Mapping (SLAM) [Thru02] could be used to learn a map. Goals can be achieved by applying AI searching and planning techniques on this map. Accurate models for the robot’s actions and sensors must be given to facilitate this type of learning, yet such models are not readily available for most autonomous robots. Most importantly, SLAM techniques need carefully designed contingency plans to deal with unpredicted changes, whereas SBL can overcome such situations by autonomously adapting the learned model.

2.3

Adapting to Unpredicted Changes (including Fault & Failure Tolerance)

“Fault & failure tolerance” used in the context of a robot’s ability to detect and recover from sensor and actuation failure is related to adapting to unpredicted changes. It is important to note that unpredicted changes include interference in addition to fault & failure tolerance. Typically there are 3 attitudes towards dealing with unpredicted 13

changes in robots as identified by Saffiotti [Saff97]. The 1st is to get rid of it through precise engineering. The 2nd is to tolerate it by utilizing redundancy with carefully designed contingency routines. The 3rd is to reason about it by using techniques for the representation and manipulation of uncertain information. There is a rich body of research under all 3 attitudes, but we are interested in the 2nd and 3rd as they focus on imbuing intelligence in robots.

In order to tolerate unpredicted changes in sensors and actions the robot must be able to detect the change and then handle it. For this purpose one feasible approach is to provide models of the sensors, actions and environments, such that when an exception occurs it will be trapped and handled by separate contingency strategies. An early example of developing contingency strategies using sensor redundancy was given by Visinsky [Visi91]. Ferrell [Ferr94] demonstrated failure tolerance on a multi-legged robot that exploits redundancy in the number of legs and sensors using a predefined strategy to deactivate and reactivate each leg. Similarly, Kececi [KTT09] demonstrated a method for redundant manipulators. From Murphy’s [Murp96] early work to Stronger’s [SS06] recent work there has been research on learning sensor models under controlled conditions first, then using them in uncontrolled situations to detect anomalies. When goals change unpredictably methods such as model switching as shown by Doya [DSKM02] can be used. Unfortunately, in most practical applications it is not always possible to acquire accurate models of the sensors, actions or environments especially because hardware degrades over time and environments change.

14

The attitude towards reasoning and manipulation of uncertainty, known as data-driven adaptation aims to overcome the shortcomings of the previous attitude. Sensor fusion techniques demonstrated by Pierce [PK97] and Soika [Soik97] are good examples that show how probabilistic sensing from multiple modalities could be used to tolerate faults. Bongard [BZL06] further demonstrated how simulating and reasoning about uncertainties can be used to recover from unpredicted actuation errors in a legged robot. When the goals change unpredictably as in the Morris water maze, Stone [SSK08] demonstrated the feasibility of updating the model and re-planning. However, most of these techniques rely on either the use of controllable environments i.e. the ability to reset or alter accordingly, or the availability of one or more models to verify the other i.e. use the sensor model to establish that an actuator has failed etc. A very competitive solution for this problem was highlighted by Pierce [PK97] where they demonstrated learning a map of the environment directly from uninterpreted sensors and actuators. They specify that the action model is learned with the aid of the previously learned sensor model, also the environment model is built using these two, yet they do not detail how to accommodate any unpredicted changes propagating from a previous model.

Despite numerous successes, none of these techniques attempt to deal with simultaneous unpredicted changes in all aspects, including changes in sensors, actions, goals and the environment’s configuration, in an unsupervised manner. This is a major motivation for the development of SBL [RS08a].

15

2.4

Reasoning with Unpredicted Interference

Hidden Markov models as demonstrated in [Diet02], conditional random fields as demonstrated in [BOP97] and neural networks have shown some promise towards detecting and reasoning with unpredicted interference. These techniques require human guidance to design approximation functions and determine the number of states or neurons. Although an exhaustive search for the number of states can be performed, this costs a large amount of time and data for training. In contrast, SBL is designed to learn the number of states from a few examples and reason with unpredicted interference.

16

17



  











TD

MMRL

IMRL

HMM

SBL









CDL















GA





NN 





Raw Data

RL

Models

No Prior

Fitting

No Over















Data

Probabilistic











Transfer

Knowledge









Training

No Explicit











Tolerance

Fault



Forgetting

Memory

Table 1: Comparison of some competitive learning algorithms

Task

Solve

Summary

Technique

2.5



Change

Runtime



Anomaly

Detect

17



Gap

Recover

Table 1 shows a comparison of the capabilities of various learning algorithms that are suitable for autonomous robotic learning. The columns are defined as follows: 

“Solve Task” means that the algorithm can learn to satisfy some goals.



“No Over Fitting” marks scalability as it implies generalization, resulting in compactness.



“No Prior Models” means that the algorithm does not need an approximation function or a sensor, actuator or environment model in advance.



“Raw Data” is the ability to handle data directly from the sensors rather than via a human engineered approximation function or a preprocessor.



“Probabilistic Data” means that non deterministic environments can be dealt with.



“Knowledge Transfer” facilitates changing goals without having to re-learn.



“No Explicit Training” indicates that the learning is incremental such that the training and testing are interleaved.



“Fault Tolerance” is the ability to compensate for faults and failures provided there is sufficient redundancy.



“Memory Forgetting” identifies the ability to reject irrelevant data over a period of time.



“Runtime Change” permits changes to hardware (sensors, actuators) or software (operators) dynamically during runtime.



“Detect Anomaly” is the ability to recognize an unpredicted or anomalous situation and identify possible causes.

18



“Recover Gap” indicates that the algorithm can recover data hidden in a gap in a data stream over a period of time.

Note that these algorithms may be improved in certain ways, such as combining them together into hybrid algorithms, but we only consider its pure form here. These capabilities are important as they relate to the original challenges in the following ways:

1. Coping with vast amount of non-discrete (continuous) information. a. No over fitting b. Raw data

2. Coping with uncertain information. a. Raw data b. Probabilistic data c. Fault tolerance d. Forgetting

3. Working without sufficient information such as action, sensor & environment models. a. No prior models b. Knowledge transfer

19

4. Requires fast response times despite resource limitations. a. No over fitting b. No explicit training c. Forgetting

5. Dealing with unpredicted changes in sensors, actions, goals or the environment. a. Fault tolerance b. Runtime change c. Forgetting d. Probabilistic data

6. Learning with minimal or no human intervention. a. No prior models

This chapter described inspirations drawn from developmental psychology, and a survey of the state of the art in the areas of artificial intelligence and robotic learning. Although there are several competitive approaches for detecting and adapting to unpredicted changes on a robot, we identified that none of them focus on simultaneous unpredicted changes in sensors, actions, goals and the environment. The rest of this dissertation will present a new approach to address this problem.

20

Chapter 3 Surprise Based Learning This chapter provides an introduction to the core concepts of Surprise-Based Learning. For this purpose the terminology used throughout this dissertation is defined first, followed by an illustrative example of a robot navigating inside a constrained box environment. The box environment is used in subsequent chapters to illustrate finer details of SBL. We describe the process of creating and maintaining a prediction model, followed by strategies for goal management, rule forgetting and entity & attribute relevance, which are needed for detecting and adapting to changes in goals, the environment, actions and sensors respectively.

3.1

Terminology

Action – An action is a physical change that occurs inside the learner, which in the case of a physical robot is a sequence of predefined actuator commands over a period of time. The set of actions given to the learner is A = {a1, a2, … , an}

Value – A value can be a numeric or categorical representation of data. Values may be ordered (i.e. numeric ordering) or unordered (i.e. categorical bins). A set of values is V = {v1, v2, … , vn}

21

Attribute – An attribute is a grouping of related values. An attribute may have a finite set of values or an infinite set of values bound by a range. A set of attributes is B = {b1= {v1, v2}, … , bk= {v1, … , vn}, bl=[vi, vj]}

Entity – An entity is a grouping of related attributes. It is a structured response to a stimulus from the environment. A set of entities is E = {e1= {b1}, … , en= {b1, … , bk}}

Sensor – A sensor is the only channel that allows data from the environment to flow into the robot. This raw input data is fed either directly to the learner or via some function that performs preprocessing. So in the case of raw data, SBL maps it to a single entity that has a single attribute (null preprocessing). For example a range sensor is mapped to a single entity which has a single attribute called size, while a camera sensor maps to several color blob entities which have the attributes size and relative location. The set of sensors given to the learner is S = {s1, s2, … , sn} A set of sensor to entity mappings is F = {s1= {e1}, … , sn= {e1, … , ek}}

Operator – An operator or comparison operator ⨀ is a mechanism that enables the learner to reason about an entity, or an attribute’s values. These operators aid in the creation and evaluation of logic literals, which serve as the basic building block for our prediction model. ⨀ may be presence(%), absence(~), increase(↑), decrease(↓), greaterthan(>), less-than(, value1) means that Condition1 is true if attribute1 of entity1 is greater than value1. Several logically related conditions can be grouped together to form a sentence using ‘And’ and ‘Not’ logical operators. Predictions are sentences that describe the expected change in the state

30

of observed entities and attributes as a result of performing a specific action possibly a number of times. As seen in (3) a prediction can be represented using a 4-tuple. i.e. Prediction1 ≡ (entity1, attribute1, ↑, value1) means that if the rule is successful the value of entity1’s attribute1 will increase by the amount indicated in value1.

An important aspects of prediction rules is that they can be sequenced to form a prediction sequence [o0, a1, p1, …, an, pn] where o0 is the current observation at the current state, and ai, 1≤i≤n, are actions, pi 1≤i≤n, are predictions. As the actions in this sequence are executed, the environmental states are perceived in a sequence o1, o2,…, on. A surprise occurs as soon as an environmental state does not match its correspondent prediction. Notice that a surprise can be “good” if the unpredicted result is desirable or “bad” otherwise. This notation is similar to the concept of “predictive state representations” recently proposed in [LSS02], but a prediction sequence here can be used to represent many other concepts such as “plan”, “exploration”, “experiment”, “example” and “advice” in a unified fashion as follows: 

A plan is a prediction sequence where the accumulated predictions in the sequence are expected to satisfy a goal.



Exploration is a prediction sequence where some predictions are deliberately chosen to generate surprises, so the learner can revise its knowledge.



An experiment is a prediction sequence where the last prediction is designed to generate a surprise with a high probability.

31



An example is a prediction sequence provided by an external source that the learner should go through and learn from the surprises generated during the execution of the sequence.



Advice is a prediction sequence that should be put into the model as it is. The learner simply translates a piece of advice into a set of prediction rules. For example, “never run off a cliff” can be translated into a prediction rule as cliff_facingforwarddestruction. The learner can weigh the advice ranging from “never” to “sometimes” depending on the seriousness of the consequence stated in the rule and use it when planning to reach a goal.

3.4

Life Cycle of Prediction Rules

Figure 4: Life cycle of a prediction rule

The illustration in Figure 4 depicts the life cycle of a prediction rule. The solid arrows indicate the evolution of a prediction rule, while the dotted lines express the input and possible output at each stage.

32

During learning and adaptation, prediction rules are created, split, and refined according to a set of templates described in equations (4)-(10) below. These form the core of the model modifier in the SBL architecture. They are in essence enhanced versions of rule creation, splitting and refinement as described in CDL. We first present the format of these templates and then give detailed examples and explanations in subsequent chapters.

Rule Creation – Let C0 represent an entity or an attribute. If C0 represents an attribute before an action then P0 indicates its change after the action. If C0 represents an entity before an action then P0 may indicate its change or another entity responsible for its change after the action. A new rule is created as follows: Rule0

=

C0 → Action+ → P0

(4)

Rule Splitting – If a surprise is caused by a single rule (e.g. Rule0 above), then for each possible cause CX identified by the analysis of the surprise, the rule is split into two complementary sibling rules as follows, where PX is a newly observed consequence of the action. RuleA

=

C0  CX

→ Action+ → P0  ¬ PX

(5)

RuleB

=

C0  ¬ CX → Action+ → ¬ P0  PX

(6)

Rule Refinement – If a surprise is caused by a RuleA that has a sibling RuleB, where C represents the rule’s current condition minus C0, and P represents the prediction of the rule, as follows:

33

RuleA

=

CO  C

→ Action+ → P

RuleB

=

CO  ¬ C → Action+ → ¬ P

(7) (8)

Then for each possible cause CX identified by the analysis of the surprise, the rules will be refined as follows: RuleA

=

CO  C  CX

→ Action+ → P

RuleB

=

CO  ¬ (C  CX ) → Action+ → ¬ P

(9) (10)

Notice that equations (7)-(10) can be applied to any pair of complementary rules. In general, a pair of complementary rules can be refined multiple times so that as many C X can be inserted into their conditions according to this procedure. Whenever a rule is discriminated, its complementary rule will be generalized, hence the name complementary discrimination learning.

3.4.1

Rule Creation

New rules are created according to equation (4). We call the observation made at time t-1 before executing an action the base-condition ‘BC’, and the observation made at time t after executing the action the base-result ‘BR’. Therefore, new rules are created and added to the model by comparing the base-condition and base-result of executing an action ‘a’ one or more times with the set of operators {%, ~, ↑, ↓}. The following functions return new prediction rules:

34

For each entity e in BC but not in BR, create (%e –a+→ ~e);

(11)

For each entity e not in BC but in BR, create (~e –a+→ %e);

(12)

For each entity e1 in BC not in BR and each e2 not in BC but in BR, create (%e1 –a+→ %e2)

(13)

For each entity e in BC and BR, do for value v increased, create (e.b –a+→ e.b↑v);

(14)

for value v decreased, create (e.b –a+→ e.b↓v);

(15)

For each newly created rule, the robot records observations corresponding to its basecondition and base-result. Rule creation is invoked when the robot, i) has no predictions for the selected action, or ii) none of the predicted rules forecasted a change in the current observation, or iii) all predicted rules have been forgotten after rule analysis as detailed later.

Table 2: Example of rule creation Base-condition

G.size = 10, H.size = 1, I.length = 2

Base-result

G.size = 15, H.size = 1, J.size = 6

Prediction Model

R1: %I –Action1→ ~I R2: ~J –Action1→ %J R3: %I –Action1→ %J R4: G.size –Action1→ G.size↑5

35

Forward Action

Figure 5: a) Robot’s location

b) Base-condition

c) Base-result

Two examples of rule creation are shown in Table 2 and Figure 5 respectively. Consider the situation in Figure 5, where the robot first explores the action “forward” and the observations before and after the action are in Figure 5b and 5c respectively. There are three entities in these observations, namely red (wall), white (floor), and proximity. The rule creation mechanism detects four changes in the attributes before and after the action, but no entities have changed. So 4 new rules Rule1, Rule2, Rule3, and Rule4, are created as listed below:

Rule1: (White, All, %, 0) → FORWARD → (White, S, , Value2)

// y-location of white increased

Rule3: (Red, All, %, 0) → FORWARD → (Red, S, >, Value3)

// size of red increased

Rule4: (Red, All, %, 0) → FORWARD → (Red, Y, >, Value4)

// y-location of red increased

3.4.2

Surprise Detection and Analysis

As prediction rules are incrementally learned, the robot uses them to make forecasts or predications whenever it can. When the conditions of a prediction rule ‘R’ are satisfied by the current observation, the rule’s predictions are evaluated after the action is executed.

36

A surprise is detected if a prediction fails to be realized, i.e. the forecasted value was not observed. When a surprise occurs, we call the observation made at time t-1 before executing the action the surprised-condition ‘SC’.

The objective of surprise analysis is to identify the possible cause(s) of the surprise by comparing entities and attributes in the base-condition with those in the surprisedcondition using the given set of comparison operators (typical set {%,~,>, e.b.vSC, cause (e.b > e.b.vSC);

(19)

If values in attribute b are unordered, then for each entity e in BC and SC, do for value e.b.vBC != e.b.vSC, cause (e.b != e.b.vSC);

(20)

37

Table 3: Example of surprise analysis Surprised Rule

R4: G.size –Action1→ G.size↑5

Base-condition

G.size = 10, H.size = 1, I.length = 2

Base-result

G.size = 15, H.size = 1, J.size = 6

Surprised-condition

G.size = 20, H.size = 2, J.size = 6

Surprised-result

G.size = 20, H.size = 2, J.size = 3

Causes of Surprise

[(%I), (~J), (G.size