Wrapper for Object Detection in an Autonomous Mobile Robot

0 downloads 0 Views 108KB Size Report
In this paper, we address the problem of object detec- tion in an autonomous mobile robot. The goal is to define a perceptual system that can quickly adapt itself ...
Wrapper for Object Detection in an Autonomous Mobile Robot Nicolas Bred`eche LIMSI - Univ. Paris 11 [email protected]

Yann Chevaleyre LIP6 - Univ. Paris 6 [email protected]

Abstract In this paper, we address the problem of object detection in an autonomous mobile robot. The goal is to define a perceptual system that can quickly adapt itself to detect a specific object. To achieve this, we propose a goal-oriented approach to let the robot build the most fitted image descriptors for a given object.

1 Introduction In real world robotics, object detection is needed in several tasks such as object identification, object tracking, scene description or grounded communication. An autonomous mobile robot can be confronted to countless situations where various objects or agents are likely to stimulate its percepts. To achieve an accurate object detection, a robot must be able to adapt its perceptual system for specific tasks. For instance, learning to detect a human or a door do not lead to the same perceptual adaptation. This adaptation can be compared to the short-term perceptual learning that occurs in the human early stage of visual processing [4]. In this paper, we are concerned with the robot’s ability to learn how to detect a specific object. The training set contains noisy images with or without the target object. This task is very close to image retrieval and relies both on a relevant image description mechanism and on an efficient matching phase between new images and known images. In order to improve the accuracy of detection, we focus on defining a goal-oriented wrapper that can be used by the robot to find a more adapted image description mechanism for the object to be detected. The contribution of this paper is a wrapper that tries to find a good expressivity for image representation as well as reducing the complexity of the matching phase.

2

Problem setting

The adaptation of robots to real situations cannot be obtained solely by modeling and design. This adaptation re-

Louis Hugues LIP6 - Univ. Paris 6 [email protected]

quires learning mechanisms that take into account the sensors incertitude and imprecision, the incompleteness and dynamic nature of real world models. Applications such as tracking or scene description do not need the target object to be explicitly recognized. As a matter of fact, detecting the presence of the object is often enough for these tasks. In the scope of this article, we define the robot’s detection task such as in order to detect an object inside an image, it is enough to detect a relevant part of this object in some part of the image. Here, object means both an inanimate object (door, window, ashtray...), and a dynamic object (another robot, a human...). Moreover, each part is created according to a given structure and embeds one to many basic connected elements of the image, where an element is a clustered set of connected pixels (e.g. a single pixel, a grown color region, a localized color histogram, etc.). Figure 1 shows two examples of an image divided into parts, each part embedding a specific number of basic elements according to a given pre-defined structure. In fact, since the matching stage between the current image description and known labeled images descriptions is limited to a single part-to-part matching, spatial relations are expressed between embedded basic elements, thus explicitly limiting the complexity of the matching. From the robot’s viewpoint, this approach is based on the data as are many other approaches in robotic systems, and does not rely on a given representation of objects as found in the world. This detection task has much in common with image indexing, especially with image retrieval using shape and spatial properties such as [5][6][8]. As a matter of fact, the parts of an image in the detection framework can be seen as flexible templates as introduced by Lipson [7] for image classification, which is also known as a configural recognition approach. Learning from such flexible templates has been formalized as a multiple instance learning problem [2] and successfully used for learning natural scene concepts from few images [8, 10]. Our problem is driven by the fact that we are concerned with checking if there is a specific property hidden in the image. As a matter of fact, the environment of the robot provides very similar images where global information are

1051-4651/02 $17.00 (c) 2002 IEEE

Figure 1. Two examples of image description.

not bounded to the target concept while a standard image retrieval task is about finding globally similar images among a set of very different images. Moreover, we intend to create a set of rules which is known to be much faster to apply than any similarity measure, leading to nearly costless image classification which can easily be implemented in a real-time operating mobile robot. In this paper, we focus on finding the best structural configuration for matching single parts. This is done thanks to successive reformulations (i.e. redefinitions) of the part’s structural configuration in order to increase the detection accuracy.

3 Proposed approach As other works based on abstraction [12], our approach relies on a representation bias that is achieved by reformulating data for the given learning task. Abstraction has been widely studied in the fields of problem solving and planning and is becoming a subject of great interest to the machine learning community as it offers a general framework to handle complex data [3]. Here, we use a multiple-instance representation of the data [2]. This kind of representation has already been successfully used for natural scene classification [8]. Within the multiple-instance framework [2], images are represented as bags of vectors of variable size. Vectors are also  called instances. The   size   of

 a bag is noted  . Its instances are noted . Let  be a feature vector space, and  the finite set of labels or classes. The multiple instance  learning task consists in finding a classifier   , which accurately predicts the label  . Learning is achieved thanks to a dataset which contains images with or without the target object to be detected (respectively labeled as positive or negative examples). In this case, vectors correspond to parts of the image. Thanks to an appropriate bias[2], classification of a single part of an image as positive is enough to classify the whole image, which

perfectly suits the detection task described in this paper. At first, the images are described into basic elements connected to each others. These basic elements are given by any cluster algorithm (e.g. region growing based on color similarity). In order to avoid a complex matching phase because of the richness of spatial properties concerning the basic elements of an image, a new description is built out of this basic description that can be used for learning. The new description relies on a set of parts to describe the image. As seen in the previous section, a part embeds one to several basic elements according to a specific structural configuration. As a consequence, the matching complexity is limited to the structure used. Figure 1 illustrates two different part structure based on a simple grid-based division of the image (e.g. local color histograms corresponding to grid elements). In the first example, a part embeds exactly one element, while in the second example, a part embeds three element in a L-like structural configuration. Many possible structural configurations can be used, and the problem is to find a good trade-off between expressivity and complexity for a part’s structural configuration. Thus, in order to adapt the structural configuration of parts to the target object, the robot uses a wrapper to iteratively reformulate the description until the prediction accuracy is satisfying starting with the simplest structural configuration. The wrapper takes into account the results and the decision rules from the learning algorithm on specific structural configurations in order to create new descriptions. This goaloriented adaptation can be referred to as an artificial perceptual learning task since the goal is to find a relevant structure for a specific detection task.

4 Experiments 4.1 Experimental setup In order to validate our approach and to train the perceptual system, we built a learning set of 350 images taken by the robot’s LCD video  camera in the corridors of our lab. The images are "!$#&%' # wide with a 24 bits colour information per pixel. An extinguisher (i.e. the target concept) can be seen in half of the images. The extinguishers as they appear in the images are in different shape, size and orientation. Finally, The images are labelled positive or negative according to the occurrence of an extinguisher. The multiple instances rule learner R IPPER M I[1] developed in our lab was used on the descriptions obtained from these images with a ten-fold cross validation. Moreover, each experiment is repeated 10 times in order to get a good approximation of the results. R IPPER M I returns a set of rules that covers the positive examples. These rules are very short and contain only relevant attributes (thus limiting the

1051-4651/02 $17.00 (c) 2002 IEEE

matching complexity). Moreover, they can be easily interpreted by a wrapper for reformulation, or by a human interlocutor in human-robot communication. R IPPER M I is also very fast, which is of utter importance given it is embedded in a real-time robot : the inducing time for all the experiments done with the wrapper were computed in about 10 seconds on a 600Mhz Pentium processor. In order to reformulate the structural configuration of parts, we developed the PLIC system (Perceptual Learning by Iterative Construction), which is both a structural reformulation tool as well as a wrapper that explores the possible structures for defining the structural configuration of a part according to given search hypotheses. The wrapper starts with a basic definition of what is a part, and then builds the new structurally enhanced parts thanks to the given construction rules and a specific heuristic. The descriptions generated by all of the structurally different parts are then evaluated with R IPPER M I and the process is repeated until a predefined termination state is reached.

sensitive to noise and scale variability. Moreover, a single part can embed several local histograms with spatial properties, which is shown to improve the accuracy of learning. In the experiment where a part embeds two local histograms, the learning algorithm R IPPER M I generates an average of ten rules in order to cover all the positive examples. Each rule being a set of conditions about 3 to 6 features among approx. 30. Here is an example of a rule given by R IPPER M I for the best experiment: true :- valH1 )+* 254, stddevB1 ,+* 56, stddevS1 ,+* 58, valB2 ,+* 128 (13/0). This rule means that if there exists an element in the image with a red hue and both a low standard deviation on the saturation and brightness, and if the brightness of the connected element is less than average, then we can conclude that the target object is detected with a probability of . In the following section, we introduce an approach based on a wrapper that iteratively builds structured descriptions. This approach is based on local histograms which proved to be more reliable and easy to handle.

4.2 Preliminary Results 4.3 Wrapper Results On the one hand, we implemented a description mechanism based on a region growing algorithm in order to get colour homogeneous parts of the image described by color information (with hue, saturation, brightness and specialized hue detectors as pixel properties) and spatial properties. We implemented the region growing algorithm known as Color Structure Code [11] in order to get a reliable description from the images. This linear-time algorithm is known to be quite efficient thanks to a specific up and down clustering which solves conflicts between shapes. However, the hue, saturation and brightness parameters used during region-growing were tuned empirically to yield good results. On the second hand, we extended the global color histogram technique which is known to yield good results [13]. To achieve this, the image is divided according to a grid with fixed dimensions, then each grid-element is described by a local color histogram and several other information about localization and size. Here, we fixed the dimension of the grid to by (which is similar to a global color histogram), and then to ( by ! , with no further empirical tuning. From these two basic descriptions, we generated four datasets. The part’s structural configuration for each dataset is defined as follow : - a part embeds a couple of connected color regions, - a single part embeds a global color histogram, - a part embeds one local histogram, - a part embeds a couple of connected local histograms, As shown in table 1, the local color histograms descriptions clearly outperform the region-growing descriptions under the given hypothesis. Local histograms are much less

PLIC’s wrapper tool was used to generate nine different part’s structural configurations. each part’s structural configuration is then used to generate a learning sets based on a (-%.! grid division of each of the images of the image database (i.e. granularity of 48 elements per image). In order to create the nine structural configurations, the wrapper starts from the simplest part’s structural configuration (with no spatial properties since it embeds only one element), and extends it into two distinct structural configurations each embedding 2 elements with specific spatial properties. Then, six other new structural configurations are extended from the two-elements structural configurations. To generate a learning dataset, one of the structural configuration is applied from every single elements of every images, leading up to 48 parts per image (with parts overlapping if the configuration embeds more than one element). Figure 2 shows the nine part’s structural configurations created by the wrapper along with the results given by R IPPER M I on the corresponding learning sets. Results from the experiments show that the highest accuracy is obtained with one of the most complex structured part’s structure TWO CONNECTED COLOR REGIONS A GLOBAL COLOR HISTOGRAM ONE LOCAL HISTOGRAM TWO CONNECTED LOCAL HISTOGRAMS

/100325 ; 46/1037 8:9  (1< 8=!> 8=!1 < ( ;? # < ( #1