Visual approaches for handle recognition - CiteSeerX

3 downloads 275 Views 145KB Size Report
A different proposal for vision-based door traversing behavior can be found in [19]. Here, the ..... 21(9):735–758, 2002. 19. M. W. Seo, Y. J. Kim, and M. T. Lim.
Visual approaches for handle recognition E. Jauregi, E. Lazkano1, J.M. Mart´ınez-Otzeta, and B. Sierra Robotics and Autonomous Systems Group http://www.sc.ehu.es/ccwrobot University of Basque Country, Donostia [email protected] Summary. Objects can be identified in images extracting local image descriptors for interesting regions. In this paper, instead of making the handle identification process rely in the keypoint detection/matching process only, we present a method that first extracts from the image a region of interest (ROI) that with high probability contains the handle. This subimage is then processed by the keypoint detection/matching algorithm. Two methods for extracting the ROI are compared, Circle Hough Transform (CHT) and blobs, and combined with three descriptor extraction methods: SIFT, SURF and USURF.

1 Introduction Door recognition is a key problem to be solved during mobile robot navigation. Many navigation tasks can be fulfilled by point to point navigation, door identification and door crossing [13]. Indoor semi-structured environments are full of corridors that connect different offices and laboratories where doors give access to many of those locations that are defined as goals for the robot. Hence, endowing the robot with the door identification ability would undoubtedly increase its navigating capabilities. Several references can be found that tackle the problem of door identification. Mu˜ noz-Salinas et al. [16] present a visual door detection system that is based on the Canny edge detector and Hough transform to extract line segments from images. Then, features of those segments are used by a genetically tuned fuzzy system that analyzes the existence of a door. On the other hand, a vision-based system for detection and traversal of doors is presented in [4]. Door structures are extracted from images using a parallel line based filtering method, and an active tracking of detected door line segments is used to drive the robot through the door. A different proposal for vision-based door traversing behavior can be found in [19]. Here, the PCA (Principal Component Analysis) pattern finding method is applied to the images obtained from the camera for door recognition. A door identification and crossing approach is also presented in [15];

2

E. Jauregi, E. Lazkano, J.M. Mart´ınez-Otzeta, and B. Sierra

there, a neural network based classification method was used for both, the recognition and crossing steps. More recently, in [11] a Bayesian Network based classifier is used to perform the door crossing task. The proposal in [10] differs in the sense that doors are located in a map and do not need to be recognized, but rectangular handles are searched for manipulation purposes. The handles are identified using cue integration by consensus: for each pixel the probability of being part of a handle is calculated by combining the gradient and the intensity of every pixel in order to have the degree of membership. The template model is used to obtain the consensus over a region. But navigating in narrow corridors makes it difficult to identify doors by line extraction due to the inappropriate viewpoint restrictions imposed by the limited distance to the walls. On the other hand, door handles are at the same height of robot’s camera and small in size. Moreover, door blades are rarely cluttered surfaces and handles are commonly the unique element doors have on them. Previously, we tackled the problem using CHT for handle identification [9] but there, the circle information was combined with color segmentation inside and around the circle. This approach showed to be too specific to robot’s particular environment and not easily generalizable. Objects can be identified in images extracting local image descriptors for interesting regions. These descriptors should be distinctive and invariant to image transformations. SIFT and SURF (together with its upright version USURF) [2] are well known methods to extract these kind of descriptors. In this paper, instead of making the handle identification process rely in the keypoint detection/matching process only, we present a method that first extracts from the image a region of interest (ROI) that with high probability contains the handle. This subimage is then processed by the keypoint detection/matching algorithm. Two methods for extracting the ROI are compared and combined with the three descriptor extraction methods previously mentioned.

2 SIFT SIFT (Scale Invariant Feature Transform) is a method to extract features invariant to image scaling and rotation, and partially invariant to change in illumination and 3D camera viewpoint. Those properties make it suitable for being used in robotics applications, where changes in robot viewpoint distort the images taken from a conventional camera. SIFT can be used for different goals as object recognition in images [14], image retrieval [12], mobile robot localization [21] [7] and SLAM [18]. After keypoints are localized, for each keypoint a descriptor is computed by calculating an histogram of local oriented gradients around the interest point and storing the bins in a 128 dimensional vector. These descriptors can be then somehow compared with stored ones for object recognition purposes.

Visual approaches for handle recognition

3

3 SURF/USURF SURF [2] is another detector-descriptor algorithm developed with the aim of speeding up the keypoint localization step without losing discriminative capabilities. Instead of using a different measure for selecting the location and the scale of the keypoints, it relies on the determinant of the Hessian for both. The second order Gaussian derivatives needed to compute the Hessian are approximated by Box filters that are evaluated very fast using integral images. Such filters of any size can be applied at exactly the same speed directly on the original image. Therefore, the scale-space is analyzed by upscaling the filter size instead of by iteratively reducing the image size as occurs in the SIFT approach. SURF descriptors are computed in two steps. First, a reproducible orientation is found based on the information of a circular region around the interest point. This is performed using Haar-wavelet responses in the x and y directions at the scale the interest point was detected. The dominant orientation then is estimated by calculating the sum of all responses within a sliding window. Next, the region is split up into smaller square subregions and some simple features are computed (weighted Haar wavelet responses in both directions, sum of the absolute values of the responses). This yields a descriptor of length 64, half size of the original SIFT descriptor and hence, offers a less computationally expensive matching process. The upright version of SURF, named USURF, skips the first step of the descriptor computation process, resulting in a faster version. USURF is proposed for those cases for which rotation invariance is not mandatory.

4 Extracting the interesting region of the image Instead of computing the invariant features of the whole image, the approach presented here aims to take advantage of the properties of the robot environment and reduce the size of the image to be processed by extracting the portion of the image that, with high probability, will contain most of the features. In this way, and considering that door blades in the robot environment are featureless surfaces, almost every keypoint is located at the door handle and only a few of them appear at the handle surroundings (see Figure 1). Therefore, there is no reason to process the whole image, only the handle must be selected as the region of interest (ROI) and processed afterwards. 4.1 Hough transform for circle identification Most handles in our environment are round in shape. Hence, the ROI can be located by finding circles and taking the subimage associated to the most probable circle. Although many circle extraction methods have been developed, probably the most well-known algorithm is the Circle Hough Transform

4

E. Jauregi, E. Lazkano, J.M. Mart´ınez-Otzeta, and B. Sierra

Fig. 1. SIFT Keypoints in a 320 × 240 image

(CHT) [3]. Yuen et al. [22] investigated five circle detection methods which are based on variations of the Hough transform. One of those methods is the Two stage Hough Transform and it is the one implemented in OpenCV vision library (http://www.intel.com/research/mrl/research/opencv) used in the experiments described below. Unfortunately, the existence of a circle on the picture does not guarantee the existence of a handle. A deeper analysis of the function indicates that the circle has to have a limited radius in order to the figure be candidate of containing a door handle. In this manner only the identified circumferences with a radius that lies within the known range would be considered as handle candidates. 4.2 Extracting blobs The Hough transform limits the approach to circular handles. Robot’s viewpoint affects the shape of the handle in the sense that the circular handle distorts and often does not look like a circle but more like an ellipsoid. Instead of looking for circles, and again taking profit of the textureless property of the door blades, the image can be scanned for continuous connected regions or blobs. Blob extraction, also known as region labeling or extracting, is an image segmentation technique that categorizes the pixels in an image as belonging to one of many discrete regions. The image is scanned and every pixel is individually labeled with an identifier which signifies the region to which it belongs (see [8] for more details). Blob extraction is generally performed on the resulting binary image from a thresholding step. Instead, we apply the SUSAN (Smallest Univalue Segment Assimilating Nucleus) edge detector [20], a more stable and faster operator. Again, some size restrictions have been imposed to the detected blobs in order to ensure that the selected one corresponds, with high probability, to the door handle. Using the blob information, a squared subimage is obtained based on the center of the blob and scaled to a fixed size. The OpenCV blobs library (http://opencvlibrary.sourceforge.net/cvBlobsLib) offers the basic functionalities needed for blob extraction and analysis.

Visual approaches for handle recognition

5

5 Off-line performance of the approach Once we get the circle, a squared subimage is obtained based on the radius and center of the circle or the blob and scaled to a fixed size. Then, the keypoint descriptor extraction procedure (SIFT/SURF/USURF) is only applied to this obtained ROI instead of applying it to the whole image. In order to measure the performance of the developed approach, experiments were carried out with a database of about 3000 entries. All the images (references and testing DB) were taken while the robot followed the corridors using its local navigation strategies and therefore, they were taken at distances at which the robot is allowed to approach the walls. This database contained positive and negative cases. It must be mentioned that the test cases were collected in a different environment from where reference cases were taken. Therefore, the test database did not contain images of the handles in the reference database. Figure 2 compares the results obtained with the different approaches. Plot on figure 2(a) shows the classification accuracies. But accuracy is considered a fairly crude score that does not give much information about the performance of a categorizer. Instead, F1 measure is employed as the main evaluation metric as it combines both precision and recall into a single metric and favor a balanced performance of the two metrics (see figure 2(b)). accuracy

F1

1

1

Blobs+usurf

0.95

0.95

Blobs+surf

0.9

Blobs+sift

Blobs+usurf 0.9

0.85

Hough+sift Hough+surf Hough+usurf

0.8

Blobs+surf Blobs+sift

0.85 Hough+surf

0.8

0.75

Hough+sift 0.75

0.7

Hough+usurf 0.7

0.65

0.65

0.6 0.55

50

100

150

200

250

0.6

50

100

150

ROI size

(a)

200

250

ROI size

(b)

Fig. 2. Results: a) accuracy. b) F1 measure

As mentioned in [17], vision turns out to be much easier when the agent interacts with its environment. Taking into account the robot’s morphology and the environmental niche, more specifically the height at which the camera is mounted on the robot, and the height at which the handles are located on the doors, these handles should always appear at a specific height on the image. This information, if used, could help to eliminate false positive candidates proposed by the Hough transform or the blob identification and not discarded by the keypoint extraction processes. The improvement introduced by this morphological restriction (MR) is also plotted in figure 3. Table 1 shows the outstanding results obtained with each approach and the corresponding ROI size. Table 1(a) shows the results obtained just limiting the

6

E. Jauregi, E. Lazkano, J.M. Mart´ınez-Otzeta, and B. Sierra accuracy

F1

1

1

Blobs+surf

Blobs+usurf 0.95

Blobs+sift

0.95

Blobs+surf Hough+surf

0.9

Hough+usurf Hough+surf

0.85 0.8

0.8 0.75

0.7

0.7

0.65

0.65

50

100

150

Blobs+sift

0.85

0.75

0.6

Blobs+usurf

Hough+sift

0.9

200

250

0.6

Hough+usurf Hough+sift

50

100

150

200

ROI size

250

ROI size

(a)

(b)

Fig. 3. Morphological restriction: a) accuracy. b) F1 measure

circle radius or the blob area, while Table 1(b) introduces the improvement obtained by the morphological restriction. These results clearly outperform results obtained when no ROI is extracted (see table 1(c)). Analyzing the two ROI extraction procedures, blob information gives better subimages than the CHT does. Although the resulting subimage after the Blob based scaling process offers a smaller amount of keypoints (a 90% in average of the number of keypoints obtained after the Hough based scaling), the repeatability of the keypoints is higher according to the obtained results. On the other hand, among the keypoint extraction algorithms tested, the USURF seems to outperform both, SIFT and SURF with respect to classification accuracy and F1 measure. Although the biclassifier Blob+USURF requires smaller ROIs, when using the blobs approach for ROI extraction the behavior of all the three keypoint extracting processes shows to remain stable when increasing the ROI size. On the contrary, the performance of the pair Hough+USURF degrades highly for big ROI sizes. Table 1. Outstanding accuracy results for each approach Hough acc. size

Blobs acc. size

Hough acc. size

Blobs acc. size

acc

Sift 87.59 100 Surf 87.16 100 Usurf 90.18 100

91.82 150 92.94 100 94.35 150

Sift 92.55 150 Surf 93.73 100 Usurf 94.35 150

94.48 240 95.70 80 96.09 150

Sift 62.39 59.25 Surf 72.5 69.17 Usurf 72.5 69.19

a) No MR

b) MR

F1

c) No ROI

In many real-world domains, errors may differ in significance and may have different consequences. The system should predict in a way to minimize unwanted side effects, namely costs. Cost-sensitive classification systems aim to minimize the total cost acquired by the prediction process. Since conventional predictive accuracy metric does not include cost information, it is possible for a less accurate classification model to be more cost-effective in reality. This

Visual approaches for handle recognition

7

means that to obtain the minimal cost, cost-sensitive learning systems may need to trade off some of the predictive accuracy and are subject to make more mistakes in quantity [5]. A common measure used in cost-sensitive systems evaluation is the Total Cost Ratio (TCR). Following the cues in [5] (λ = 7, and allowing a maximum global error of 0.02), we obtained the TCR values in figure 4. tcr

tcr 2

2

sift+mr

sift+mr

1.5

1.5

sift

sift 1

1

usurf+mr surf+mr

usurf+mr 0.5

usurf

0.5

surf+mr usurf

surf

surf 0

50

100

150

200

ROI size

0

50

(a) Hough

100

150

200

ROI size

(b) Blobs Fig. 4. Total cost ratios

6 Door knocking behavior Tartalo is a PeopleBot robot from MobileRobots. This robot is provided with a Canon VCC5 monocular PTZ vision system, a Sick Laser, several sonars and bumpers and some other sensors. All the computation is carried out in its on-board Pentium (1.6GHz). Player-Stage [6] is used to communicate with the different devices and the software to implement the control architecture is SORGIN [1], a specially designed framework that facilitates behavior definition. To evaluate the robustness of the handle identification system developed, it has been integrated in a behavior-based control architecture that allows the robot to travel across corridors without bumping into obstacles. When the robot finds a door, it stops, turns to face the door and knocks it with its bumpers a couple of times asking for the door to be opened and waiting for someone to open it. If after a certain time the door is still closed, Tartalo turns again to face the corridor and continues looking for a new handle. On the contrary, if someone opens the door the robot detects the opening with its laser and activates a door crossing behavior module that allows it to enter the room. Experiments were performed in a different environment from where reference images were acquired (see figure 5(a)). The reference DB contained the same 40 images used for the off-line experimental phase previously described. Although the off-line experimental step showed a degraded accuracy for the ROI of size 40 extracted using the blobs approach and SIFT with MR, the short time needed to compute the keypoints and perform the matching

8

E. Jauregi, E. Lazkano, J.M. Mart´ınez-Otzeta, and B. Sierra

process (see table 2), together with the lowest percentage of false positives confirmed by the TCR, makes it more appealing for the real time problem stated in this paper. Table 2. Computational payload (s) Method

40

80

100

150

200 No ROI

Hough+Sift 0.093 0.209 0.243 0.488 0.777 0.307 Blobs+Sift 0.068 0.184 0.218 0.463 0.752 0.307

To make the behavior more robust, instead of relying on a single image classification, the robot will base its decision upon the sum of the descriptor matches accumulated for the last five consecutive images. Figure 5(b) shows the evolution of the sum of the matching keypoints over time. The horizontal line represents the value at which the threshold was fixed. The 18 doors present in the environment were properly identified and no false positives occurred.

D18

1 0 0 1 finish

D17

D16

D15

D14

D13

11 00

1 0

1 0

1 0

1 0

D12

D11

D10

D9

1 0

1 0

1 0

1 0

D1

D7

11 00 00 11

corridor

corridor

start 00 11 00 11

D8

11 00

1 0 1 0

11 00 00 11

D2

D3

11 00 11 00

1 0 0 1

Hall

D4

11 00 00 11

D5

D6

0000 1111 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111

(a) Environment 45

Keypoint sum

D7 dead ends

40

D4

35

D16

D2

D11

25 20

D18

D9

30

D1

D13

D10

D3 D5 D6

15

D8

D17 D12

D14

D15

10 5 0

hall

hall 0

200

400

600

800

1000

1200

1400

1600

time (sec)

(b) Keypoint sum over time Fig. 5. Door handle identification during navigation

7 Conclusions and further work The experiments here described showed an attempt to use scale invariant image features to identify door handles during robot navigation. Taking advantage of the featureless property of the door blades, the area corresponding

Visual approaches for handle recognition

9

to the handle is extracted. ROI extraction improves handle identification procedure and depending on the ROI size, the computational time to classify an image can be considerably reduced. Blob based scaling outperforms Hough, makes the approach more general and applicable to other type of handles (see figure 6) and requires less computational effort.

(a)

(b)

Fig. 6. Blob extraction in a non circular handle

The developed system outperforms the performances obtained without extracting the ROIs and experiments carried out in a real robot-environment system show the adequateness of the approach. The system showed a very low tendency to give false positives while providing a robust handle identification. Optimizing the contents of the reference DB would be desirable and could be achieve by using a genetic algorithm based search. The keypoint matching criteria also has to be analyzed more deeply. More sophisticated and efficient algorithms (SVN, AdaBoost) remain to be tested and the performance of different distance measures still needs to be studied. Our interest now focuses on extending the approach to non circular handles in order to generalize the behavior to cross every door in the environment. It is worth mentioning that the approach could also be applied to different tasks like face recognition, taking profit of the performance of the descriptor extraction approaches at a lower computational payload. Acknowldegments: This work has been supported by SAIOTEK (S-PE06UN16) and by the Gipuzkoako Foru Aldundia (OF 0105/2006).

References 1. A. Astigarraga, E. Lazkano, I. Ra˜ n´ o, B. Sierra, and I. Zarautz. SORGIN: a software framework for behavior control implementation. In CSCS14, volume 1, pages 243–248, 2003. 2. H. Bay, T. Tuytelaars, and L. Vam Gool. SURF: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision, 2006. 3. R. Duda and P. E. Hart. Use of Hough transform to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972.

10

E. Jauregi, E. Lazkano, J.M. Mart´ınez-Otzeta, and B. Sierra

4. C. Eberset, M. Andersson, and H. I. Christensen. Vision-based door-traversal for autonomous mobile robots. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 620–625, 2000. 5. C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973–978, 2001. 6. B. P. Gerkey, R. T. Vaughan, and A. Howard. The Player/Stage project: tools for multi-robot and distributed sensor systems. In Proc. of the International Conference on Advanced Robotics (ICAR), pages 317–323, 2003. 7. A. Gil, O. Reinoso, A. Vicente, C. Fern´ andez, and L. Pay´ a. Monte Carlo localization using SIFT features. Lecture Notes in Computer Science, 3522:623–630, 2005. 8. Berthold K. P Horn. Robot Vision. MIT Press, 1986. 9. E. Jauregi, J. M. Mart´ınez-Otzeta, B. Sierra, and E. Lazkano. Handle identification: a three-stage approach. In Intelligent Autonomous Vehicles, september 2007. 10. D. Kragic, L. Petersson, and H. I. Christensen. Visually guided manipulation tasks. Robotics and Autonomous Systems, 40(2-3):193–203, 2002. 11. E. Lazkano, B. Sierra, A. Astigarraga, and J. M. Mart´ınez-Otzeta. On the use of bayesian networks to develop behavior for mobile robots. Robotics and Autonomous Systems, doi:10.1016/j.robot.2006.08.003, 2006. 12. L. Ledwich and S. Williams. Reduced sift features for image retrieval and indoor localisation. In Australian Conference on Robotics and Automation, 2004. 13. W. Li, H. I. Christensen, and A. Oreb¨ ack. An architecture for indoor navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1783–1788, 2004. 14. D. G. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, pages 1150–1157, Corfu, Greece, 1999. 15. I. Monasterio, E. Lazkano, I. Ra˜ n´ o, and B. Sierra. Learning to traverse doors using visual information. Mathematics and Computers in Simulation, 60:347– 356, 2002. 16. R. Mu˜ noz-Salinas, E. Aguirre, and M. Garc´ıa-Silvente. Detection of doors using a genetic visual fuzzy system for mobile robots. Technical report, University of Granada, 2005. 17. R. Pfeifer and J. Bongard. How the body shapes the way we think. A new view of intelligence. MIT Press, 2006. 18. S. Se, D. G. Lowe, and J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. International Journal of Robotics Research, 21(9):735–758, 2002. 19. M. W. Seo, Y. J. Kim, and M. T. Lim. LNAI, chapter Door Traversing for a Vision Based Mobile Robot using PCA, pages 525–531. Springer-Verlag, 2005. 20. S. M. Smith and J. M. Brady. Susan: a new approach for low level image processing. International Journal of Computer Vision, 23(1):45–78, May 1997. 21. H. Tamimi, A. Halawani, H. Burkhardt, and A. Zell. Appearance-based localization of mobile robots using local integral invariants. In In Proc. of the 9th International Conference on Intelligent Autonomous Systems (IAS-9), pages 181–188, Tokyo, Japan, 2006. 22. H. K. Yuen, J. Princen, J. Illingworth, and J. Kittler. Comparative study of Hough transform methods for circle finding. Image and Vision Computing, 8(1):71–77, 1990.