Multilevel Image Segmentation in Computer-Vision Systems - CiteSeerX

0 downloads 0 Views 4MB Size Report
In this contribution a multilevel image segmentation system is presented that is able to separate reliably objects in front of a static, but arbitrary background. ... Finally the mid-level processing stage advises the low-level stage to learn ... Video projection is in widespread use for multimedial presentations in classrooms and at ...
Multilevel Image Segmentation in Computer-Vision Systems Christian Leubner Universität Dortmund Lehrstuhl für Graphische Systeme 44221 Dortmund Abstract In this contribution a multilevel image segmentation system is presented that is able to separate reliably objects in front of a static, but arbitrary background. Our algorithm automatically adapts to background features and is not restricted to specific background content. The system consists of three different processing stages. The low-level processing stage works pixelwise. For the midlevel processing stage the image area is subdivided in small rectangular box areas so that further classification results are obtained for these boxes. The high-level processing stage works on the box classifications of the mid-level stage and determines the final object of interest by employing further application-dependent knowledge. Depending on the high-level results the parameters of the midlevel stage are modified. Finally the mid-level processing stage advises the low-level stage to learn colors and edges of the recognized background regions. Due to this multilevel approach the algorithm is able to cope with varying illumination conditions smoothly and to remove noise almost completely. Moreover our segmentation system is able to recognize permanent changes in the background and to recall formerly learned situations reliably.

1 Introduction Video projection is in widespread use for multimedial presentations in classrooms and at conferences. It also plays an important role in group meetings for visualization purposes. For these types of application back-projection walls become more and more important. Usually interaction is performed at a standard keyboard/mouse computer whose screen content is additionally directed to a video beamer. This type of interaction limits the possibilities of group meetings because the interaction has to be performed at the computer, although it would be more natural to interact directly at the back-projection wall. For that purpose, special displays augmented with sensors have been developed, like e.g. the SmartBoard [STR94]. Another recent development is to use classical laser pointers by capturing the laser point on the projection screen by video cameras. Versions for front- and back-projection have been implemented based on that idea [KIR98, WIS01]. Another step further is to let the user directly point to the projection wall, without any additional pointing tool. The idea is to observe the user with video cameras in order to recognize his arm and to calculate the three-dimensional pointing direction of the arm. There has been and is an increasing number of projects concerning tracking of the human body, like e.g. the Pfinder and Spfinder of the MIT [WRE95] for human-computer interaction. Most of the projects emphasize on the recognition of symbolic static or motion gestures while the precise determination of locations or directions is less often treated. Examples concerning pointing are Visualization Space and Dream Space by IBM [IBMDS, IBMVS]. These systems and others define the pointing direction by a line connecting the head and the hand of the pointing arm. According to our experience the postures that have to be performed are somehow unnatural as far as pointing is concerned and thus are somehow inconvenient. Our system uses two standard video cameras, under the ceiling and sidewards, which observe the space in front of the back-projection wall. The user directly interacts with the graphical user interface

1

Figure 1: User interacting with back-projection wall. Two cameras observe the arm of the user. The mouse cursor is moved to the position the user is aiming at.

by pointing with his straight arm to the wall. The arm has to be in a defined region in front of the wall of about 1 to 1.5 meters of depth. Other dynamic objects like for example further users are not allowed in this region. Typically the cursor of the application is displayed at the intersection of the pointing line with the wall and can freely be moved by moving the arm. Further instructions, for example initiating a mouse button click, are given by natural voice commands using a wireless headset microphone. Figure 1 shows an example of application. In this contribution we focus on an important aspect for computer-vision systems, namely the segmentation. As the first step of the image processing pipeline segmentation - the separation of relevant and irrelevant image content - is important for the whole application because further processing steps have to rely on the provided preliminary results. Segmentation itself is as old as image processing is. A good survey of early used techniques is given in [HAR85] from 1985. Later, in 1993, a survey was published in [PAL93]. Another more extensive survey can be found in [SKA94]. In our application segmentation is necessary to determine background regions and the position of the user. In order to achieve a stable segmentation most systems have a couple of restrictions: the user for example must not wear clothes having a specific color, or the background in front of which the segmentation is performed is not allowed to consist of arbitrary colors. Furthermore many systems need specific expert knowledge at least at the start of the system to work properly. We present an image segmentation system, which is able to segment a user in front of a more or less static, but arbitrary background. Our algorithm aims at finding contour pixels of the user and does not need any manually given information at starting time. As mentioned above we require the background that can be seen by the cameras to be static. However, in many cases the background is not static and is changed within one or more parts of the image (for example a chair may be moved). In order to cope with these permanent changes our segmentation system is able to recognize them, if they endure for a defined time. Our system is intended to cope with varying illumination conditions smoothly and is able to remove noise almost completely. These are key problems of many segmentation techniques (as for example thresholding or color-based segmentation methods). The basic idea is to employ a three-level segmentation hierarchy that consists of a low-level, a midlevel and a high-level processing stage. Internal feedback mechanisms regulate and control parameters and adaptive learning of underlying stages. The key benefits of our approach to segmentation are that noise and other disturbing influences (as for example aliasing effects of the camera) can be eliminated

2

almost completely. The proposed classification technique that is used within the mid-level processing stage partly employs relative measures so that the system is able to cope with varying illumination conditions smoothly. Moreover the system can detect permanent changes in the background and is also able to recognize and recall previous situations. The interaction system for back-projection walls was originally presented in [LEU01b]. Within this system the mere low-level segmentation approach that has been introduced in [LEU01a] was used. Although the results were already promising the segmentation sometimes failed in certain situations, for example because of changing illumination conditions. In section 2 a survey of our segmentation system is given. The sections 3, 4 and 5 explain the lowlevel, the mid-level and high-level processing stage, respectively, in more detail. In section 6 the process of continuous learning and the internal feedback mechanims are explained. In section 7 some examples of applications are presented. Finally in section 8 the results are summarized.

2 Survey of the System Our approach to segmentation consists of an initial learning phase and the subsequent application phase. During the learning phase the static background has to be presented to the system which learns and acquires features of the background automatically. Afterwards during the application phase the system has sufficient information to perform the segmentation task. Our segmentation system consists of a low-level, a mid-level and a high-level processing stage. Additionally internal feedback mechanisms are employed to continuously adapt to possibly changing situations. Another important aspect of our three-level structure is the reduction of the computational effort that has to be spent with each processing stage. The low-level stage performs a pixelwise segmentation and provides for each pixel a fuzzy classification whether it is a ”background” or a ”foreground” pixel where ”foreground” in our case denotes the user. For the mid-level processing stage the image area is subdivided into small rectangular boxes. Both the results of low-level processing and box-wise calculated features are employed to assign a classification result to each box whether the box content belongs to the background or to the user. The high-level processing stage incorporates application-dependent knowledge as for example the number of expected objects, their size or features of their shape. For this purpose the connected components of boxes which have been classified as foreground within the mid-level stage are used. As a result the high-level stage passes back the boxes determined as object of interest (in our case the user) to the mid-level stage where these results are used to either adapt parameters or to correct possibly wrong decisions. As a consequence the mid-level stage advises the low-level processing stage to work out contour pixels of the user in the respective box areas and to learn and continuously adapt to slight changes in the background areas. The principal structure of the multi-level segmentation system is illustrated in figure 2.

3 Low-Level Processing Stage The low-level processing stage performs a pixelwise segmentation. For this purpose both color and edge information are learned and stored in respective knowledge bases. Basically the segmentation is performed as a kind of differencing between the initially learned background and current video images. This approach is supplemented by the ability to store and recall a large amount of color values for each pixel separately, enabling continuous learning and updating of the formerly learned color knowledge. In order to yield reliable results color and edge segmentation are combined using fuzzy techniques. This approach has been introduced and extensively tested in [LEU01a].

4 Mid-Level Processing Stage For the mid-level processing stage the image area is subdivided in small rectangular boxes. For example     we have worked with boxes for an image size of pixels. Basically the mid-level processing stage realizes a pattern recognition system that is able to measure the similarity between 3

Figure 2: Structure of the multi-level segmentation approach.

initially learned and current box content. In order to obtain reliable classification results we employ three measures which incorporate different image features. The basic approach is to observe the background during an initial learning phase and to extract typical features. Afterwards the detectors are able to measure the similarity between the initially learned content and current video images. The definition of the detectors is general and they are intended to be applied for each box area separately. As a result for each box area a fuzzy membership value is obtained that describes the degree of membership to the semantic property ”is background” for the respective box area. Applying a threshold to that fuzzy result finally allows the crisp classification in ”background” and ”foreground” boxes.

4.1 Edge-Based Relative Similarity Measure The first detector does not work on color values, but on edge information within the image data. Edge information can be obtained by applying an arbitrary edge operator, for example the Sobel filter, to the image. For further edge filtering operations see for example [GON92]. In order to abstract from any specific edge operator we assume the results of an edge filtering algorithm to be fuzzy membership $#  % (' !*)+ of a given image. values   "! to the semantic property ”is edge” for each pixel & Such a general representation can be easily achieved for example by scaling. Our edge-based detector can be applied for an arbitrary number ,.- of edge points defined by their respective pixel position 0 / !232424! , - ). (with 1 % One major aspect of using edge information for similarity comparisons is the desire for an illumination independent measure. Nevertheless, edge information also depends on varying illumination conditions because a bright illumination leads to larger edge values than a darker illumination. In order to have a similarity measure that is able to cope with these influences a relative comparison scheme for

4

the edge values is introduced. For this purpose the relative difference between the , - pixels, a feature  , , - -matrix 5 is used whose elements are defined as 6/87%

0 /9;:

@? (with A % !232424! ,.B ) occurring during the learning phase a corresponding feature matrix 5 ? is determined. The matrix = is calculated as: DFE G % = 5 ? (2) ,CB ?IHKJ

Averaging the differences between the edge values of a sequence of , B image frames yields the matrix = . For the elements L /87 of = the equations L /M7.%N: L 7O/ and L /87C%P with 1 % Q hold. Thus it is sufficient to calculate only either the upper or the lower triangular matrix. During the application phase the similarity between the current video image > ? and the initially learned images is measured by comparing the relative differences 5 ? between the edge values of > ? and = . The membership value to the semantic property ”similar” is calculated as

SR UT3*@*V /3W 95

X% ?

DF` D_` a ZY\[*] a D_^ /  6 /M7 : 7 H J H /4b J K

L /87 c

(3)

where 6 /87 are the elements of the current feature matrix 5 ? , L /87 are the elements of the averaged differences matrix = , ,Cd is the number of elements in the triangular matrix and eJ gf is a regulating factor. The number ,.d of elements in the triangular matrices 5 ? and = can be calculated as ,Cd

%

G

D_` G

DF` %

/ HKJ 7 H 3/ b J

 , - h, -

Y @2

Realizing the unstableness in digital video images over time raises the question of adaptiveness as far as the membership function is concerned. Nevertheless it has to be considered that updates of the membership function with current feature matrices 5 ? should only be performed if it is however sure that 5 ? still represents the background and not an occasionally moved in foreground object. The centroid matrix = can be updated with a current feature matrix 5 ? b J to = ? b J by applying the following formula: Y =

%

AFi$= ?

A

5 Yj

?b J

(4)

The introduced detector R UT3V /3W may work very well and reliably if the observed background area contains a lot of edge pixels and if differences between them are noticeable. Nevertheless this detector may partly fail if the observed area is smooth because a likewise smooth foreground object will not be recognized.

4.2 Mean Color Similarity Measure Realizing the problem that absolute values, both color or edge values, heavily depend on the illumination conditions, we also propose a relative color measure that additionally comprises the structure of the considered box area. The box area itself is subdivided in arbitrary parts. In each part the mean color value of each color channel is determined and the difference to the mean color values of the other parts is calculated. The determined differences are employed to measure the similarity between current and initially learned images. In the examples of application in section 7 we usually subdivided the box areas in four equally sized quarters. During the learning phase , B images are used to calculate the averaged differences. For each of the !232424! ,.k ) is required. Averaging the color values in ,.k color channels a feature matrix = l (with m % 5

?

the ,Cn parts o T of the box areas yields for each of the , k color channels the mean color values p lq T !$242324! ,Cn and A % !$242423! ,.B ). Moreover the color values are transformed into the interval (with r %  +! $# , so that p ? s +! # holds. This transformation can easily be achieved for example by scaling. The lq T difference between the box area parts is determined and averaged for all the ,CB images of the learning  /87 phase. This results in the ,.n ,Cn matrices = l whose elements L lq are thus calculated as: G

L lq /87t%

,CB

DFE

? / : l q p 

?IHKJ

? 7 l q p 

(5)

where L lq /M7 are the elements of the averaged difference matrix = l , p ?lq / are the mean color values of color channel m in area o / in frame A and ,.B is the number of frames during the learning phase. Similar to the edge-based relative similarity measure introduced in section 4.1 for the elements L lq /87 of = l the following equations hold L lq /87 %u: L lq 7O/ and L lq /87 %j with 1 %vQ . Thus for this detector it is sufficient as well to calculate only either the upper or the lower triangular matrix. During the application phase a current image >w and more specific the current box area can be compared to the initially learned box areas. For that purpose again the current average color values w !$242423! , n ) and for each color channel m % !232424! ,.k p lq T in each part oxT of the box area (with r % are calculated. The current feature matrices 5 l are determined analogously to the learning phase 6 lq /M7y% p w / : p w 7 . The similarity between the initially learned image content and the current image lq lq content can be measured with the following fuzzy membership function:

W z@w {}|*T3V /4W h5vJ !$242324! 5

D_~ %

D ~ a

D_ a l q /M7 : 6  / 7 l HKJ HKJ H /3b J

ZY\[}€ D_^

DF a

L lq /87 *c

(6)

where 6 lq /87 are the elements of the current feature matrix 5 l , L lq /M7 are the elements of the averaged differences matrix = l , ,Cd is the number of elements under consideration, ,Cn is the number of box areas, , k is the number of color channels under consideration and e c gf is a regulating factor. The number ,.d of elements under consideration can be calculated as ,.d

%

, k

G

DF G

DF %

/ HKJ 7 H 4/ b J

 , k ,.n‚(,Cn

Yj 2

High membership values of W *z@w {}|*T3V /4W denote a high similarity between the initially learned background and the current image content and vice versa. As first experiments have shown, this detector is indeed rather insensitive to changing illumination conditions.

4.3 Sum of Foreground Edges Measure As already explained in section 3 the low-level segmentation performs a color and an edge segmentation, which are based on calculating differences between the initially learned background and current video  % (' !O)