Visual Search for Object Features

0 downloads 0 Views 155KB Size Report
combines perceptual and cognitive information during the visual search ..... particular dictionary word is given and the task of the recognition system (the.
Visual Search for Object Features Predrag Neskovic and Leon N. Cooper Institute for Brain and Neural Systems and Department of Physics, Brown University, Providence, RI 02912, USA pedja@@brown.edu, Leon Cooper@@brown.edu

Abstract. In this work we present the computational algorithm that combines perceptual and cognitive information during the visual search for object features. The algorithm is initially driven purely by the bottomup information but during the recognition process it becomes more constrained by the top-down information. Furthermore, we propose a concrete model for integrating information from successive saccades and demonstrate the necessity of using two coordinate systems for measuring feature locations. During the search process, across saccades, the network uses an object-based coordinate system, while during a fixation the network uses the retinal coordinate system that is tied to the location of the fixation point. The only information that the network stores during saccadic exploration is the identity of the features on which it has fixated and their locations with respect to the object-centered system.

1

Introduction

When we look at the world around us, we perceive it as highly detailed, full colored and stable. However, our eyes neither process visual information with uniformly high resolution nor are they motionless. The only region of the visual scene that is processed with high resolution is that which is very close to the fixation point. The acuity and color sensitivity of retinal cells rapidly decreases with distance from the fovea, the region of the retina that corresponds to only about the central 2 degrees of the viewed scene. In order to overcome this limitation of the optical structure of the eyes, our visual system uses saccades to reposition the fovea over different locations and thus obtain locally high resolution samples of a visual scene. The question is then what information is retained during saccadic exploration and how detailed is that information? It has been shown in numerous experiments that our visual system is fairly insensitive to visual changes in an image across a saccade, a phenomenon called change blindness [1]. As a consequence, it has been proposed [2] that because the “world is its own memory”, the visual system does not need to store visual information from fixation to fixation. On the other hand, the visual memory theory of Henderson and Hollingworth [3] posits that a relatively detailed scene representation is built up in memory over time and across successive fixations. The question then becomes: how do we piece together information from different fixations? According to composite sensory image hypothesis the sensory images from consecutive fixations are spatially aligned and fused in a system that maps a retinal L. Wang, K. Chen, and Y.S. Ong (Eds.): ICNC 2005, LNCS 3610, pp. 877–887, 2005. c Springer-Verlag Berlin Heidelberg 2005 

878

P. Neskovic and L.N. Cooper

reference frame onto a spatiotopic frame [4]. However, numerous psychophysical and behavioral data from vision and cognition literature have provided evidence against this hypothesis [5]. Another important question, related to saccadic exploration of the pattern, is what criterion our visual system uses when it select the location on which it is going to fixate - the target location? Is this process driven purely by bottom-up information (by the salient properties of the pattern), or by top-down information (our expectations) or by a combination of the two? It has been known for a long time [6,7] that more informative scene regions receive more fixations and thus informative regions are most likely candidates for being target locations. What is not known is how to define the informative region. Again, one can use only perceptual information (bottom-up), or cognitive information (top-down) or a combination of the two. Experimental evidence suggests that while initial fixations are controlled by bottom-up information, the subsequent fixations are influenced by cognitive expectations [8]. However, how exactly and when (at what stages during the recognition process) these two sources of information interact with each other is still an open question. In this work we address the previous questions and present the algorithm for searching for object features that combines perceptual and cognitive information. More specifically, the selection of the target feature is initially driven purely by bottom up information but during the recognition process becomes more constrained by the top-down information. Furthermore, we propose a concrete model for integrating information from successive saccades. We show that the only information that is necessary to retain across fixations, in order to segment and recognize an object, is the location and identity of some of the features on which the system has fixated. We demonstrate that our model can also be utilized for building a real-world recognition system. To this end, we constructed a working system and tested it on the difficult task of searching for letters in handwritten words. The paper is organized as follows: In section 2 we describe the feature-based object representation and the architecture of the network that integrates information from different regions of the pattern. In section 3 we show a detailed algorithm for searching for object features and the mechanism that the system uses to resolve conflicting configurations. We illustrate the results of our algorithm when applied to real-world dataset of cursive script in section 4. Final remarks and summary are given in section 5 .

2

Object Representation

In our model, an object is represented as a collection of features of specific classes arranged at specific locations with respect to one another [9,10]. Detecting an object is then equivalent to detecting constituent features and estimating their locations. The main problem in detecting individual features is that information contained in the local region of an image is often ambiguous and therefore can be interpreted in many different ways. Human visual system, overcomes this ambiguity by incorporating contextual information. During fixation on a par-

Visual Search for Object Features

879

ticular region of the object, we use contextual information, such as locations and identities of surrounding features, to help determine identity of the fixated region. Similarly, information from the previous fixations is used as contextual information to help determine identity of the region around the current fixation. Feature Detectors. Let us assume that we have N feature detectors, each selective to a feature of particular class. For example, if an object is a face then features can be the nose, the mouth and the eyes. If an object is a word then features can be the letters from the alphabet, N = 26. If we denote with xi the location of the pattern over which the feature detector is positioned, then the output of the feature detector, di (xi ), is proportional to the confidence that the local region around xi represents the feature of class i, where i = 1, ..., N . The closer in appearance the region is to the feature that the detector is selective to, the higher is the output of that feature detector. Simple Units (SUs). Let us now choose one region of the pattern and assume that it represents a specific feature, say a letter “a” of the word “act”. In order to incorporate contextual information, we construct a set of units, called simple units, that capture the locations and identities of surrounding features. The sizes and distribution of the receptive fields of the (surrounding) simple units are designed in the following way. The receptive field of the simple unit that is selective to the letter “c” is constructed so that it can capture all possible variations in location of the letter “c” given the location of the letter “a”. We will denote this unit as SU (2|1). The simple unit that is selective to the letter “t”, SU (3|1), is further away from the central unit and its size is larger than the size of the SU (2|1) since the variations in feature sizes and locations accumulate. In general, both the sizes and the overlapping of the receptive fields of simple units become progressively larger with the distance from the central unit. However, the surrounding SU s that are nearest neighbors to the central SU do not overlap with the central SU and therefore the order is preserved within the local neighborhood around the fixation point. If we denote with Rji the receptive field of the simple unit that is selective to the j th feature, and xj is the location of the j th feature, then the simple unit fires if the feature is detected (its value is above some threshold, dj (x) > threshold) and is located within the unit’s receptive field (xj ∈ Rji ). In our implementation, we set the threshold to a very small value close to zero. We will denote with symbol y the location of a feature with respect to the location of the fixation point and with symbol x the location of a feature with respect to a coordinate system that is fixed to an object, e.g. a specific object feature. The activation of the simple unit, whose center is at distance y j = xj − xi with respect to the fixation point xi , is calculated as sji (y j ) = sji (xj |xi ) = max [dj (xj )], xj ∈Rji

(1)

Complex Units (CUs). Although each simple unit processes only local information, combination of all the simple units associated with a given central feature provides contextual information. Since feature detection and location estimation is much less reliable for features that are further away from the

880

P. Neskovic and L.N. Cooper

fixation point compared to features that are closer we weigh the contribution of each simple unit differently. The weighing factor in our implementation is set to be inversely proportional to the size of the simple units receptive field, ωji = 1/(size of Rji ). In this way, the contribution of simple units that are closer to the fixation point (those that have smaller receptive fields) is larger compared to simple units that are further away. The output of a complex unit, associated with the ith object feature is given as ci (x) = di (xi ) ·

1 N −1

N 

ωji · sji (xj |xi ),

(2)

j=1,j=i

where di (xi ) is the activation of the feature detector positioned over the ith object feature, and N represents the number of object features. For a given object, there are as many complex units as there are features. The receptive fields of all the simple units that belong to the same complex unit form complex unit’s receptive field. Since the receptive fields of the simple units closer to the central SU are smaller than the receptive fields of those that are further away, the complex unit captures with high accuracy only the locations of the features that are close to the fixation point. As a consequence, a complex unit can determine only whether surrounding features are correctly positioned with respect to the central feature but not whether they are correctly positioned with respect to one another. Object Units. In order to capture different regions of an object with high resolution and in order to correctly estimate locations of the features with respect to one another, the system has to probe the pattern at different locations. We call these exploratory movements of the system saccades. At the top of the processing hierarchy are the object units, one unit representing each object. The outputs of all the complex units are supplied to the object unit and they are combined in the following way N  ci (x) (3) o(x) = i=1

where x = (x1 , · · ·, xN ) is a particular configuration of selected features. An object unit, therefore represents an object regardless of any specific point of view or fixation point. This hierarchical representation consisting of different collections of simple units, complex units and an object unit comprises a neural network-like architecture that we use to represents each object from a library of objects. In summary, simple units provide local information about presence of specific features within specific regions; complex units integrate information from different regions, given a specific fixation location, whereas object units integrate information from different fixations. In order to accomplish this task of integrating information across fixations, the system has to use two different coordinate systems. One system is tied to the fixation point, the retinal coordinate system, while the other system is object-based and can be centered on any object feature. In the following section we will see how the system combines these two coordinate systems during the visual search for object features.

Visual Search for Object Features

3

881

The Search Algorithm

In this section we present the algorithm that combines perceptual and cognitive information during the process of searching for object features. In order to simplify description, we will assume that an object is a word and that features are letters. However, the algorithm is general and can be applied to any other object consisting of different features. Instead on operating on the pattern, the network operates on the detection matrix that represents sensory input and consists of the outputs of feature detectors (in this case letter detectors) whose receptive fields overlap and completely cover the pattern. Therefore, a row of the detection matrix represents a class of the letter and a column corresponds to the position of the letter within the pattern. We will call the letter on which the system fixates the central letter, the corresponding location within the matrix the central column, and the (simple) unit that is positioned over the central letter the central unit. All the simple units that are surrounding the central unit are called the surrounding units and they provide contextual information. In the following, we will assume that a particular dictionary word is given and the task of the recognition system (the

Complex Unit

CU1

a

c

t

Central Unit

Simple Units

Detection Matrix

a

0.0

0.0

0.0

0.0

0.0

0.1

. c

0.0

0.0

0.0

0.0

0.0

0.3

. t

0.0

0.0

0.0

0.0

0.0

0.3

0.8

0.2

.

.

0.1 .

.

0.0

0.1

0.1

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.2 0.1 .

0.0

0.4

0.0

0.0

0.0

0.0

0.0

0.0

0.0

. 0.0

0.0 .

. 7

0.1 .

0.9

0.7

11

12

0.2

0.2

0.0

0.0

0.0

0.0

0.0

.

Cental Letter Cental Column

Fig. 1. The complex unit CU1 represents the word “act” from the perspective of the letter “a”. The detection matrix corresponds to the pattern representing the word “act”. The central letter is the letter “a” from the 7-th column. The simple unit selective to the letter “c” selects the letter “c” from the 12-th column, and the simple unit representing the letter “t” selects the letter “t” from the 11-th column. The segmentation of the pattern determined by the complex unit CU1 is “atc”, which is obviously incorrect segmentation.

882

P. Neskovic and L.N. Cooper

network) is to select elements from the detection matrix that correspond to the letters of the given word and thus segment a pattern into letters. The recognition process starts by selecting the most prominent letter from the pattern, the element from the detection matrix that has the highest value. If we think of the detection matrix as the saliency map then this procedure is equivalent to winner-take-all mechanism proposed by Koch and Ullman [11]. Note that at this stage the feature selection is purely a bottom-up process. Let us now assume that the selected letter, the central letter, is one of the letters of the given dictionary word. All complex units (more specifically their central units) are then positioned over this letter and we say that the system fixates on the central letter. However, not all the complex units will be equally activated and only those complex units whose central unit is selective to the central letter will fire. For example, if the given dictionary word is “again” and the network fixates on the letter “a” then two complex units will have central units that are selective to this letter: the complex unit CU(1) with surrounding simple units selective to letters “-gain”, and the complex unit CU(3) with surrounding simple units selective to letters “ag-in”. Which of those two complex units would have higher activation would depend on the presence of the letters to which each complex unit (and the corresponding simple units) is selective. If the network finds the letters “-gain” at expected locations then the unit CU(1) becomes activated. On the other hand if the network finds the letters “ag-in” at expected location then the unit CU(3) becomes activated. The complex unit with highest activation then segments the pattern by choosing the letters that activate it the most. We will call this segmentation a tentative segmentation. Unfortunately, due to often high overlapping of the simple units’ receptive fields, this segmentation can be incorrect in the sense that selected letters are not at correct locations with respect to one another, as illustrated in Figure 1. The only way to assure that the selected letters are correctly positioned with respect to one another is if the network fixates on each of them since the ordering is preserved only locally, for the nearest neighbor simple units. We will call the letters that are correctly positioned with respect to one another the active letters. Similarly, we call the complex unit with highest activation, for a given fixation, the active complex unit. The first letter on which the network fixates therefore becomes the first active letter. The network then selects the target letter and the location within the pattern on which it is going to fixate. The target letter is selected by one of the simple units of the active complex unit. More specifically, the new fixation point becomes the location of the letter that is selected by the simple unit that has the highest activation. In this way, the network combines top-down information (expectation about the location of the letter) with bottom-up information that is provided by letter detectors. The question is now what information about the pattern, given the current fixation, is stored in the short term memory? The only information that is retained across fixations is the location of each active letter and its identity. Information about locations of active letters is important for the network so it does not in the future make fixations on the same locations. In effect, in this

Visual Search for Object Features

883

Complex Unit CU2

c

a

t

Central Unit

Simple Units

Active Letter Detection Matrix a

0.0

0.0

0.0

0.0

0.0

0.1

0.8

0.2

c

0.0

0.0

0.0

0.0

0.0

0.3

0.1

0.4

. t .

0.1

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

.

.

0.7

0.1 .

0.0 .

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.1 .

0.2 .

0.1

0.0

0.2

0.9

0.0

0.0

0.0

0.0

.

.

. 0.0

0.0

0.0

0.0

0.0

0.3

0.0

0.1 .

7

9

0.0

15

Cental Letter

Active Letter Cental Column = Central Location

Fig. 2. The complex unit CU2 represents the word “act” from the perspective of the letter “c”. The detection matrix corresponds to the pattern representing the word “act”. The detection matrix corresponds to the pattern representing the word “act”. The central letter is the letter “c” from the 9-th column. The central unit of the complex unit CU2 is selective to the letter “c” and is located over the letter detector that has detected the letter “c” with confidence 0.7. The active letters are the letters “a” and “t”. The location of the active letter “a” is within the receptive field of the corresponding simple unit selective to the letter “a”, but the location of the active letter “t” is not within the expected region which is the receptive field of the simple unit selective to the letter “t”. We say that the central unit is in conflict with active letter “t”.

way we implement an inhibition of return mechanism which is necessary so that the network doesn’t enter an endless loop. The locations of the active letters are measured with respect to some point within the object, for example the location of the first active letter can be used as the center of the coordinate system. It is important to note that the spatial arrangement of the receptive fields of the simple units (that belong to the same complex unit) is fixed (with respect to each other), but the position of the complex unit over the detection matrix is not. With each selection of the new target letter, the network is repositioned so that all the central units (of all the complex units) are placed over the target letter. If the active letter falls within the receptive field of the simple unit that is selective to this letter, then the simple unit does not process information using Eq. 1 but instead immediately chooses the active letter and outputs the confidence with which the active letter is detected. As we have mentioned earlier, one of the consequences of the “fovea-like” distribution of the receptive fields of the simple units is that feature ordering is

884

P. Neskovic and L.N. Cooper

preserved only with respect to the fixation point. This means that the central letter is not always at the correct location with respect to one or more active letters, as illustrated in Figure 2. If this situation occurs, then the network probes the local neighborhood around the central letter and tries to find a letter of the same class as the central letter but whose location is not in conflict with active letters. This small repositioning of the network is similar to microsaccades performed by the eyes during the visual search. If the network finds a new (central) letter that is not in conflict with active letters, then this letter is added to the group of active letters and the network continues to search for remaining letters. Otherwise, if the network cannot find a letter that is not in conflict with existing active letters then it suppresses either the central letter or the active letters that are in conflict with the central letter. This is done by comparing the values of the object unit for those two scenarios: a) when the existing active letters are accepted and the central letter is rejected, and b) when the active letters that are in conflict with the central letter are rejected and the central letter, together with active letters that are not in conflict with the central letter, are accepted. The first configuration we call the old value of the object unit while the second configuration we call the new value of the object unit. In order to calculate the new value (without the conflicting active letters) the network has to fixate on all the active letters that are not in conflict with the central letter since previous fixations on those active letters (previous values of corresponding complex units) did not include the current central letter. This means that for the network the landscape of activations of feature detectors, the pattern, appears differently as a consequence of exploring the pattern and accumulating new information. The central letter is accepted (and conflicting active letters suppressed) only if the new object unit value is strictly greater than the old object unit value. Once the network accepts or rejects the central letter, it continues to search for the new target letter until all letters of a given dictionary word are discovered. The visual search for object features, when it is not known in advance what object the pattern represents, does not significantly differ from the previously described algorithm for searching the features of the known object. We will again focus on handwriting recognition and assume that a pattern represents an unknown dictionary word. The network first selects the most prominent letter from the detection matrix but this time, instead of using only the complex units of one object unit, the central letters of the complex units of all the object units (all dictionary words) are positioned over this letter. The complex unit that has the highest activation propagates its output to the corresponding object unit that becomes the active object unit. The word that is represented with this active object unit now imposes the structure on the pattern in the sense that the network starts to search for the letters of only the active object unit. In this way, the algorithm reduces to the previously described procedure for searching for the features of a given object. Since the network always searches for the letters of the dictionary word that is associated with object unit that has the highest value, it might happen that during the search process the network switch from searching for the letters of one dictionary

Visual Search for Object Features

885

word to searching for the letters of some other dictionary word. The visual search is completed when all the letters of the active dictionary word are found and the object unit’s value is above some threshold value.

4

Implementation and Results

We tested the search algorithm on a database of online cursive words where the features were letters and objects were dictionary words. The letter detectors were designed using the weight sharing neural network [12] and the receptive fields of simple units were designed using pairwise probabilities of letter locations as described in [9]. In addition to 26 letters from the alphabet, we introduced the features that denote the beginning and end of the word so our alphabet effectively consists of 28 symbols. The beginning and end features are important in order to provide context for one letter words or words that can at the same time be part of longer words such as the word “act” that is also part of the words “actual”, “activation”, “fact”, “exact”, and etc. The only way to verify the accuracy of the search algorithm is to compare the segmentation obtained with our algorithm to groundtruthed data - where the location of every letter of every dictionary word is known. However, since the pre segmented data is not available for this database, another possibility is to compare the recognition rates of our algorithm to recognition rates of some of the best recognition algorithms. We constructed two systems for recognition of online cursive script. One based on Hidden Markov Model (HMM), which is the state-ofthe-art model for handwriting recognition, and the other based on the Interactive Parts (IP) model [13]. The objective function for the IP model is exactly the same as the one that we use in this paper except that in the IP model only the first neighbor interactions are considered. This reduced contextual information has important consequences since the model then becomes much more tractable and one can use dynamic programming in order to exactly solve the objective function. Both the HMM and the IP model give comparable results and they are slightly better compared to our results. The recognition accuracy of our system varies from around 65% to over 90% for different writers, depending how clearly the words are written. On average, the accuracy of our system is about 4% lower compared to HMM and IP models. However, we should emphasize that the recognition rate depends not only on the correct segmentation of the pattern but also on the way the output of the letter detectors are combined - the connections between the units of the network and therefore the recognition accuracy is just one way of testing the performance of the search algorithm.

5

Summary

In this work we presented the computational algorithm that combines both perceptual and cognitive information during the process of searching for object features. Our algorithm, as suggested by numerous experiments [8], is initially

886

P. Neskovic and L.N. Cooper

driven purely by bottom up information but during the recognition process becomes more constrained by the top-down information. The network that performs the search algorithm utilizes contextual information on two levels. During a fixation, the locations and identities of surrounding features provide context while during the search process contextual information is represented through the locations and identities of visited (active) features. We showed that in order to capture variations in feature locations, the receptive fields of the units (the simple units) become progressively larger as well as their overlap. Therefore, the network can estimate with high resolution only the locations of features that are close to the fixation point. In order to estimate the locations of all the features with high resolutions, and thus ensure that features are correctly positioned with respect to one another, the network has to make saccadic movements. As a consequence of the foveal distribution of the receptive fields, some of the features that activate simple units may be incorrectly positioned with respect to one another. We described a detailed mechanism for resolving conflicting configurations and showed that in some situations the network benefits from making microsaccades. We also demonstrated the necessity of using two coordinate systems for measuring feature locations. During the search process, across saccades, the network uses an object-based coordinate system that is centered at any feature/location of the object while during a fixation the network uses the retinal coordinate system that is tied to the location of the fixation point. The only information that the network stores during saccadic exploration is the identity of the active features, on which it has fixated, and their locations with respect to the objectcentered system. This information allows the network to effectively implement inhibition of return mechanism and therefore enhance processing by withdrawing attention from previously attended locations. We tested the search algorithm on real world data of online cursive script and achieved very high recognition rates. The performance of the system favorably compares even to the state-of-the-art system for handwriting recognition. We believe that in addition to providing an insight into information processing by the human visual system, one of the major strengths of our algorithm is that it demonstrates that some mechanisms of the human vision can be successfully used in constructing an efficient system for real-world applications. Acknowledgments. This work is supported in part the ARO under contract W911NF-04-1-0357.

References 1. Simons, D.J., Levin, D.T.: Change blindness. Trends in Cognitive Sciences 1 (1997) 261–267 2. O’Regan, J.K.: Solving the ’real’ mysteries of visual perception: The world as an outside memory. Canadian Journal of Psychology 46 (1992) 461–488

Visual Search for Object Features

887

3. Hollingworth, A., Henderson, J.M.: Accurate visual memory for previously attended objects in natural scenes. Journal of Experimental Psychology: Human Perception and Performance 28 (2002) 113–136 4. Jonides, J., Irwin, D.E., Yantis, S.: Integrating visual information from succesive fixations. Science 215 (1982) 188 5. Pollatsek, A., Rayner, K.: What is integrated across fixations? In: Eye Movements and Visual Cognition. Springer-Verlag (1992) 166–191 6. Buswell, G.T.: How people look at pictures. Chicago: Univ. Chicago Press (1935) 7. Yarbus, A.L.: Eye movements and vision. New York: Plenum (1967) 8. Henderson, J.M., Hollingworth, A.: High-level scene perception. Annu. Rev. Psychol. 50 (1999) 243–271 9. Neskovic, P., Cooper, L.: Neural network-based context driven recognition of online cursive script. In: 7th International Workshop on Frontiers in Handwriting Recognition. (2000) 352–362 10. Neskovic, P., Schuster, D., Cooper, L.: Biologically inspired recognition system for car detection from real-time video streams. In: Neural Information Processing: Research and Development, J. C. Rajapakse and L. Wang (Eds.), Springer - Verlag (2004) 320–334 11. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol 4 (1985) 219–227 12. Rumelhart, D.E.: Theory to practice: A case study – recognizing cursive handwriting. In Baum, E.B., ed.: Computational Learning and Cognition: Proceedings of the Third NEC Research Symposium. SIAM, Philadelphia (1993) 13. Neskovic, P., Davis, P., Cooper, L.: Interactive parts model: an application to recognition of on-line cursive script. In: Advances in Neural Information Processing Systems. (2000)