Role of gestalt principles in selecting attention areas ...

2 downloads 0 Views 4MB Size Report
Keywords: Gestalt principle, selective attention, perception, symmetry, conti- ... In general Gestalt psychology attempts to describe how people organize visual el-.
Role of gestalt principles in selecting attention areas for object recognition Jixiang Shen, Amitash Ojha and Minho Lee* 1

School of Electronics Engineering, Kyungpook National University, 1370 Sankyuk-Dong, Puk-Gu, Taegu 702-701, Republic of Korea [email protected], [email protected], [email protected]

Abstract. Human attention plays an important role in human visual system. We assume that the Gestalt law is one of important factors to guide human selective attention. In this paper, we present a series of studies in which we hypothesized that regions of image that get more attention in an object recognition task, confirm to one or more gestalt principles and subconsciously attract human attention which eventually help in object recognition. In our study, we collected attention parts of images by analyzing eye movement of participants. Then we compared Gestalt scores of high attention parts with those of non-attended random parts. Our results suggest that continuity and symmetry of features attract human attention. We argue that an approach to analyze parts with high Gestalt scores can yield better than analyzing random parts of image in object recognition. Keywords: Gestalt principle, selective attention, perception, symmetry, continuity, regularity

1

Introduction

Attention plays an important role in human cognition and allows selective processing of useful information while ignoring less important ones [1]. Attention can operate in multiple modalities but the most important among them is visual attention. Various theories and models have been proposed to describe the process of attention and its role in visual perception [2]. For example, James proposed, “spotlight model” [3]. The model suggests that attention works with “focus”, “fringe”, and “margin”. The “focus” extracts high-resolution information from visual scene. “Fringe”, surrounding focus, extracts crude information. The edge of this fringe is called the “margin”. Eriksen proposed a model called the “zoom-lens model” which is similar to James’ model [4]. This model postulates that deployment of attention can vary from a sharp focus to a broad window. While discussing attention, a distinction between bottom-up and top-down processes is also mentioned. According to James, bottom-up process is driven by the properties of the object [5] that can attract human attention subconsciously.

adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

In AI, especially in computer vision, several attempts have been made to imitate human visual system using attention models. For example, Park et al. [6] used independent component analysis (ICA) and entropy as a filter to select significant regions in an image. Jeong et al. developed another model to generate saliency map based on the bottom-up and top-down processes [7]. Gestalt theory [8] attempts to describe how people organize visual elements into group or unified wholes by applying various principles such as similarity, continuity, regularity, symmetry, etc. In the bigger schema of our research, we are interested in developing an alternative cognitive model for object recognition using Gestalt principles. Moreover, we assume that principles of Gestalt guide human attention. In other words, features having properties for Gestalt grouping attract human attention and finally help in recognizing objects. Rationale of our assumption is based on perception-action cycle [9]. Perceptionaction cycle suggests that top-down and bottom-up processes work in tandem and mutually contributes to the success of each other. Bottom-up information collects low-level perceptual features from environment and organizes it using various integration mechanisms (including gestalt principles) for partial perception (an estimate). In return, this partial perception, based on top-down process, guides attention for more specific features search to confirm and complete the perception process. Therefore it is assumed that features having similar patterns for possible grouping are attended in the case of perception. We try to confirm this assumption in our research. In this paper, we present a series of three studies. We hypothesized that parts of an image that gets maximum human attention in an object recognition task, confirms to some or at least one principle of Gestalt more significantly than non-attended random parts of the image. To test our hypothesis, we compared Gestalt scores (calculated using previously developed algorithms) of attention regions and non-attended random regions of an image. We call them “attention patches” and “random patches”, respectively. The rest of the paper is organized as follows: In section 2, we describe the Gestalt principle in general and principle of (1) continuity, (2) regularity and (3) symmetry in particular. In section 3, we present our study, which was conducted in three steps. In section 4, we discuss the implication and shortcomings of our study and in last section, we present our conclusions and future direction of research.

2

Gestalt principle

2.1

Principle of Regularity, Continuity and Symmetry

In general Gestalt psychology attempts to describe how people organize visual elements into group or unified wholes by applying various principles [10], such as similarity, proximity, continuity, regularity, symmetry, etc. We explored three of them namely principle of (1) continuity, (2) regularity and (3) symmetry in attended regions of an image. “Principle of continuity” suggests that humans tend to organize continuous parts of objects as a whole. For example in Fig. 1 (a), a cross is more prominently perceived than two right angle shapes. “Principle of regularity” postulates that ele-

ments of objects are perceptually grouped together if they form a pattern that is regular, simple, and in order. For example in Fig. 1 (c), two squares constructed using lines are arrayed so orderly and enhanced by irregular lines that these two squares can be perceived easily in the complex background. “Principle of symmetry” suggests that humans perceive objects by considering parts that are symmetrical. For example, in Fig.1 (d), three pairs of symmetrical brackets are perceived rather than six individual brackets.

(a)

(b)

(c)

(d)

Fig. 1. (a) Cross, (b) Celtic Knot, (c) Collinear pattern, (d) Law of symmetry

3

The experiment

To test our above-mentioned hypothesis, we conducted three related studies. In first study, we collected regions of attention in an image by analyzing fixation count and fixation time (using eye-movement) of participants while they did an object recognition task. In second study, we calculated Gestalt scores of regularity, continuity and symmetry in mean attention regions and non-attended random regions and then compared them. These scores were calculated using previously developed algorithms. High score for a particular principle in a particular patch meant that the patch is more liked to be grouped under that principle. In third study, we verified our assumption by highlighting (1) attention areas with high Gestalt score and (non-attended random areas) and by presenting them to participants for an object recognition task. In this section we discuss methods and results of our studies. 3.1

Study 1: Identifying attention regions using eye-movement

In the first study, we collected images of 60 objects with complex background in three following categories: (1) Animals in natural scenes, (2) Objects of daily use in normal environment and (3) Vehicles in urban and rural environment. The objective of this study was to identify regions of image(s) that attracted attention of participants while they did object recognition task. 3.1.1. Participants and procedure: 12 graduate students (3 females and 9 males) (mean age=21) from Kyungpook National University, Korea, participated in the study. Participants were shown 60 images, from three above-mentioned categories (20 each) randomly, on a 21 inch computer screen (1280x780). They were asked to look at the image and recognize the target object. There was no time limit and they were asked to press space bar once they had

recognized the object. While participants performed the task, their eye-movement was recorded using Tobii eye tracker from a distance of 60-70 centimeters [11]. Although, the viewing was binocular, we only recorded the movement of left eye because of certain hardware constraints. 3.1.2. Output of the study:

Fig. 2. Sample images with gaze plot; (a) original image, (b) gaze plot of one participant, (c) average heat map Output of this study included eye movement samples of 12 participants for 60 images in complex background in three categories. We calculated mean (1) fixation count and (2) fixation time to acquire the most attended regions for each image. Fig. 2 shows an example of (a) gaze plot and (2) heat map of mean fixation. Based on this analysis, most attended regions were marked as areas of interest (AOI) and were collected as “attention patches”. Same numbers of non-attended and random patches were also collected as “random patches” (having same size as attention patch) from all images. These patches (attention and random) were used in the next study. 3.2

Study 2: Analysis and comparison of gestalt scores

In this study, we compared Gestalt scores of “attention patches” and “random patches” for all images. To acquire the score for three Gestalt principles namely: (1) Regularity, (2) Continuity and (3) Symmetry, we used previously developed algorithms. In the following, we mention our method of calculating scores for these three principles of Gestalt: 3.2.1. Principle of continuity: Principle of continuity is defined as the longest contour line in a patch. We followed Dou and Kong’s [12] method for contour detection. Canny operator was used to get the edge information. We try to generate a smooth line as long as possible. To do this, we randomly take a start point A and set eight directions around it. If we find a neighbor edge point B, near by the start point, the direction of AB is seen as the initial direction of this line. In order to generate a smooth line, we search the edge point C nearby B where BC = AB. If there is no C, then we search the edge point C’ nearby C where BC′ = AB ± 45°. The process of searching edge is repeated until no points are left.

3.2.2. Principle of Regularity: The regularity of point A is defined as the “surrounding edge pixels expanding acceleration”. We use Canny operator to obtain edge information of the image. Then we check A’s surrounding area (5x5 box, A is located in the center of this box) and count number of edge pixels. We expand the box radius by 2 pixels (5x5, 7x7 …) each time and record number of added pixels. We believe that in the regular area the edge pixel increasing speed should approach a constant, which means the edge pixels expanding acceleration should be small and approach zero. Suppose i is the current expanding time, Pi means count of edge pixels inside the current box. The acceleration of surrounding edge pixels are defined as: 𝑅! = 𝑃!!! − 𝑃! − 𝑃! − 𝑃!!! (1) Smaller 𝑅! means higher regularity. So to measure the regularity of an image, we investigate the regularity of each edge point and calculate the average value. !! !∈!"#! !" !"#$% !

𝑅!"#$% =

(2)

where N is the number of edge pixels in the image. 3.2.3. Principle of Symmetry To test the symmetry of image, we consider the bottom-up saliency map (SM) with symmetry information method [7]. To generate a symmetry feature map, we take Kdirectional orientation features as input, and then compare the orientation features located on opposite sides. Take these information to generate Gaussian blurred pyramid orientation features with various scales, at finally symmetry axis S f, n, k at the 𝑓 -th pyramid orientation feature is calculated using Eq. (3), where n is the location within an image, F is the total number of pyramids, M is the level of blurred images and 𝑘 is the orientation direction, 𝐺𝑃! !!! is (m+f) th orientation Gaussian pyramid, m represents a level of blurred image. 𝛾! and 𝛿! are weight parameters and ! φ(x)=max(x,0). 𝑘 is the opposite of the 𝑘 direction, 𝑘= 𝐾 + ,where 𝐾 is the total !

number of directions being considered. For a location n=(x,y), the locations 𝑛! =(𝑥! , 𝑦! ) and 𝑛! =(𝑥! , 𝑦! ) are obtained by Eq. (4): !!! !!!

S f, n, k = φ

!!! !!!

𝛿! ∙ 𝐺𝑃!

𝑛! , 𝑘 + 𝑟 − 𝐺𝑃!

!!!

𝛾! ∙ 𝐺𝑃!

!!!

!!!

𝑛! , 𝑘 + 𝑟 + 𝐺𝑃!

× 𝑛! , 𝐾 − 𝑘

!!!

× 𝑛! , 𝐾 − 𝑘



, 𝑓 = 0,1 ⋯ 𝐹 − 𝑀, 𝑘 =

!

0,1 ⋯ − 1

(3)

!

𝑥! = 𝑥 + 𝐴!" cos 𝛼! , 𝑦! = 𝑦 + 𝐴!" sin 𝛼! 𝑥! = 𝑥 − 𝐴!" cos 𝛼! , 𝑦! = 𝑦 − 𝐴!" sin 𝛼! 𝛼! = 𝛼! + 𝜋, 𝐴!" = 2! 𝐴!!,

𝛼! =

2𝑘𝜋 𝐾 (4)

𝐴!" is the m-th distance between two opposite side locations. At finally using center surround difference and normalization (CSD&N) [9] of symmetry axes in different scale to obtain feature map. We take average pixel of feature map to represent how symmetry the image is. 3.2.4. Results By applying above-mentioned algorithms, we got the Gestalt scores for three Gestalt principles for all attention patches and random patches. We calculated the average score of all patches (attention and random) in all three categories namely (1) animals in natural scenes, (2) objects of daily life, and (3) vehicles in urban and rural backgrounds. As shown in Fig. 3, results suggest that scores of ‘principle of symmetry’ and ‘principle of continuity’ were significantly high in attention patches than random patches. This difference was statistically significant, F=(2,118)=5.32, p