Efficient Spatiotemporal-Attention-Driven Shot Matching - CiteSeerX

2 downloads 0 Views 746KB Size Report
Sep 28, 2007 - foreground objects, however, the two shots will be considered as dissimilar in traditional shot matching approaches, due to the big difference of ...
Efficient Spatiotemporal-Attention-Driven Shot Matching Shan Li and Moon-Chuen Lee Department of Computer Science & Engineering, The Chinese University of Hong Kong, Hong Kong

{sli, mclee} @ cse.cuhk.edu.hk same cartoon figure (Lisa in the cartoon show ‘The Simpsons’) flew in different backgrounds in two shots, as shown in Fig.1 (a) and (b). In both shots, the camera tracks the flying figure, which is the attractive object in both shots. Despite the similar foreground objects, however, the two shots will be considered as dissimilar in traditional shot matching approaches, due to the big difference of the background which occupies a big portion of the frame. An alternative way to video understanding is to simulate the mechanism of human visual attention. The preliminary psychology study in human vision system (HVS) is helpful to the understanding of visual attention mechanism. Given far more perceptual information than can be effectively processed, HVS allows people to select the information that is most relevant to ongoing behaviors. This selective behavior is referred as visual attention, a process involving the complex activities in the retina and the cortex.

ABSTRACT As human attention is an effective mechanism for information prioritizing and selecting, it provides a practical approach for intelligent shot similarity matching. In this paper, we propose an attention-driven video interpretation method using an efficient spatiotemporal attention detection framework. The motion attention detection in most existing methods is unstable and computationally expensive. Avoiding calculating motion explicitly, the proposed framework generates motion saliency using the rank deficiency of grayscale gradient tensors. To address an ill-posed weight determination problem, an adaptive fusion method is proposed for motion and spatial saliency integration by highlighting the more reliable saliency maps. An attention-drive matching strategy is proposed by converting attention values to importance factors, which subsequently boost the attended regions in region-based shot matching. A global feature-based matching strategy is also included the attentiondriven strategy, to address cases where visual attention detection is less applicable. Experiment results demonstrate the advantages of the proposed method in similarity matching.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – abstracting methods.

Fig. 1. (a) the cartoon figure flies in a red background; (b) the cartoon figure flies in a more blue-like background. (c) and (d) are the respective foreground objects in the two shots.

General Terms Algorithms, Design, Experimentation, Theory

In this paper, we present a spatiotemporal visual attention framework, which simulates the behavior of the human vision system by automatically producing saliency maps of the given video sequence in both static attention and temporal attention (also quoted as motion attention). The spatial and temporal attention maps are combined adaptively to form an overall attention map, from which the focus of attention (FOA) can be extracted as the attentive region in the scene. As one of the important applications of visual attention framework, an intelligent shot matching method is proposed to match frames with emphasis on attentive regions.

Keywords Focus of Attention, motion saliency, shot similarity.

1. INTRODUCTION Shot-based similarity matching using features extracted from shot keyframes is the fundamental operation of any video indexing and retrieval systems. Different approaches and frameworks have been proposed to measure the similarities between shots using low-level features, such as texture or color histograms of frames. Low-level feature based approaches, though widely used, are not effective enough due to the large gap between semantic interpretation of the video and its low level features. Fig. 1 gives a typical example where traditional low-level feature based approaches may produce unsatisfactory matching results. The

The rest of the paper is organized as follows. In Section 2, we briefly review the related work on visual attention modeling and shot similarity matching. The proposed efficient spatiotemporal attention detection framework is introduced in Section 3. A shot matching method is proposed in Section 4 and experiment results are reported in Section 5. Finally, Section 6 concludes the paper.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009…$5.00.

2. RELATED WORK 2.1 Visual attention modeling Following different attention patterns, schemes proposed for FOA detection can be divided into two categories: saliency-oriented and task-oriented. a priori knowledge of attention objects is

178

usually used to generate top-down attention cues in the taskoriented attention detection models [13]. Since a priori knowledge of attention objects is usually hard to obtain, this paper focuses on saliency-oriented attention detection.

(a) a shot with static camera

The saliency-oriented schemes detect interesting regions (i.e., Focus of Attention-FOA) according to the saliency information [9] [10]. Itti et al. proposed one of the earliest saliency-based computational models for visual attention. The principle idea is based on the stimulus-driven human vision characteristics that objects with distinctive features to their surroundings are often perceived. More elaborated models have been proposed by other researchers to generate the local conspicuity maps for images[2][7]. Though much work has been done on static image attention analysis, the extension of the visual attention model from static images to dynamic image sequences still needs to be explored.

(b) a shot with a moving camera Fig.2. (a) A moving car under a static camera attracts attentions. (b)A walking person in a moving camera attracts attentions. Assuming that object are rigid, and motion vectors within a moving object tend to be locally consistent [2][12], we aim to find objects with both locally constant motions inside the object and salient motions to the background. The optical flow f= (u,v,w) of the neighborhood can be estimated by solving the following structural tensor[11]:

A few approaches were proposed to extend the spatial attention to videos where the motions play an important role. To find the motion activities in videos, those motion attention models use structure tensors to generate optical flows, or apply other motion vector estimation methods such as block-based matching or parametric motion model to obtain motion vector fields [2][15][19]. A common problem of the existing motion attention models is that the calculation of motions is highly computation intensive, which makes the real time analysis on videos with large amount of frames impossible. A more efficient and reliable method for motion attention detection is needed to improve the applicability of visual attention model in multimedia applications.

M ⋅ [u v w]T = O3×1 , with

⎡ 2 ⎢∑ w(c − i ) g x i∈Ω ⎢ M = ⎢∑ w(c − i ) g x g y ⎢ i∈Ω ⎢ ⎢∑ w(c − i ) g x g t ⎣ i∈Ω

∑ w(c − i) g i∈Ω

x

gy

∑ w(c − i) g

2 y

∑ w(c − i) g

y

i∈Ω

i∈Ω

gt

(1)

⎤ w(c − i ) g x g t ⎥ ∑ i∈Ω ⎥ w(c − i ) g y g t ⎥ ∑ ⎥ i∈Ω ⎥ 2 ( ) w c i g − ∑ t ⎥ i∈Ω ⎦

where the weighting function w(c) selects the size of the neighborhood Ω centered at the pixel c. In implementations, the weighting is realized by a Gaussian smoothing kernel. ∇g = (gx, gy, gt) is the space-time gradients of the intensity.

2.2 Shot Similarity Matching Videos are composed of elementary shots which can be segmented by using existing shot detection methods [17][18]. Shot matching which measures the similarity between shots is the fundamental operation of any video indexing and retrieval systems. In general, the existing shot matching methods [4][6][16] use low-level features, such as color, texture, edge, shape, etc. to describe the key frames of the shots. The similarity of two shots is measured based on the distance of the feature vectors of the respective key frames. Though the low-level feature-based methods could be successful at cases where users show no preferences to any foreground objects/events, such method usually become less effective when the semantic concepts of videos can not be captured by the low-level features. Fig. 1 gives an example where the global similarity of the two shots is low, while the semantic similarity of the two shots is high.

If rank(M)=3, there are multiple motions within Ω ; if rank(M)=2, a distributed spatial brightness structure moves at a constant motion; In the degenerate image structure (i.e., rank(M)=0,1), the motion of the image structure can not be concluded by using Eq. (1). However, since the neighborhood with constant brightness or homogeneous contour most likely belongs to the same object or background, the neighborhood can be assumed to move at a constant motion. Therefore, if the optical flows (u, v, w) are constant within the neighborhood, the coefficient matrix M should be rank deficient: rank (M) ≤ 2. In the presence of noise, M may be always full rank in real videos. A normalized and continuous measure is used to quantify the matrix deficiency. Let λ1 ≥ λ 2 ≥ λ3 be the eigenvalues of M. We define the continuous rank-deficient measure dM as:

The semantic similarity matching can be implemented by training concept templates of videos [14]. However, building a concept feature classifier is a difficult task. Moreover, pre knowledge of the semantic concepts may not be always available. As visual attention models imitate the mechanisms of human vision systems, a matching solution designed from the visual attention perspective is expected to improve on the problems of the current shot matching approaches.

trace( M ) < γ ⎧0, ⎪ dM = ⎨ λ2 3 , otherwise ⎪ 2 2 ⎩ 0.5 ⋅ λ 1 + 0.5 ⋅ λ 2 + ε

(2)

where ε avoids division by 0. The threshold γ is used to handle the case rank(M)=0. It is set as the 0.1 percentile point in the accumulative distribution of the trace values of neighborhood centered at each pixel.

3. SPATIOTEMPORAL ATTENTION 3.1 Motion Attention Detection

Similar rank-based measures can be used to determine whether two image structures share similar motions. Given the spatiotemporal structural tensors M1 and M2 of a local

In HVS, temporal attention is selectively attended to objects that have salient motions [5]. Fig.2 showed two examples.

179

neighborhood Ω1 and a background Ω 2 respectively, we say

surround scales (6-3, 7-3), some areas of the background are detected with motion salience because of the trivial waving movements of tree leaves, and also the noise from the input picture (e.g., the noise introduced by video compression and decompression). The strategy of center-surround pyramids is commonly used in visual attention simulation [9]. While the high scales tend to deliver “global views”, the lower scales tend to produce more accurate segmentation of the attended regions. Therefore different center-surround scales are combined to produce a more “balanced” detection result.

Ω1 and Ω 2 share the same motion (u, v, w) if the combined coefficient matrix M12 is rank deficient: M12=[M1+M2], with Ω = Ω1 + Ω 2 (3) The integrated spatiotemporal structural tensor defined in Eq. (2) could be highly sensitive to noise. Multiple motions in either M1 or M2 will both yield high values in dM12. M12 could be full rank in most real videos, making dM12 an impractical measure for motion saliency estimation. The problem can be solved by computing M12 based on the averaged motions in Ω1 and Ω 2 respectively. Recall that in Eq.(1), a single uniform optical flow (u, v, w) can be calculated if rank(M)=2. For rank(M)=3, however,

3.2 Static Attention Detection Static scenes may also attract human attention, yet cannot be estimated by the motion attention detection module. The Itti model is currently most widely used in static attention analysis and simulation applications. The experiments performed by Itti et al.[9] suggest that the model based on the color, intensity, and orientation features is already capable of achieving good performance for complex natural scenes. The three features are selected because they are used in the primary visual cortex and have been well studied. Although the other features such as edge and symmetry could be salient features of the foreground objects, the existence of the respective neural detectors of such features remains controversial. Besides, the salient targets containing salient features such as edge and symmetry may often include one or more of the three implemented features as salient features. Besides providing a plausible simulation of bottom-up attention in primates, the Itti model could also provide a computation-efficient solution for real-time visual attention simulation. Therefore, we adopt the Itti model in the static attention detection module.

~

~ ) should satisfy: an averaged optical flow f = (u~, v~, w

∑ ∑

~ ~ ]T = O , with M 2×3 ⋅ [u~ v~ w 3×1

∑ ∑

(4)

∑ ∑

⎡ w(c w(c − i ) g x g y w(c − i ) g x g t ⎤ ⎥ ~ ⎢ i∈Ω i∈Ω i∈Ω M =⎢ ⎥ ⎢ w(c − i ) g x g y w(c − i ) g 2y w(c − i ) g y g t ⎥ ⎢⎣ i∈Ω ⎥⎦ i∈Ω i∈Ω Replacing motions in Ω1 and Ω 2 with the averaged motions respectively, we have: ~ ~ ~ T ~ ~ ~T ~ ~ ]T = O ; M (5) M i∈{1, 2} ⋅ [u~ v~ w 12 = [ M 1 M 2 ] ⋅ [u v w] = O3×1 3×1 *

M 3×3

− i ) g x2

⎡u~ ⎤ ~T ~ ~T ~ ~T ~ * ⋅ ⎢⎢v~ ⎥⎥ = O3×1 , M = M 12 M 12 = [ M 1 M 1 + M 2 M 2 ] 3×3 (6) ~⎥ ⎢⎣ w ⎦

The measure of Eq. (2) can be applied to M* to form a rankdeficient value d M * . d M * gives a low value if the averaged

Three contrast based features: intensity feature(I), color feature (C) combined by both red-green opponency and blueyellow opponency, and orientation feature (O) are used. A set of saliency maps Ps∈{I ,C ,O} are generated by calculating center-

motions of Ω1 and Ω 2 are consistent, and gives a high value if the averaged motion in Ω1 is salient to Ω 2 .The following motion attention measure captures the degree of motion attention by combining the local motion consistency with the motion saliency:

r = d M * /(d M + ε )

surround differences (Θ) between different scales:

PIc,s = PIc ΘPIs ; PCc,s = PCc ΘPCs ; POc, s = POck ΘPOsk ,

(7)

where c ∈ {2,3,4} , s = c + δ , δ ∈ {3,4} k ∈ {0 o ,45o ,90 o ,135o } .

Motion attention maps Pm can be formed by regarding the motion attention value r as the pixel value in the map. To obtain multi-scaling motion attention maps, Ω1 and Ω 2 are extracted in a form of center-surround pyramids. Local areas can be sampled from a video frame at different scales 2σ, σ=[0..8]. Centersurround pyramids are formed by regarding areas in a “center” fine scale 2c as Ω1 and areas in a “surround” coarser scale2s as Ω1+Ω2. A set of six motion attention maps

Pmc,s

(8)

Eye tracking experiments suggested participants tend to stare at the areas near to the center of the frame prior to stimulus onset. This preference can be simulated by favorably weighting Ps with an anisotropic Gaussian centered on the center location (cenx, ceny): Ps' ( x, y) = Ps ( x, y) ⋅ exp(−( x − cenx ) 2 / 2σ x 2 − ( y − cen y ) 2 / 2σ y 2 ) (9)

can be formed

with c ∈ {2,3,4} and s = c + δ , δ ∈ {3,4} . One example of the

σx=2 is used by default; σy is obtained as σy=rσx, where r is the ratio of the frame’s height to its width.

motion attention detection is shown in Fig.3,where the camera moves rapidly to track a walking person. Note that in Fig. 3, there are motions in both background and foreground. It is motion salience, instead of motions, that is detected from the background of the frames with scales of 6-3 and 7-3. Center-surround pyramids are used in motion attention detection to produce salience maps with different precisions. Lower center-surround scales (e.g., 6-3, 7-3) tend to detect locally salient regions and higher center-surround scales (7-4, 8-4, and 85) tend to detect globally salient regions. With the low center-

As people usually look more frequently towards the image center, the strategy can therefore simulate the most common situations. In case where the attentive regions are away from the centre, their attention values will still be big if there is nothing interesting in the center. If, on the other hand, there are attentive regions both in the center and away from the center, the regions away from the center will indeed be suppressed. This is similar to the mechanism of human visual attention, where the foreground objects in the center is usually more eye-catching than the objects

180

Fig. 3. An example of the motion saliency estimation. The first column shows the original images in the video sequence. Columns from 2 to 6 are the generated motion maps with center-surround scales of 6-3, 7-3, 7-4, 8-4, and 8-5 respectively. The last column shows the combined motion saliency map. In this example, the camera tracks the walking person. While the background has intensive motions, the person’s motion is greatly canceled by the camera motion. As shown in the maps, the person has a motion salient to his intermediate background. The brighter color indicates higher attention values in the attention map. far away from the center (disregarding the interference of topdown information). Fig. 4 gives an example of static attention generation.

F ( x) =



X ∈N x

( SM ( X ) • G ( X ) − Edge( X ) • G ( x)) (13)

where SM(x) and Edge(x) are the saliency and edge value of pixel x respectively; G is a normalized Gaussian function centered at x; Nx is the local area centered at pixel x. For a 240×320 image, Nx is a 15×15 square region. The pixel with the highest F(x) becomes the center of the FOA. A few pre-processing steps of the attention map are required before determining the size of FOA. The attention map is first binarized using a small gating threshold Th. A “close” morphological operation is then employed to fill the holes in the binary attention map. The neighboring regions with high attention values are merged if their color distributions in the original image are similar. Finally, the four sizes of the rectangle box expand from the center simultaneously, until the box encloses all the attended pixels which connect to each other in 8-connectivity. In the case of multiple foreground objects, there will be multiple rectangle boxes with each box surrounding an attended box. An iterative FOA determination process consists of the following steps: Algorithm 1: Iterative FOA determination

Fig. 4. Static attention generation. The first column shows the original images of the video sequence. From left to right, the remaining columns show the color, intensity and orientation attention maps correspondingly.

3.3 Adaptive Fusion and FOA Detection The motion attention map Pm and static attention maps Ps are integrated as:

SM = wm ⋅ Pm +



s∈{ I ,C ,O}

ws ⋅ Ps ,

(10)

Since each cue varies in its performance, it is not appropriate to weigh the measurements equally. In this work, we explore the fact that in HVS, features with strong contrast are activated while similar features prohibit each other [3]. The variance of map P is defined as: E(P)=Max(P)-Mean(P), (11) The weights of the feature maps are then defined as:

wi = E ( Pi ) /(



i∈{m , I ,C ,O}

E ( Pi )) ,

Step 1: apply pre-processing on a given attention map SM, to obtain a binary attention map SM’. Step 2: obtain the focus values F (x ) of the remaining pixels on the attention map SM. Let the pixel with the highest

(12)

F (x)

be the

center of the i-th FOA; Step 3: expand the rectangle box around the center of the i-th FOA until the box enclosed all the pixels which directly or indirectly connect to the FOA center in the binary attention map SM’. Remove the pixels within the box from the binary map SM’. Step 4: if one of the termination conditions is satisfied, terminate the procedure; otherwise, go to Step 2 with i=i+1. The termination conditions are: 1) i > NFOA; 2)SM(Ci) < 2/3*SM(C1).

When the weight of the motion attention map wm is the maximum of all weights and is larger than 3 times of the second maximum, the corresponding shot are referred as motion dominant shot; if wm is smaller than 1/3 of the maximum weight, the shot is referred as static attention dominant shot; otherwise, the shot is referred as “motion-static shot”. Fig. 5 gives three examples where the dominant cue varies with different shots. Given the generated saliency map, the attended region (i.e., Focus of Attention: FOA) is determined by placing a rectangle box around each attended object iteratively. A focus value F(x) is defined to measure the possibility of a pixel x as a center of FOA:

Fig. 6. Iterative FOA determination SM(Ci) denotes the attention value of the i-th FOA’s center. NFOA is the number of FOA. Normally NFOA in a video frame will be no more than 3, as the frames are played fast (24~40 frames/second). It will be difficult for the observers to attend these objects simultaneously at the same frame. As a matter of fact, the

181

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Fig.5. Three examples of adaptive fusion with N=10. Column (a) is the key frame of the given sequence. (b) shows the temporal attention map generated from the two neighboring frames of the sequence. (c)~(e) are the spatial attention maps using color, intensity and orientation respectively. (f) is the combined spatiotemporal attention map. (g) shows the generated FOA in the original video frame. observer might lose his focus if his attention is distracted by too many salient objects.

with M t =

i

number of FOA in i-th key frame. For each FOA in the i-th key frame, a normalized importance factor I(FOAi,q) is calculated based on its average saliency value in the attention map SM:

with the constraint of



x∈FOAi , q

SM ( x) (14)

I ( FOAi ,q ) = 1 . N(FOAi,q) is

q∈{1,.., N i +1}

the number of pixels in FOAi,q. In the attention-driven matching strategy, higher importance factors indicate that the FOA should have bigger contribution in the shot matching. Therefore, the importance factor is used later in the paper as the weight factor for FOA-based shot matching. A shot is a collection of frames where their contents are related. Therefore it is reasonable to assume the change of FOA within a temporal distance is smooth. A temporal window W=5 centered at each key frame is applied to perform the temporal smoothness check. FOA detection will be less reliable if the detected FOA vary greatly within the window. A set of spatial features, including object centroid(c), bounding box size(s), and statistical distribution (h), is used for the temporal smoothness check. The confidence rate C(t) of frame t is calculated as: 1 Mt

q

(16)

Note that in Eq. (15), the distance between two FOA is adjusted by the importance factors of the two FOA. The adjustment ensures that FOA receiving more attentions will be assigned with larger weights in the frame matching. The temporal smoothness between two FOA is computed from all features {c, s, h} :

∑ ∑ I (FOA w∈W

(15)

We also notice that very salient motion would not necessarily cause greater location distance. For example, in a tracking shot, the camera is panning fast to track the foreground moving object. The foreground object, with its motion greatly canceled by the global motion, could stay around the center of the screen throughout the entire shot. In this case, the FOA motion is salient with respect to the background motion. However, the calculated location distance will be small.

4.2 Confidence Rate of FOA Detection

C (t ) = 1 −

I ( FOAt ,q ) ⋅ I ( FOAw, p* )

In Eq. (15), distances in three cues (location, size and color/edge histogram) collectively determine the confidence rate of the FOA detection. It should be stressed that the FOA location distance is a discriminative cue for confidence rate estimation. In unreliable FOA detection cases, FOA are detected from successive frames and these FOA are inconsistent to each other. The locations of the detected FOA could be arbitrary in different frames. In a reliable FOA detection case even with fast FOA motion, however, FOA usually moves smoothly in successive frames; and the location difference between these FOA is usually much smaller than the unreliable cases. After all, the cases are rare where the objects jump across the entire screen within 3 or 4 frames. In such cases, we could also consider the FOA detection as unreliable because the object does not stay constantly within the temporal window and therefore should not be used to represent the shot.

SMi and a number of FOA:{ FOAi ,1 ,…, FOAi , N }. Ni is the



q

p* = arg min p d ( FOAt ,q , FOAw, p )

Using the method in [1], a set of keyframes {f1,…, fn}are extracted from a given shot according to the motion and content complexity of the shot. The proposed spatiotemporal attention framework is applied on each selected key frame fi to extract an attention map

1 NormFact ∗ N ( FOAi , q )

w∈W

where I(FOAt,q) is the importance weight of FOAt,q; the subscript q indicates the q-th FOA in frame t; p* is the corresponding most similar FOA in frame w:

4. APPLICATION TO SHOT MATCHING 4.1 Importance Factor

I ( FOAi ,q ) =

∑ ∑

t , q ) I ( FOAw, p* )d ( FOAt , q , FOAw, p* )

182

d ( FOAt ,q , FOAw, p* ) =

1 3



i∈{d ,c , s )

d i ( FOAt ,q , FOAw, p* )

D FOA ( f i , f j' ) =

(17)

i∈{1,.., m}

D( f i , f

j* = arg min j∈{1,..,n} D( f i , f j' )

i ,q ) ⋅ I ( FOA j , p* ))

q

.

Sim(Sa, Sb)= min(D(Sa, Sb), D(Sb, Sa))

(20)

(21)

5. EXPERIMENTS To demonstrate the effectiveness of the proposed attention detection framework, we conduct the experiments on a large amount of video shots extracted from TV programs and TREC05

, with

video repository. The frame sizes of the videos are 240×320. Table I summarize the details of the testing videos.

(18)

The difference between fi and f’j is composed of two parts: global feature-based difference DGLB and FOA-based difference DFOA. The confidence rate is used as a tuning factor between two components: D( fi , f j' ) = (1 −

∑ ( I ( FOA

The similarity measure D(Sa, Sb)is asymmetric. Therefore, we define the final similarity as:

If shot a and b is composed of m and n key frames respectively: SaÆ{f1,…,fm}; SbÆ{f’1,…,f’n}, the similarity of the two shots is measured by the difference between each key frame in Sa and its most similar key frame in Sb:



i , q ) ⋅ I ( FOA j , p* ) ⋅ d h ( FOAi ,q , FOA j , p* ))

q

M=

4.3 Integration of FOA and Global Features

' j* )

∑ ( I ( FOA

, with p* = arg min p d ( FOAi ,q , FOA j , p ) and

where dc is the centroid difference normalized by frame size; ds is the size difference of the two bounding boxes normalized by the size of the bigger box; and dh is the normalized HSV color and edge histogram difference between two FOA.

1 D( S a , S b ) = m

1 M

Table I Summary on the testing videos Video genres

C (i ) + C ( j ) C (i ) + C ( j ) ) ⋅ DGLB ( fi , f j' ) + ⋅ DFOA ( fi , f j' ) (19) 2 2

Simple features including 64×16×16 HSV color histograms and 1 × 64 edge histograms are used in the calculation of global feature-based difference DGLB ( fi, fj’ ). FOA-based difference DFOA (fi, fj’) resembles the temporal smoothness calculation (see Eq. 15~17), except that only color and edge histogram difference (dh) is applied. Note that the importance factors of FOA are used as weights to combine distances from different FOA pairs:

Total No. of shots Duration (min.)

TV shows

2996

180

News

2070

180

Sports

977

120

Commercials

1603

60

Documentaries

1010

120

Total

8,656

11 hours

(a) The dominant cue is motion attention

(b) The dominant cue is static attention

(c) Motion attention and static attention both contribute significantly Fig. 7 Testing results on different videos

183

Table III Subjective evaluation result

5.1 Performance on FOA Detection For each shot of the testing videos, we applied the proposed approach of spatiotemporal attention detection. Given the frame

Test sets Motion-dominating videos

size of 240 × 320 , we construct attention maps in 5 different scales, where center-surround scales are 6-3, 7-3, 7-4, 8-4, and 8-5 respectively. The FOA detection results on some testing videos are shown in Fig. 7. In Fig.7(a), motion attention dominates the overall attention map. Motion saliency becomes a decisive cue when the static features of the foreground is not distinct (eg., the right-top and the right-bottom shots in Fig.7(a)) or the rich details of the background produce meaningless static attention maps(eg., the left-top and the left-bottom shots in Fig.7(a)). On the contrary, when there is no prominent motion saliency or when there are complex details of motions, static attention will dominate the overall attention determination (See Fig. 7(b)). Fig. 7(c) show sequences with objects salient in both static features and motions. The static and the motion attention maps agree on the attended regions, making the attended objects pop out from the background.

(b)

(c)

(d)

We randomly pick 100 shots from each video genre and ask 5 observers to rate on them. Among the 5 observers, only 2 observers have computer science background. Each observer is asked to grade the detected FOA using the criteria summarized in Table II. Table II Criteria for subjective evaluation Description of criteria

Good

The detected FOA covers most attractive regions

Acceptable

The detected FOA covers less attractive regions

Failed

The attractive regions are not detected at all

Original image

0.16

0.06

Static attention-dominating videos 0.61

0.28

0.11

Motion-Static attention videos

0.85

0.10

0.05

Average

0.75

0.18

0.07

With the same salience map, different FOA strategies could be developed for different applications. For example, Itti et al. detect FOA by placing uniform-size circles around the most salient locations. A series of subsequent FOA detected for each picture can be used to simulate the mechanism of eye scan paths when humans view a still picture freely. This strategy is not applicable to our application, since videos are formed of dynamic pictures and the viewing time for each picture is limited. Further, it is difficult to use uniform-size FOA to capture the actual regions of foreground objects, which will affect the performance in the application of attention-driven shot matching. The method proposed in [7] locates FOA by extracting the boundaries of the visual attention objects. However, object segmentation is time consuming; the experiment results (see section 5.3) showed that FOA based on object segmentation doesn’t necessarily outperform FOA contained in a square box in a video shot matching application.

Fig. 8 Examples of unsatisfactory detection.

Statement

0.78

The subjective evaluation result shown in Table III suggests that the proposed FOA detection method is able to detect most interesting foreground objects of videos with a satisfaction rate of 0.93. FOA detection on Motion-Static attention videos has the best performance among the three test sets, as in such sequences the saliency effects of the foreground objects are boosted by both motion and static attention detection results. We also notice that FOA detection on motion-dominating sequences outperforms the detection on static attention-dominating sequences. In static attention, the semantic information of a target plays a role in determining observers’ attentions. Since the semantic information varies with different people, observers may have diverse opinions about the FOA. In motion-dominating sequences, on the other hand, human vision system is attentive to the motion event and the observers tend to pick up motion salient objects quickly with less confusion.

Some less satisfactory detection results are shown in Fig. 8. Most of the detections are made when the foreground objects are not salient in either motions or spatial features (Fig. 8(a) and (b)). The existence of many distracting objects salient in either visual features or motions could also lead to unsatisfactory detections (See Fig. 8 (c) and (d)).

(a)

Good Acceptable Failed

Detected FOA

The performance comparison among different FOA strategies is not performed here because: 1) if the comparison is based on shot matching performance, it may not be fair since different FOA strategies have been designed for different applications; 2) on the other hand, if the comparison is based on FOA detection accuracy, there are not commonly accepted performance measurement criteria available.

5.2 Performance on Shot Matching A shot retrieval experiment is conducted on the database of 8,656 shots to evaluate the effectiveness of the proposed shot matching method.100 queries are used with 20 from each video genre, as summarized in Table IV. For each query, k most similar shots are returned. Two human observers identify the scene relevancy and the intersection of their identification is used as the ground truth.

Due to the great subjectivity in observer evaluation, the same detected FOA may be graded differently. It is important to consider the grading from all the observers and compute the statistics. The statistic results of the evaluation on all 500 shots are listed in Table III.

184

affect the shot matching performance, we conduct a comparison experiment using a boundary based method (referred as AOG).

Table IV Statistics of the query shots # of Query Genre queries

# of frames

# of key # of frames relevant shots

TV shows

20

88

1.4

9.0

news

20

130

1.6

17.5

sports

20

161

2.6

13.1

commercials

20

63

1.9

6.5

documentaries

20

229

2.0

9.4

(a) Query shot

(b)Retrieval results using the proposed method Precision-recall graphs are used to evaluate the retrieval performance based on all queries. We set recall to different values by tuning the value of k. Precision at any recall rate is estimated using linear interpolation.

(c) Retrieval results using “OBJ”

Three other shot matching strategies are used for performance comparison: 1) GD: Shot matching using the global HSV and edge histograms of the key frames; 2)OBJ: Shot matching using only FOA-based shot difference in our proposal (no global features); 3)STS: Shot matching using spatiotemporal statistics of shots [8]. In all four methods, histogram intersection is used as the histogram-based distance metric; the color space HSV is adopted; and the key frame extraction follows the same method as used in our approach. In the method STS, we perform both the shot-level global matching and key frame-level matching. In its weight determination, we tune weights between [0 1] at the interval of 0.05 to select one with best overall performance. Fig. 9 shows the comparison results in all 5 categories of queries.

(d) Retrieval results using “STS”

(e) Retrieval results using “GB” Fig. 10 Top 5 similar shots using query #1 from “TV show”: (a) shows the key frame of the query shot. (b)~(e) show the results using four different methods respectively. Due to the space limit, only one key frame of each retrieved shot is shown.

As shown in Fig. 9, for all different query genres, our method produces higher precisions than the other three methods because it interprets videos from the perspective of human perception. In another word, the similarity between two shots will be boosted if the attended targets of the two shots are the same, even though their backgrounds are different. As our method considers the contribution from both the foreground objects and the global features, it produces higher precisions and more stable performance than “OBJ”, which matches shots based on FOA only. Though the method “STS” considers the temporal information of shots (i.e., motions), it may be highly noisy to directly compare two shots based on their motion distributions.

In AOG, the proposed spatiotemporal attention framework is used to produce the overall attention map first. Instead of using a rectangle box, the 2D color object extraction method proposed in [7] is applied to generate object boundary of each FOA. The region inside the extracted boundary is used to represent the FOA. The shot retrieval performance comparison between the proposed method and AOG is shown in Table V. Table V Performance comparison of our method and “AOG” R=0.2

Fig.10 shows the returned top 5 shots of query#2 in TV show “Simpson” using the four methods. Our method correctly detects the flying figure as the FOA and returned the 5 shots with similar FOA. Without the confidence check for FOA detection, “OBJ” returned shots where the statistics of the falsely detected “FOA” happen to match the query shot (e.g., the yellow and bright background of the 3rd shot returned by “OBJ” is falsely detected as FOA). The shot matching methods using only low-level features clearly produce far less satisfactory results; there is only one relevant shot in the results returned by “STS” and “GB”.

Query Genre

Our method

AOG

R=0.4 Our method

AOG

R=0.6 Our method

AOG

TV shows

0.76

0.77

0.42

0.40

0.31

0.35

news

0.66

0.60

0.45

0.41

0.29

0.31

sports

0.81

0.75

0.59

0.63

0.44

0.42

commercials

0.71

0.77

0.53

0.51

0.30

0.41

documentarie s

0.82

0.76

0.52

0.61

0.37

0.36

Average

0.75

0.73

0.50

0.51

0.34

0.37

5.3 Rectangle Box VS. Object Boundary In the proposed method, FOA is represented as a region surrounded by a rectangle box. Though easy to implement, this method may not give accurate segmentation of foreground objects. To discuss whether this rather rough segmentation would

As the comparison experiment suggested, despite of its heavy computation load, object boundary-based matching doest not necessary outperform the proposed method (our method slightly outperforms AOG at R=0.2. For the two recall rates R=0.4 and R=0.6, AOG produces slightly better precisions). The potential

185

reasons for the similar performance of the two methods might be

(a) TV shows

the following: 1) the existing object segmentation methods suffer

(b) News

(d) Commercials

(c) Sports

(e) Documentaries

Fig. 9 Video retrieval performance in P-R graph for queries of 5 different video genres the attentive regions are converted to the importance values in similarity matching. To address cases where visual attention detection is less applicable, global feature-based matching strategy is also included in the shot matching method. Experiment results from a large amount of videos show that our method compares favorably with conventional ones in shot matching.

from considerable segmentation noise. 2) As the foreground objects dominate the bounding box, the color and edge distribution within the bounding box well represent the statistics of the foreground object. Thus the influence of the background area is limited. Inaccurate segmentation of the AOG technique appears more frequently when the foreground objects do not have easily identified distinct boundary or when the foreground objects are composed of several regions with much different spatial contents. Some examples generated by AOG are shown below:

Our method provides a method to measure shot similarities according to human attention. It can be combined with other features to obtain a better similarity matching performance, such as text, speech, and even priori knowledge input by users. The proposed spatiotemporal attention detection method automatically detects the interesting regions in the video from low level features, without knowing what kind of objects the interesting regions really are. However, it can not address the problem where semantically similar objects have different low-level features. For example, different cars may have different shapes and colors. FOA-based matching between these cars may yield low similarities. To solve this problem, classifiers of each semantic category needs to be trained by different techniques, such as SVM and mixture Gaussian models, etc.

Fig. 11 Unsatisfactory segmentation examples using the AOG technique

6. CONCLUSION

7. REFERENCES

In this paper, we propose an attention-driven video interpretation method using an efficient spatiotemporal attention detection framework. In temporal attention detection, motion saliency map is computed based on the rank deficiency of the structural grayscale gradient tensors of the frames, which requires no foreground/ background motion segmentation and estimation. The method can also detect salient motions even when multiple complex activities occur simultaneously in the background. The motion attention and static attention maps are fused in an adaptive pattern in order to highlight the more reliable attention maps and to suppress the less reliable ones.

[1] Adjeroh, DA and Lee, MC,"Scene-Adaptive Transform Domain Video Partitioning", IEEE Trans. MM, Vol. 6, no. 1, pp. 58-69, 2004 [2] Boccignone G, Chianese A, Moscato V and Picariello A, “Foveated shot detection for video segmentation,” IEEE Trans. Circuits Syst. Video Tech., 15(3), pp. 365-377, 2005 [3] Cannon, M. W. and S. C. Fullenkamp, “A model for inhibitory lateral interaction effects in perceived contrast,” Vision Research, 36(8), pp.1115-1125, 1996 [4] Cheung, S.-S. and Zakhor, A., “Fast similarity search and clustering of video sequences on the world-wide-web,” IEEE Trans. MM, 7(3), pp.524 – 537, 2005

As one of the important application of visual attention framework, a shot matching method is proposed based on the attention-driven matching strategy, where the attention values of

186

[13] Navalpakkam, V. and Itti L., "An Integrated Model of TopDown and Bottom-Up Attention for Optimizing Detection Speed," IEEE CVPR, vol.2, pp.2049 - 2056, 2006 [14] A.F. Smeaton, W. Kraaij, and P. Over, “The TREC VIDeo Retrieval Evaluation (TRECVID): A Case Study and Status Report,” RIAO,2004 [15] Wen-Huang Cheng, Wei-Ta Chu and Ja-Ling Wu “A Visual Attention Based Region-of-Interest Determination Framework for Video Sequences,” IEICE Trans. Information and Systems, E88-D(7), pp. 1578-1586,2005. [16] Yuxin Peng and Chong-Wah Ngo, “Clip-based similarity measure for query-dependent clip retrieval and video summarization,” IEEE Trans. Circuits Syst. Video Techn. 16(5), pp.612-627, 2006 [17] Shan Li, Moon-Chuen Lee, “An Improved Sliding Window Method for Shot Change Detection,” IASTED Int. Conf. on Signal and Image Processing, pp. 464-468, 2005. [18] Shan Li and Moon-Chuen Lee, "Effective Detection of Various Wipe Transitions," IEEE Trans. on Trans. Circuits Syst. Video Tech., 17(6), pp.663-673, 2007 [19] Yun Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” ACM MM, 2006

[5] Coull J.T., “fMRI studies of temporal attention: allocating attention within, or towards, time,” Brain Res Cogn Brain Res. Vol. 21(2), pp. 16-26, Oct 2004. [6] J. D. Courtney, “Automatic video indexing via object motion analysis,” Pattern Recognit., vol. 30, no. 4, pp. 607–627, 1997. [7] Han, J., Ngan, K.N., Mingjing Li, and Hong-Jiang Zhang, “Unsupervised extraction of visual attention objects in color images,” IEEE Trans. Circuits Syst. Video Techn, vol. 16(1), pp.141 – 145, Jan. 2006 [8] Y.-H. Ho, C.-W. Lin, and J.-F. Chen, “Fast coarse-to-fine video retrieval using shot-level spatio-temporal statistics,” IEEE Trans. on Trans. Circuits Syst. Video Tech., Vol.16, no.5, pp.642-648, May 2006 [9] Itti L. and Baldi P., “A principled approach to detecting surprising events in video,” IEEE CVPR, Vol. 1, pp. 631-637 , 2005 [10] C. Koch and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry,” Human Neurobiology, 4(4), pp.219-227, 1985. [11] B. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” International Joint Conf. AI, pp. 674-679, 1981. [12] Ma Y.F., Hua X.S., Lu L., Zhang H.J., “A generic framework of user attention model and its application in video summarization,” IEEE Trans. MM, 7(5), pp. 907-919, 2005

187