Learning Relations Among Movie Characters: A Social Network ...

24 downloads 5391 Views 715KB Size Report
affinity is learned, we perform social network analysis to find communities ... then detected using a common social network analysis tool referred to as the.
Learning Relations Among Movie Characters: A Social Network Perspective Lei Ding and Alper Yilmaz Photogrammetric Computer Vision Lab The Ohio State University [email protected], [email protected]

Abstract. If you have ever watched movies or television shows, you know how easy it is to tell the good characters from the bad ones. Little, however, is known “whether” or “how” computers can achieve such high-level understanding of movies. In this paper, we take the first step towards learning the relations among movie characters using visual and auditory cues. Specifically, we use support vector regression to estimate local characterization of adverseness at the scene level. Such local properties are then synthesized via statistical learning based on Gaussian processes to derive the affinity between the movie characters. Once the affinity is learned, we perform social network analysis to find communities of characters and identify the leader of each community. We experimentally demonstrate that the relations among characters can be determined with reasonable accuracy from the movie content.

1

Introduction

During recent years, researchers have devoted countless efforts on object detection and tracking to understand the scene content from motion patterns in videos [12, 7, 1, 6]. Most of these efforts, however, did not go beyond analyzing or grouping trajectories, or understanding individual actions performed by tracked objects [11, 8, 2]. The computer vision community, generally speaking, did not consider analyzing the video content from a sociological perspective, which would provide systematic understanding of the roles and social activities performed by actors based on their relations. In sociology, the social happenings in a society are conjectured to be best represented and analyzed using a social network structure [22]. The social network structure provides a means to detect and analyze communities in the network, which is one of the most important problems studied in modern sociology. The communities are generally detected based on the connectivity between the actors in a network. In context of surveillance, a recent research reported in [23] takes advantage of social networks to find such communities. The authors use a proximity heuristic to generate a social network, which may not necessarily represent the social structure in the scene. The communities in the network are then detected using a common social network analysis tool referred to as the modularity algorithm [17]. In a similar fashion, authors of [10] generate social

2

Lei Ding and Alper Yilmaz

relations based on the proximity and relative velocity between the actors in a scene, which are later used to detect groups of people in a crowd by means of clustering techniques. In this paper, we attempt to construct social networks, identify communities and find the leader of each community in a video sequence from a sociological perspective using computer vision and machine learning techniques. Due to the availability of visual and auditory information, we chose to perform the proposed techniques on theatrical movies, which contain recordings of social happenings and interactions. The generality of relations among the characters in a movie introduces several challenges to analysis of the movie content: (1) it is not clear which actors act as the key characters; (2) we do not know how the low-level features relate to relations among characters; (3) no studies have been carried on how to synthesize high-level relational information from local visual or auditory cues from movies. In order to address these challenges, our approach first aligns the movie script with the frames in the video using closed captions [5]. We should note that, the movie script is used only to segment the movie into scenes and provide a basis for generating the scene-character relation matrix. Alternatively, this information can be obtained using video segmentation [24] and face detection and recognition techniques [3]. A unique characteristic of our proposed framework is its applicability to an adversarial social network, which is a highly recognized but less researched topic in sociology [22], possibly due to the complexity of defining adversarial relations alongside friendship relations. Without loss of generality, an adversarial social network contains two disjoint rival comS munities C1 C2 = {c1 , c2 , · · · , cN } composed of actors, where members within a community have friendly relations and across communities have adversarial relations. In our framework, we use visual and auditory information to quantify adverseness at the scene level, which serves as soft constraints among the movie characters. These soft constraints are then systematically integrated to learn inter-character affinity. The adverse communities in the resulting social network are discovered by subjecting the inter-character affinity matrix to a generalized modularity principle [4], which is shown to perform better than the original modularity [17]. Social networks commonly contain leaders who have the most important roles in their communities. The importance of an actor is quantified by computing degree, closeness, or betweenness centralities [9]. More recently, eigenvector centrality has been proposed as an alternative [19]. In this paper, due to its intrinsic relation to the proposed learning mechanism, we adopt the eigenvector centrality to find leaders in the two adverse communities. An illustration of the communities and their leaders discovered by our approach is given in Figure 1 for the movie titled G.I. Joe: The Rise of Cobra (2009). The remainder of the paper is organized as follows. We start with providing the basics of the social network framework in the next section, which is followed by a discussion on how we construct the social networks from movies in Section 3. The methodology used to analyze these social networks is described in Section

Learning Relations Among Movie Characters: A Social Network Perspective

3

Fig. 1. Pictorial representation of communities in the movie titled G.I. Joe: The Rise of Cobra (2009). Our approach automatically detects the two rival communities (G.I. Joe and Cobra), and identifies their leaders (Duke and McCullen) visualized as the upscaled characters at the front of each community.

4, and is evaluated on a set of movies in Section 5. Our contributions in this paper can be summarized as follows: – Proposal of principled methods for learning adversarial and non-adversarial relations among actors, which is new to both computer vision and sociology communities; – Understanding these relations using a modified modularity principle for social network analysis; – A dataset of movies, which contain scripts, closed captions and visual and auditory features, for further research in high-level video understanding.

2

Social Network Representation

Following a common practice in sociology, we define interactions between the characters in a movie using a social network structure. In this setting, the characters are treated as the vertices V = {vi : vi represents ci }1 with cardinality |V | and their interactions are defined as edges E = {(vi , vj )|vi , vj ∈ V } between the vertices in a graph G(V, E). The resulting graph G is a fully-connected weighted graph with an affinity matrix K of size |V | × |V |. In this social setting, the characters may have either adversarial or nonadversarial relations with each other. These relations can be exemplified between the characters in a war movie as non-adversarial (collaborative) within respective armies, and adversarial (competing) across the armies. Sociology and computer vision researchers often neglect adversarial relations and only analyze non-adversarial relations, such as spatial proximity relationship, friendship and kinship. The co-occurrence of both relations generates an adversarial network, which exhibits a heterogeneous social structure. Technically, adversarial or nonadversarial relation between the characters ci and cj can be represented by a real-valued weight in the affinity matrix K(ci , cj ), which will be decided by the proposed affinity learning method. 1

While conventionally v is used to represent a vertex in a graph, we will also use c in this paper, and both v and c point to the same character in the movie.

4

Lei Ding and Alper Yilmaz

S S S A movie M is composed of non-overlapping M scenes, M = s1 s2 · · · sM , where each scene contains interactions among a set of movie characters. Appearance of a character in a scene can be encoded in a scene-character relation matrix denoted by A = {Ai,j }, where Ai,j = 1 if cj ∈ si . It can be obtained by searching for speaker names in the script. This representation is reminiscent of the actorevent graph in social network analysis [22]. While the character relations in A can be directly used for construction of the social network, we will demonstrate later that the use of visual and auditory scene features can lead to a better social network representation by learning the inter-character affinity matrix K. Temporal Segmentation of Movie into Scenes In order to align visual and auditory features with the movie script, we require temporal segmentation of the movie into scenes, which will provide start and stop timings for each scene. This segmentation process is guided by the accompanying movie script and closed captions. The script is usually a draft version with no time tagging and lacks professional editing, while the closed captions are composed of lines di , which contain timed sentences uttered by characters. The approach we use to perform this task can be considered as a variant of the alignment technique in [5] and is summarized in Figure 2: 1. Divide the script into scenes, each of which is denoted as si . Similarly, closed captions are divided into lines di . 2. Define C to be a cost matrix. Compute the percentage p of the words in closed caption dj matched with scene si while respecting the order of words. Set the cost as Ci,j = 1 − p. 3. Apply dynamic time warping to C for estimating start ti1 and stop times ti2 of si , which respectively correspond to the smallest and largest time stamps for closed captions matched with si . Due to the fact that publicly available scripts for movies are not perfectly edited, the temporal segmentation may not be precise. Regardless, our approach is robust to such inaccuracies in segment boundaries. A potential future modification of temporal segmentation can include a combination of the proposed approach with other automatic scene segmentation techniques, such as [24].

3

Learning Social Networks

Adversarial is defined “to have or involve antagonistic parties or opposing interests” between individuals or groups of people [16]. In movies or more generally in real life environments, adversarial relations between individuals are exhibited in the words they use, tones of their speech and actions they perform. Considering that a scene is the smallest segment in a movie which contains a continued event, low-level features generated from the video and audio of each scene can be used to quantify adversarial and non-adversarial contents. In the following text, we conjecture that the character members of the same community co-occur more often in non-adversarial scenes than in adversarial ones, and learn the social network formed by movie characters based on both the scene-character relations and scene contents.

Learning Relations Among Movie Characters: A Social Network Perspective

5

Fig. 2. Temporal segmentation of a movie into scenes. The colored numbers in the middle block indicate matched sentences in the closed captions shown on the right.

3.1

Scene Level Features and Scene Characterization

Movie directors often follow certain rules, referred to as the film grammar or cinematic principles in the film literature, to emphasize the adversarial content in scenes. Typically, adversarial scenes contain abrupt changes in visual and auditory contents, whereas these contents change gradually in non-adversarial scenes. We should, however, note that these clues can be dormant depending on the director’s style. In our framework, we handle such dormant relations by learning a robust support vector regressor from a training set. The visual and auditory features, which quantify adversarial scene content, can be extracted by analyzing the disturbances in the video [18]. In particular for measuring visual disturbance, we follow the cinematic principles and conjecture that for an adversarial scene, the motion field is nearly evenly distributed in all directions (see Figure 3 for illustration). For generating the optical flow distributions, we use the Kanade-Lucas-Tomasi tracker [20] within the scene bounds and use good features to track. Alternatively, one can use dense flow field generated by estimating optical flow at each pixel [15]. The visual disturbance in the observed flow field can be measured by entropy of the orientation distribution as shown in Figure 4. Specifically, we apply a moving window of 10 frames with 5 frames overlapping in the video for constructing the orientation histograms of optical flows. We use histograms of optical flow vectors weighted by the magnitude of motion. The number of orientation bins is set to 10 and the number of entropy bins in the final feature vector is set to 5. As can be observed in Figure 5, flow distributions generated from adversarial scenes tend to be uniformly distributed and thus, they consistently have more high-entropy peaks compared to non-adversarial scenes. This observation serves as the basis for distinguishing the two types of scenes. Auditory features extracted from the accompanying movie audio are used together with the visual features to improve the performance. We adopt a combination of temporal and spectral auditory features discussed in [13, 18]: energy peak ratio, energy entropy, short-time energy, spectral flux and zero crossing rate. Specifically, these features are computed for sliding audio frames that are

6

Lei Ding and Alper Yilmaz

0

0

2 4 Flow entropy

0.005

6

0

# of energy peaks=65

0

5000 Time

10000

1

Energy

0.5

0.01

Distrib.

Energy

Distrib.

# of energy peaks=59 1

0.5 0

0

2 4 Flow entropy

0.01 0.005

6

0

0

0

2 4 Flow entropy

6

0.02 0

0

5000 Time

10000

5000 Time

10000

# of energy peaks=312 1

Energy

0.5

0.04

Distrib.

Energy

Distrib.

# of energy peaks=110 1

0

0.5 0

0

2 4 Flow entropy

6

0.1 0.05 0

0

5000 Time

10000

Fig. 3. Visual and auditory characteristics of adversarial scenes. Top row: nonadversarial scenes from Year One (2009) and G.I. Joe: The Rise of Cobra (2009); Bottom row: adversarial scenes from these two movies. Optical flow vectors are superimposed on the frames and computed features are shown as plots for a temporal window of 10 video frames, including entropy distribution of optical flow vectors and detected energy peaks (red dots in energy signals).

400 ms in length. The means of these features over the duration of the scene constitute a feature vector. A sample auditory feature (energy peaks) is shown in Figure 3 for both adversarial and non-adversarial scenes. It can be observed that adversarial scenes have more peaks in energy signals, which are moving averages of squared audio signals. The visual and auditory features provide two vectors per scene (5 dimensional visual and 5 dimensional auditory), which are used to estimate a real value βi ∈ [−1, +1] for quantifying the adverseness of the scene si . Broadly speaking, the more negative the βi is the more adversarial the scene is, and vise versa. In order to facilitate estimation of βi , we use support vector regression (SVR) [21], which has been successfully used to solve various problems in recent computer vision literature. We apply a radial basis function to both the visual and auditory feature vectors, which leads to two kernel matrices Kv and Ka respectively. The two kernel bandwidths can be chosen by using cross-validation. The joint kernel is ˆ = Kv Ka . Due to space limitations, then computed as the multiplication kernel: K we skip the details of the SVR and refer the reader to [21]. The final decision PL ˆ l ,i + b, where the coefficient function is written as: βi = g(si ) = j=1 (αj − αj∗ )K j

Learning Relations Among Movie Characters: A Social Network Perspective

7

Fig. 4. Generation of the normalized entropy histogram from orientation distributions of optical flows detected from a scene.

PublicEnemies 0 Non−adversarial Adversarial 0.2

0.6

0.4

0.4

0.6

0.2

0.8

0

−1

0

1

2 3 Entropy

4

5

6

1

Normalized counts

Normalized counts

1

Troy

0.8

1

0.8

0 Non−adversarial Adversarial 0.2

0.6

0.4

0.4

0.6

0.2

0.8

0

−1

0

1

HarryPotter

0.8 0.6

0.4

0.4

0.6

0.2

0.8

0

−1

0

1

2 3 Entropy

4

5

6

1

LordOfRings 0 Non−adversarial Adversarial 0.2

4

5

6

1

1

Normalized counts

Normalized counts

1

2 3 Entropy

0.8

0 Non−adversarial Adversarial 0.2

0.6

0.4

0.4

0.6

0.2 0

0.8

−1

0

1

2 3 Entropy

4

5

6

1

Fig. 5. Visualization of entropy histogram feature vectors extracted from four example movies. The two classes (adversarial and non-adversarial) have distinct patterns, in that adversarial scenes tend to consistently produce strong peaks in high entropies. Best viewed in color.

b is offset, αi and αi∗ are the Lagrange multipliers for labeling constraints, L is the number of labeled examples, and lj is the index for the j th labeled example. The training for support vector regression is achieved by using a set of scenes labeled as adversarial (βi = −1) and non-adversarial (βi = +1). We define a nonadversarial scene in the training and test sets as a scene which contains character members from only one group. Conversely, a scene in which the members of rival groups co-occur is labeled as adversarial. Considering that the adverseness of a scene is sometimes unclear and involves high-level semantics instead of pure observations, the stated approach avoids the subjectiveness in scene labeling. The labeling of scenes si in the novel movie M is then achieved by estimating corresponding βi using the regression learned from labeled scene examples from other movies in a dataset.

8

Lei Ding and Alper Yilmaz

3.2

Learning Inter-character Affinity

Let ci be character i, and f = (f1 , · · · , fN )T be the vector of community memberships containing ±1 values, where fi refers to the membership of ci . Let f distribute according to a zero-mean identity-covariance Gaussian process 1 T P (f ) = (2π)−N/2 exp− 2 f f . In order to model the information contained in the scene-character relation matrix A and the aforementioned adverseness of each scene βi , we assume the following distributions: (1) if ci and cj occur in a nonadversarial scene k (βk ≥ 0), we assume fi − fj ∼ N (0, β12 ); (2) if ci and cj occur k

in an adversarial scene k (βk < 0), we assume fi + fj ∼ N (0, β12 ). k Therefore, if βi = 0, then the constraint imposed by a scene becomes inconsequential, which corresponds to the least confidence in the constraint. On the other hand, if βi = ±1, the corresponding constraint becomes the strongest. Because of the distributions we use, none of the constraints is hard, making our method relatively flexible and insensitive to prediction errors. Applying the Bayes’ rule, the posterior probability of f given the constraints is defined by: 

X 1 P (f |A, β) ∝ exp − f T f − 2

X

k:βk ≥0 ci ,cj ∈sk

X (fi − fj )2 βk2 − 2

X

k:βk