Video Scene Determination using Audiovisual ... - Semantic Scholar

1 downloads 0 Views 281KB Size Report
Average zero crossing rate. ➢. Energy distribution. ➢. Bandwidth. ➢. Harmonicity. ▫. Scene determination. ➢. Video shot detection. ➢. Audio shot segmentation. ➢.
Video Scene Determination using Audiovisual Data Analysis Multimedia Network Systems & Applications 2004 SeungMin Rho and EenJun Hwang Graduate School of Information and Communication Ajou University

Motivation ƒ

Use of shot/scene change detection ¾ ¾

ƒ

Shot boundary is useful for video editing and more detailed analysis of video content Scene change detection is very important for video indexing and retrieval

Existing algorithms ¾ ¾

Mostly based on visual information such as color, motion, edge features, etc Do not clearly distinguish between shot boundary and scene change detections.

Multimedia Network Systems and Applications 2004

2/20

Our Goal ƒ

Various audio feature analysis ¾ ¾

ƒ

Audio feature extraction Classification of those features into 6 categories such as silence, speech, music, speech w/ music, environmental sound, and speech w/ environmental sound

Video scene classification using audio and visual information ¾ ¾

Shot boundary detection – simultaneous change of color, motion, and audio characteristics Scene determination – analyzed audio and video shots are classified into the semantic scenes

Multimedia Network Systems and Applications 2004

3/20

Outline ƒ ƒ

Related works Audio feature analysis ¾ ¾ ¾ ¾ ¾

ƒ

Scene determination ¾ ¾ ¾

ƒ ƒ

Short time average energy function Average zero crossing rate Energy distribution Bandwidth Harmonicity Video shot detection Audio shot segmentation Scene determination

Experiments Conclusion and future works

Multimedia Network Systems and Applications 2004

4/20

Related Works ƒ

Audio Segmentation and Classification

ƒ

Content-Based Audio Retrieval

ƒ

Audio Scene Analysis (ASA)

ƒ

Audio Analysis for Video Indexing

ƒ

Integrations of Audio and Visual Information for Video Segmentation and Indexing

Multimedia Network Systems and Applications 2004

5/20

System Overview Shot A nalyzer video signal

A udiovis ual Data

audio signal

Sho t boundary det ec tion Audio fe atur e e xtr act ion

K e yfra me e xtrac tion Audio shot bounda ry de tec tio n

C lassifier Cla ssific ation of e ach se gme nt E nerg y function Ave rag e z er oc rossing rat e

Cha rac ter iz ation

Ene rgy distr ibutio n Annotation too l Bandwidt h

Har monicity XM L Data base

Multimedia Network Systems and Applications 2004

Audiov isual Da tabase

6/20

Audio Feature Analysis 1. Short time average energy function 2. Average zero crossing rate 3. Energy distribution 4. Bandwidth 5. Harmonicity

Multimedia Network Systems and Applications 2004

7/20

Audio Features Computation Audio Input

Compute Average Energy (Em)

Is Em > Te Yes

No

Yes Speech

Compute Fundamental Frequency (Ff)

Compute Average Zero Crossing Rate (ZCR)

Is Ff > Tfreq

Is ZCR > Tzcr

Yes

No

Silence

No

Compute the Harmonicity

Music or Environmental Sound

Multimedia Network Systems and Applications 2004

8/20

Audio Feature Analysis (1)

(a) speech

(b) speech w/ music

Audio waveform and spectrogram of speech and speech with music

Multimedia Network Systems and Applications 2004

9/20

Audio Feature Analysis (2)

(a) music

(b) environmental sound

Audio waveform and spectrogram of music and environmental sound

Multimedia Network Systems and Applications 2004

10/20

Scene Determination Process (1) A u d io V is u a l D a ta

D e M u ltip le x e r A u d io S ig n a l

V id e o S ig n a l

S h o t A n a ly z e r A u d io S h o ts

V id e o S h o ts

F in d th e c a n d id a te s ce n e b o u n d a ry

C o m p a re th e c a n d id a te a u d io s h o ts & A d ju s t c a n d id a te v id e o sh o ts to th e s ta rtin g tim e o f c lo s e r a u d io sh o ts

S cene D e te rm in a tio n P ro c e s s

M e rg e th e c o n se c u tiv e s h o ts

S c e n e b o u n d a ry is d e te rm in e d

Multimedia Network Systems and Applications 2004

11/20

Scene Determination Process (2) Step 1:

If ( t(CSvi) = t(CSaj) ) then Candidate scene boundary is detected and go to step 3 else Go to step 3

Step 2:

diff1 = Diff(t(CSvi, CSaj)), diff2 = Diff(t(CSvi, CSaj+1)) Candidate shot boundary is adjusted to t(CSaj)

[ diff1 < diff2 ]

Candidate shot boundary is adjusted to t(CSaj+1) [ diff1 ≥ diff2 ] Then go to step 3

CSvi = Candidate video shot boundaries (i = 1, …, n ) CSaj = Candidate audio shot boundaries (j = 1, …, m) t(CS) = Starting time of a candidate shot boundary Multimedia Network Systems and Applications 2004

12/20

Scene Determination Process (3) Step 3:

Max1 = Max(F(t(CSa))) between t(CSvi) and t(CSvi+1) Max2 = Max(F(t(CSa))) between t(CSvi+1) and t(CSvi+2) If ( dist(Max1, Max2) ≤ Tf ) then Merge the consecutive video shots (CSvi and CSvi+1) and Adjust a candidate shot boundary to t(CSvi+1) Else Scene boundary is determined

F(t(CSa)) = { Silence, Speech, Music, Speech w/ Music, Environmental Sound, Speech w/ Environmental Sound } Max = maximum value of the percentages of candidate audio shots within a candidate video shot dist(Max1, Max2) = the distance of the maximum values which are obtained from the max functions using Euclidean distance Multimedia Network Systems and Applications 2004

13/20

Scene Determination

Multimedia Network Systems and Applications 2004

14/20

Experiments ƒ

ƒ

ƒ

Video data ¾ 6 sample data: movies, TV commercials, news ¾ captured by 320x240 pixels, 24 bits, 30 fps Audio data ¾ sampled by 16bit stereo ¾ sampled at 44.1KHz Measurement

Multimedia Network Systems and Applications 2004

15/20

Scene Detection Rate by audiovisual features

TV Commercials & News

Shot

Scene

Correct

Miss

Fault

Precision

Recall

VSample1

85

22

20

2

1

0.95

0.91

VSample2

54

16

13

3

2

0.87

0.81

VSample3

18

10

8

2

1

0.89

0.8

0.90

0.84

Average VSample4

68

26

23

3

1

0.96

0.88

VSample5

159

41

34

7

1

0.97

0.83

VSample6

243

67

61

6

2

0.97

0.91

0.97

0.87

0.93

0.86

Movies

Average Total average

Multimedia Network Systems and Applications 2004

16/20

Details of integrated evaluation Correct / Miss / Fault / Precision / Recall

TV Commercial & News

Movies







④ (①+②)

⑤ (①+③)

⑥ (①+②+③)

VSample1

14/8/4/ 0.78/0.64

6/16/13/ 0.32/0.27

8/14/7/ 0.53/0.36

15/7/6/ 0.71/0.68

19/3/3/ 0.86/0.86

20/2/1/ 0.95/0.91

VSample2

10/6/3/ 0.77/0.63

4/12/7/ 0.36/0.25

5/11/6/ 0.46/0.31

11/5/5/ 0.69/0.69

12/4/3/ 0.8/0.75

13/3/2/ 0.87/0.81

VSample3

6/4/2/ 0.75/0.6

2/8/3/ 0.4/0.2

3/7/4/ 0.43/0.3

6/4/4/ 0.6/0.6

8/2/2/ 0.8/0.8

8/2/1/ 0.89/0.8

VSample4

17/9/4/ 0.81/0.65

8/18/3/ 0.73/0.31

10/16/13/ 0.43/0.38

19/7/5/ 0.79/0.73

21/5/3/ 0.88/0.81

23/3/1/ 0.96/0.88

VSample5

28/13/6/ 0.82/0.68

11/30/12/ 0.48/0.27

16/25/24/ 0.4/0.39

30/11/16/ 0.65/0.73

33/8/4/ 0.89/0.8

34/7/1/ 0.97/0.83

VSample6

47/20/9/ 0.84/0.7

24/43/19/ 0.56/0.36

21/46/33/ 0.39/0.31

51/16/24/ 0.68/0.76

58/9/4/ 0.94/0.87

61/6/2/ 0.97/0.91

Multimedia Network Systems and Applications 2004

17/20

Conclusion & Future Work ƒ A scheme of scene change determination based on the integration of audio and video information is proposed ƒ Various useful audio features are discussed and classification method of semantic scene by analyzing both audio and video data together are also discussed ƒ Future Work ¾ Better audio features for scene classification ¾ Better integration of audio/visual information for classification

Multimedia Network Systems and Applications 2004

18/20

References [1] Z. Liu, J. Huang, and Y. Wang et al., “Audio feature extraction and analysis for scene classification,” in Proc. IEEE 1st Multimedia Workshop, 1997. [2] T. Zhang, C.-C. Kuo: “Content-based Classification and Retrieval of Audio,” SPIE's 43rd Annual Meeting - Conference on Advanced Signal Processing Algorithms, Architectures, and Implementations VIII, San Diego, July 1998. [3] Hao Jiang, Tony Lin, Hongjiang Zhang, “Video segmentation with the Support of Audio Segmentation and classification,” ICME'2000-IEEE International Conference on Multimedia and Expo, New York City, NY, USA, July 30 - August 2, 2000. [4] A. Yoshitaka, and M. Miyake, “Scene Detection by Audio-Visual Features,” IEEE International Conference on Multimedia and Expo (ICME01), pp.49-52, 2001. [5] Shu-Ching Chen, Mei-Ling Shyu, Wenhui Liao, and Chengcui Zhang, “Scene Change Detection By Audio and Video Clues,” IEEE International Conference on Multimedia and Expo (ICME02), pp.365-368, 2002. Multimedia Network Systems and Applications 2004

19/20

Q&A

Thank You!!! Any Questions? Visit our homepage if you need additional information http://adtl.ajou.ac.kr Multimedia Network Systems and Applications 2004

20/20