A Hierarchical Signature Scheme for Robust Video ... - CiteSeerX

3 downloads 118594 Views 245KB Size Report
master secret is used as digital signature to authenticate the video. We present .... to drop (or remove), to replace, to add extra frames, and to reorder the video.
A Hierarchical Signature Scheme for Robust Video Authentication using Secret Sharing Pradeep K. Atrey, Wei-Qi Yan, Ee-Chien Chang, Mohan S. Kankanhalli School of Computing, National University of Singapore Abstract Ensuring the integrity of a digital video is an important and challenging research problem arising out of many video applications. In this paper, we present a hierarchical framework for video authentication based on cryptographic secret sharing that protects a video from spatial cropping and temporal jittering, yet is robust against frame dropping in the streaming video scenario. Our algorithm provides a trade-off between security and robustness by having configurable inputs. The authentication signature is compact and very sensitive against spatial attacks such as region tampering, and inter-frame attacks like frame replacement, major frame dropping, and frame reordering. Given a video, we identify the key frames based on differential energy between the frames. Considering video frames as shares, we compute the secret at three hierarchical levels. The master secret is used as digital signature to authenticate the video. We present extensive experimental results which show the utility of our technique.

1. Introduction Video authentication techniques have witnessed a tremendous rise in interest over the past few years. By definition, authentication is a process where by an entity proves its identity to another entity. In the multimedia context, video authentication aims to establish its veracity in time, sequence and content. A video authentication system ensures the integrity of digital video, and verifies that the video taken into use has not been tampered. Video authentication is important in many applications such as surveillance, journalism, and video broadcast etc. In surveillance camera systems, it is hard to reassure that the digital video produced as evidence is the one that is actually shot by the camera. A journalist cannot prove that video played by a news channel is trustworthy. So there is a compelling need that the video wherever it is, in whatever form it is, should be made authenticable before use. In this paper, we propose a video authentication system that is sensitive to spatial and temporal tampering; and also robust to frame dropping. Our method is suitable

for the scenario of video streaming through a communication channel. Due to the large size of video data, the video streaming often suffers from congestion problem at the bottlenecks on the network. To overcome the network congestion problem, some data loss (for example, loss of few video frames) is inevitable. We exploit cryptographic secret sharing and the temporal relationship in video to afford frame drops yet maintain the integrity of the video. The core idea of our technique is to utilize three hierarchical levels of a video and to use cryptographic secret sharing [12] to create what we call as a “secret frame”. We authenticate a given video by computing the secret frames based on randomly generated private keys at three hierarchical levels i.e. key frame level, shot level, and video level. We first segment the video into shots. Then, for each shot we identify the key frames. At the key frame level, we compute the secret for each pair of key frames using secret sharing considering all non-key frames between the two key frames as shares. The secrets computed at this level and the key frames are treated as shares to compute the secret at the shot level. Finally, all shot secrets are used to compute a master secret that is considered as the signature for the video. We provide a tradeoff between security and the robustness by having configurable inputs to our algorithm. In our scheme, the size of the authenticating signature is equal to a video frame size irrespective of the length of the video in time. We begin the paper with a brief overview of a typical video authentication system and its properties, benign and tampering operations in section 2. In section 3, we discuss the past work and we also introduce how cryptographic secret sharing is used in our framework. In section 4, we present the framework of our algorithm with the analysis of its security against various attacks and robustness to benign operations. We present the results in section 5. Finally, section 6 concludes the paper with discussions.

2. Video authentication A typical video authentication system is shown in figure 1. In the authentication process, for a given video, the authentication data S is generated using the features f of the video. This authentication data S is encrypted and

packaged with the video as a signature, or alternatively it can be embedded into the video content as a watermark. The video integrity is verified by computing new authentication data S’ for the given video. The new authentication data S’ is compared with decrypted original authentication data S. If both are matched, the video is treated as authentic else it is construed to be tampered. Authentication Input

Video

Features f

Feature of video extraction

data Video S Encryption authentication algorithm

Output

EK(S)

Key K

Video & EK(S)

region in a frame, in several frames, or even in entire shot/video. Temporal tampering is an inter-frame jittering. It refers to manipulations performed with respect to time. The possible alterations are to drop (or remove), to replace, to add extra frames, and to reorder the video frames (or the sequence of frames). These manipulations can be performed on sequence of frames level, shot level, or video level. Due to temporal redundancy in the video data, it is noticed that dropping few adjacent frames does not affect the visual appearance and semantic meaning much. So, we can afford to drop or reorder the frames up to certain extent if the application so demands.

3. Related work

Authenticated video

(a) Authentication process

Video

New authentication data Features f Video Feature of video S’ authentication extraction algorithm S Key K

Output

Authentic/ tampered

Decryption

Video & EK(S)

(b) Verification process.

Figure 1. A typical video authentication system An ideal video authentication system, to be efficacious, must adhere to the properties such as sensitivity to malicious alterations, localization and selfrecovery of altered regions, robustness to normal video processing operations, tolerance against some loss of information, compactness of signature, one-way property of signature, sensitivity against false alarm, and computational feasibility. Several video processing operations that do not modify its content semantically such as geometric transformations, image enhancement, and compression are classified as benign operations. In addition to having robustness against benign operations, an ideal video authentication system must make a given video resistant against all possible attacks and must verify whether a given video is tampered or not. It is really useful to find where (i.e. localization of alterations) and how the tampering has been done [5]. Based on the dimension, the tampering is divided into spatial tampering and temporal tampering. Spatial tampering, also called as intra-frame tampering, refers to the alterations in frame content. The malicious operations include cropping, replacement, content adding and removing at pixel level, block level, frame-level, sequence of frames level, and shot level. The type of manipulation includes spatial cropping in a specified

There has been some relevant work in the area of video authentication as typified by references [1] – [11]. Mainly two approaches have been used: digital signature [6-8, 10-11] and digital watermarking [1-5, 9]. However, cryptographic hash function has been widely used to ensure security by hashing the inherent features of video to produce a compact size signature or watermark. Cryptographic hash function is good for security but it does not offer robustness to benign operations. Therefore, the need of visual hash function [16] has been discussed. In the past, Lin et al [8] and Peng et al [5] have proposed compressed domain schemes that are robust against transcoding [5, 8] and editing [8] operations. To compute the signature, Lin [8] used the difference in DCT coefficients of frame-pairs which is vulnerable to counterfeiting attack since the value of DCT coefficients can be modified keeping their relationship preserved. Peng [5] used DC-DCT coefficients as features to build a watermark. Since the DC-DCT coefficients are computed using a linear equation based on the statistics of pixel value, so can be compromised. An attacker can replace a blocks of 8×8 pixels by another 64 pixel values retaining the same DC-DCT coefficient. The features such as edges in the frame used by Chih et al [6] are also highly vulnerable to content modification if a smart attacker modifies the content keeping the edges preserved. Recently, Dajun et al [9] used background features to embed the watermark into foreground objects to establish a relation between background and the foreground of a video. The scheme is interesting but temporal relationship in video data has not been fully utilized. In our method, we make use of cryptographic secret sharing to ensure security right to the pixel level by computing its secret through the temporal axis. Cryptographic secret sharing has been successfully used for non-traditional applications such as message authentication [14]. The novel feature of this method is that it ensures unconditional security [13]. In other words,

the security of secret sharing does not rely on any unproven assumptions (unlike that of many other cryptographic schemes). This motivated us to use the secret sharing in place of traditional hash function but with a fundamental difference. In normal secret sharing, we hold the original secret and we utilize the interpolating polynomial to compute the shares (corresponding to frames here). In our scheme, we have the shares as given and we use them to compute the polynomial and then deduce the corresponding secret by extrapolating the polynomial at a position known to the authenticator and the verifier privately. We also exploit the temporal redundancy in video to reduce the shares required in secret computation. It reduces the computational efforts to a high degree. In secret sharing framework, by adding some redundant frames (like error correcting codes) lying on the interpolating polynomial we can afford to lose some frame yet compute the same secret [14]. It enhances the degree of robustness which is novel compared to other hashing schemes.

and secret is computed by extrapolation as described in section 4.2.2. Video Segment video into p shots Vs1, Vs2, …, Vsp, and set j = 1 Pick a video shot Vsj Find q key frames K1, K2, K3,…., Kq

DT, Tb, W

Tp

Perform normalization of all frames and optimization of non-key frames, and set i = 1

PK1

For each pair (Ki, Ki+1) of key frames, compute secret Ski for all r normalized and optimized nonkey frames between Kith and Ki+1th key frame based on private key PK1 using (r, r)-secret sharing i < q No

Considering K1, Sk1,K2, Sk2,…, Kq-1, Skq-1, and Kq as shares, compute secret Svj for shot Vsj based on private key PK2 using (2q-1, 2q-1)-secret sharing

PK2

4. Proposed framework j≤p

Our algorithm is designed to work in uncompressed domain so that it remains independent of the video file type. However, it can be changed to compressed domain also with minimal changes. The input to our algorithm is a video consisting of sequence of frames. Since, for color video, we work on the luminance channel, so each frame contains the luminance values (0-255) for all pixels in it.

4.1. Overview of the method The complete process of video authentication is described in the flowchart given in figure 2. The steps are as follows: a) An input video is segmented into video shots, Vsj, 1 ≤ j ≤ p. We define a ‘video shot’ as a contiguous recording of one or more video frames describing a contiguous action in time and space. b) For each of the video shots, we find the key frames (K-frame), Ki, 1 ≤ i ≤ q (See section 4.2.1). Then, we normalize all frames and optimize non-key frames to exploit the temporal redundancy in the video (See section 4.2.2). In these steps, we make use of configurable inputs (DT, Tb, W, and Tp) to our algorithm in controlling the degree of security and robustness. (See table 1). c) Then, we compute secret frame at key frame level (Sk-frame), Ski, 1 ≤ i ≤ q-1 based on a private key PK1 (i.e. key at key-frame level) using (r, r)-secret sharing. The r normalized and optimized non-key frames between two key frames are treated as shares

Yes i = i + 1

Yes j = j + 1

No PK3

Compute master secret frame Ms with shares Svj for 1 ≤ j ≤ p, based on private key PK3 using (p, p)-secret sharing

PK4

DS = EncrPK4 (Ms + Configured inputs) DS

Figure 2. Video authentication process.

Table 2. Configurable inputs to our algorithm Input parameter

Description

DT

Threshold for differential energy

Tb

Threshold difference at block level

W

Weight factor

Tp

Threshold for pixel difference

Ts

Threshold for signature similarity

d) Using K1, Sk1, K2, Sk2, …, Kq-1, Skq-1, Kq frames as shares we compute another secret frame at video shot level based on a private key PK2 (i.e. key at shot level) using (2q-1, 2q-1)-secret sharing. We denote this secret frame as Sv-frame, i.e. Svj for shot j, 1≤ j ≤ p. (See section 4.2.3). e) Finally we compute a master secret (Ms-frame) from all Sv-frames based on a private key PK3 (i.e. key at video level) (See section 4.2.4). The Msframe along with the configurable inputs (DT, Tb, W,

Tp, and Ts) is encrypted using private key PK4 (i.e. encryption key) to form the signature for the video.

4.2. Authentication steps 4.2.1. Key frame extraction in a shot. Since video shot boundary detection is very well understood, we assume that the shot boundaries of the video have been computed [15]. After a video shot is acquired, we identify the key frames for each shot using a method similar to what has been proposed in [15]. First frame in a shot is designated as a key frame. Then we compute the differential energy between this key frame and subsequent frames. We define differential energy as the weighted sum of block wise Euclidean differences between two frames. Once the differential energy is found greater than a threshold value, we designate the corresponding frame as the key frame. This process continues till the last frame in the video. The last frame is by default designated as key frame. • Differential energy computation. We use pixel luminance values to compute the differential energy. More formally, Dl,m is the differential energy between frame l and frame m. The Dl,m is computed by the following equation: D

l ,m

=

b −1



k=0

w lk, m × d

k l ,m

(1)

Where k is the block index and b is the number of blocks in a frame. The d lk, m is the Euclidean difference between the frame l and frame m for kth block, and w lk, m is the weight factor for kth block that is multiplied with d lk, m to exaggerate it so that we can increase the sensitivity of few specific blocks against spatial cropping by increasing the weight factor. The value of weight factor is determined as follows: If d lk, m ≥ Tb, then w lk, m = W, (W >1); Otherwise w lk, m =1. • K-frame identification The algorithmic steps to identify K-frames are as follows: 1. All N frames f0, f1, …,fN-1 in shot are the input. 2. Designate f0 as the first key frame, K-frame[0] = 0. 3. Set h = 1, i = 0. 4. If h ≥ N then go to step 8. 5. Compute differential energy Di,h. 6. If Di,h ≥ DT then i = i + 1, K-frame[i] = h. 7. h = h + 1, go to step 4. 8. i = i + 1, K-frame[i] = N-1; return K-frame. 4.2.2. Secret frame computation at key frame level. Once the key frames are detected in a video shot, we perform the following two steps. • Normalization and optimization of frames

We uniformly normalize the 256 luminance levels for the pixels in all the frames to fewer levels. A configurable parameter Tp is used to find the number of reduced levels. For instance, Tp = 10 reduces 256 levels to (256/10) ≈ 26 levels. It allows us to have robustness to increase or decrease the luminance values globally by a value Tp; and also to reduce the number of shares in the secret computation. After normalization, we get many pixel values normalized to the same level. We exploit this data redundancy through temporal axis by choosing only the unique luminance values while ignoring the repeated ones for the non-key frames. We call this step as optimization of the non-key frames. The optimized non-key frames are used in the computation of Sk-frames. • Sk-frame extrapolation The Sk-frame is of same size as any frame in the video as shown in figure 3. The secret frame extrapolation is shown in figure 4. The x-coordinate indicates the position of the optimized non-key frames. The y-coordinate indicates the luminance value of the pixels of each of these frames. Using these frames we compute the interpolating polynomial for (r, r)-secret sharing by using equation (2). r

r

f ( x) = ∑∏ j =1 i =1

x − xi Yi x j − xi

(2)

This expression is essentially the Lagrange interpolation formulation, where the xi position refers to non-key frame and Yi is the pixel value of these frames. This formulation is exactly like the problem of secret sharing with a fundamental difference. In equation (2), we have the shares (non-key frames) as given and we use them to compute the polynomial and then deduce the corresponding secret frame by extrapolating the polynomial at x-position which is determined by the private key PK1. The steps to compute Ski-frame are summarized here: 1. Normalized and optimized r non-key frames between Kith and Ki+1th key frame are the input. 2. Position all the input non-key frames along xpositions. 3. Extrapolate the secret frame Ski at position x = PK1. ith Sk-frame at a particular pixel is the secret of non-key frames between ith K-frame and (i+1)th K-frame.

(i+1)th key frame Non-key frames *

ith key frame *

* *

Figure 3. Sk-frame computation

4.2.3 Secret frame computation at shot level. The secret frame for the shot j i.e. Svj-frame, encapsulates the information in form of K-frames and the Sk-frames computed in the previous steps. We consider the

PK1

Non-key frames

y

0 1

2

3

4 k

5

6 …………..………….…… r

y

K-frames

Sk-frame

2

6

Svj-frame

PK2

0 1

3

4 5

2(q-1)+1

Sk-frame

value depending upon the application requirements to detect a given video is tampered or not.

2(i-1)+2 2(i-1)+1

normalized K-frames (Ki, 1 ≤ i ≤ q) and the Sk-frames (Ski, 1≤ i ≤ q-1) as the shares (figure 5). The secret is computed using (2q-1, 2q-1)-secret sharing and the private key PK2 assuming there are q key frames in a shot. We summarize the steps here: 1. The normalized K-frames and Sk-frames are the input. 2. Position the ith K-frame and ith Sk-frame of each shot at 2(i-1)+1 and 2(i-1)+2 x-positions respectively. The last K-frame is positioned at 2(q-1)+1. 3. Extrapolate the secret frame (Svj -frame) of the shot j at position x = PK2.

x

Figure 5. Secret-frame extrapolation at shot level

4.3. Analysis

x

Figure 4. S -frame extrapolation 4.2.4 Master secret frame computation and signature generation. At the top (master) level of the hierarchical structure of our framework, we utilize the secrets obtained for all the p shots of the video (i.e. Svj -frames, 1 ≤ j ≤ p) to compute the master secret frame using the (p, p)-secret sharing and the private key PK3 in the similar way as described in previous two steps. This master frame is treated as the signature for the video. The Msframe (with the configurable inputs) is encrypted with the private key PK4 and is provided with the video to the verifier. 4.2.5 Signature verification. We verify the authenticity of a given video as follows. We follow the steps described in section 4.2.1-4.2.4 to compute the new Msframe using the same configurable inputs with which original Ms-frame is computed. This new Ms-frame is compared to the original Ms-frame. If they match, then we are guaranteed that the video has not been tampered with, and its content is the same as that of the original, or else this video is not to be trusted. We measure the similarity between two master frames (or signature) pixel by pixel. We call this similarity as sim value. The steps to compute the sim value are as follows: 1. Original Ms-frame: O; new Ms-frame: N; each of size w× h are the input. 2. For each location (i, j) in O and N If abs (O(i, j) – N(i, j)) ≤ Ts, the Count += 1 3. Percentage sim = (Count / (w × h)) × 100. The sim value lies in the range [0,100]. If sim = 100 then the two master frames are the same, where sim = 0 would mean that the two master frames are absolutely different. The sim value is used to judge the authenticity of any video. We can establish a threshold on the sim

4.3.1 Security analysis. Our algorithm ensures the authenticity of video by computing secrets from video data based on the private keys (PK1, PK2, and PK3) which are known to the authenticator (the owner) and to the trusted verifier (the user) of the video. By having different sets of private keys for each user, the owner of the video can individualize the signature to each user. Assuming the private keys are secured, our algorithm provides perfect security against spatial and temporal tampering attacks. We analyze below how secure is our algorithm. • Sensitivity against spatial tampering An important and inherent feature of our algorithm is that key frame selection is tightly dependent on the content of non-key frames in the sense that if an attacker modifies the contents of few frame(s), key frame position changes in the video. In our method, since the key frame selection is based on how much a frame differs from its previous key frame, a change in the content of a frame can lead to its emergence as a new key frame or the elimination of an existing key frame. Any change in key frame position affects interpolating polynomial; hence, a different secret frame is generated. This change is propagated from one level to another in our hierarchical framework, which eventually changes the secrets at other levels also. One may argue that if the key frame selection depends only upon the increase in difference, then an intelligent attacker can modify all the frames of a video in such a way that their respective differences remain same and key frame positions are also preserved. We contradict this as follows. Modifying all the frames will also include the modification of key frames. In our algorithm, since key frame content is also used in shot-secret computation, any change in the content of key frames is reflected in the shot-secret even if their position remains same. In our algorithm, since we compute differential energy at block level, if any significant difference (i.e. ≥ Tb) in block wise differential energy is noticed between

two frames; then this difference is multiplied by a weight factor (W > 1) to highlight this change. Computing differential energy at block level ensures that the even small changes are highlighted. This is very useful in the cases such as tampering with the face of a person. Since a face may occupy several blocks in the frame, it is infeasible to replace it with a face that occupies the same number of blocks having the similar content. We provide experimental evidence for this in the next section. • Sensitivity against temporal tampering Another main feature of our algorithm is that the secrets at each level are computed with the given shares in a specific order. If any of the frame in the sequence changes (content-wise or position-wise), it is reflected in the secret frame. For instance, since we consider all nonkey frames as shares to compute Sk-frame, any alteration such as dropping and reordering of non-key frames is reflected in Sk-frame. That eventually changes the shot secret frame and the master secret frame. Also, if a frame is replaced by some extraneous frame, or some extra frames are added; the first new frame emerges as extra key frame because the differential energy (between the first new frame and the old frame just before it) rises above the threshold level DT. Having few more key frames in shot secret computation, we get an entirely different shot secret, hence the master secret. So our algorithm is very secure against all the temporal attacks. 4.3.2 Robustness analysis. Robustness against frame dropping is achieved by exploiting the temporal redundancy in the video data. If the two consecutive frames differ by a small amount then we can afford to drop the later. The configurable parameter Tp used in normalization process increases the degree of temporal redundancy and hence the degree of robustness. It is important to mention that we allow only non-key frames dropping. Note that the dropping is allowed up to an extent that is supported by temporal redundancy and does not change the signature highly. Also it should not have visual artifacts on video. Our method is also robust against frame reordering till it is performed within a group of non-key frames. The robustness can be maintained till the key frames sequences are preserved. Table 2 describes how the input parameters to our algorithm can be configured appropriately to tailor the requirement of security and robustness. 4.3.3 Time complexity analysis. Consider a video of resolution u×v and of length z frames, the worst time complexity of our method would be O(u×v×z2) assuming zero temporal redundancy in the video. Since, in practice, video has lot of temporal redundancy therefore the number of frames used for interpolation would be small compared to z. Note that we configure the parameter Tp to

utilize the temporal redundancy due to which number of frames to be interpolated decreases significantly. Table 2. Effect of inputs Input DT ,Tb↑

Effect on the security and robustness Security↓, Robustness ↑, and vice versa

W↑

Block level Security↑, Robustness ↓, and vice versa Number of normalization levels↓ ⇒ Security ↓, Robustness ↑, and vice versa Security ↓, Robustness ↑, and vice versa

Tp↑ Ts ↑

5. Results Several video sequences have been tested to substantiate the theory. We tested our scheme for the following: • Temporal tampering (frame jittering): Frame dropping, frame reordering, and frame replacing. • Spatial tampering: Spatial cropping for different cropping window sizes in few frames/ all frames. We present the results of tests on two MPEG videos: finger.mpg, and shaking.mpg, both of resolution 352×272. We decoded the video and extracted the frames having luminance values. We configure the inputs as follows: Tp=10, Tb= 16×Tp (for shaking.mpg), and 8×Tp (for finger.mpg), DT = Tb× b, W = 2, and Ts = 5.

5.1 Results of temporal tampering We performed the tests by reordering and replacing 0 to 10 frames (figure 6(a)). We also tested the degree of robustness by dropping 0 to 30 frames (figure 6(b)) on finger.mpg video. Due to the randomness in the selection of frames, we considered the average of ten instances of each test. We obtained the results as expected. In the figures 6(a) and 6(b), the graphs show the change in sim value with respect to the change in number of frame replaced, frame reordered, and frame dropped. The graph is plotted for a shot of 61 frames. We notice that the sim value drops down very sharply as the number of frame replaced or frame reordered increases (figure 6(a)). With the replacement of even a single frame, the sim value drops down to less than 20% as expected. In case of reordering of frames, the sim value drops down to less than 60%. It is because that reordering is random and the frames at longer distance in time axis might be reordered. In case of frame dropping, the sim value goes down slowly (figure 6(b)). Note that up to 5 frames drops, the sim value is more than 80%. This proves the robustness

to drop each five frames in a shot of 61 frames keeping sim threshold at 80%.

Table 3. Parameter set for spatial tampering test

60

40

60

40

20

0

0 0

1

-20

2

3

4

5

6

7

8

9

0

1

Number of frames replaced/reordered

(a) Frame replacing and reordering

5

10

15

20

25

30

Number of frames dropped

(b) Frame dropping

120

100

Original video: K-frames 0, 24, 31, 61 Reordered pairs 0-1, 1-2, 3-4, 1-10: K-frames 0, 24 ,31 ,61 ⇒ K-frame sequence unchanged Reordered pair 1-40: K- frames 0, 1 ,2 ,24, 31, 40, 41 ,61 ⇒ K-frame sequence changed Reordered pair 10-50: K- frames 0,10,11,24,31,50,51,61 ⇒ K-frame sequence changed

80

60

40

1×1, 2×2, 4×4, 8×8, 16×16, 32×32, 64×64, 128×128, and 256×256.

Window size

80

20

Change in pixel value Position of the cropping window No of frames cropped

±100, ±50, ±20, ±10, random value. Beginning of any 8×8 block, or any random position. All, five, or single frame.

Cropping position

Same or different for all the frames.

Crop value = +10, cropping location (0,0)

120

100

80

0 0

0,1

1,2

3,4

1,10

1,40

10,50

The frame-pairs reordered

(c) Frame reordering: special cases

80

60

60

40

40

20

Figure 6. Temporal tampering

Crop value = +20, cropping location (0,0)

120

100

S im value (% )

80

100

S im value (% )

Frame reordering

20

Sim value (%)

Change in sim value ∝ (1/Change in crop value).

120

Frame replacing

100

Sim value (%)

Sim value (%)

120



20

0

0 0

1

2

4

8

16

32

64

128

0

256

1

2

4

Window size

5.2 Results of spatial tampering Spatial cropping is tested exhaustively on shaking.mpg video for different size of cropping windows in variable number of frames. We performed test of cropping with many possible combinations. Some of the varying parameters in this test are given in table 3. We change the pixel values by +10, +20, +50, +100, from block position (0, 0) onwards in all frames, in five frames (Nos. 7, 40, 60, 76, 80) and in a single frame (7th). The general trends of the results (figure 7) are as follows: • Change in sim value ∝ (1/Change in cropping window size).

Crop value = +50, cropping location (0,0)

120

16

32

64

128

256

128

256

Crop value = +100, cropping location (0,0)

120

100

100

S im value (% )

S im value (% )

We also tested for robustness for frame reordering by limiting the reorder within a group of non-key frames (figure 6(c)). We chose one pair of frame for reordering and observed the sim value above 80%. However if the reordering is performed between the non-key frames of different key frame groups, the sim values decreases drastically below 50%. We notice that in the reordering of (1, 40) frame pair, the key frame position changes. The 40th frame that is reordered to 1st frame’s position, now emerges as a new key frame because its differential energy with respect to its previous key frame (0th) is found substantial. Similarly, a new key frame (2nd) emerges because it becomes substantially different from its previous key frame (1st). The same is observed in case of (10, 50) frame pair reordering. This test is performed for one frame pair reordering. The robustness against more number of frame-pairs reordering can also be maintained till the reordering is done in neighbor frames (within a range) and till it does not affect the key frame positions.

8

Window size

80

80

60

60

40

40

20

20

0

0 0

1

2

4

8

16

32

64

128

256

0

Window size

Single frame

1

2

4

8

16

32

64

Window size

Five frames

All frames

Figure 7. Spatial tampering As shown in figure 7, for +10 change value, we notice that if any change within the allowable pixel difference (Tp) is performed in all the frames, it does not change the sim value much. Due to this, we get sim = 95% even for the largest cropping window a cropping value of +10. Also, for small size regions i.e. below 16×16, there is not much change in the sim value. For +20, the trend continues to be the same but the sim value is changed slightly more. For +50 and for +100 change values in all cases, we notice drastic drop in the sim value after a particular crop size. It is due to reason that larger change in pixel value disturbs the differential energy between the frames and hence the key frame sequence. We illustrate it with an example. Suppose that two pixel values through temporal axis are 200 and 250. Since, the differential energy is computed by taking the square of difference in pixel values along temporal axis; therefore, before cropping, this difference is 250-200=50. The value 502 = 2500 contributes to differential energy. After the cropping of +50, these pixel values rise to 200+50 = 250 and 250+50 = 300 ≈ 255 (since luminance value can be 0-255). Now the difference of 255 and 250 is 5. The value 52 = 25 contributes to differential energy. That shows if this change is performed for many pixels, it may

lead to major change in differential energy between frames. We also performed a specific test by replacing pixel value to 255 for all pixels in a 128×128 window at (50,54) (i.e. non-block location). As expected, the sim value goes down very sharply from 100% to 17% for all frames cropping, to 16% for five cropping, and to 14% for single frame cropping. These results prove that our scheme is very sensitive to spatial cropping.

30th frame

30th frame

31st frame

31st frame

(a) Original video frames (b) Tampered video frames Face tampered

robustness against video processing operations will also be addressed in the future work.

7. References [1] Lang. S. Thiemert, Petitcolas F.A.P., “Authentication of MPEG-4 data: Risks and solutions”, Security and Watermarking of Multimedia Content V 2003, Santa Clara. [2] Celik M. U., Sharma G., Tekalp A. M., Saber E. S., “Video authentication with self-recovery”, SPIE Electronic Imaging San Jose 2002. [3] Rui Du, Fridrich Jessica., “Lossless authentication of MPEG-2 video”, IEEE ICIP, Rochester, NY, USA, 2002. [4] Cross Daniel, Mobasseri B., “Watermarking for selfauthentication of compressed video”, IEEE ICIP, Rochester, New York, USA, 2002. [5] Peng, Heather, “A semi-fragile watermarking system for MPEG video authentication”, ICASSP, Orlando, 2002. [6] Chih Hsuan Tzeng, Wen-Hsiang Tsai, “A new technique for authentication of image/video for multimedia applications”, ACM Multimedia, Ottawa, 2001.

(c) 30th frame of original and the tampered video Figure 8. Face tampering

5.3 Test for face tampering We performed a specific test for face tampering by replacing only the face of a person with other person in several frames of a video. We modified a video from 30th to 59th frame. In figure 8(a) and 8(b), we show two frames of original and modified video, respectively. We get the sim value as 70% during the verification process. In other words, it reveals 30% tampering between the original and the modified video. Our algorithm detects the change from 29th to 30th frame and identifies 30th frame as a new key frame. Introduction of these two key frames in the secret computation results in a new secret.

6. Conclusion and future work In this paper, we propose a novel technique for video authentication. Our framework authenticates the video based on atypical use of cryptographic secret sharing. The experimental results show that the proposed method is very secure against malicious spatial cropping and frame jittering. It is also robust against frame dropping, frame reordering up to certain extent that is highly needed in video streaming scenario. The future work is to further analyze the algorithm to incorporate the type of alteration, position of alteration and method of recovery into the secret sharing framework with the aid of forward error correction techniques. The other issues like

[7] Mobasseri B., Sieffert Michael J., Simard Richard J., “Content authentication and tamper detection in digital video”, IEEE ICIP, Vancouver, 2000. [8] Lin C.Y., Chang S.F., “Issues and solutions for authenticating MPEG video”, SPIE Electronic Imaging, San Jose, 1999. [9] Dajun He, Qibin Sun, Qi Tian, “A semi-fragile object based video authentication system”, IEEE ISCAS, Bangkok, 2003. [10] Quisquater Jean-Jacques, Joye Marc, ‘Authentication of sequences with the sl2 hash function: Application to video sequences”, Journal of Computer Security, 1997. [11] Yan W., Kankanhalli M. S., “Motion trajectory based video authentication”, IEEE ISCAS, Bangkok, 2003. [12] Shamir A., “How to share a secret”, Communications of the ACM, 1979, Vol. 22, No 11, pp. 612-613. [13] Stinson D. R. Cryptography: Theory and Practice, CRC press, Boca Raton, 1995 First Edition, Chapter 11, pp-333. [14] Eskiciglu Ahmet M., “A prepositioned secret sharing scheme for message authentication in broadcast networks”, Fifth Joint Working Conference on Communications and Multimedia Security (CMS’01), Darmstadt, May 21-22, 2001, pp. 363-373. [15] Wei-Ying Ma, HongJiang Zhang, “An indexing and browsing system for home video”, 10th European Signal Processing conference (invited paper), Finland, 2000. [16] Regunathan R., Ziyou Xiong, Nasir Memon, “On the security of visual hash function”, SPIE's 15th Annual Symposium on Electronic Imaging: Science and Technology, Santa Clara, 20–24 January 2003.