Seamless Wireless Networking for Video Surveillance ... - CiteSeerX

1 downloads 23408 Views 3MB Size Report
With digital cameras becoming cheaper, video ... The cost of WLANs is decreasing and portable equipment like Laptops and PDAs already include a WLAN ...
Seamless Wireless Networking for Video Surveillance Applications D. Agrafiotis*, T-K Chiew, P.Ferre, D.R Bull, A.R. Nix, A. Doufexi, J. Chung-Howa, D. Nicholsonb CCR, Dept. of Electrical & Electronic Engineering, University of Bristol, Bristol, BS8 1UB,UK a ProVision Communication Technologies Ltd, UK b Thales Communications, France ABSTRACT The EU FP6 WCAM (Wireless Cameras and Audio-Visual Seamless Networking) project aims to study, develop and validate a wireless, seamless and secured end-to-end networked audio-visual system for video surveillance and multimedia distribution applications. This paper describes the video transmission aspects of the project, with contributions in the areas of H.264 video delivery over wireless LANs. The planned demonstrations under WCAM include the transmission of H.264 coded material over 802.11b/g networks with TCP/IP and UDP/IP being employed as the transport and network layers over unicast and multicast links. UDP based unicast and multicast transmissions pose the problem of packet erasures while TCP based transmission is associated with long delays and the need for a large jitter buffer. This paper presents measurement data that have been collected at the WCAM trial site along with analysis of the data, including characterisation of the channel conditions as well as recommendations on the optimal operating parameters for each of the above transmission scenarios (e.g. jitter buffer sizes, packet error rates, etc.). Recommendations for error resilient coding algorithms and packetisation strategies are made in order to moderate the effect of the observed packet erasures on the quality of the transmitted video. Advanced error concealment methods for masking the effects of packet erasures at the receiver/decoder are also described. Keywords: Video Transmission, Error Resilience, Error Concealment, WLAN, H.264.

1. INTRODUCTION Recently there has been growing interest in video surveillance applications for monitoring public and commercial premises with the aim of preventing crime and ensuring the public safety. With digital cameras becoming cheaper, video coding schemes becoming more efficient [1] and (wired or wireless) networks becoming more widely available, video surveillance technology is migrating from analogue to digital [2]. Use of IEEE 802.11 Wireless Local Area Networks (WLAN) [3] as an extension to the existing wired infrastructure is rapidly growing, offering mobility and easy deployment of equipment. The cost of WLANs is decreasing and portable equipment like Laptops and PDAs already include a WLAN modem as standard peripherals. The use of WLAN technologies [4][5][6] for data communication is well established and currently most WLANs are used for data transfer. Video communication/streaming however, although possible due to the high bandwidth provided by WLANs, has yet to reach the same levels of maturity as its data counterpart. The EU FP6 WCAM (Wireless Cameras and Audio-Visual Seamless Networking) project addresses wireless video transmission trials for two types of applications: remote surveillance and entertainment. The project’s aim is to study, develop and validate a wireless, seamless and secured end-to-end networked audio-visual system for video surveillance and multimedia distribution applications. The envisaged scenario is one where the network infrastructure of a specific facility (e.g. an exhibition centre) is used for both the delivery of entertainment type video and for remote video surveillance. The planned demonstrations include the transmission of H.264 coded video material over 802.11b/g networks with TCP/IP and UDP/IP being employed as the transport and network layers over unicast and multicast links. TCP is preferred for storage of video surveillance material where a large delay is acceptable and error free footage is needed for legal reasons. UDP usage is however planned for low latency point-to-point video links, such as video clip viewing pedestals, and remote surveillance monitoring using tablet PCs and/or PDAs. The wireless video transmission aspects of the surveillance part of the trials are illustrated in Figure 1. *

[email protected]

© Provision Communication Technologies, Thales Communications, and the University of Bristol

1

UDP based unicast and multicast transmissions pose the problem of packet erasures. The protocol facilitates the transfer of time-sensitive video data, its lossy nature however creates the need for a certain amount of error resilience at the encoder and the use of some form of error concealment in the video decoder (receiver). Not taking the latter two into account could allow intra- and inter-frame propagation of errors to introduce excessive artifacts into the reconstructed video frames and in adverse channel conditions render them visibly intolerable. This paper presents measurement data that have been collected at the WCAM trial site with the aim of characterising the expected performance of the system from a video transmission point of view. Analysis of the data has been conducted and is included in this work in the form of characterisation of the channel conditions as well as recommendations on the optimal operating parameters for each of the above transmission scenarios (e.g. jitter buffer sizes, packet error rates, preferred packet sizes etc.). Recommendations for error resilient coding algorithms and packetisation strategies are also made in order to moderate the effect of the observed packet erasures on the quality of the transmitted video. Error resilience tools that have been investigated include flexible macroblock ordering (FMO), reference frame selection, and the use of multiple slices. Packet sizes affect error resilience as well as throughput performance in an inverse manner and care has been taken to select them in a way that meets the minimum throughput requirements without sacrificing error resilience performance under the measured conditions. The performance of the system is rated based on the final video quality at the receiving end using peak signal to noise ratio (PSNR) as the quality measure. Error concealment for masking packet erasures at the receiver/decoder affects visual quality significantly and advanced methods are proposed and described in this manuscript. The performance of the system using such advanced concealment methods is examined. The results presented reflect the expected performance during the planned trials. The paper is organised as follows: the key components of the planned wireless video transmission trials – 802.11b/g WLANs and the H.264 video coding standard - are discussed in section 2. Section 3 describes the measurements conducted including the measurement system and the data collected. Buffering problems that need to be tackled with regard to UDP and TCP wireless video transmission are also discussed and recommendations are given based on the measurement data. Error resilience and packetisation strategies are presented in section 4. Advanced error concealment methods are proposed in section 5. Finally, in section 6, the performance of the system under the measured conditions and using the proposed concealment methods is examined. Conclusions are given at the end of this paper.

Figure 1 : Wireless video transmission for remote surveillance - description of planned demonstration system.

© Provision Communication Technologies, Thales Communications, and the University of Bristol

2

2. WIRELESS VIDEO TRANSMISSION – H.264 AND 802.11b/g WLANs 2.1. H.264 / Advanced Video Coding The H.264/Advanced Video Coding (AVC) standard [7] [8] is the latest video coding standard developed jointly by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). H.264 aims to provid enhanced coding efficiency as well as a ‘network-friendly’ representation of the encoded video which will ensure its suitability for transmission over existing and future networks. The standard is divided into two layers. The first layer defines the syntax specification for the description of the video content together with the encoding/decoding tools and is known as the video coding layer (VCL). The second layer, known as the Network Abstraction Layer (NAL), provides a means to transport the video over numerous and heterogeneous networks by allowing a seamless and easy integration of the coded video stream into current and future system architectures. 2.1.1. The H.264 Video Coding Layer The coding scheme defined by H.264 is very similar to that employed in prior video coding standards. It is a hybrid codec that makes use of translational block-based motion compensation followed by block transformation of the displaced frame difference, scalar quantization of the transform coefficients with an adjustable step size for bit rate control, scanning and run-length variable length coding (VLC) of the quantized transform coefficients [1]. However, H.264 modifies and enhances almost all of the above operational blocks thus achieving significant performance improvements albeit at the cost of increased complexity [9]. The video coding layer also offers error resilience support in the form of optional coding modes. Apart from the error resilience options also found in past video coding standards [10][11], like intra refresh coding, use of slices, data partitioning and reference picture selection, H.264 further offers flexible macroblock ordering (FMO), redundant slices and the use of parameter sets [12]. These error resilience tools aim at either preventing error propagation or enabling the decoder to perform error concealment more succesfully. Particular combinations of coding and error resilience tools are specified in the three profiles defined by the standard. For complexity and delay reasons the baseline profile was chosen for implementation in the surveillance system under study. Some of the error resilience tools available in this profile and considered in this work are described next. Intra refresh. Intra placement or intra refresh serves the purpose of preventing or reducing drifting errors caused by error propagation due to the predictive nature of the codec (temporal and spatial). Intra placement can be applied at the picture level, slice level or MB level. At the picture level, a clearing of the multi-picture buffer is needed for stopping any drifting effects and this takes place with IDR pictures (Instantaneous Decoder Refresh) [7]. At the macroblock level, intra coding, unless otherwise specified at the sequence parameter set (SPS), might entail spatial prediction from neighbouring inter coded MBs which can result in propagation of errors even to intra coded MBs. Hence for this tool to be effective the constrained intra prediction flag at the SPS has to be raised. Slices. A slice is a collection of macroblocks in raster scan order which can range from one MB to all MBs in one picture. Slices interrupt the in-picture coding mechanisms thus limiting any spatial error propagation to the affected slice only. Additionally, headers included in each slice serve as spatial synchronisation markers. As a result slices can be independently decoded without requiring any other information. Slices are a prerequisite for error concealment methods since they can prevent the loss of entire pictures at the decoder. Slices are also useful for adapting the payload of packets and for interleaving purposes. At the same time however the in-picture prediction restrictions (intra and MV coding) and the increased overhead associated with small slices can harm the efficiency of the codec considerably [13]. Reference frame management. The multi-picture reference buffer supported by H.264 allows the encoder to select the reference picture used in inter prediction. This can be exploited for error resilience purposes. In a feedback-based system reference picture selection on a slice or picture basis can stop error drifting. When no feedback is available, periodic referencing (every nth frame) of a specific past frame (nth previous frame) can be employed with the periodic frames being coded more robustly compared to other frames (e.g. using FEC codes and data partitioning) [14]. FMO. Flexible macroblock ordering permits the assignment of MBs to slice groups in orders other than the normal raster scan order based on a macroblock allocation map. The available map types include, among others, dispersed and interleaved macroblock allocation, which can lead to very good results when combined with concealment by preserving many neighbouring MBs in the case of errors [12].

© Provision Communication Technologies, Thales Communications, and the University of Bristol

3

2.1.2. The H.264 Network Abstraction Layer The Network Abstraction Layer (NAL) offers support for a packet-based approach with an RTP-based extension, as well as for byte-stream oriented protocols with the optional Annex B extension of the standard. It operates on NAL Units (NALU), consisting of a 1-byte header followed by a byte string of variable length that contains syntax elements, i.e. in most cases, macroblocks (MBs) of one slice or its partition. Depending on encoding constraints, a NAL unit can be one whole frame, it can contain a fixed (or limited by a maximum) number of MBs, and/or it can have a fixed (or limited by a maximum) number of bytes. The design of the Network Abstraction Layer has been performed with IPbased transmissions in mind, and therefore includes RTP and UDP layers. The RTP format for H.264 [15] was designed so that the H.264 encoded output can be transmitted over packet-oriented networks. The payload of an RTP packet has a variable length and consists of the NAL header and its payload. No error correction or detection scheme is implemented. 2.2. IEEE 802.11 WLANS WLAN technology is designed to provide wireless connectivity, even in less favourable non-line-of-sight conditions. Current popular standards include the IEEE 802.11b and IEEE 802.11a/g [16][17]. The 802.11b standard offers datarates up to a maximum of 11 Mbps where environmental conditions allow. This standard, now known as WiFi, was the first to offer the possibility of digital video transmission, however the standard was developed with data transfer rates in mind and no provision was made for the allocation of fixed bandwidths. This makes the transmission of video extremely challenging due to large latencies and extreme timing jitter. The 802.11b standard operates in the 2.4 GHz band and uses Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) to enable fair multiple access to the radio channel. 802.11a operates in the 5 GHz band and provides data rates up to 54 Mbps at the physical layer. 802.11g uses the same physical layer with 802.11a (Coded Orthogonal Frequency Division Multiplexing - COFDM) but operates at 2.4 GHz. 802.11g is backward compatible with 802.11b. In native mode (i.e. sacrificing backward compatibility), 802.11g can offer similar throughputs to 802.11a [18]. However when backwards compatibility is chosen the throughput is considerably reduced due to overheads in the medium access layer (MAC). 2.2.1. The IEEE 802.11 Medium Access Layer The IEEE 802.11 MAC is based on Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA). The CSMA/CA implements a ‘Listen Before Talk’ mechanism where stations are only allowed to transmit when the medium is sensed to be idle. It can be implemented in two different modes: the basic mode (two way hand-shaking) and the Ready-To-Send/Clear-To-Send mode (RTS/CTS), which is designed to avoid collisions (four way hand-shaking). Figure 2 describes the access scheme with RTS/CTS. The IEEE802.11 MAC relies on a Stop and Wait ARQ (Automatic Repeat Request) retransmission scheme. Transmitted frames are acknowledged within a Short Inter Frame Spacing (SIFS). If the receiver does not receive the ACK within SIFS, the frame is rescheduled and the transmitter contends again for the channel. MAC layer parameters such as SIFS or DIFS are PHY layer dependent. From Figure 2, we can see that, since the DIFS, SIFS, ACK, CTS and RTS are constant (only PHY layer dependent), the throughput will be packet length dependent. Longer packets will have a more efficient use of channel resources, whereas small packets will have larger MAC access overheads (Figure 3).

Figure 2 : RTS/CTS Access Scheme

2.2.2. The IEEE 802.11a PHY layer The physical layer technology for 802.11a/g is Coded Orthogonal Frequency Division Multiplexing (COFDM). COFDM is used to combat frequency selective fading and to randomize the burst errors caused by wireless channels. Full details on this PHY layer, can be found in [4] [19]. IEEE 802.11a/g operates in 8 different modes, offering different coding rates (1/2, 3/4, 2/3, 9/16), different modulation schemes (BPSK, QPSK, 16/64 QAM), different bit rates (from 6 to 54Mbits/s) and different reliabilities as shown in Table 1 and Figure 4 respectively. The Packet Error Rate (PER) is packet length dependent i.e. longer packets are more likely to be corrupted than smaller packets for a particular Carrier to Noise Ratio (C/N) configuration.

© Provision Communication Technologies, Thales Communications, and the University of Bristol

4

Mode

Modulation

1 2 3 4 5 5 6 7 8

BPSK BPSK QPSK QPSK 16QAM (H/2 only ) 16QAM (IEEE only) 16QAM 64QAM 64QAM (IEEE only)

Figure 3 : Throughput for the RTS/CTS access scheme 10

PER

10

10

10

10

Coding rate R 1/2 3/4 1/2 3/4 9/16 1/2 3/4 3/4 2/3

Nominal bit rate [Mbit/s] 6 9 12 18 27 24 36 54 48

Table 1 : IEEE 802.11a/g modes of operation.

0

Mode 1 Mode 2 Mode 3 Mode 4 Mode 5 Mode 6 Mode 7

-1

-2

-3

-4

-5

0

5

10

15 C/N in dB

20

25

30

35

Figure 4 : PER curves under ETSI-Bran Channel A model, with packet length of 188 bytes

3. MEASUREMENTS In order to gain insight into the expected WLAN performance levels of the system, live measurement data was collected using 802.11g cards in locations where the planned trials will occur [20]. This section describes the server/client hardware and software that was used and developed within the WCAM project, the measurements conducted and the data collected. Also included is an analysis of the data and the recommendations that resulted. 3.1. Measurement hardware The measurements were conducted using the hardware configuration shown in Figure 5. The system comprises of two laptops connected in an ad-hoc network. One laptop is used as the server and the other as the client. Commercially available IEEE 802.11b/g cards were employed in these measurements (Table 2). An H.264 sequence encoded at a rate of 1-2 Mbit/s and stored on the server hard disk was used as the source for all WLAN transmissions. The software interface enabled a range of parameters to be varied at the PHY/network level. The software transmits UDP or TCP packets to the 802.11 modem via the standard NDIS protocol stack. At the client, the packets are received and passed up through the protocol stack. The client software extracts the packets from the NDIS interface and logs a range of parameters, such as transmission delay and signal level. 3.2. Measurement software The measurement software is used to configure the cards, set up the wireless links, serve the video data and extract the critical received data at the client. UDP/IP and TCP/IP stacks are used as the intermediate network and transport layers between the application and the 802.11 link. The software has access to low-level network-specific information like the MAC address, RSSI and Link-Speed values. Using this software the computers can be configured as a server and client. The server software allows control of (amongst others) the following parameters: the choice of transport protocol (UDP or TCP); the video sequence to be served; the packet size to use (up to a maximum of 1472 bytes) and the video transmission data-rate (this is the controlled rate at which the video will be served and should not be confused with the

© Provision Communication Technologies, Thales Communications, and the University of Bristol

5

wireless data-rate). The client software logs data in two files, one containing time based information (Table 3) and the other containing packet based information (Table 4).More information on the measurement system can be found in [20].

Manufacturers WLI-CB-G54A Buffalo Card F5D7010 Belkin Card

Figure 5 : Hardware configuration used for measurements. Time in se since the start of the session Bit-rate in kbit/s (averaged over 1sec) Average Delay in milliseconds (average over 1sec) Standard Deviation of the delay in milliseconds (averaged over 1sec) The Range (max-min) of the delay in milliseconds (observed over 1sec) Packet Error Rate (averaged over 1sec) RSSI in dBm at the client (sampled over a 1sec interval) LinkSpeed in Mbit/s at the server, e.g. 1, 2, 5.5, 6, 9, 10, 12, 18, 24, 36, 54, 108

Table 3 : Time based logged data.

Standard 802.11b/g 802.11b/g

Table 2 : Cards employed in measurements. Packet Arrival in msec since the start of the session Packet Counter (a running number in the packet header) Current packet size (usually constant) Packet delay in msec (measured as the end to end video transport delay) The total number of bytes into the sequence for the data currently received (information for the client to know exactly how many bytes the server has attempted to send despite of dropped packets) RSSI of the client Link speed of the server

Table 4 : Packet based logged data

3.3. Measurement data and analysis Measurements taken at the WCAM trial site that correspond to actual scenarios planned for the first trial are described in Table 5. Analysis of the packet based logged data (Table 4) allows the end-to-end performance of the wireless video link to be determined. This includes evaluating on a packet-by-packet basis the following parameters: packet delay, packet jitter, packet error, packet loss, RSSI, buffer occupancy and link speed. Packet loss can occur for two reasons. First, the radio channel was not sufficiently strong enough to enable the packet to be received after the allotted number of MAC level ARQs. Secondly, the server was unable to transfer the video packets to the WLAN since it was already busy sending previous packets (this occurs when the link throughput is less than the video rate and the WLAN becomes overloaded). The first error mechanism can be identified by a jump in the packet counter. The second error mechanism can be identified by a jump in the offset of the total number of bytes sent. Name Data7 Data8 Data9

Description Indoor mobile measurements, AP ground floor; from forum out main door, loop, back - 100 kbps UDP back channel Indoor mobile measurements, AP ground floor; from forum out back door, loop, back - 100 kbps UDP back channel Indoor mobile measurements, AP on 1st floor; stairs to exhibition hall, out left exit, in, out main entrance, in, out right exit, then in again, back to stairs. - 100 kbps reversed link data Data10 Indoor static measurements, AP on 1st floor, STA on second floor Data11 Indoor static measurements, AP on 1st floor, STA at nearby café Data12 Outdoor range test, _r card pointing away, _f card pointing towards tester Table 5 : Measurements conducted

Examples of the measurement data taken in the planned trial location are shown in figures 7,8,9, and 10. Figure 6 and Figure 7 show data for an outdoor route measurement with UDP and TCP links respectively. In this case, the mobile client is moving away from, and then returning to, the server. We can see that the TCP link delivers error free data, whereas the UDP connections suffer missing packets. In the UDP case, when a packet is missing, it has been through the 32 ARQ processes (the maximum set in the card) and thus experienced considerable delay. We therefore see an

© Provision Communication Technologies, Thales Communications, and the University of Bristol

6

increase in delay when packet loss occurs. In the TCP case, delays are much higher than UDP since the packets that fail the MAC ARQ are later corrected using a TCP ARQ process. This means that each time TCP retransmits a packet, it has to go through another MAC ARQ process. In the TCP case, bad channel conditions, congestion, or collisions can result in high levels of delay (often as high as 5-10 seconds for persistently difficult channels). For both cases, the link speed at the server varies as the RSSI decreases and as the mobile moves away from the server. These plots show how the link adaptation algorithm works. In both cases, the target bit rate of 2000 kbits/s was maintained.

Figure 6 : Example Mobile Measurements - UDP.

Figure 7 : Example Mobile Measurements - TCP.

Figure 8 and Figure 9 show static measurements in an indoor environment with UDP and TCP links respectively. For both cases, as the measurements are static, the link speed does not vary over a large range.

Figure 8 : Example Static Measurements - UDP.

Figure 9 : Example Static Measurements - TCP.

3.4. Recommendations based on measurements The measurements provide characterisation of the channel conditions in the form of recorded packet error patterns which are subsequently used for simulating video transmission and testing the received video quality. Furthermore the collected data indicate the need for a pair of video packet buffers as shown in Figure 10, in order to protect video playback at the client (delay) and to prevent packet dropping (blocking) at the server. These buffers are not present in the logging software since the intention was to observe these effects and recommend a suitable buffer length. Figure 11 describes the buffer operation in a variable bandwidth channel. The source link pushes encoded video packets into the server WLAN. The destination link pulls received data from the client WLAN for H.264 decoding and display. The source link data arrives at constant bit-rate (CBR), or a variable bit-rate (VBR) bounded by an upper level (say 2 Mbits/s). If we consider the CBR case, the destination link must pull data at a constant rate in order to ensure smooth video playback. However, the wireless channel that lies between the server and client operates at a variable rate due to factors such as channel fading, interference and congestion. The use of transmit and receive buffers aim to isolate the

© Provision Communication Technologies, Thales Communications, and the University of Bristol

7

variable bit-rate characteristics of the wireless channel from the constant (or bounded) bit-rates of the source and destination link. These buffers are critical to ensure high quality video, however their presence increases the end-to-end delay. There is clearly a trade-off to be made between maintaining video quality and minimising end-to-end delay. To ensure correct operation, the server and client buffers should be of identical length (Figure 11). When the video link is formed, video playback at the client should be delayed by a period equal to the buffer length. Ideally, the client buffer remains full and the server buffer remains empty. When the channel is poor, the server writes to the server buffer and the client reads from the client buffer. Successful operation continues until the server buffer is full and the client buffer is empty. At this point, packets are dropped at the server and any packets currently in transmit will arrive too late for decoding. Hence the server buffer will remain effective so long as: the average throughput in the wireless channel is higher than the source link rate; fluctuations in wireless channel throughput are not severe enough to cause the transmit buffer to overflow. When that happens the incoming data will either be lost (dropped) if the server cannot rate control the source data; be delayed if the server can postpone the sending of future data. Most video streaming applications take the second option, i.e. they delay the video stream under poor conditions. This ensures that all packets are eventually received correctly at the client and thus allows video playback, albeit at a slower rate. For video streaming it is common to use the TCP protocol to ensure the correct delivery of packets. When TCP is used, long delays can arise due to the end-to-end transport level use of ARQ. It is common for 10 second or more delays to be used with TCP and this is the recommended buffers size for the TCP link trials. For real-time video surveillance however, the use of long buffers is unacceptable. As a result, UDP links must be used with shorter server buffers. Based on the data collected a server buffer length in the region of 200ms would be recommended.

Video Source

Transmit Buffer

Source link

Wireless Channel

Receive Buffer

Video Decoder

Destination link

Figure 10: System model of the source/destination buffer

Source link incoming Data

Changing Buffer occupancy

Destination link outgoing Data

Constant rate

Wireless Channel

Constant rate

Figure 11 : Buffer Operation in Variable Channel.

4. ERROR RESILIENCE AND PACKETISATION So far in this work we have investigated the following error resilience options that are available to the encoder and which we plan to use during the actual surveillance trials: • • • •

Intra refresh in the form of regular IDR frames (one every 12 frames). Constrained intra prediction (no prediction of intra MBs from inter coded ones) Dispersed flexible macroblock ordering (FMO) Slices of specific size in bytes

Use of dispersed FMO coupled with the use of slices can lead to better results in the presence of errors assuming that intelligent concealment is employed at the decoder [12]. An example of this is shown in Figure 12 where the same packet errors have been applied (PER of 10%) to the same sequence coded with and without use of FMO (dispersed) at 1Mbps. In both cases slices of size equal to 66 MBs were used and IDR frames were inserted every 90 frames. For each of the two decoded bitstreams, one frame with no concealment is depicted showing the propagation of errors since the last IDR frame (11 frames before), together with the same frame after having been concealed with a fixed concealment method. Any difference in visual quality is due to the use of FMO. Slices are defined in terms of number of bytes and not number of macroblocks for two reasons. Firstly, because this improves error resilience by equalizing the probability that a slice be hit by a transmission error and by ensuring that a higher number of slices will appear in the active regions of a picture, which could, therefore, be reconstructed with a higher probability [21]. The second reason for having slices of specific (maximum) byte size is because their size is coupled with the packetisation process that precedes transmission and which affects not only error resilience but also the throughput of the network (section 2.2.2 and Figure 3) [22]. Clearly a compromise is needed that satisfies both the throughput requirements of the application and the error resilience attributes of the encoded stream. In this work one

© Provision Communication Technologies, Thales Communications, and the University of Bristol

8

MAC frame (IEEE packet) carries one NAL unit only. Since erroneous packets that have not been correctly received after the maximum number of retries are discarded, there is no real gain in having more than one NAL unit in each packet compared to one NAL unit of bigger size. The recommended NAL unit size depends on the bit rate of the encoded video, the number of unicast links over which the video is served (number of users), as well as the bit rate supported by the mode at which the network operates (Table 1). Assuming the scenario shown in Figure 1 (two mobile users and one static acting as the AP) and a rate of 1Mbps for the video, a slice size of 600 bytes would be a good compromise in terms of throughput and error resilience.

Figure 12 : Corrupted and concealed frame without (left) and with use of FMO (dispersed).

5. ADVANCED ERROR CONCEALMENT Error concealment methods [23][21] estimate lost information by employing the correlation that exists between a missing macroblock (MB) and its adjacent macroblocks in the same or previous frame(s). Motion vector (MV) estimation (temporal concealment) is normally used in P and B frames and can lead to adequate concealment via motion compensated temporal replacement of the missing pixels. Spatial concealment is usually reserved for intra coded frames (IDR or I frames in H.264) which lack motion information. Although not normative, the reference software decoder (JM 8.0) implements both spatial and temporal error concealment for missing intra and inter coded macroblocks [24]. The spatial concealment (employed only for damaged MBs in IDR or I frames) is based on the method described in [25] which replaces missing pixels with weighted averages of boundary pixels in surrounding MBs. The weights used are inversely proportional to the distance of source and destination pixels. Temporal concealment (employed solely for damaged MBs in P or B frames), is implemented as specified in [26]. This is the boundary matching algorithm (BMA) which predicts one missing macroblock MV out of a list of candidate MVs coming from 4-neighbouring MBs or 8x8 blocks of these MBs. The average and median of the surrounding blocks are no longer used as candidates. Instead only the actual MVs of these blocks together with a zero MV are tested and the one that results in the smallest boundary matching error (BME) becomes the recovered MV, which is then used for motion compensated replacement of the missing pixels from a previously decoded frame. Additionally if the motion activity in the current frame, as recorded by the average MV of all correctly received MBs, is below a certain threshold, then the JM decoder uses a simple temporal copying. In the following sub-sections we describe advanced temporal error concealment methods that have been developed within the WCAM project for an implemented H.264 decoder, with the aim of providing improved concealment performance relative to that of the JM decoder. Spatial concealment remains similar to JM, however a concealment mode decision is taken for every damaged MB of both P and IDR frames based on the method of [32] which chooses which concealment approach will be used based on spatial and temporal activity measures. 5.1 Temporal concealment for damaged P frames A number of temporal concealment methods have been reported in the relevant literature for use with standard video decoders [24], [26]-[31]. In order to devise an improved temporal concealment approach, a model was constructed that describes the steps many of them follow. This model is shown in Figure 13. First a list of MV candidates is formed for replacing the MV(s) missing from the damaged MB. Each one of the candidates is then tested using a matching error measure that will determine which of the candidates offers the best possible replacement for the damaged MB. Having selected the replacement MV some methods proceed to what is described here as an enhancement step. This can be

© Provision Communication Technologies, Thales Communications, and the University of Bristol

9

overlapped block motion compensation (OBMC), i.e. replacement of the damaged MB with pixels coming from more than one previous MBs and/or motion refinement whereby the selected replacement MV (or other candidate MVs) is used as a starting point for a motion estimation process that looks for better MV replacements using the same matching measure. List of MV Candidates

Zero MV Previous

MV of Previous Frame MB

Current

MV of Neighbouring MBs/Blocks MV of 4- or 8- Neighbours Average MV of Surrounding MBs Median MV of Surrounding MBs Matching error measure

Estimated MV

Enhancements

BME

EBME

Boundary Matching Error (BME) External Boundary Matching Error (EBME) Weighted External Boundary Matching Error (WEBME)

WEBME

OBMC

Refinement

Figure 13: Typical temporal concealment steps

Figure 14 : Matching measures

In this work we have considered all of the above steps through a performance analysis that demonstrates how each component affects concealment. For the list of MV candidates (Figure 13), apart from the zero MV and 4-neighbours, we have included the MV of the collocated MB in the previous frame, the 8-neighbours and the average and median of all surrounding MBs. As a matching measure (Figure 14) we have tested the following: the boundary matching error (BME) used in the JM decoder (the sum of absolute differences - SAD - between pixels on the boundary of the replacement reference MB and the boundary of the MBs adjacent to the damaged one); the external boundary matching error (EBME) suggested in [27][30] (the SAD between the external two-pixel-wide boundary of the MBs adjacent to the damaged MB and the same external boundary of the MBs adjacent to the candidate replacement reference MB); and finally the weighted external boundary matching error (WEBME) [29], similar to EBME with the boundary being 4 pixels wide and with raised cosine weights being used for calculating the distortion. In terms of enhancements, overlapping was implemented according to [29] (i.e. using 4 prediction signals and the raised cosine matrix as the weighting matrix) and motion refinement was implemented following a 3-step search approach to avoid a large increase in complexity. Three CIF sequences (“Foreman”, “Bus”, “Stefan”) were coded with H.264 (JM 8.0) at 1Mbits/sec using 3 reference frames with I and P pictures only (one IDR frame was used at every 3 seconds -90 frames). Each sequence was encoded using 4 different slice sizes equal to 1,1.5, 2 and 2.5 rows of MBs (12 clips in total). To simulate transmission errors (packet erasures) - assuming that one packet includes one slice only - random packet errors were introduced for each of the 12 clips at rates of 0.1% 0.5% 1% 2% 4% and 10%. For each packet error rate and each clip there were 10 different error sequences. Note that errors were not introduced in IDR pictures in order to avoid the use of spatial concealment. The testing procedure was the following. For each of the three matching measures (BME,EBME,WEBME) the performance with different candidate MVs was measured in order to evaluate the performance contribution of each one of them. Then the performance of each matching measure was evaluated. Finally enhancements were assessed. The PSNR results presented (apart from motion refinement) are averaged over 10 clips per slice size, over 4 slice sizes and over the 3 test sequences (a total of 120 clips for each PER). Figure 15 shows the difference in performance when the respective MVs are added to a candidate list which already includes the 4-neighbours of a missing MB. One can see from the graphs that the average and median MVs offer limited improvement in performance, while the previous frame MV (i.e. the MV of the collocated MB in the previous frame) and the 8-neighbours do improve the results. Including all the candidates gives a further small improvement compared

© Provision Communication Technologies, Thales Communications, and the University of Bristol

10

to the best single-candidate performance. This is not however the case with BME, which suggests that the matching process does not always select the best possible candidate. Figure 16 shows how changing the matching measure affects the results when all candidates are used. It is clear that changing to an external boundary match improves the results significantly on average (close to 1 dB at high PERs). With no significant difference between EBME and WEBME the former one is preferred due to its smaller complexity. Overlapping was found to affect the concealment performance in a positive manner, especially at higher packet error rates. Motion refinement was tested with one sequence only (Foreman) and one slice size and was found to offer very little gain while at the same time increasing complexity significantly. As a result it was decided not to be used as an enhancement for P frame concealment. The performance of the enhanced temporal concealment method resulting from this study and adopted for the WCAM H.264 decoder is shown in Figure 17 with and without FMO. Average for all sequneces and all slices - EBMA

Average for all s equneces and all s lices - B M A 0.16

0.3

PSNR Relative to basic (dB)

0.12 0.1 0.08 0.06 0.04

8

0.4

Median Average Previous 8-Neighbors All Candidates

M edian Average Previous 8-Neighbors All Candidates

0.35 0.3 PSNR Relative to basic (dB)

0.14

PSNR Relative to basic (dB)

Average for all sequneces and all s lic es - W E B M A

2

0.35 M edian Average Previous 8-Neighbors All Candidates

0.25

0.2

0.15

0.1

0.02

0.25 0.2 0.15 0.1 0.05

0.05

0 -0.02

1

2

3

0.1% 0.5%

1%

4

5

2%

4%

0 -0.05

0

6

1

10%

2

3

4

5

0.1%PER0.5% 1%- 1% -2% ( 0.1% - 0.5% 2% - 4% - 4% 10%)

PER

1

6

2

0.1% 0.5%

10%

3

1%

4

2%

5

4%

6

10%

PER

PER

Figure 15 : MV candidate results. The difference in PSNR performance is shown for 3 matching measures when the respective MVs are added to a candidate list which already includes the 4-neighbours of a missing MB and the zero MV (note - scales are different). Comparison of Matching Measures - EBME & WEBME relative to BME 1

37

EBME WEBME

Enhanced Temporal Concealment (EBMA) vs JM Temporal Concealment (BMA) EBMA-FMO-Dispersed EBMA-FMO-Interleaved EBMA-No FMO BMA-FMO-Dispersed BMA-FMO-Interleaved BMA-No FMO

36

0.9

34 0.8 33 0.7

32 31

PSNR (dB)

PSNR Relative to BMA using All Candidates (dB)

35

0.6

0.5

30 29 28

0.4

27 26

0.3 25 24 0.2 23 0.1 0.1

0.5

1

PER (%)

2

4

Figure 16: Matching measure results

10

22 0.1

0.5

1

2

4

10

PER (%)

Figure 17 : Temporal Concealment results

5.2 Temporal concealment for damaged IDR frames. Although IDR frames lack any motion information and as a result errors tend to be concealed using spatial methods, it is clear that concealment performance could benefit significantly if temporal correlation (especially high in sequences with uniform motion) was exploited. Following exactly the same steps as in temporal concealment for P frames (Figure 13), the major problem in the case of IDR frames is how to form the motion vector candidate list. One obvious choice is the zero MV (temporal copying) [32]. We additionaly employ the collocated MB in the previous frame (when not coded in intra mode) and its 8-neighbours in a similar manner to the P-frame concealment case (only here the previous frame is employed). Previously concealed meighbouring MBs of the damaged IDR frame are also used. To illustrate the benefit brought by the use of such a temporal concelament method for concealing damaged MBs in an IDR frame, we give the following example (Figure 18) using ‘foreman’ encoded at 1Mbps, with 1 IDR frame every 30

© Provision Communication Technologies, Thales Communications, and the University of Bristol

11

frames, slice size of 66 MBs and FMO-dispersed mode on. Errors are introduced in both IDR and P frames. P frame errors are concealed using identical temporal concelament, while IDR frame errors are either concealed temporally as described before or spatially as in the JM decoder. One can see the difference in visual quality on the depicted frames (IDR frame 30 is shown) and in PSNR performance on the plot below. (IDR frames 30 and 90 were damaged). The average PSNR was 34.61dB for the spatial concealment case and 36.51 dB for the temporal one. In IDR frames, the shortage of good motion vector candidates (due to the lack of motion data) makes motion estimation/refinement a useful way of improving temporal concealment in such frames. Motion estimation is effectivey refinement of the zero MV. Refinement is applied to the selected replacement MV as described in the previous subsection. With IDR frames present less frequently in the bitstream the increased complexity of motion refinemnet makes sense when one considers the possible gains (Figure 19 - frame 90 is shown).

No Concealment

Spatial

Temporal

No Concealment

No Refinement

Refinement

IDR Temporal Concealment with and without MV Refinement

Temporal vs Spatial Concealment for IDR Frames with identical P-frame concealment 44

44 Temporal IDR Spatial IDR

43

No Refinement Refinement 42

42 41

40

40 39

38 38

36

PSNR (dB)

PSNR (dB)

37 36 35 34

34

32

33 32

30 31 30

28

No Refinement

29

Spatial Concealment

28

26

27

24 26

0

15

30

45

60

75

90

105

120

135

150

165

180

195

210

225

240

255

Frame Number

Figure 18: Temporal concealment for IDR frames

270

285

300

0

15

30

45

60

75

90

105

120

135

150

165

180

195

210

225

240

255

270

285

300

Frame

Figure 19: Motion refinement for IDR frames

6. SIMULATED TRANSMISSION RESULTS The performance of the system under the measured conditions and using the proposed concealment methods is examined herein. Results for two scenarios are reported, one corresponding to static reception (Data 11 in Table 5) and one to mobile reception (Data7 in Table 5). Two surveillance type sequences of CIF resolution were used: “Hall” a standard sequence captured with a static camera monitoring corridor activity, and “Queens” a sequence captured at the University of Bristol monitoring street activity (traffic and pedestrians). Sequences were encoded at 1 Mbits/sec using the error resilience options mentioned in section 4, with two slice sizes equal to 600 and 1200 bytes. Both input sequences were repeated 10 times before coding. Errors (packet erasures) were introduced based on the error patterns collected during the two measurement sessions mentioned above. Results for static measurements are shown below for the “Queens” sequences using packets (and NAL units) of 600 bytes. Results are presented as frame by frame PSNR plots for the error-free case (EF), the H.264 reference decoder case (AEC) and the enhanced error concealment one (EEC). Selected frames affected by packet erasures are shown in Figure 20 and Figure 21. It can be seen that the enhanced concealment method (EEC) outperforms AEC. The average PSNR of the error free sequence was 31.14 dB with the minimum frame PSNR being 27.81 dB. The corresponding AEC results were 30.93 dB and 24.23 dB respectively while for the EEC decoder results were 31.02 dB and 26.17dB. The average PER was ~0.2%.

© Provision Communication Technologies, Thales Communications, and the University of Bristol

12

STATIC - Data 11 - UDP 600 bytes - "Queens" AEC EEC EF

36

35

34

PSNR (dB)

33

32

31

30

29

28

27

26

250

260

270

280

290

300

310

320

330

340

350

Frame Number

Figure 20: Static PSNR results for part of the “Queens” sequence. Frame 288 (IDR) (error free) is shown below.

Figure 21: Frame 288 (IDR) :Corrupted (top), AEC concealed (middle), EEC concealed (bottom)

Results with the mobile measurement data are shown in Figure 22,Figure 23 and Figure 24 for sequence “Hall” using packets (and NAL units) of 1200 bytes. Again results are presented as frame by frame PSNR plots for the error-free case (EF), the H.264 reference decoder case (AEC) and the enhanced error concealment one (EEC). The average PSNR values for the frames included in the graph of Figure 23 were 40.15dB for the error free case, 31.51dB for the nonconcealed case (not shown), 38.48dB for the JM decoder case (AEC) and 39.18 dB for EEC. Note that frames 2384 and 2455 were lost due to transmission errors and were concealed by copying the previous frame. The PSNR values for the reconstructed frame shown in Figure 22 were 41.33 dB, 34.9 dB and 37.33 dB for EF, AEC and EEC respectively. PSNR values for the frame of Figure 24 in the same order were 42.57 dB, 31.9 dB and 38.36 dB.

Figure 22: Frame 2365 (P) of sequence “Hall”. Left-Error free; Middle left-Corrupted; Middle right-AEC (Detail); Right-EEC.

© Provision Communication Technologies, Thales Communications, and the University of Bristol

13

Mobile - Data 7 - UDP - 1200 bytes - "Hall" 43 42 41 40 39 38 37

PSNR (dB)

36 35 34 33 32 31 30 AEC 29 EEC 28

EF

27 26 2150

2180

2210

2240

2270

2300

2330

2360

2390

2420

2450

2480

2500

Frame Number

Figure 23: Mobile PSNR results for part of the “Hall” sequence.

Figure 24: Detail from frame 2280 (IDR) : error free (1st) corrupted (2nd), AEC concealed (3rd), EEC concealed (4th).

7. CONCLUSION The video delivery aspects of the wireless video surveillance system considered in the (EU-FP6) WCAM project have been discussed and analysed in this paper. Measurement data captured at the location of the planned trial have been presented and used in order to evaluate the expected performance of the system. Advanced concealment methods were described for masking packet erasures that can take place during transmission. Use of such methods can lead to better video quality at the receiving end as demonstrated in this paper. Current work in the WCAM project also considers further error resilient methods and region of interest coding for surveillance purposes.

ACKNOWLEDGEMENTS This work was performed as part of the European Union FP6 WCAM project

REFERENCES [1] A. Joch, F. Kossentini, H. Schwarz, T. Wiegand, and G. J. Sullivan, "Performance Comparison Of Video Coding Standards Using Lagrangian Coder Control," IEEE International Conference on Image Processing (ICIP), 2002. [2] J. Peterson, “Understanding Surveillance Techniques”, CRC Press, 2001. [3] IEEE Std 802.11 – 1999, ”Part 11:Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications” [4] Angela Doufexi, Simon Armour, Peter Karlsson, Michael Butler, Andrew Nix, David Bull, “A comparison of the HIPERLAN/2 and IEEE 802.11a Wireless LAN Standards”, IEEE Communication Magazine, May 2002. [5] R. Van Nee, G. Awater, M. Morikura, H. Takanashi, M. Webster and K. Halford, “New High-Rate Wireless LAN Standards,” IEEE Comm. Magazine, Dec. 1999. [6] Pierre Ferré, Angela Doufexi, Andrew Nix, David Bull, “ Throughput Analysis of the IEEE 802.11 and IEEE 802.11e MAC”, IEEE WCNC 2004, Atlanta. [7] Joint Video Team of ISO/IEC MPEG and I.-T. VCEG, “ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITUT Rec. H.264 - ISO/IEC 14496-10 AVC)”, March 2003. [8] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVC Video Coding Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 560-577, 2003. [9] Jörn Ostermann, Jan Bormans, Peter List, Detlev Marpe, Matthias Narroschke, Fernando Pereira, Thomas Stockhammer, and Thomas Wedi, “Video coding with H.264/AVC: Tools, Performance, and Complexity”, IEEE Circuits and Systems, 2004

© Provision Communication Technologies, Thales Communications, and the University of Bristol

14

[10] Raj Talluri ,“Error resilient video coding in the ISO MPEG4 Standard”, IEEE Communications Magazine, June, 1998 [11] Stephan Wenger, Gerd Knorr, J¨org Ott, and Faouzi Kossentini, ,“Error Resilience Support in H.263”, IEEE Transactions On Circuits And Systems For Video Technology, Vol. 8, No. 7, November 1998 [12] S. Wenger, "H.264/AVC Over IP," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, 2003. [13] T. Stockhammer, M. M. Hannuksela, and T. Wiegand, "H.264/AVC in Wireless Environments," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 657-673, 2003. [14] J. T. H. Chung-How and D. R. Bull, "Loss resilient H.263+ video over the Internet," Signal Processing: Image Communication, vol. 16, pp. 891-908, 2001. [15] S. Wenger, M. M.Hannuksela, T. Stockammer, M. Westerlund, and D.Singer, “Internet Draft - RTP Payload Format for H.264,” April 2004. [16] IEEE Std 802.11a, “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications: High Speed Physical Layer in the 5GHz band”, 1999. [17] “IEEE Std 802.11g; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Further High-Speed Physical Layer in the 2.4Ghz Band, d1.1,” 2001 [18] Angela Doufexi, Simon Armour, Beng-Sin Lee, Andrew Nix and David Bull, “An Evaluation of the Performance of IEEE 802.11a and 802.11g Wireless Local Area Networks in a Corporate Office Enrivonment”, in ICC 2003. [19] Angela Doufexi, Simon Armour, Michael Butler, Andrew Nix, David Bull, “A study of the Performance of Hiperlan/2 and IEEE 802.11a Physical Layers”, IEEE VTC Springs 2001. [20] T. K. Chiew, P. Ferre, D. Agrafiotis, A. Molina, A.R. Nix and D.R. Bull , “Cross-Layer WLAN Measurement and Link Analysis for Low Latency Error Resilient Wireless Video Transmission” in ICCE, Las Vegas, January 2005 [21] Y. Wang, S. Wenger, J. Wen, and A. K. Katsaggelos, "Error Resilient Video Coding Techniques Real time video communications over unreliable networks," IEEE Signal Processing Magazine, vol. 17, pp. 61-82, 2000. [22] Pierre Ferre, Angela Doufexi, Andrew Nix, David Bull, James Chung-How, “Packetisation Strategies for Enhanced Video Transmission over Wireless LANs”, in Packet Video 2004, Irvine CA, December 2004 [23] Y. Wang and Q.-F. Zhu, "Error Control and Concealment for Video Communication: A Review," Proceedings of the IEEE, vol. 86, pp. 974-997, 1998. [24] Y.-K. Wang, M. M. Hannuksela, V. Varsa, A. Hourunranta, and M. Gabbouj, "The error concealment feature in the H.26L test model," presented at ICIP, Rochester, New York,USA, 2002. [25] P. Salama, N. B. Shroff, and E. J. Delp, "Error concealment in encoded video streams," in Signal Recovery Techniques for Image and Video Compression and Transmission, A. K. Katsaggelos and N. P. Galatsanos, Eds, 1998. [26] W.-M. Lam and A. R. Reibman, "Recovery of lost or erroneously received motion vectors," ICASSP '93, USA, 1993. [27] T. Chen, "Refined boundary matching algorithm for temporal error concealment," in Packet Video, Pittsburgh, 2002. [28] M.-J. Chen, L.-G. Chen, and R.-M. Weng, "Error concealment of lost motion vectors with overlapped motion compensation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, pp. 560-563, 1997. [29] T.-Y. Kuo and S.-C. Tsao, "Error concealment based on overlapping," in Visual Communications and Image Processing (VCIP), San Jose, CA, USA, 2002. [30] S. Valente, C. Dufour, F. Groliere and D. Snook, “An efficient error concealment implementation for MPEG-4 video streams”,IEEE Transactions in Consumer Electronics, vol. 47, no.3, August 2001. [31] M. C. Hong, H. Scwab, L. Kondi, and A. K. Katsaggelos, "Error concealment algorithms for compressed video," Signal Processing: Image Communication, vol. 14, pp. 473-492, 1999. [32] Huifang Suna, Joel W. Zdepskib, Wilson Kwok, D. Raychaudhurid, “Error concealment algorithms for robust decoding of MPEG compressed video”, Signal Processing: Image Communication 10 (1997) 249-268

© Provision Communication Technologies, Thales Communications, and the University of Bristol

15