Integrating Virtual Reality, Tele-Conferencing, and ... - CiteSeerX

2 downloads 83560 Views 124KB Size Report
plications, such as tele-conferencing and video enter- ... dimensional video images and stereophonic audio. ... audio will trigger a wide spectrum of multimedia.
Integrating Virtual Reality, Tele-Conferencing, and Entertainment into Multimedia Home Computers Srinivas Ramanathan, P. Venkat Rangan, and Harrick M. Vin Multimedia Laboratory Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114

Abstract Technological advances are revolutionizing computers and electronic devices to support digital multimedia, stimulating the development of a wide spectrum of applications, such as tele-conferencing and video entertainment, that can be o ered to end-consumers. Such applications when coupled with virtual reality, will result in a new generation of systems enabling e ective tele-personal interactions between individuals via their multimedia home workstations. The integration of virtual reality techniques with multimedia teleconferencing leads to the development of tele-virtual conferencing systems, that synthesize panoramic, life-like threedimensional video images and stereophonic audio. We investigate the architectural requirements of such systems and propose a high-level design of an Intelligent Multimedia Interface Unit (IMIU) capable of supporting tele-virtual conferencing in multimedia home computers.

1 Introduction Technological advances are making electronic devices with video and audio digitizing capabilities pervasive. Coupled with advances in broadcast technologies and the emergence of high-bandwidth telecommunication networks, these technological advances are revolutionizing computers to support digital multimedia. In the short term future, the power of computers unleashed on digital video and audio will trigger a wide spectrum of multimedia applications that can have a long standing e ect on

day to day activities. Multimedia computing workstations connected by high-speed networks will begin supporting consumer applications such as video on-demand viewing, entertainment, news distribution, and advertisement. For instance, it is envisaged that the videotape rental stores of today will make way for HDTV-on-demand storage servers that store digitized HDTV video, such as entertainment movies, educational documentaries, etc., on a large array of high-capacity storage devices, such as optical or magnetic disks. Consumers can interactively retrieve the video of their choice from such a HDTVon-demand server to their home multimedia workstations on high-speed metropolitan area networks (such as B-ISDN) that are expected to replace low-speed telephone networks of today [6]. To see why this architectural vision of a HDTV-ondemand server is feasible within the next several years (rather than decades), consider the storage and transmission capacities required for a HDTV server. Assuming HDTV video to require a data rate of about 2 Mbytes/s [1], a 100 minute long movie requires 12 Gbytes. Storage of 1000 such videos requires a capacity of 12 terabytes. In comparison, capacities of currently available disks are about 10 GBytes, and are expected to increase to 100 Gbytes in a few years. Thus, with an array of 120 disks, a HDTV server can store 1000 popular movies simultaneously. As for the transmission capacities, ber-optic networks o ering gigabyte bandwidths are already in place, and those o ering terabyte bandwidths are conjectured to be

only a few years away. However, a key deciding factor for the feasibility of such a HDTV server is its economic viability. Assuming costs of about $ 4,000 per disk, the total expected installation cost of a HDTV server will be $ 0.48 million, which when amortized over 1000 subscribers is about $ 480 per subscriber, making it a viable alternative to owning a VCR. Possibly, the ultimately sophisticated class of multimedia applications that can be provided to end-consumers using digital video and audio capabilities of computer systems are those that simulate a world of \virtual reality". In such applications, panoramic, three-dimensional life-like video imagery and spatially distributed audio combine to present a virtual world that is fairly close to real world to end-users. In the virtual world, individuals can interact with other individuals in a manner that is almost as e ective as real-world interaction. Users can carry out surrogate travel [3], in which, an individual sitting in his or her living room can view scenes of distant sites and navigate through them almost as e ectively as real world travel. Using sophisticated information services, users can create their own personalized newspapers tailored to suit their reading preferences [7], and carry out surrogate travel to the site of the news. However, one of the most important potential uses of virtual reality is for carrying out multimedia tele-conferencing between geographically separated users. Virtual reality has the promise to make teleconferencing as a viable, e ective, and possibly, more exible alternative to real-world meetings and conferences. We use the term tele-virtual conferencing to denote such systems which can present an illusion of a real life conference to users. In this paper, we investigate how life-like imagery and stereophonic audio can be provided within multimedia workstations, so as to make them capable of synthesizing virtual environments that emulate the real world interactions. Supporting such architectural capabilities will enable us to o er televirtual conferencing as a service to end-consumers in a manner similar to telephone, TV, etc.

2 Tele-Virtual Conferencing Interactive exchange of media information between users generally take the form of tele-conferences.

Meetings, classrooms, examinations, and corporate negotiations are all examples of various types of tele-conferences. Di erent types of conferences require di erent kinds of participation by its participants. For instance, a session at a convention usually involves a chairperson, a speaker, and attendees who may pose questions to the speaker if they are permitted to do so by the chairperson. A classroom is similar, except that the speaker is also the chairperson. An examination, on the other hand, involves a proctor and examinees, with communication restricted to that between the proctor and any of the examinees, but absolutely no communication among examinees themselves. Negotiations among multiple corporate groups may involve both intragroup and inter-group communication, which may need to be separated for purposes of con dentiality. Recording the proceedings of a meeting or retrieving the minutes of a previously recorded meeting requires the establishment of a tele-conference with a storage server. Supporting such types of teleconferencing environments requires exible control of media connections among users and/or servers. Furthermore, a teleconferencing system should permit a user to simultaneously participate in multiple conferences (using same or di erent media). A tele-conference may also be an aggregation of other tele-conferences, that is, some of the participants may themselves be other tele-conferences, rather than individual users. Such a multi-level tele-conference is a natural paradigm for expressing meetings between two or more sub-groups of users, such as, corporate meetings in which managers together with their multi-level groups participate. Most tele-conferencing systems available today are very rudimentary and provide only a simple abstraction of a virtual meeting room in which each participant can view images of other participants. They usually use a bridging system in which centralized multimedia bridge mixes media streams received from di erent participants in a conference. Each participants receives and displays media streams transmitted by all other participants. Individual media streams are mixed together to form composite media streams suitable for playback at display devices. In the case of audio, mixing multiple streams involves digitally summing audio samples and then normalizing the result. Mixing in video domain is performed by reducing the

individual video images to a fraction of the frame size, and juxtapositioning the fractions to form a composite frame. For example, in a tele-conference consisting of four participants, video frames from each of the participants may be reduced to the size of a quadrant, and the four quadrants juxtaposed to form a composite image. Our experience with such conferencing systems [5, 8] indicates that the absence of face-to-face interactions will make it dif cult for such conferencing systems to be preferred to real world meetings. In order to increase the e ectiveness of information exchange between participants and to make computer mediated conferencing viable alternatives to real life meetings, it is necessary that conferencing systems provide environments that closely emulate real world scenarios. Tele-virtual conferences enhance the e ectiveness of interactions among a group of users by providing a virtual environment that closely emulates real life scenarios. Di erent virtual environments are conducive to di erent types of real life interaction. For instance, in a computer mediated meeting between individuals, it is desirable for each participant to be presented with a view of the virtual meeting room, that emulates a physical meeting in most respects (see Figure 1). In addition, the directional and iconic information about the positions of participants in the virtual meeting room can be used to emulate the exchange of audio between the participants in the meeting. Creation of such virtual interaction environments requires that the system support stereoscopic video (which provides life-like imagery) and stereophonic audio [2].

channels the appropriate image to the appropriate eye [2]. Enhanced information exchange provided by stereo vision, can be further supplemented by auditory systems that utilize directional and iconic information to enable participants to experience virtual sounds. Audio streams emanating from each of the participants must be combined in a manner that enables the listener to perceive the location of the other participants in a tele-virtual conference. Furthermore, the time and phase di erences that exists between a sound reaching one human ear and another, and the amplitude of the sound in each ear, are important cues used by human listeners for accurate localization of audio sources. It has been demonstrated [4] that the complex folds in the outer ears of humans (pinnae) are responsible for localization in elevation and for providing the ability to distinguish sounds emanating from the front and the back. In order to capture both pinnae and interaural di erence cues, Wenzel et al. [9] propose a technique for measurement of Head-Related Transfer Functions (HRTFs) in the ear-canals of individual subjects. Such transfer functions can be used to compute factors that can be used by audio mixers to relatively scale the audio streams from all the participants to form a composite audio stream. Support for such spatial location based mixing is lacking in most conferencing systems in existence today.

3 Multi-Dimensional Video and Audio

In order to make desktop tele-virtual conferencing feasible, the multimedia stations of the future will need to be equipped with not only video and audio digitizers, but also with additional hardware in the form of devices that map natural movements to digital information and interface units that utilize this information to synthesize virtual environments (see Figure 2). Devices such as head mounted displays and ear phones provide participants with virtual images and sound respectively. Position sensors are used for constantly tracking the position and orientation of each participant. Any movement of a participant will cause the system to shift the images being displayed in an opposite direction, thereby providing

In order to create stereo images on computer systems, it is required to simulate what takes place naturally in the real world. Individuals view their surroundings with both eyes, each of which has its own, slightly di erent perspective. Both of these images are fused into a single image by the brain. In order to simulate this real life situation, left and right perspectives should be generated and transmitted to a video monitor that displays them sequentially as alternate frames. A stereo liquid crystal viewing device, synchronized with the monitor,

4 Architectural Support for Tele-Virtual Conferencing

Figure 1: An example of a tele-virtual conference in progress

Tele-virtual conference

Video camera

Video camera

Multimedia workstation Ear-phone

Figure 2: A multimedia station for carrying out desktop tele-virtual conferencing

an illusion of actual movement in the tele-virtual conference. Devices of this type are referred to as Computerized Clothing. In order to perform media-dependent transformations and spatial location based mixing, we propose Intelligent Multimedia Interface Units (IMIUs), each of which comprises of the following specialized modules (see Figure 3):  Audio processing module: subjects audio streams to silence elimination prior to transmission over the network. Information provided by position trackers about the spatial location and orientation of the participants is transmitted with the audio streams, in encoded form. On reception of audio units, the audio processing module computes scaling factors based on the spatial location and orientation of participants, which are used to compose stereophonic audio for playback at an audio device.  Video processing module: Video streams are sequences of video images, which are atomic units of generation and display. These images are complex compositions of individual video objects. In order to reduce bandwidth utilization, video images can be decomposed into video objects which can be selectively transmitted over the network. On reception of individual objects, the video processing module synthesizes a virtual image by performing transformations (such as transposition, rotation and zoom) on some video objects. The stereo vision synthesizer generates left and right perspectives of images and displays the images in pairs at the display monitor, so as to provide an illusion of stereo vision.  Media synchronizer: Strict temporal coordination between video and audio playback is requisite for e ective simulation of real world. The media synchronizer timestamps video and audio streams being transmitted to other participants. For synchronous playback, it is required that media units with the same timestamp be played back simultaneously. Rate mismatches at display devices may cause playback to go out of synchrony. The media synchronizer implements mechanisms to steer video and audio playback back to synchrony.

5 Conclusion Emergence of such digital multimedia is transforming home computers from being mere data processing units to versatile multimedia gateways between users and the rest of the world, thereby contributing towards the rapid ushering in of a new age of tele-virtual interaction and entertainment. We have investigated the architectural requirements of such tele-virtual systems, and proposed a high-level design of an Intelligent Multimedia Interface Unit (IMIU) capable of supporting panoramic, life-like three-dimensional video images and stereophonic audio, so as to synthesize tele-virtual conferencing environments.

References [1] G. Y. Beakley. Channel Coding for Digital HDTV Terrestrial Broadcasting. IEEE Transaction on Broadcasting, 37(4):137{140, December 1991. [2] Dave Holbrook. Stereo Viewing: Looking `Into' Manufacturing. Manufacturing Systems, pages 30{31, January 1991. [3] A.C. Luther. Digital Video in the PC Environment. Chapters 3, 4, 5, and 6, pages 45{99, McGraw Hill, February 1989. [4] G. Plenge. On the di erence between localization and lateralization. Journal of American Acoustic Society, 56:944{951, 1974. [5] P. Venkat Rangan and D. C. Swinehart. Software Architecture for Integration of Video Services in the Etherphone Environment. IEEE Journal on Selected Areas in Communication, 9(9):1395{1404, December 1991. [6] P. Venkat Rangan, Harrick M. Vin, and Srinivas Ramanathan. Designing an On-Demand Multimedia Service. IEEE Communications Magazine, 30(7):56{65, July 1992. [7] C.S. Skrzypczak. The Intelligent Home of 2010. IEEE Communications Magazine, pages 81{84, December 1987. [8] Harrick M. Vin, P.T. Zellweger, D.C. Swinehart, and P. Venkat Rangan. Multimedia

Audio mixer

To audio device

AUDIO PROCESSING MODULE Silence eliminator

From audio device

MEDIA SYNCHRONIZER

Image rotation, transposition and scaling module

Stereo vision synthesizer

To video display

VIDEO PROCESSING MODULE Stereo image recognizer and analyzer

From video camera

Figure 3: Structure of an Intelligent Multimedia Interface Unit, IMIU Conferencing in the Etherphone Environment. IEEE Computer - Special Issue on Multimedia Information Systems, 24(11):69{79, October 1991. [9] E.M. Wenzel, F.L. Wightman, and D.J. Kistler. Localization with non-individualized virtual acoustic display cues. Conference on Computer-Human Interaction, pages 351{359, January 1991.

a

a

Biography Srinivas Ramanathan is a doctoral student in the Department of Computer Science and Engineering at the University of California, San Diego. His research interests are in multimedia communication and high speed networking. He received his B.Tech in Chemical Engineering from Anna University, Madras, India, in 1988 and M.Tech in Computer Science and Engineering from the Indian Institute of Technology, Madras, India, in 1990. Venkat Rangan directs the Multimedia Laboratory at the University of California, San Diego, where he is an Assistant Professor of Computer Science since 1989. He serves on the editorial boards of the Journal of Interactive Multimedia, Journal of Organizational Computing Systems, IEEE Network, and as the Program Chair for the 1992 International Multimedia Workshop Dr. Rangan earned a Ph.D. in the Computer Systems Research Group at the University of California, Berkeley, in 1989. Dr. Rangan has received numerous awards including the Powell Foundation Fellowship and NCR Research Innovation Award for establishing multimedia laboratory at UCSD, IBM doctoral fellowship for outstanding Ph.D. research at U.C. Berkeley, and the "President of India Gold Medal" for the best undergraduate academic record at Indian Institute of Technology, Madras, India, in 1984. Harrick Vin is a doctoral candidate in the

Department of Computer Science and Engineering at the University of California, San Diego. His research interests are in multimedia computer systems and ultra-high speed networking with emphasis on conferencing and storage architectures for digital video and audio. He is a recipient of an IBM Doctoral Fellowship and NCR Innovation Award. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Bombay, India, in 1987 and his MS from Colorado State University in 1988.