A Telepresence System by Using Live Video ... - Semantic Scholar

6 downloads 0 Views 1MB Size Report
represent a remote scene by a live video stream or a 3D scene model. ..... 1992), an active laser pointer (Mann, 2000; Sakata et al., 2003), or a multimedia ...
A Telepresence System by Using Live Video Projection of a Wearable Camera onto a 3D Scene Model Tomoaki Adachi1 1

Takefumi Ogawa1,2

Kiyoshi Kiyokawa1,2

Haruo Takemura1,2

Graduate School of Information Science and Technology, Osaka University 1-5 Yamadaoka, Suita, Osaka 565-0871, Japan [email protected] 2

Cybermedia Center, Osaka University 1-32 Machikaneyama, Toyonaka, Osaka 560-0043, Japan {ogawa, kiyo, takemura}@ime.cmc.osaka-u.ac.jp

Abstract In this paper, we propose a novel telepresence technique that has the advantage of both model-based and imagebased approaches. In this technique, a virtual environment representing a remote place is presented by using projection texture mapping of live video images captured by a wearable camera onto the 3D geometry of the remote place. The 3D geometry is acquired in advance. The virtual environment can be rendered from an arbitrary viewpoint, while its texture is dynamically updated according to the camera motion. The observer can then observe the remote place in cooperation with the remote camera man. The prototype system shows the validity of our approach.

1

Introduction

In recent years, reproducing a remote scene for telepresence or surveillance has been a focus of attention. A telepresence system provides a user the sense of being present at a remote location, and is often used as a communication tool for a group of people who exist at physically different locations. Most existing systems represent a remote scene by a live video stream or a 3D scene model. Image-based rendering (IBR) is a popular method to simulate a visually rich telepresence. IBR constructs a panoramic image with high resolution and a large field of view from multiple images (MacMillian & Bishop, 1995). For example, QuickTime VR (Chen, 1995) enables a user to pan and zoom inside an environment using a cylindrical image created from real images. Yamaguchi et al. proposed a method to generate arbitrarily directional binocular stereo images from a sequence of omnidirectional images in real time by combining captured rays (Yamaguchi et al., 2000). These techniques do not require laborious modeling and expensive special purpose rendering hardware. However, they require a tremendous number of images for a user to move the viewpoint arbitrarily in the virtual environment, because it is difficult to set a viewpoint where images were not taken. Model-based approaches enable a user to move the viewpoint arbitrarily in a virtual environment. A stereo omnidirectional system (Takahashi, Shimada, Yamamoto, & Niwa, 2001) obtains color images and depth maps in all directions by using sixty cameras. The cameras are grouped into twenty stereo units and each stereo unit is composed of three cameras. The stereo matching method can acquire fast 3D geometry from the observation point. However, such methods cannot provide high position accuracy and requires feature patterns in the real environment in most situations. Another approach is an active sensor method which utilizes laser range finders. This approach can acquire highly accurate 3D geometry of the real environment. However, it takes quite a long time for the system to measure the entire circumference, and the rendering cost is high due to the high number of polygons. Therefore, it is not suitable in a dynamic environment. A live video stream provides a realistic image of the remote scene in real time, though it does not allow a user to move his/her viewpoint. The user is able to see the 3D scene model from an arbitrary viewpoint, though it is difficult to acquire the 3D scene model in real time. In this paper, we propose a novel telepresence technique that has the advantages of both the image-based and model-based approaches.

Figure 1: An overview of the proposed telepresence system. The remainder of the paper is organized as follows. Section 2 provides an overview of the hybrid technique using model-based and image-based approaches, and Section 3 describes the details of this technique. Section 4 presents the results of experiments and evaluates the effectiveness of the prototype system. Section 5 describes the application of distance instruction using our method. Finally, the paper is concluded in Section 6, and possible future work is outlined.

2

Technique overview

Figure 1 illustrates an overview of the proposed telepresence system. In site A, a worker who wears the headset with a camera, a tracker, and an HMD (head mounted display) takes live video. At site B, an observer looks at the reproduced virtual environment. In our technique, the virtual environment representing a remote place is presented by using projection texture mapping of the live video image captured in real time by the partner's wearable camera onto a static 3D scene model of the remote place. The static 3D scene model is acquired in advance. The virtual environment can be rendered from an arbitrary viewpoint, while its texture is dynamically updated according to the camera motion. For the sake of simplicity, we assume the 3D geometry of the remote site does not change significantly. The 3D scene model is acquired in advance by using a high-performance 3D scanner such as a laser range finder. In order to acquire the color and texture information of the remote place in real time, a telepresence partner at the remote place captures a live video image with the wearable camera. The captured live video image is then projected onto the 3D scene model by using projection texture mapping. Not only the current frame but also all of the previous frames are projected onto the 3D scene model so that a large portion of the model is gradually covered with texture as the camera moves around the scene. Table 1 shows a comparison of virtual spaces using several telepresence techniques. The advantages of our technique are that the virtual space is constructed photorealistically in real time and that the viewpoint of the observer is independent of the viewpoint of the camera. The observer can freely control his/her viewpoint. In addition, an arbitrary number of observers can share the virtual scene even though the scene is captured by a single camera.

Image-based Model-based Proposed method

3

Table 1: A comparison of telepresence techniques. Real-time Data Photorealistic construction Images High OK 3D geometry Low Hard Images and 3D geometry High OK

Movement of viewpoint Hard OK OK

Implementation

In order to assign the color information of a video frame to the surface of the 3D scene model, it is necessary to calculate all intersection points between the model and each half-line from the camera viewpoint to each pixel of the frame. However, the intersection calculation for all pixels of the frame is time-consuming. Moreover, each polygon needs to be subdivided if there is no corresponding vertex on its surface at the intersection point. This subdivision will increase the total number of polygons and slow down the rendering significantly. Our technique can assign the texture information of a live video image to a 3D model very fast by employing a projection texture mapping feature (Segal et al., 1992). Supported by recent graphics libraries such as OpenGL and Direct3D, the projection texture mapping can now be rendered in real time. In order to project all of the video frames in the past and present, our technique provides a “scene texture” which is a dedicated buffer area. A scene texture is a rendered image of a texture-mapped scene model from a unique viewpoint chosen so that as many

(a)

(b)

(c)

(d) Figure 2: Projection of scene textures onto a 3D model.

Figure 3: Scene texture updating process. polygons as possible are seen. There are many polygons which are out of view of a certain scene texture, so our approach projects multiple scene textures from multiple directions. It is necessary to project at least two scene textures theoretically. Figure 2 shows a situation where three scene textures are projected onto a 3D model. The algorithm for reproducing the virtual environment proceeds as follows: • • •

3.1

Scene texture selection Scene texture update Virtual environment representation

Scene texture selection

Scene texture selection decides whether or not the system uses a video frame captured by the wearable camera for updating the scene texture, and which scene textures are updated. In scene texture selection, movement and angular velocity are calculated based on the six degrees-of-freedom data of the camera in capturing the image. If the camera moves at a high speed, the image is not used for updating the scene texture because most captured images become blurred when the camera moves quickly. Next, the system decides the scene textures which should be updated. In this system, each polygon is assigned an ID showing the scene texture to which the polygon belongs. Using the selection function of OpenGL, the polygons are selected in the view frustum of the current captured image. The IDs of the selected polygons indicate the corresponding scene textures to update.

3.2

Scene texture update

In this section, we describe how to update a scene texture in detail. Figure 3 shows the process of scene texture updating. First, the current view frustum is set at that of the scene texture. With projection texture mapping, the color information of the current video frame is assigned to a 3D model based on the view frustum of the camera taking the video frame. Next, a stencil mask is generated on the stencil buffer using a quadrangular pyramid model corresponding to the view frustum of the camera as a shadow volume. The stencil mask is used to copy the previous scene texture on the frame buffer. Then, the frame buffer becomes the new scene texture. Sample images of updating a scene texture are shown from Figure 4 to Figure 10. Figure 4 shows a picture of the real environment which is to be reproduced as a virtual environment. Figure 5 shows an observer’s view after some video frames have already been projected onto the model. If the observer’s PC receives a new video image, as shown in Figure 6, from the worker’s PC, it renders the 3D model without color information at the viewpoint of the scene texture, as shown in Figure 7. Next, it projects the video image onto the 3D model using the frustum based on the position, orientation, and angle of those of the camera. Figure 8 shows a situation where the video image was projected onto the 3D model. In this figure, the quadrangular pyramid wire frame represents the camera’s view frustum. Using a shadow volume technique, the stencil buffer is set to a certain value to generate the stencil mask, as shown in Figure 9. Our system regards the camera’s view frustum as a shadow volume. The shaded portion of Figure 9 is where the stencil value is incremented. The updated scene texture is composed of the previous scene texture, the projected video image, and the mask. Figure 10 shows an observer’s view projected new video frame onto the model.

Figure 4: A picture of the real environment.

Figure 5: An image of the observer’s current view.

Figure 6: The sample image of a video frame. Figure 7: The view frustum of the scene texture.

Figure 8: The scene texture projected onto the video image.

Figure 9: The stencil mask.

Figure 10: A sample image with the updated observer’s view.

3.3

Virtual environment representation

In this step, the reproduced scene is rendered based on the observer’s viewpoint. First, the system sets the observer’s viewpoint and the view frustum. Next, each scene texture assigns color information to the 3D scene model projecting from each unique viewpoint, and then the all the textures are finally projected onto the model. Figure 11 shows the result of projecting the updated scene texture.

3.4

System configuration

We built a prototype system composed of two Windows® computers of the same specifications (Windows® XP SP2 Pentium IV-3.2GHz 1 GB RAM NVIDIA Geforce 6800 GTO), an IEEE1394 camera (Dragonfly, Point Grey Research Inc), a six degrees-of-freedom sensor (HiBall-3100, 3rdTech Inc.), and an optical see-through HMD (Mediamask MW601, Olympus Corp.). Figure 11 shows a worker wearing a headset which is equipped with these devices. The resolution of the video frames and that of the scene textures are 320 x 240 and 512 x 512 pixels, respectively. Each video frame is compressed into a JPEG format and transferred to the observer’s PC. The HMD displays the observer’s instructions using an arrow, lines, and so on.

tracker

camera

HMD

Figure 11: A user wearing a camera, tracker, and HMD.

4

Experiment

An experiment was conducted to evaluate the feasibility of the prototype system in terms of texture distortion and processing time. The configuration of the experiment is presented below and the results are discussed.

4.1

Experimental setting

Figure 12 shows a real environment which was reproduced as a virtual environment in this experiment. In the real environment, there is a desk on which four cardboard boxes and two documents are placed, and there is a calendar on the wall. The 3D geometry of the desk and boxes were measured manually in advance. Figure 13 shows the environment and the worker with the wearable camera shooting videos. An observer in a different room observed the texture-mapped 3D scene model via a connected 100base-T LAN. On the observer’s PC, the model of the desk, the boxes, and the wall were rendered. The video images shot by the worker were transferred from the worker’s PC, and the observer’s PC accumulated them as a scene texture and projected it onto the 3D model. Figures 14 and 15 show that the model is gradually texture-mapped and reproduced.

4.2

Results and discussion

Through the experiment, it was confirmed that the quality of the texture-mapped 3D model was generally acceptable for grasping the atmosphere of the remote place. The result is similar to those of video mosaics (Barfield & Caudell, 2001). The proposed method is different from the video mosaics in the maintaining of the 3D model and the projection of the live video texture. In fact, our prototype system enabled the observer to select an arbitrary viewpoint. However, when the camera moves very quickly, the captured image becomes blurred and the registration

Figure 12: The real environment to be reproduced as a virtual environment.

Figure 13: A worker shooting video.

Figure 14: Texture is mapped onto the virtual object.

Figure 15: The reproduced virtual environment.

error increases due to the slow shutter speed. In Figure 15, the scene texture was not projected correctly on the left side of the white box: the texture image is coarse and the brightness is not continuous at the edges of the video frames. In order to resolve these issues, we need to set a high resolution of the scene texture and video image, and use a camera which does not automatically balance the white level of an image. By tuning the hardware settings in this way, these problems can be fixed. The update rate of the scene texture and the frame rate of the video camera were 15 frames per second and 25 frames per second, respectively. It takes about 60 milliseconds to render the virtual space on the observer’s PC. In this experiment, the virtual environment was displayed in real time, but the processing time will increase with the number of scene textures.

5

Applications of our method

For example, a disaster area assessment system or a collaborative surveillance system can be developed using our method. Information must be gathered when a disaster occurs. The 3D geometry of collapsed buildings can be measured using existing techniques (Cao, Oh, & Hall, 1999). Color information can be assigned onto the 3D geometry using our method. Our method is applicable to a large virtual environment, where a lot of workers can shoot live videos simultaneously. Moreover, the system enables an observer at a remote site to investigate independently a worker’s activity through the virtual environment. Using several display devices, the worker can get the observer’s detailed instructions on surveillance by annotation and gesture. There are many studies on distance instruction using an HMD (BT Development, 1993; Kuzuoka, 1992), an active laser pointer (Mann, 2000; Sakata et al., 2003), or a multimedia projector (Yamashita et al., 1999). By introducing these instruction methods, we can construct a collaborative surveillance system which enables distant users to cooperate effectively. In our system, an observer at a remote site instructs the worker with the mouse through the virtual environment, and the worker confirms the observer’s instructions through the HMD without preventing movement and work. In addition, in order to grasp the observer’s instructions intuitively, the system overlays the contents of the instruction directly onto real objects based on 3D world coordinates corresponding to the position indicated with the mouse, and the system enables the worker to grasp the relationship between the instruction and the indicated object. Moreover, the system enables the worker to grasp the contents of the instruction in three dimensions by implementing the function of stereovision. In order to overlay the contents of the instruction onto the exact position of the real object corresponding to the position indicated with the mouse, it is necessary to convert from the 2D screen coordinates indicated with the mouse through the virtual environment to the 3D world coordinates, as shown in Figure 16. Our system, using the selection function of OpenGL, converts the 2D screen coordinates to 3D world coordinates. The value of the Zbuffer in the pixel indicated with the mouse is acquired by referring to the information of the selected primitive at the indicated position. Based on the acquired value of the Z-buffer and the parameter of the perspective projection using the rendering of the virtual environment, the screen coordinates which are indicated with the mouse are reverse-projected and converted to 3D world coordinates. The converted position is transferred to the worker’s PC. At the worker’s site, the converted position of the 3D world coordinates, transferred from the observer’s PC, and the contents of the instruction are overlaid onto the real object based on this position through the HMD. The contents of the instruction are presented to the worker as follows: z z

The instruction by the 3D cursor object: it is possible to display the arbitrary position that the observer wants to point out, the arbitrary target of the instruction, and the nonverbal information made by a gesture. The instruction by line drawing: it is possible to display the observer’s intention by line drawing, pictures, and characters.

The instruction by 3D cursor object is presented constantly at the position indicated with the mouse, as shown in Figure 16. And the instruction by line drawing is presented by the locus of the point dragging the mouse, as shown in Figure 17. Also, Figure 16(a) and Figure 1(a) show the images for the right eye display in the HMD, although

world coordinate (X,Y,Z)

screen coordinate (U, V)

(a) The real environment (b) The virtual environment Figure 16: Instruction by a 3D cursor object.

(a) The real environment (b) The virtual environment Figure 17: Instruction by line drawing. the worker can see these instructions by stereovision in three dimensions. Moreover, the red line displayed along the shape of the real objects is a wire frame of the shape of the 3D geometry in the virtual environment.

6

Conclusion

We have proposed a telepresence technique that has the advantages of both the model-based and image-based approaches. By projecting texture mapping of a live video onto a static 3D scene model, our technique allows a user to see the texture-mapped virtual environment realistically from an arbitrary viewpoint in order to grasp the atmosphere. A prototype system confirmed the feasibility of our approach. By using our method, we constructed a collaborative surveillance system which enables distant users to cooperate effectively. Our system allows an observer at a remote site to give instructions with the mouse through the virtual environment, and allows a worker to confirm the observer’s instructions through the HMD. Future work includes the incorporation of a 3D real-time measuring system using a laser range finder or other devices, improvements in the texture distortion and processing speed, and development of practical applications for multi-user collaboration.

References Barfield, W. & Caudell, T. (2001). Fundamentals of wearable computers and augmented reality, Lawrence Erlbaum Associates, 253-257. BT Development (1993). Camnet videotape. Suffolk, Great Britain. Cao, Z. L., Oh, S. J., & Hall, E. L. (1999). Dynamic omnidirectional vision for mobile robots, J. Robotic System, 3(1), 5-7. Chen, S. E. (1995). QuickTime VR - An image-based approach to virtual environment navigation, Proc. of SIGGRAPH '95, 29-38.

Kuzuoka, H. (1992). Spatial workspace collaboration: A shared view video support system for remote collaboration capability, Proc. of CHI ‘92, 533-540. MacMillian, L. & Bishop, G. (1995). Plenoptic modeling: An image-based rendering system, Proc. of SIGGRAPH ‘95, 39-46. Mann, S. (2000). Telepointer: Hands-free completely self contained wearable visual augmented reality without headwear and without any infrastructure reliance, Proc. of ISWC2000, 177-178. Sakata, N., Kurata, T., Kato, T., Kourogi, M., & Kuzuoka, H. (2003). WACL: Supporting telecommunications using wearable active camera with laser pointer, Proc. of ISWC2003, 53-56. Segel, M., et al. (1992). Fast Shadows and Lighting Effects Using Texture Mapping, Proc. of SIGGRAPH ’92, 249252. Tanahashi, H., Shimada, D., Yamamoto, K., & Niwa, Y. (2001). Acquisition of three-dimensional information in a real environment by using the stereo omni-directional system (SOS), Proc of Third International Conference on 3-D Digital Imaging and Modeling (3DIM '01), 365-371. Yamaguch, K., Yamazawa, K., Takemura, H., & Yokoya, N. (2000). Real-time generation and presentation of viewdependent binocular stereo images using a sequence of omnidirectional images, Proc. of 15th IAPR International Conference on Pattern Recognition (ICPR2000), IV, 589-593. Yamashita, J., Kuzuoka, H., et al. (1999). Agora: Supporting Multi-participant Telecollaboration, Proc. of HCI ’99, 543-547.