A new parallel ray-tracing system based on object decomposition

2 downloads 33 Views 2MB Size Report
omputer. A new parallel ray-tracing system based on object decomposition. Hyun-Joon Kim Chong-Min Kyung 2. 1Information Technology Lab, LG Electronics.
omputer

A new parallel ray-tracing system based on object decomposition Hyun-JoonKim Chong-MinKyung 2 1Information Technology Lab, LG Electronics Research Center 16 Woomyeon-Dong, Seocho-Gu, Seoul 137-140, Korea e-mail: [email protected] 2VLSI Systems Laboratory, Department of Electrical Engineering. Korea Advanced Institute of Science and Technology, 373-1, Kusong-dong, Yusong-gu, Taejon 305-701, Korea e-mail: [email protected]

We propose a new parallel ray-tracing hardware architecture in which processors are connected as a ring. Most parallel ray-tracing algorithms subdivide the whole object space into subregions; A processor handles only rays entering the subregion assigned to it. Here we assign each processor objects that are spread over the whole object space. The processors trace rays on their own objects. The respective partial results are combined to form the final image. This scheme is especially suitable for synthesizing animated sequences because objects need not be reallocated for every frame. Preliminary results show a speed-up factor almost linearly proportional to the number of processors.

Key words: Parallel ray tracing - Computer animation - Ring architecture Correspondence to: H.-J. Kim

244

1 Introduction Ray tracing (Whitted 1980) is a powerful technique widely used for synthesizing realistic images, as it can express a wide variety of geometric primitives, shadows, multiple reflections, transparency, etc. However, it is generally very time consuming due to the large number of ray/object intersection calculations. Some efficient methods have been evolved to reduce the execution time of ray-tracing algorithms (Glassner 1989; Watt 1989). Most of them use a secondary data structure for grouping to reduce the number of objects to be checked for ray intersection. Some of these methods are based on the partitioning of 3D space by the so-called space subdivision technique (Glassner 1984; Cleary and Wyvill 1988), while others partition the set of objects through the use of tree-@extents (Kay and Kajiya 1986; Goldsmith and Salmon 1987). The space-subdivision scheme adaptively or uniformly subdivides the object space into subspaces. When a ray passes through a subspace, it is tested only against objects in this subspace. The tree-of-extents scheme places a simple bounding volume around each object, and constructs a hierarchical tree of those bounding volumes. Each node in the tree is associated with a bounding volume that contains all the bounding volumes of its children. Then each ray traverses down the hierarchical tree to find the intersecting object. If a ray does not intersect with the bounding volume of a node, we conclude that the ray does not intersect any object in its descendent nodes. Despite the impressive advances made with these algorithmic approaches, the synthesis time still remains considerable. Whitted (1980), in his original paper, observes that the ray-tracing algorithm is suitable for implementation on multiprocessors because the intensity calculation in one pixel is independent of that in others. Many systems have subsequently been proposed to exploit this inherent parallelism in a variety of ways (Dew et al. 1989). For convenience, we classify the parallel ray-tracing systems into two schemes: the ray data flow scheme, which is based on the movement of rays between processors, and the object dataflow scheme, which is based on the movement of objects between processors. The ray dataflow scheme is a combination of parallel processing and space subdivision. Each processor is assigned one or more subregions, and each subregion contains a part of the database. The VisualComputer (1996) 12:244 253 © Springer-Verlag1996

.;oreputcr Neighboring processors contain objects in adjacent regions and communicate locally via ray messages. In this method, load imbalance can occur due to an uneven number and the shape complexity of the objects in each region. To solve this problem, several methods have been proposed. One approach dynamically modifies the size of each processor's region (Dipp6 and Swensen 1984). Another approach is distributed allocation (Kobayashi et al. 1988), in which each processor is assigned subspaces that are spatially distant rather than adjacent. Yet another approach presamples the image to predict the computational load in each region and subdivides the space according to this prediction (Priol and Bouatouch 1989). Scherson and Caspary (1988) present a method, another form of the ray dataflow scheme, that accommodates the tree-of-extents as the secondary data structure. The tree is partitioned into an upper and a lower part by a particular level in the hierarchy. The upper tree, containing all the bounding volumes above this level, is distributed to all processors. The lower tree, containing the bounding volumes below the designated level, is distributed to the appropriate individual processor. Then a ray/object intersection calculation is divided into the upper-level process and the lower-level process. The upperlevel process is distributed to the processor with the least amount of work for the lower-level process. This is a key to the load-balancing feature of the algorithm. However, it is not very obvious how to determine the level that partitions the tree into the upper and lower levels. In contrast, the object dataflow scheme takes advantage of the so-called ray coherence, which means that rays with origins and directions that are almost equal are likely to intersect with the same objects. Therefore, computing the adjacent area of pixels generally requires the knowledge of a few objects (Green and Paddon 1989). Such a solution has been proposed by Green and Paddon (1990). They implemented a shared virtual memory containing the whole database. The processor's local memory consists of a memoryresident part and a cache part. The memoryresident part stores frequently used data, and the cache part stores the most recently accessed data. Badouel et al. (1994) give experimental results and comparisons of the ray dataflow scheme and the object dataflow scheme.

The parallel systems mentioned have their own advantages over other systems, but have their own problems, too. In the method of dynamically resizing the region allocated to each processor, the process of reallocation of objects in each region is itself an overhead. A large number of presampling points is necessary to balance the load among the processors in the presampling scheme and the preprocessing step becomes a significant overhead. In object dataflow systems, memory contention can occur due to several processors simultaneously accessing the shared memory. These systems require large local memories to solve this problem. Moreover, although ray tracing is used to generate high-quality animation, only a few parallel systems have considered the animation situation (Kobayashi et al. 1989, Horiguchi et al. 1993). Kobayashi et al. (1989) present the animation version of the system they invented (Kobayashi et al. 1988), in which the load is estimated from the previous frame, and the object assignment is adjusted to the load estimation at the start of each frame. Horiguchi et al. (1993) described a parallel processing technique for incremental ray tracing (Hirota and Murakami 1990), which recalculates only the rays changed by the moving objects in the successive frames. In animation, most ray dataflow systems, including those mentioned here, require an initialization process, such as object allocation at the start of each frame, and, in object dataflow systems, the cache is missed frequently at the time of changing frames, which causes performance degradation. In this paper, we propose a new parallel ray-tracing system in which the ray messages move from processor to processor, connected in a ring. The communication complexity is minimal due to the local communication in pipeline form. We obtained an almost linear speed-up, depending on the number of processors, with a dynamic loadbalancing scheme. Especially, in the proposed system, an animation sequence of images can be synthesized seamlessly without the object reallocation step. The outline of this paper is as follows. In Sect. 2, we describe the proposed system architecture based on the object decomposition scheme, followed by the load-balancing strategy in Sect. 3. In Sect. 4, we discuss how the proposed method can be applied to the animation sequence. Finally, the

245

ompa[er evaluation results of the performance of the proposed system are given in Sect. 5.

2 The system architecture Some of the parallel ray-tracing systems explained in the previous section subdivide object space into subregions and assign one or more subregions to each processor. Each processor handles only the rays in its own region. In our system, each processor is assigned a subset of objects. Figure 1 shows a decomposition of objects into three object groups assigned to P1, P2 and P3. Each processor then executes the ray-tracing algorithm individually, considering only its own object group, and the results of this individual work are combined to obtain the final image. This scheme is similar to the image-composition architecture (Molnar et al. 1992) for real-time display systems. In this approach, no specific rule of assignment between processors and objects is needed, i.e., objects can be randomly distributed. However, redundant calculations can occur when multiple processors independently calculate the intersection of a ray with the objects of their responsibility. For example, in Fig. 2a, processor P1 and processor P2 simultaneously calculate the intersection of a ray R with their own-objects. The intersection calculation in P2 is not necessary in this case, as the result of the ray intersection check with object A in Pt turns out to be 'true'. However, as shown in Fig. 2b, if P2 knows that ray R intersects P~'s object A at t = 3.2, where t is the parameter in the parametric equation of a ray (Glassner 1989), P2 can skip the intersection calculations for its objects in the interval from t = 4.0 to t = 5.0. That is, if a processor is informed of the intersection results of other processors, it is possible to speed up the whole process by avoiding unnecessary calculations. In this paper, we propose a ring architecture as shown in Fig. 3 to avoid unnecessary calculations based on the object-decomposition scheme. The operation is as follows. Firstly, objects and pixels to be calculated are evenly distributed to the processors, which then construct a secondary data structure of the assigned objects for the acceleration of ray/object intersection calculations. The secondary data structure consists of axis-aligned 3D regions (voxels) in which the associated objects

246

are registered. Each processor receives a ray from the preceding processor and calculates the intersection between this ray and each of its assigned objects. A ray message (Table 1) is then sent to the next processor. The following processors calculate only the intersections of the ray with objects in voxels whose t intervals lie within, or include, t,,, i.e., voxels for which tt ___t,,, where the t interval of the voxel is represented as [tl, t2]. If the intersection is found to be nearer than tm, then the t value at the intersection point is updated as the new tm value. For a certain ray, it is possible that a processor either calculates intersections with some of the objects or only traverses the voxel. For shadow rays, if an intersection is reported in any of the processors, the trip of the shadow ray around the ring is terminated immediately because it has turned out that the ray origin is in shadow. Each processor manages its own ray message queue (Fig. 3) that stores ray messages from the preceding processor. If a processor's ray message queue becomes empty, the processor generates a new ray. The actual intersection point of a ray is determined by the final tm value after the ray finishes the trip around the ring. A ray that has completed a trip around the ring is called complete ray in the following discussion. When a processor receives a complete ray, the processor calculates the local intensity or derives secondary rays from the complete ray, such as a shadow ray, reflection ray, and/or refraction ray. When all pixel intensities are calculated, the contents of the local buffer of each processor is transmitted to the host to form the final image. The internal processes of each processor (Fig. 3) are briefly described as follows.

The receiving process. The processor receives a ray message from the preceding processor. 2. The ray management process. The processor fetches a ray from the ray message queue and calls the intersection processor if the ray is not complete. If the fetched ray is complete, the processor derives secondary rays from it. If the complete ray is a shadow ray, the shading process for calculating the intensity is invoked. If the ray message queue is empty, the processor generates a new ray message from the eye ray generated from an assigned pixel. .

T ~ . T~

ect Space

of 91

ObjectS p ~ e of P2

Otc

~ 3

P=

P1

~_-. . . . . ;;,---~ ,/ i

/

/

; , ' , ~ii~i!iiii, /

,k".. . . . - /

2a

P~

t =5.0 t=3.2 t =3.0

, ......

~

y--"

P2

t =5.0

t=3.0 ___/_

__,

/ :':%',..

R

Whole Ol~ect Space

_

2b

Fig. 1. Object decomposition scheme for a parallel ra)-tracing architecture ~

f

•> , , , ,

..........

- ~

j

-"

............

3. Intersection process. The processor calculates intersections with its assigned objects. 4. Shading process. The processor calculates the illumination and adds the result to the intensity of the appropriate pixel. 5. Sending process. The processor updates ray messages, if necessary, and sends the ray message to the next processor.

Fig. 2. Example of unnecessary calculations: a shows a case where Pa does not know the processing result of/)I, while b shows a case where P2 does know the processing result of PI Fig. 3. A detailed view of the internal processes of each processor

3 Load balancing An important advantage of the proposed method is that the communication complexity is minimal due to the local communication in pipeline form. However, if a processor has a predominantly larger computational load than other processors, the system performance can be significantly degraded due to the ensuing bottleneck.

247

emp

er

Table 1. Contents of a ray message Rg R~ Rk Rc PC

PE~ t,,

Ray geometries Ray depth Kind of ray, for example, eye, shadow, or secondary ray Color parameters of the ray Pixel coordinates Identification of the processor element where the ray was born Ray parameter at the nearest intersection point of the ray as computed so far in the ring architecture

For load balancing among processors, an even spatial distribution of objects is desirable in the initial object-allocation stage. However, as it is impossible to predict accurately the size of the computational load before the ray tracing, load imbalance can occur and cause a bottleneck. Our load-balancing strategy is as follows. If a processor's ray queue is partially full, for example, three-fourths queue size in our simulation, it is desirable to transfer some of its objects to the next processor. At this time, it is important to decide which objects are to be transferred. A simple rule is to choose any object in the voxel with the largest number of registered objects. The following method is used in our work. If n is the number of voxels and L(i) is the load of the ith voxel, then the total load of a processor can be expressed as ~ = 1L(i), where L(i) can be approximately represented as follows: L(i) = N,(i) x No(i)

where Nr(i) is the number of rays that enter the ith voxel and No(i) is the number of objects. Here we assume for simplicity that the complexity of the intersection calculation is the same for all objects. This assumption seems to be sufficient for now, although, in the strict sense, L(i) needs to reflect the computational complexity of each object by using different weights. If it is necessary to send some objects from an overloaded processor to another, the sending processor first determines the voxel i with the largest L(i) value of all the voxels, and then sends an object selected arbitrarily from those in the ith voxel to the next processor. At this point, it is necessary to point out the possibility of some required intersection calculations being missed. For example, as shown in

248

Fig. 4, ray R starts its round trip at time step 0 in P2. When R is returned to P2 as a complete ray at time step 4, it is assumed that all the intersections between R and the objects have been calculated. However, if object g in P1 is sent to P2 at time step 3 due to an overload of P1, then the intersection between R and g has not been calculated. To solve this problem, P2 is not allowed to transfer g to the next processor until R is returned to P2, i.e., to its origin. Additional intersections between R and g are only calculated when R has returned. As a general rule, the processor at which a ray has started its journey, does not transfer objects imported from the preceding processor until the ray has returned. To achieve this, each processor maintains an imported object table (IOT) that registers the objects sent from the preceding processor and the time of receipt. When a ray message arrives at the processor as a complete ray, the processor compares the object time of receipt in the l O T and the time of generation of the returned ray message. If the object receipt time is greater, i.e., later than the generation time of the ray message, additional intersections between this object and the ray need to be calculated. Otherwise, no additional intersections for this ray are calculated, and the time of succeeding complete rays need not be compared because the processor's ray message queue acts in a first-in, first-out (FIFO) fashion. Thus, this object is cleared from the IOT. Our load-balancing strategy is called dynamic object transfer (DOT) in the following discussions.

4 Animation As for the animation, ray dataflow systems must allocate objects at the start of each frame because the processors' objects change when objects move. In the object dataflow system, the cache is missed in most processors when the processing frame is changed, which causes performance degradation. However, in the proposed scheme, object allocation at the start of each frame is not necessary because each ~processor is assigned objects that are not confined to a subregion, as in ray dataflow system, but rather spread over the whole object space. Moreover, the secondary data structure must be reconstructed in the ray dataflow system with the tree-of-extents. That is not necessary here

Time step

R

Ray starts here

Objects checked after

T

a, c

T=O

T=I

a,b,c,d

...........+~+++j+:i

j

+++++i+ i

, +,+~+~

i

a, b, c, d,e,h

T=2

g /"~'~,,

T=3

f " ~ , i , Ray R

R i+; ~ ipm~

"+,

a,b,c,d,

,.

a,b,c,d,e

e,f,h,i,j

Complete ray T=4

~

+

G iI

~

sassigned t a g eo iobjects b ecsjtm p o at r tthe e dinNa,aolcaoitn

Overloaded Processor's processor Ioca memory ,..............................................................

because the proposed system uses uniform space subdivision as the secondary data structure. Each processor just updates the position of its moving objects, if any, at the start of each frame. For synthesizing the next frame, the processor that finishes the current frame updates the position of its moving objects, generates a ray message for the next frame, and then traces the ray. For updating the positions of moving objects, the information on the motion of the objects, including those that appear in the course of the animation, if

f,g,h,i,]

Fig. 4. Example of a ray/object intersection calculation with dynamic object transfer (DOT)

any, is allocated to each processor in the initial distribution stage. Load imbalance, which is caused by moving objects, can be solved by our load-balancing strategy, as explained in the previous section.

5 Simulation results Most algorithms developed in the past were based on simulations rather than implementations on

249

Fig. 5. Char image Fig. 6. Ducks image Fig. 7. Balls image

Table 2. Complexities of the test images Model Number Number Number Number

of polygons of spheres of voxels of rays cast

Char

Ducks

Balls

4 321 4 42 300 881 594

2 554 1 42 336 1438 054

0 1000 18 144 1286 372

actual multiprocessors (Whitman 1992). The proposed hardware architecture along with the algorithm was also simulated using the Parallel Virtual Machine (PVM) library (Geist et al. 1994) on a network that consisted of 8 SPARC-station 20 s, each having identical computing power and resources. The P V M manages all the data conversion required and provides routines for packing

250

and sending messages between processes on a processor or network processors. There are several problems in this simulation model. Namely, the communication time is dependant on the network's load, and the performance of the processors on the network is not identical because the number of processes managed by the processors is not the same. These make our simulation system a heterogeneous parallel system. To avoid this situation, the simulation was performed under the condition that most of the computing resources in the network were not shared by other applications. However, if performance degradation of several processors occurs for some reason, the resulting bottleneck can be settled by DOT. Although these problems exist, the simulation model can reflect realities such as communication

Jomp ¢er 8

I

Char Ducks

6 Relative speed-up

~

}

I

I

I

I

-x---

--o--

5 4 3 2 1

I -~

I

I

2

3

I

I"

I

4 5 6 Number of processors

I

I

7

8

time and connectivity issues. Three test scenes sized 256 x 225 were used to evaluate the algorithm. The simulated rendering images of these scenes are shown in Figs. 5, 6, and 7. Table 2 gives the details of each scene. The average bandwidth of the network is 55 kByte/s and the size of the ray message queue is 128. We now consider the behavior of the algorithm when the number of processors is varied. Figure 8 shows the speed-up of the three test scenes as a function of the number of processors. In the proposed ring-connected architecture, the total number of intersection calculations actually increases as the number of processors increases. This increase is due to redundant calculations in multiple processors, which are mostly rejected in uniprocessor systems by the secondary data structure. Table 3 shows the percentage increase of the number of intersection calculations compared to the uniprocessor case. Table 4 shows the efficiency of the system with DOT. The definition of efficiency e is: -

rup x T,.p x Np

Fig. 8. The speed-up factor of the proposed scheme versus the number of processors for the test images

Table 3. Percentage increase of the number of intersection calculations for various test images as a function of the number of processors Number of processors

Char (%)

Ducks (%)

Balls (%)

1 2 4 8

0.0 7.9 10.3 15.2

0.0 4.9 7.6 12.5

0.0 4.5 9.3 11.2

Table 4. Processor efficiency as a function of the number of processors for various test images using dynamic object transfer (DOT) Number of processors

Char (%)

Ducks (%)

Balls (%)

2 4 8

82 73 63

85 76 68

89 79 75

6 Conclusion 100

where Tup is processing time in the uniprocessor system, Trap is processing time in the multiprocessor system, and Np is the number of the processors. This table shows that the effÉciency decreases as the number of processors increases. This is mainly due to the increase of the number of intersection calculations.

We have described a parallel ray-tracing system with a moderate number of processors that uses an object-decomposition scheme. In the objectdecomposition scheme, each processor is assigned objects not confined to a local subregion, but rather spread over the whole object space. Each processor individually executes a ray-tracing algorithm and obtains a partial result. The results are then combined to form the final image.

251

Processors organized in a ring architecture are useful, not only for composition, but also to prevent redundant calculations. If a processor has a predominantly larger computational load than other processors, load imbalance can occur, causing a bottleneck. To solve this problem, we proposed the dynamic object transfer (DOT) scheme, in which objects of an overloaded processor are transferred to the next processor. The proposed architecture exhibits a speed-up factor almost proportional to the number of processors, according to the results obtained for several test images. The proposed scheme is very suitable for synthesizing animated sequences of images because objects need not to be reallocated at the start of every new frame. However, there are several disadvantages in the proposed system. It is not suitable for large parallel systems because parallel processing systems with a ring topology have a problem in view of the latency in a trip around the ring that restricts the scalability of the system. And if the queue of a processor is full when the preceding processor is ready to send a ray message, it must remain idle until the queue has room. Ways must be found to overcome this last problem if system performance is to be improved. These ways are being investigated.

References 1. Badouel D, Bouatouch K, Priol T (1994) Distributing data and control for ray tracing in parallel. IEEE Comput Graph Appl 14:69-77 2. Cleary JG, Wyvill G (1988) Analysis of an algorithm for fast ray tracing using uniform space subdivision. Vis Comput 4:65-83 3. Dew PM, Earnshaw RA, Heywood TR (1989) Parallel processing for computer vision and display. AddisonWesley, Reading, Mass 4. Dipp6 M, Swensen J (1984) An adaptive subdivision algorithm and parallel architecture for realistic image synthesis. Comput Graph 18:149-158 5. Giest A, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam V (1994) PVM: parallel virtual machine; a user's guide and tutorial for networked parallel computing. MIT Press, Massachusetts, Mass 6. Glassner AS (1984) Space subdivision for fast ray tracing. IEEE Comput Graph Appl 4:15-22 7. Glassner AS (1989) An introduction to ray tracing. Academic Press, San Diego, Calif 8. Goldsmith J, Salmon J (1987) Automatic creation of object hierarchies for ray tracing. IEEE Comput Graph Appl 7:14-20

252

9. Green S, Paddon D (1989) Exploiting coherence for multiprocessor ray tracing. IEEE Comput Graph Appl 9:12-26 10. Green S, Paddon D (1990) A highly flexible multiprocessor solution for ray tracing. Vis Comput 6:62-73 11. Hirota K, Murakami K (1990) Incremental ray tracing. Proceedings of the Eurographics Workshop on Photosimulation, Realism, and Physics in Computer Graphics, France, pp 15-29 12. Horiguchi S, Katahira M, Nakada T (1993) Parallel processing of incremental ray tracing on a shared-memory multiprocessor. Vis Comput 9:371-380 13. Kay TL, Kajiya JT (1986) Ray tracing complex scenes. Comput Graph 20:269-278 14. Kobayashi H, Nishimura S, Kubota H, Nakamura T, Shigei Y (1988) Load balancing strategies for a parallel ray-tracing system based on constant subdivision. Vis Comput 4:197-209 15. Kobayashi H, Kubota H, Horiguchi S, Nakamura T (1989) Effective parallel processing for synthesizing continuous images. Proceedings of Computer Graphics International' 89, Springer-Verlag, pp 343-351 16. Lin T, Slater M (1991) Stochastic ray tracing using SIMD processor arrays. Vis Comput 7:187-199 17. Molnar S, Eyles E, Poulton J (1992) PixelFlow: highspeed rendering using image composition. Comput Graph 26:231-240 18. Priol T, Bouatouch K (1989) Static load balancing for a parallel ray tracing on a M I M D hypercube. Vis Comput 5:109 119 19. Scherson ID, Caspary E (1988) Multiprocessing for ray tracing: a hierarchical self-balancing approach. Vis Cornput 4:188-196 20. Watt A (1989) Three-dimensional computer graphics. Addison-Wesley, Reading, Mass 21. Whitman S (1992) Multiprocessor methods for computer graphics rendering. Jones and Bartlett, Boston 22. Whitted T (1980) An improved illumination model for shaded display. Commun ACM 23:343-349

HYUN-JOON KIM received his BS in Electronics Engineering from Ajou University, Korea, in 1989, and his MS and PhD from the Department of Electrical Engineering at the Korea Advanced Institute of Science and Technology (KAIST) in 1991 and 1996, respectively. He is now a senior research engineer in LG Electronics Research Center. His current research interests include rendering on multicomputer environment and motion control for computer animation.

C H O N G - M I N K Y U N G received his BS in Electronics Engineering from Seoul National University, Korea, in 1975, and his MS and PhD from the Department of Electrical Engineering at KAIST in 1977 and 1981, respectively. From 1981 to 1983 he worked at Bell Telephone Laboratories, USA, in the area of semiconductor devices and process modeling. He joined the Department of Electrical Engineering, KAIST, in 1983, where he is now a Professor. From 1988 to 1989, he was a visiting Professor at the University of Karlsruhe, Germany. His current research interests include CAD algorithms for VLSI, computer graphics, VLSI architecture for microprocessors and DSP.

253