ParaGL: Real-Time Parallel Rendering of Complex Polygonal Data-Sets

Students:

Introduction

The exponential growth in the rendering rate of commodity PC graphics accelerators, has enabled the creation of high performance rendering platforms based on commodity clusters. When combined with a high bandwidth switching interconnect, it is theoretically possible to achieve real-time rendering at frame rates rivalling custom hardware rendering platforms. Our research focuses on building such a high performance parallel rendering platform to support the following three performance objectives:

System Architecture

We use a master-slave architecture whose characteristics fall under the sort first category of parallel rendering. In this approach, we treat the rendering pipeline (OpenGL) as a "black-box", which allows for future upgrades in hardware and/or software drivers without affecting the functionality of our system. There are 3 stages in our parallel rendering system:

Scene Partitioning

Our scene partitioning strategy relies heavily on the use of spatial data structures. Spatial data structures are an efficient means of organizing a scene database as well as provide a fast means of determining the visible geometry for a given frame. Our algorithm therefore uses an oct-tree spatial data structure to maximize the efficiency of our rendering pipeline. The oct-tree is very well suited for generic polygonal datasets and is easy to pre-calculate and store on disk. It is also well suited for out-of-core rendering (Section: Out-of-Core). In addition, the oct-tree representation provides a scalable path for sort-first parallel rendering. As the number of render nodes increases, the viewing volume per node decreases, thus allowing for more object level culling which greatly speeds the rendering stage of our system. It is also important to point out, that the oct-tree is shared amongst all render nodes. Our system currently uses a shared file system, as it is easier than managing multiple copies of the oct-tree across nodes. Storing the oct-tree is this fashion removes the need of our system to transmit sections of the database across nodes, however, for certain dynamic scenes, object state is communicated or broadcast amongst each render node.

Reconfigurable Rendering

Our current implementation uses a 32 node Intel Xeon cluster, with each node containing a 1.5GHz CPU, a 64 bit 66Mhz PCI bus and a GeForce2 Ultra chipset based graphics card on a 4X AGP bus. The nodes are interconnected over a switched Myrinet network, which provides 4Gbps of full-duplex bandwidth per node. The cluster is designed to drive a variety of display devices, including mono display monitors, stereo devices, head mounted displays, virtual workbenches and CAVE facilities.

To achieve this goal, our system partitions the viewing volume or viewing frustum into regions corresponding to the geographical position of the actual display devices in the system. These partitions are then assigned to master nodes attached to each display device. Slave nodes are then initially assigned in a round robin fashion to each master node in the system. All master nodes then partition their own viewing volume into regions whose size depends on the number of slave nodes assigned. This results in a smaller viewing volume for each slave node.

Rendering and Communication

For each render node in the system, there are three processors: Main CPU, GPU (graphics processing unit), and the LANai CPU (dedicated processor on the Myrinet network interface card). It is critical that the system keep each processor active thus minimizing stalls and flushes which can drastically decrease the performance of the system. Our solution to maximize processor activity is though the use of a two stage rendering pipeline with each stage running in its own thread to maximize concurrency. The first stage is the actual render stage. Each node renders the subset of the scene database corresponding to its viewing volume. The second stage is then responsible for reading the rendered image and sending it across the network. Note, that most of the data transmission in the second stage of the pipeline in over the PCI bus and network media, where as the first stage of the pipeline is initially over the AGP bus, and largely over the high bandwidth data bus on the graphics card. This property provides the needed bandwidth for each stage in the pipeline. The main CPU effectively acts as a controller for channeling data from the graphics card to the network and assuring that there is a constant flow of data from/to the graphics board's GPU to/from the network interface (myrinet device). In this 2-stage pipeline design, the GPU is actually processing the next frame while the current frame is transmitted across the network.

Composition

The composition stage of our parallel rendering system occurs on each master node. Slave nodes are only responsible for subsets of the scene and the results are therefore not displayed. The master nodes, however, are connected to an external display device and must display and entire frame. Our methods for scene composition are greatly simplified based on the properties of a sort first parallel rendering algorithm. There is no overlap in the resulting frame of each slave node (in image space) because its viewport is calculated and corresponds to its viewing volume (in object space). Based on the aforementioned property, each master node can thus operate entirely in image space allowing the master nodes to merely copy a slave node's sub-image to the framebuffer. Again, multi-threading is used to achieve maximum processor parallelism during the scene composition phase.

Dynamic Load Balancing

In our system, load-balancing occurs at two different levels. In the first level - called intra-master load balancing - , we are concerned with load-balancing between the slave nodes assinged to a single master node. In this scheme, slave nodes are initially allocated constant area view frustums in round-robin fashion. This is however, not suitable for continuous movement through a scene due to the fact that polygonal density is not constant. Our system therefore supports a dynamic load balancing scheme to adaptively reconfigure itself depending on scene complexity (polygonal density) during a given frame or sequence of frames. The process proceeds as follows. Every set number of frames (operation could be performed each frame) the master node queries its slave nodes for its average render time. Based on the timings obtained from each slave node, the master node then resizes their viewing volumes. The new size of a slave node's viewing volume is directly proportional to the time taken for it to render its frame with respect to the time taken to render the complete frame. The new viewing volume is then propagated from the master node to its slave nodes and the system then proceeds with rendering the next frame.

The second level of load balancing - called inter-master load balancing - occurs across multiple masters involved in a driving a multi-display device. In the initial allocation, slave nodes are split equally among all master nodes, which may not be optimal, given the varying polygon density in the view frustums of each master. To load balance this case, master nodes are queried at the end of each frame or set of frames to determine the average rendering time for each master. Slave nodes are then reassigned from lightly loaded masters to heavily loaded master nodes. This in turn results in an intra-master load balancing operation and the system then proceeds to the next frame.

Out-Of-Core Rendering

It is important for our system to support an out-of-core rendering strategy for scene databases too large to fit into main memory. This is most likely the case as our system is designed to support scenes with several million polygons. Our methods rely on a preprocessing technique which constructs an oct-tree from a scene database. Current file formats supported are the 3d Studio (.3ds) and Stanford (.ply) data files. The internal node structure is then loaded into the system at run time where as the leaf nodes which contain the actual polygonal data are read from a file when needed. A caching system is employed to minimize disk access times. At the present time, a simple look-ahead method of caching is performed based on the current location in the scene and direction traveled. Future work includes the use of more advanced visibility testing and out-of-core methods to achieve better performance and reduce disk access times.

Project Status (10/15/2002)