Time is a wonderful teacher. Unfortunately, it kills all of it's students.
The 3D Studio MAX R2 Display Architecture
Don Brittain, Ph.D., for Yost Group, Inc.
The MAX R2 Display Pipeline: Rough Pass
2D Display Cards
Which is Fastest?
Optimal Features for an OpenGL Display Card Driver
OpenGL Buffer Region Extension
Direct3D Display Driver Optimizations
With 3D Studio MAX, overall productivity is largely tied to a users ability to manipulate 3D data quickly and smoothly. Thus, almost all users are interested in what system configuration will lead to the most interactive "bang for the buck".
This article will attempt to describe the MAX R2 display architecture in such a way that users can figure out what type and level of display acceleration will best meet their needs.
Please note that the MAX R2 display pipeline is a significantly enhanced superset of the MAX R1.x pipeline. Thus, if you are using MAX R1.x (or one of the 1.x derivative products, such as 3D Studio VIZ or 3D Studio Apprentice), then the information here will not help. Rather, you should consult my earlier article Shedding Light on MAX Benchmarks: An Analysis of Interactive Performance With 3D Studio MAX.
The MAX R2 Display Pipeline: Rough Pass
In order to get 3D models to appear on a 2D computer screen, there are two major computational steps.
In the case of a shaded, textured object, the first step involves lighting, transforming, and clipping each geometric primitive (triangles, in the case of MAX). The second step involves filling in the projected triangles (with color ramps), interpolating depth values, and performing texture lookup.
Each of these steps can be computationally intensive, and either step (or both!) could be bottlenecks slowing down your use of MAX.
For future reference, Step 1 is referred to as geometry acceleration (and it includes transforming vertices, clipping them to the 3D view volume, and calculating the illumination at each vertex). Step 2 is referred to as rasterization. It is really a 2-and-a-half dimensional process, in that interpolating and filling in color, depth, and texture values are operations that happen per 2D pixel (as opposed to the Step 1 calculations which happen per 3D vertex).
Display cards handle both, one, or neither of these steps.
2D Display Cards
The simplest case is where the display card doesnt accelerate either step. Such cards are called "2D" cards, or "dumb" cards, or perhaps "Windows accelerators", because they do not offload any of the 3D processing from the main CPU. In this case, the MAX software handles all of the interactive rendering tasks except the actual display of the resulting image. An off-screen bitmap is created as the output of the rendering pipeline, and when MAX has finished filling in all the pixels, the image is blitted (copied) to the screen.
Since all display cards (even fancy, high-end 3D cards!) can act as dumb 2D cards, we made it a priority to make this process as efficient as possible. To this end, we use multiple threads of execution within MAX to largely uncouple step 1 from step 2, thereby allowing two processors (within multi-processing NT systems) to work in parallel. In effect, what this does is allow the second CPU to act as a rasterization processor, thereby allowing a 2-processor / dumb display card system to emulate a single processor system with a rasterization accelerator.
For completeness, it should be noted that there are other ways that this part of the pipeline is optimized, including early rejection of out-of-view data, lazy evaluation of lighting calculations, caching of shared computation results, and multiple execution threads to handle the geometric transformations of the scene database.
This leads us naturally to the next level of display acceleration: rasterization accelerators.
These cards handle all of Step 2 above. Once the 3D data is converted into 2D device space (with colors, and possibly depth values and texture coordinates), it is handed over to the rasterization processor on the display card. This processor then computes all of the intermediate color and depth values, which are recorded right into display card memory.
Most 3D cards available today are actually rasterization-only accelerators, though some of these accelerators can handle natural window coordinate data directly, whereas other (slower) cards require the data to be massaged into a special (and card-specific) format before the fast rasterization hardware takes over. (This extra formatting is called "triangle setup processing", and cards with hardware triangle setup naturally perform better than those without.)
In order to optimize MAXs uses of all rasterization cards, we apply all the optimizations mentioned in the last paragraph of the 2D Display Cards section, and we also convert all possible data into triangle strips, which minimizes the communication and computational overhead of getting data to and through the rasterization processor. (Basically, if triangles are sent in strips, where each new vertex implicitly defines a whole new triangle by using it in combination with the previous 2 vertices, we can send data to the rasterization processor faster. This allows an optional second CPU to work more on the Step 1 part of the pipeline.)
It should be noted that, as with 2D display cards, Step 1 and Step 2 are decoupled by separate threads so that the system can perform automatic load balancing of all available processors (one or two CPUs plus the rasterization processor).
The most sophisticated (and, at this point, rarest) type of 3D display card is one with geometry acceleration built-in. Cards in the this category are true 3D cards, in that they can work with data in 3D space: they transform it, clip it, light it, and rasterize it. In other words, they handle all of Step 1 and all of Step 2.
In order to take full advantage of geometry acceleration, MAX converts all appropriate scene data into a form that is easy and efficient to feed to the accelerator. This involves turning the 3D scene database into 3D triangle strips with surface normal vectors for each vertex. (To do this efficiently, the process of constructing the normals is handled by multiple threads, and the strip data is preserved by the MAX dataflow pipeline.)
As with the other two scenarios, multiple execution threads are used to decouple the various computational stages, so that automatic load balancing can occur. This is especially effective when MAX is generating new (procedural) data on each frame update: then each host CPU can work on creating the data, and the geometry and rasterization processors can work on displaying the data.
Which is Fastest?
The natural question now, is, "which of these setups provides the fastest display path for MAX?" And the truthful (and admittedly somewhat unhelpful) answer is "it depends".
There are two types of bottlenecks that can slow down the pipeline. One is computational overload, and the other is communication limitations.
For a very simple scene, such as a single big cube and one light, the overhead of step 1 is minimal, since it only involves transforming, clipping, and lighting a very small number of vertices or triangles. But if the scene is displayed in a large viewport, it could potentially involve filling in hundreds of thousands (or even millions) of pixels, so the rasterization process (step 2) is quite expensive.
Thus, in this case, a rasterization accelerator would result in much faster system throughput.
Also note that if that single cube was textured, the step 1 calculations would go up only slightly, whereas step 2 would be much more complex. (There would be additional texture vertices to interpolate, and the actual texel values would have to be read and/or computed and potentially masked against, or blended with, the interpolated color values.)
So, in this case, a rasterization accelerator that supports texturing would make even a bigger difference.
It should be noted that, despite the fact that a single cube is too simple to be considered a likely real-life scene, this case is very much representative of most 3D games: geometric complexity is kept to a minimum (i.e. there is a low vertex/triangle count) and the game is made more visually complex and interesting by extensive use of textures. Indeed, most 3D game cards are simply rasterization accelerators (as opposed to geometry accelerators), so they do, indeed, provide excellent acceleration for such scenes.
At the opposite end of the spectrum is a geometrically complex scene (say, about 1 million vertices and triangles), displayed in a quarter-screen viewport. In this case, step 1 is quite involved, whereas step 2 is quite simple. (Indeed, there is quite likely very little 2D interpolation to do, since a transformed triangle may be only one pixel in size! And besides, even if the viewport is about 500 x 400 pixels in size, that "only" amounts to 200,000 pixels, which is significantly less than the amount of data step 1 must deal with.)
Thus, for such a scene, rasterization acceleration is not particularly helpful, since rasterization time is not limiting the throughput. And, indeed, a geometrically-large scene takes about the same amount of time to display using the HEIDI software z-buffer as with a hardware rasterization card.
Note that this illustrates, full-circle, the symbiosis between geometrically simple / texturally complex games and existing 3D game display cards: the fact is, game cards provide no benefit at all to displaying geometrically complex scenes (and hence it is somewhat a self-fulfilling prophecy that 3D games use simple geometry).
Finally, a quick comment on 3D API's vs rasterization-only API's with regard to efficiency of handling data. When dealing with a rasterization-only driver interface, MAX employs a very efficient transformation-and-lighting algorithm that results in each vertex of a mesh being transformed only once and lit at most once, independent of the structure of the mesh. (Vertices that are clipped or culled are never lit.) In contrast, 3D API's have to transform each vertex passed to them, even if that vertex has been "seen" before.
To illustrate the effect of this, suppose we have a 10,000 face / 5,000 vertex model that consists of 200 triangle strips containing 50 triangles each. (This would be considered very efficient stripping.) Then, in the rasterization-only case, MAX would transform a total of 5,000 vertices, whereas the 3D API would have to transform 200 (strips) x 52 (vertices per strip) = 10,400 vertices. Thus, a geometry accelerator would have to be more than twice as fast as the internal MAX computations just to break even!
Let's look at the extreme cases: With best-case "perfect stripping" (1 strip with 10,000 triangles), the 3D API still has to deal with 10,002 vertices (more than 2X the rasterization case), and in the worst case, where each "strip" contains only one triangle, the 3D API must transform 10,000 (strips) x 3 (vertices per strip) = 30,000 vertices. This is 6 times more transformations than MAX makes when using a rasterization-only interface!
Since geometry accelerators are special-purpose devices, they are usually substantially more efficient at tranformations than the CPU-implemented code inside MAX, but this does illustrate the uphill battle to attain high-speed 3D nirvana.
Communication with the display hardware is the next big performance-critical area, and that brings us naturally to the next section.
Lets again consider a million-vertex scene being displayed in a 500x400 pixel viewport. If we download the scene to a geometry accelerator, we need to transfer each vertex along with a surface normal. This amounts to 24 million bytes of information (4 bytes for each floating point x, y, and z vertex value, and 4 bytes for each normal vector component). Other data (such as light position and color, material descriptions, texture map images, world space locations, etc) also have to be transferred to the accelerator card, but this additional information is with the exception of texture images relatively small. Except in the case of animating textures, we do not need to send texture maps to the accelerator for each viewport update, so they can also be factored out of this analysis.
OK, so were about to send 24Meg of data to a geometry accelerator card. But how does it get there? Unfortunately, it usually has to be transferred across the system bus. Now, apart from mechanical devices (like floppy, CD, and hard disk drives), the bus is one of the slowest parts of the system. So this is bad news!
But how bad? Well, a "high-speed" PCI bus runs at 66MHz. In practice, a bus running at this speed is doing well if it can actually sustain a transfer rate of 20Meg per second. Thus, assuming no overhead in MAX or the display card, it will take us over 1 second just to send the scene data to the card! So, getting a 30 frames-per-second update rate on this scene aint gonna happen! (Note: some of the latest PCI chipsets have significiantly better throughput, but if you have the resources available, it is best to actually measure the sustained transfer rate, rather than trust the almost-always-overstated specs.)
Note that if MAX computes all of the scene lighting, clipping, rasterization, etc, then only 500x400x3=600,000 bytes must be transferred across the system bus to produce a truecolor image on a dumb display card. This is 40 times less data hitting the slow part of the system!
But the fact remains that geometry accelerator cards do the lighting, clipping, rasterization, etc. much faster than the main CPU can, and while the display card is working on those problems the CPU is freed up to work on other parts of the scene display process. So there is still an overall win to use a geometry accelerator, but this shows why the performance may be somewhat limited.
Note that the amount of data sent across the bus can also limit the overall display speed for rasterization-only cards, since it is possible for geometric data to exceed pixel data in size under these conditions too.
With an understanding of the communication overhead, some ways that MAX can (and does!) make the whole display process more efficient become apparent. Here are a few:
To conclude this section, heres some good news: The Intel AGP (Accelerated Graphics Port) bus has been designed to help alleviate the PCI bus bottleneck inherent in all 3D graphics systems. This bus has been designed with the transfer of graphics-specific information in mind, and it should be "standard equipment" in most new systems by the end of 1997. Virtually all new 3D graphics chipsets have been designed with the AGP bus in mind, so overall 3D performance should improve dramatically in the near future.
There are several factors that can affect how responsive a 3D program can feel to a user other than the overall rate at which triangles can be displayed on the screen. Although these factors rarely show up in benchmarks (whether by the card manufacturers, industry trade organizations, or magazines), they can have a great affect on how usable a 3D system is for a given task.
Here are a couple examples to keep in mind.
As you might guess by now, the systems that allow MAX to do 1 efficiently tend to force 2 to be inefficient, and vice versa. There are, however, some low-level drivers that are clever enough to change how they operate on the fly so that they always work optimally with MAX.
Since updating damaged viewports can be expensive, MAX makes sure that pop-up menus and viewport tooltips handle the repainting efficiently independent of the underlying 3D driver. Unfortunately, moving "heavy weight" windows (like the material editor, trackview, and video post) will force MAX to do a full viewport repair via either a blit or scene re-traversal.
Another "quality of life" driver issue is how much support is provided to allow MAX to perform incremental scene updates. Recall from the previous section that the fastest way for MAX to update a scene is to render only the smallest amount of scene data generally only that part of the scene that has changed since the previous frame.
In order to accomplish this, the underlying driver and display hardware has to either provide an efficient way to store a partially-rendered scene (dual planes), or a way for MAX to only paint a rectangular subset of the 3D viewports (incremental viewport update). And just providing a functionally-correct way to do these things is not enough, in that moving any unnecessary data across the bus can cause unnerving "jumps" in interactive performance, which end up making MAX harder to use, even if the average frame rate actually goes up.
This will be discussed in more depth in the next section.
MAX R2 supports dynamically-loadable display drivers. These drivers are linked into MAX at runtime and allow MAX to be efficiently optimized for the underlying display hardware.
In order to optimize the overall system throughput, the driver provides the high-level MAX code with information on what sort of operations it supports. This allows MAX to either perform some of the interactive display calculations itself, or to hand off the calculations directly to the driver.
Some driver-level decisions are speed/quality tradeoffs (e.g. point-sampled vs. mipmapped texel lookup), or affect qualitative factors regarding how the viewport images appear (e.g. anti-aliased lines). The driver has the option of letting the user make decisions about these issues through a driver-specific configuration dialog box.
The configuration dialog box can also allow a user to choose driver options which allow MAX to "break some rules" in order to work optimally with the underlying hardware. This pertains, in particular, to the OpenGL driver, where not enough information about the underlying hardware is available through the API to have MAX make all the decisions itself.
MAX R2 comes with three built-in drivers: HEIDI, OpenGL, and Direct3D. Details about each of these drivers will be presented below. (Since the drivers are dynamically loaded, other drivers may be added over time.)
HEIDI is unique in the broad range of 3D support it provides: it supports primitives at the rasterization (device coordinate) level, at an abstract 2D coordinate level, and at the full 3D scene level. Moreover, it allows hardware display cards to accelerate the API at any of these levels.
The HEIDI driver in MAX R2 uses only the rasterization level API at this point.
Although HEIDI is, itself, customizable through dynamically-loaded, hardware-specific drivers, the only driver that ships with MAX R2 is the software z-buffer driver. This HEIDI driver is unique in that it is hardware-independent. It performs all rasterization operations using the main CPU and then the resulting image is blitted to the screen.
This driver has the following advantages:
But it also has the following disadvantages:
The Microsoft Direct3D API supports both rasterization and 3D scene level calls, although in D3D Version 5 (the only version supported by MAX at this time), the 3D calls can not be accelerated by any underlying display hardware. Thus, as with the HEIDI driver, we chose to use only the rasterization-level API calls.
This driver has the following advantages:
But it has these disadvantages:
The OpenGL API quite large, but it works only at the 3D scene level. Thus, when running with this driver, MAX hands off all 3D primitives to the OpenGL driver, independent of the level of hardware acceleration actually provided by the display card.
Because of this higher level of abstraction, there are more variables that affect overall performance with this driver. A separate section will discuss how a display cards OpenGL display driver can provide the best support for MAX.
In general, the MAX OpenGL has the following advantages:
But it also has these disadvantages:
Optimal Features for an OpenGL Display Card Driver
Because the OpenGL API is both large and abstract, there are many ways that an OpenGL display card driver can be fully-compliant with the OpenGL spec, yet not provide optimal support for MAX. This section discusses details regarding how such drivers can be optimized for best MAX performance.
Before going into the gory details, please note that MAX will run adequately with any OpenGL-compliant driver (version 1.1 or later). The details discussed here are ways to go above and beyond what the spec requires, so that MAX can provide the best possible throughput to the end user.
With that in mind, here are some implementation-specific points:
OpenGL Buffer Region Extension
The OpenGL extension described below, if present, will be used by MAX to implement dual planes under OpenGL. As with all OpenGL extensions under Windows NT, the functions are imported into MAX by calling wglGetProcAddress, and the functions themselves are implemented with the __stdcall calling convention. The presence of this extension is indicated by the keyword "GL_KTX_buffer_region" being present in the string returned by glGetString(GL_EXTENSIONS).
In an optimal implementation of this extension, the buffer regions are stored in video RAM so that buffer data transfers do not have to cross the system bus. Note that no data in the backing buffers is ever interpreted by MAX it is just returned to the active image and/or Z buffers later to restore a partially rendered scene without having to actually perform any rendering. Thus, the buffered data should be kept in the native display card format without any translation.
GLuint glNewBufferRegion(GLenum type)
This function creates a new buffer region and returns a handle to it. The type parameter can be one of GL_KTX_FRONT_REGION, GL_KTX_BACK_REGION, GL_KTX_Z_REGION or GL_KTX_STENCIL_REGION. These symbols are defined in the MAX gfx.h header file, but they are simply mapped to 0 through 3 in the order given above. Note that the storage of this region data is implementation specific and the pixel data is not available to the client.
void glDeleteBufferRegion(GLuint region)
This function deletes a buffer region and any associated buffer data.
void glReadBufferRegion(GLuint region, GLint x, GLint y, Glsizei width, GLsizei height)
This function reads buffer data into a region specified by the given region handle. The type of data read depends on the type of the region handle being used. All coordinates are window-based (with the origin at the lower-left, as is common with OpenGL) and attempts to read areas that are clipped by the window bounds fail silently. In MAX, x and y are always 0.
void glDrawBufferRegion(GLuint region, GLint x, GLint y, Glsizei width, GLsizei height, GLint xDest, GLint yDest)
This copies a rectangular region of data back to a display buffer. In other words, it moves previously saved data from the specified region back to its originating buffer. The type of data drawn depends on the type of the region handle being used. The rectangle specified by x, y, width, and height will always lie completely within the rectangle specified by previous calls to glReadBufferRegion. This rectangle is to be placed back into the display buffer at the location specified by xDest and yDest. Attempts to draw sub-regions outside the area of the last buffer region read will fail (silently). In MAX, xDest and yDest are always equal to x and y, respectively.)
This routine returns 1 (TRUE) if MAX should use the buffer region extension, and 0 (FALSE) if MAX shouldn't. This call is here so that if a single display driver supports a family of display cards with varying functionality and onboard memory, the extension can be implemented yet only used if a given display card could benefit from its use. In particular, if a given display card does not have enough memory to efficiently support the buffer region extension, then this call should return FALSE. (Even for cards with lots of memory, whether or not to enable the extension could be left up to the end-user through a configuration option available through a manufacturer's addition to the Windows tabbed Display Properties dialog. Then, those users who like to have as much video memory available for textures as possible could disable the option, or other users who work with large scene databases but not lots of textures could explicitly enable the extension.)
Buffer region data is stored per window. Any context associated with the window can access the buffer regions for that window. Buffer regions are cleaned up on deletion of the window.
MAX uses the buffer region calls to squirrel away complete copies of each viewports image and Z buffers. Then, when a rectangular region of the screen must be updated because "foreground" objects have moved, that subregion is moved from "storage" back to the image and Z buffers used for scene display. MAX then renders the objects that have moved to complete the update of the viewport display.
Direct3D Display Driver Optimizations
Because most of the early applications designed to take advantage of the Direct3D API were games, most D3D drivers are currently optimized for getting very high-throughput from scenes found in typical 3D games. These scenes typically have a low polygon count and are rendered as shaded, textured polygons in a full screen display mode.
MAX, of course, is a Windows program that requires a standard Windows UI to be displayed under typical usage (Expert Mode notwithstanding!). Thus, MAX opens the primary DirectX display surface in "cooperative" mode. Moreover, since MAX does not permit overlapping 3D viewports, we can allocate backbuffer and Z-buffer resources in a somewhat unusual, but very efficient manner: we request a single backbuffer and a single z-buffer underlying the entire primary display surface, and then we manage the individual viewport drawing regions ourselves. (This differs from both the OpenGL and HEIDI drivers -- they treat each viewport as a totally separate 3D window having its own backbuffer and z-buffer.)
All viewport updates are done by blitting from the backbuffer to the primary display surface. Since MAX tracks the scene's damaged rectangles on a per-view basis, we only blit the smallest rectangular region of a viewport that has been changed. This allows us for efficient incremental scene updates.
The MAX Direct3D driver uses DrawPrimitive calls (as opposed to execute buffers) for all primitives.
MAX handles all 3D->2D transformations and clipping, and does all lighting using a lazy evaluation algorithm. This means that only primitives that actually need to get rendered will be handed off to the D3D driver. Thus, all primitives are rendered with D3D clipping turned off.
In order to provide optimal support for MAX, a low-level D3D driver should provide efficient support for lines (both solid and dashed), textures (with both MODULATE and MODULATEMASK addressing modes), and blit-based screen updates (as opposed to page flipping). The driver should be "DrawPrimitive-aware", and texture formats should support at least a 1-bit alpha channel (which MAX uses for non-tiled texture display).
It is impossible to provide a finite set of benchmarks that can definitively rank display card performance for an application as complex as MAX. Moreover, because MAX is an interactive application, raw scene throughput numbers can not fully describe how efficient (or pleasant) it is to use MAX within a particular hardware environment. Finally, how reliable or stable a driver is or how easy it is to upgrade are factors that effect the overall end-user experience in a way that will never show up in frames-per-second benchmarks.
All that being true, there is no doubt that some display cards provide better throughput than other cards (for certain work styles, at least), and it is useful to have a set of test scenes for getting some rough idea of the "bang-per-buck" for a given hardware configuration. With that in mind, the following benchmark scenes (download size: 1 Megabyte) may come in handy. (If you are already a MAX user, and you are looking for the optimal display setup for accelerating the kind of work you do, I highly encourage you to make up your own set of benchmarks that are more representative of the type of work you do with MAX!)
Before starting MAX, if you add the line
to the [Performance] section of the 3dsmax.ini file, MAX will then display 5 indented fields in the status bar prompt area at the bottom of the MAX window. The numbers that appear in these boxes are the frames-per-second (FPS) update times for each of the four MAX viewports, together with the overall FPS number for then entire 3D viewing region. When playing an animation (typcically with "real time update" set to OFF), the FPS numbers represent the average throughput for the entire animation. (The average is restarted every 1000 frames or so to filter out the effects of singular events.)
The FPS readout is a good indication of how fast a display card accelerates scene rendering. Alternatively, you can write a MAXScript program that will load MAX, playback a series of test scenes, and record the run times to a text log file (or you could get fancier and have the script automatically log the data to an Excel spreadsheet!).
Here is a brief description of the scenes:
Back to Don's Home Page
Please send me e-mail with any comments about this web site.
Copyright 1997, D. L. Brittain.
Last revised: January 28, 1999.