Shedding Light on MAX Benchmarks
An Analysis of Interactive Performance With 3D Studio MAX
Don Brittain, Ph.D., for Yost Group, Inc.
The MAX Display Pipeline: Rough Pass
3D Accelerated Display Cards
High Complexity Scenes
With 3D Studio MAX, overall productivity is largely tied to a users ability to manipulate 3D data quickly and smoothly. Thus, almost all users are interested in what system configuration will lead to the most interactive "bang for the buck".
Adding more memory, faster hard drives, and a higher-speed (or second) CPU all help to speed up throughput, and the gains are fairly easily understood and explained. A better graphics card also speeds up interactive throughput, but figuring out what display card is "better" for a given configuration and interaction style is not always straightforward.
In particular, many users have been surprised at the effect a more expensive graphics card had on their interactive throughput: some things may have sped up more than they expected, whereas other things may not have gone noticeably faster.
Since I am one of the authors of 3D Studio MAX, and have worked extensively with the "graphics pipeline" part of the MAX code, I thought it would be helpful to explain some of the "unexpected" results so that users may make informed choices regarding what type of display acceleration will optimize their particular configurations.
But be forewarned: This article will (hopefully) explain the unexpected benchmark timings popular on the various MAX forums, but it will not provide a ranking of commercial display cards or provide any recommendations as to which cards are better than others. As youll see, "better" is not well-defined, and the best card for the way you use MAX could easily be a terrible choice for the way someone else uses MAX.
Also, please note that the descriptions in this document apply to MAX Release 1.2. (Earlier versions of MAX worked in pretty much the same way, though some details may not be precisely correct for those versions.)
The MAX Display Pipeline: Rough Pass
In order to get 3D models to appear on a 2D computer screen, there are two major computational steps.
In the case of a shaded, textured object, the first step involves lighting, transforming, and clipping each geometric primitive (triangles, in the case of MAX). The second step involves filling in the projected triangles (with color ramps), interpolating depth values, and performing texture lookup.
Each of these steps can be computationally intensive, and either step (or both!) could be bottlenecks slowing down your use of MAX.
The first of these major steps is always handled by the main part of the MAX program. MAX has many optimizations to speed up this process (early back-facing triangle rejection, efficient clip checking, and lazy and shared vertex lighting, to name a few), but this part of the pipeline is very floating-point intensive, so eventually the speed of the CPU will be the limiting factor. This, of course, is assuming that the entire scene database fits into memory and swapping is not an issue. But note that step 1 is completely independent of the display card in the system it only involves raw data computations.
Step 2 is handled by the HEIDI driver, and thus is performed outside of the main part of the MAX program. Step 1 provides the driver with triangles having 2D screen coordinates, a normalized depth (or Z) value, colors at each vertex, and possibly floating-point texture coordinates. At this point, the driver must draw the triangles on the screen, interpolating colors, depth, and texture coordinate values as the triangles are rendered. (Of course, only those pixels that have a Z value closer than earlier-drawn pixels actually appear on the screen.)
These calculations can be done in floating or fixed point, but in either case they are computationally expensive due to the shear quantity of data that must be computed. Remember, step 1 computations are generally done per vertex whereas step 2 computations are done per pixel.
3D Accelerated Display Cards
When running with the Software Z-Buffer HEIDI driver, every pixel drawn by MAX must be computed by way of the system CPU(s). This explains why SZB slows down when a viewport is maximized or the color depth is increased in either case there is a lot more data for the CPU to handle.
In particular, even a single viewport-filling triangle on a 24-bit color system rendered with a texture in a maximized viewport will bring most systems to crawl. Note that for this extreme example, the computational burden of Step 1 is almost nil only three vertices must be transformed, lit, etc. So, in this case the display part of the pipeline is the overwhelming bottleneck.
But for this case there is a better alternative: get a 3D rasterization accelerator. Most "3D" display cards on the market today only accelerate the computations in step 2 (which arent really 3D, by the way!), which is why we designed MAX with the driver layer where it is.
With such a card, a dedicated processor optimized for triangle interpolation (of color, Z, and texture information) is handed the triangle data and takes over the burden of rendering the triangles onto the screen. Due to the highly-optimized nature of such accelerators, it is not uncommon to see very little (if any) slow down between quarter screen renderings and full screen renderings, or between untextured and textured renderings. (Indeed, turning on perspective texture correction slows down SZB even more, though it usually comes for free with dedicated accelerations.)
We now have enough information to understand one aspect of MAX performance benchmarks: hardware accelerated systems always perform relatively better than the Software Z-Buffer driver for full-screen, large-color depth displays. (To see this, try playing back an animation at both quarter screen and full screen, with both SZB and an accelerated display card.)
High Complexity Scenes
By looking at the simple extreme of one large triangle we ended up in the case where the computational load was entirely in the second, rasterization, step. At the opposite extreme is a scene with, say, a million triangles.
In this case, the main part of MAX must transform, clip, light, etc., a huge number of vertices, and the resulting 2D triangles are bound to be quite small. (Indeed, a 1024x768 pixel screen doesnt even have a million pixels, so if a million triangles were all visible, their visible portion would have to average less than a pixel in size!).
As you can imagine, this is computationally intensive, so step 1 really taxes the CPU. But, on the other hand, the resulting 2D triangles, being at most a few pixels in size, are "easy" to render, since there are not many (if any!) interpolations to perform. So, in this opposite extreme, the entire computational bottleneck lies in step 1, and system throughput is not significantly affected by whether display acceleration is present or not.
And this explains one of the most common "unexpected" benchmark results: After seeing noticeable improvement (over SZB) on a small or medium sized test scene, proud owners of expensive 3D rasterization cards then load in a huge scene, expecting to see the relative throughput go up, but instead it always goes down. In fact, for huge scenes it is not unusual to see the SZB times being virtually identical to the accelerated display card time. But now we see that this makes total sense: rasterization is an insignificant part of all the computations necessary to display a huge scene!
For dual-processor Pentium Pro systems, there is another factor that can affect benchmark results. And that is how efficiently the various threads inside MAX are scheduled to push both processors to their maximum throughput.
The number of threads running inside MAX at any given time depends on what the user is currently doing. Some threads exist only to allow asynchronous interruptions or display updates, and their existence doesnt increase graphics pipeline performance (though they help make MAX more user-friendly from an interactive response point of view). Other threads are assigned to the heavily computational parts of the display pipeline.
But having many threads, even on a multi-processor machine, is not a golden bullet. Certain calculations are not "parallelizable" and other calculations cant start until all the data they need are ready.
I will not attempt to describe the MAX thread load-balancing in detail, but I will describe a useful and typical case. The HEIDI driver runs in its own thread (actually, each viewport has its own HEIDI driver thread), so it is possible for the main part of MAX to be computing data in step 1 while the second processor starts running the rasterization code found in step 2. The result is much better throughput than is possible on a single processor system.
This is best seen with a medium complexity scene with shading and texturing turned on. As we know from above, this causes the loads between steps 1 and 2 to be approximately equal, and hence each processor in a dual processor system can be kept busy and overall throughput goes up.
Note that if a rasterization accelerator is present then the "sweet spot" for where both processors are most efficiently used is different from the optimal case with SZB. This is because a rasterization accelerator is really a third processor, and the situation for optimally distributing the load three ways is understandably different than for a 2-way optimal distribution.
And this explains yet another unexpected benchmark result: After seeing a dramatic speed-up between a uniprocessor and dual processor system for a particular scene, a user then adds a rasterization accelerator and expects to see even more speed-up. This often doesnt happen, because, for that particular scene, the second Pentium Pro is no longer being taxed since the display card is now handling most of its prior load, so the system is no longer in an optimal throughput state!
The last topic of this article deals with the issue of MAXs dual planes option.
The theory behind dual planes is that the fastest way to do something is to skip it all together. In practice, what this means is that scene objects that arent being moved or edited are rendered once and then saved off screen. When it comes time to update the scene display, the static objects are not rerendered, but rather the saved bitmap is blitted back onscreen. (This is complicated slightly by the fact that the image data as well as the state of the z-buffer must be saved and blitted, so that the moving objects can still go behind the static ones, for example).
Note that this optimization is a potentially big win it avoids the overhead of both step 1 and step 2 for the static objects.
But, as might be expected at this point, this option affects SZB and rasterization accelerated systems differently. Heres why: with SZB, the saving and restoring of the image and z-buffer bitmaps is a simple and fast memory to memory copy, whereas the same operations for an accelerated system must move data across the (relatively slow) system PCI bus.
In practice, this bus transfer slowdown manifests itself in a noticeable lag when changing selection sets or creating new objects with a rasterization accelerator present. Unfortunately, this lag makes the system feel (psychologically speaking) slower than it really is, in that most of the time the dual planes option speeds up both SZB and accelerated systems. (Note that dual planes do not speed up, or slow down, scene playback when every object, or the viewport camera, is moving.)
But the worst case for a graphics accelerated system is to have one large, but low triangle count, static object in the scene (along with some moving objects too, of course!). Then the dual plane blits, which are essentially free on SZB, actually slow down a rasterization accelerator since it could render the simple object faster than it could blit the large image data across the system bus.
Finally, if dual planes are turned off, I should note that MAX still doesnt usually render every object in the scene. Indeed only those objects that intersect the screen rectangle of the changed part of the scene are forced to be rerendered. You can test this out by turning dual planes off (in the File | Preferences | Viewports page) and making a high-complexity sphere on the left of the viewport and a low-complexity sphere on the right of the viewport. Then, make another low complexity sphere and drag it from the left side to the right side. If there is a sufficiently large gap between the two static spheres, you should notice a big change in frame-rate update as you move the third sphere from left to right and back.
Obviously, it would be quite tough to assign a single frames-per-second number to a scene like this! Thus, this adds yet another wrinkle to the benchmarking process.
Back to Don's Home Page
Please send me e-mail with any comments about this web site.
Copyright 1997, D. L. Brittain.
Last revised: January 28, 1999.