Optimizing Dreamcast Microsoft Direct3D Performance

Important:
This is retired content. This content is outdated and is no longer being maintained. It is provided as a courtesy for individuals who are still using these technologies. This content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.
By Sebastian Wloch
Kalisto Entertainment
March 1999
Summary:This article provides guidelines for achieving high performance for Microsoft Windows CE-based game applications. Game developers share useful implementations for those who want to write an efficient 3-D engine, based on Microsoft Direct3D and the Windows CE operating system for the Dreamcast. The article discusses performance techniques, optimization methods, geometries and textures, and solutions to problems. (11 printed pages)
Introduction Taking Advantage of the Power of the Dreamcast 3-D Chip Improving Performance Working with Geometry and Performance Optimizing a Game Summary
`Introduction`


While developing a Microsoft Windows CE–based game on the Sega
Dreamcast, we discovered several techniques that help to optimize
game code and make the best use of the Microsoft Direct3D API. This
article documents what we learned.
A game developer might think that Direct3D techniques would be
the same, whether you're developing your game for the PC or for the
Dreamcast. However, in reality, Microsoft optimized Direct3D
specifically for the Dreamcast hardware. Therefore, to obtain the
best performance, you need to pay attention to Dreamcast-specific
issues. In other words, you need to understand the Dreamcast
hardware and the Direct3D for Dreamcast implementation.
This article presents an overview of what we as game developers
consider useful for anyone who wants to write an efficient 3-D
engine, based on Direct3D and the Windows CE operating system for
the Dreamcast. First, we will cover features of the Dreamcast's 3-D
hardware. Then we will provide tips to help you implement the
following techniques, which can improve the overall performance of
your 3-D game engine. 

Send less geometry to Direct3D.
Choose the best way to send geometry to Direct3D.
Test different optimizations, and then view the results by
using the performance viewer tool of Direct3D.

Taking Advantage of the Power
of the Dreamcast 3-D Chip
As triangles are sent to it, the Dreamcast hardware 3-D chip
does not render the triangles scan line by scan line. Instead, it
stores the triangles in video memory as they are sent. Once the
entire scene has been collected, the hardware sends all triangles
to the screen tile by tile, not triangle by triangle.
Every tile is 32 x 32 pixels. For each tile, the hardware
selects the pixels that intersect the tile and retrieves for each
pixel the closest triangle to the camera (viewport). Then this
pixel is rendered to the screen by following the process of
completing the interpolations, reading the texel, and so on. Thus,
every pixel on the screen is actually rendered to the screen buffer
only once. Other 3-D hardware systems render every pixel as often
as that pixel is recovered by a triangle, but not the Dreamcast
hardware.
By using this method, the hardware is not limited by the fill
rate. No matter how many triangles recover a single pixel, that
single pixel is rendered only once. Therefore, with the Dreamcast
hardware, you don't need a Z-Buffer, because only the closest
triangle is rendered.
In addition, with the Dreamcast hardware, you don't need to clip
the triangles to the screen viewport, so there is no need for
clipping tests and calculations. This is because the hardware
renders graphics tile by tile. As a result, you don't need to test
primitives, nor do you need to break up primitives into smaller
primitives that fit on the screen.
The Dreamcast hardware does have to do several passes to render
transparency, which slows down the rendering process a little.
However, during that process, the hardware sorts the transparent
triangles automatically, so your game engine does not need to sort
them. Because your game engine doesn't have to do the memory
manipulations that come with sorting, it avoids disturbing (slowing
down) your 3-D pipeline. Even if the polygons intersect, there
won't be any artifacts because the translucency sorting is done for
each pixel by the hardware.
Not all transparent modes need several passes. The 5551 (Punch
Through) mode does not need to combine the most recently rendered
pixel with the pixel previously rendered to the screen buffer
because 1 bit of alpha channel does not allow any degree of
translucency. Such triangles are rendered with the same speed as
opaque triangles—in a single pass.
Another feature of the Dreamcast hardware is that it has SH4
native operations that are fully supported by a set of intrinsics.
The ones that we use the most are the dot product and the
reciprocal square root. One special function that computes the sine
and cosine of an angle is also very useful for character animation
and camera movement calculations.
You can also apply the following Dreamcast hardware features to
each pixel the hardware renders to the screen: 

Use a special surface mode to perform realistic bump mapping.
Use a special texture mode (VQ compression) to complete texture
compression with an 8:1 compression ratio plus 2 KB of overhead for
the codebook.
Test the on-screen pixel with a set of volumes, and apply a
specific operation to the pixels inside or outside of the volume
(color modification, transparency, or the texture ID). This makes
shadows, lighting, and other special effects easy, and it doesn't
break up the 3-D geometry pipeline.

Improving Performance
Usually in games, the complete scene is much larger than the
part a game user actually sees on the screen. Therefore, sending
every triangle of the scene to Direct3D would waste resources and
slow down performance. So, cull the triangles that are not
currently visible from the triangle set sent to Direct3D.
To eliminate the geometry that is outside the viewable area, you
need to build efficient tests that meet all of the following rules:


They are called as infrequently as possible.
They are as fast as possible.
They eliminate as many triangles as possible.
Tests are designed to eliminate the following three kinds of
geometry: 


Triangles off the screen—To test for this
condition, apply view frustum elimination. That is, test every
triangle, primitive, or object against the viewing frustum pyramid,
and then eliminate the triangle, primitive, or object if it is
outside the viewing frustum pyramid. This test generally eliminates
a lot of triangles by using only a few tests.

Triangles not facing the screen—To test for this
condition, apply backface culling. That is, test every triangle or
group of triangles to see if it faces the screen, and eliminate the
geometry that is not facing the screen, such as the back of a
person's head. This test generally eliminates 10-50 percent of the
geometry, but the cost and overhead may be huge. The efficiency
depends on the geometry; the more strips you find, the better.

Triangles completely hidden by other objects—In
this case, create an advanced scene organization to determine
rapidly which triangles are hidden. This test generally eliminates
10-50 percent of the triangle geometry, but the performance depends
on the geometrical organization. This method is not discussed in
this article because it depends on the type of game. For example,
there is a big difference between exteriors and interiors.
To apply viewing frustum elimination, you need a test that
rapidly determines whether or not a triangle is in the viewing
frustum. The easiest way is to group the triangles into objects or
primitives, and then test all the triangles of an object or a
primitive together. Then you can easily have a bounding sphere that
is larger than all the triangles, and test whether or not the
bounding sphere touches the viewing frustum, is completely inside
the frustum, or is completely outside the frustum. The center of
the sphere may just be the barycentrum of the triangles.
It is also very efficient to group primitives together into
objects. Then you need only test the primitives if the object is on
the edge of the viewing frustum. If the object is completely inside
or outside the viewing frustum, you know that all the primitives
share their container object's property.
Direct3D already does backface culling very efficiently. In some
cases, we can also group triangles and treat them together. For a
series of connected triangles (a strip for example) that are
completely or almost on the same plane, you can: 

Calculate an average normal vector.
Compute the backface culling on the average vector.
Use a tolerance value to know if the whole set of triangles is
in the viewable area or not.
By using this process, instead of testing each triangle, you can
eliminate a strip of 10 triangles with a single test.
If an object is getting very big and contains a lot of
primitives or triangles, you may find it worthwhile to subdivide
the object into a hierarchy of smaller objects. Indeed, a large
object often does touch the viewing frustum even if only a small
piece of it really intersects the frustum. This results in sending
a large invisible piece to Direct3D for nothing. To solve this
problem, you can apply a subdivision technique such as an Octree or
a SEAD to test each piece of the large object. The idea is to
create subgroups of objects based on a regular (SEAD) or irregular
(Octree) subdivision. You could also use the logical hierarchy of
the scene. For example, the hierarchy of a single character—if the
arm isn't on the screen, you don't need to check to see if the hand
is on the screen.
All these elimination techniques are based on grouping triangles
or primitives together. They are inefficient if applied to small
groups of triangles or, worse, to single triangles.
Summary

Do the fewest number of tests per triangle to eliminate it (1
bspere test for 1000 triangle objects costs 1/1000th of a test for
1 triangle).
Create hierarchies to reduce the number of tests for each
object.
Subdivide objects that are too large into smaller hierarchies,
so that you don't end up with one 
DrawPrimitivecall for 10,000 triangles when only 1000 of the
triangles are actually in the viewable area.

Working with Geometry and
Performance
The way you store geometry and send it to Direct3D affects
performance.
In some games, you'll find that triangle lists provide better
performance. In others, you'll find that triangle strips provide
better performance. Test your situation to determine the best
approach to use.
Strips share vertices. Therefore, in very large strips, you'll
find that the number of vertices in the primitive tends towards the
number of triangles, so a large strip represents three times less
data to send to Direct3D than does a list of triangles of the same
size. Therefore, Direct3D transforms, lights, and sends three times
less data to the hardware. This is why strips are much faster than
single triangles.
One difficulty with strips is that triangles must share the same
state (texture and effects) and the adjacent vertices must be
identical (xyz, rgb, normal vector, and so on). Those constraints
are very important and the quality of the meshes directly
influences the size and number of strips that can be found. To get
the best results, you should ensure that meshes use as few
different textures as possible and that texture mapping is done so
that all adjacent vertices share the UV coordinates.
There are two different ways to send geometry to Direct3D. You
can use 
DrawPrimitiveor 
DrawIndexedPrimitive. If you use the 
DrawPrimitivefunction, you should send triangles in the
D3DPT_TRIANGLESTRIP mode, especially if you can do a simple
backface culling test for the whole strip. Avoid using the
D3DPT_TRIANGLELIST mode with the 
DrawPrimitivefunction.
If you simply want to send a list of triangles, use 
DrawIndexedPrimitiveinstead. It is the best solution if you
can't do backface culling on large groups of triangles. With 
DrawIndexedPrimitive, Direct3D automatically generates
strips from the triangle list wherever the list of indexes makes it
possible.
Regarding the type of vertex data sent, generally, D3D_LVERTEX
(lit by the game but transformed by Direct3D) is faster than
D3D_TLVERTEX (lit and transformed by the game) because Direct3D has
very efficient transformation code. But if you already have the
screen coordinates (for On Screen Display for example) or if you
can generate the geometry in the screen space (for Bezier patches
for example), then you might prefer D3D_TLVERTEX.
A problem may occur if you group several objects into a single
list and these objects are positioned differently (different limbs
of a character for example). In this case, the only way you can
have Direct3D carry out the transformations is to split the
triangle list into several smaller lists. This reduces performance
because Direct3D is faster with large lists. It may be impossible
to create some lists if several vertices of a triangle don't share
the same matrix, which happens when you are putting skin on
characters. In those cases, it is usually more efficient to do the
transformation in the game code (for example, with the animation)
and send the transformed vertices in larger lists by using the
D3D_TLVERTEX type.
While the Dreamcast hardware does the viewport clipping,
Direct3D does the near plane clipping if the DONOTCLIP flag is not
set. The DONOTCLIP flag tells Direct3D not to do clipping
calculations. It is best to turn the DONOTCLIP flag on whenever
possible. Test each object to see if it touches the near plane. If
it does, then you know that all of its triangles won't have the
DONOTCLIP flag set.
Our final issue with geometry involves data locality and
alignment. To be as efficient as possible, align all vertex data to
32 bytes. If the vertex data is misaligned, Direct3D has to copy
the data to another memory block that is aligned to 32 bytes. An
important thing to consider is that a block allocated with the 
mallocfunction is only aligned to 4 bytes.
Also, you should not generate primitives on the fly. It is much
faster to have everything ready in the final format. Then you can
simply call the 
DrawPrimitivefunction. You should use D3D_VERTEX
(transformed and lit by Direct3D) wherever possible.
Finally, don't store the primitives in a random order. Try to
group them in the same order that you're going to render them. This
will be faster due to better cache coherence.
Summary

Send as many vertices as possible in a single 
DrawPrimitivecall. This is the most important optimization
you can do. Do everything you can to keep from breaking up
primitives.
Do the transformation yourself if it would make you break up
primitives, because vertices have different matrices.
Try to share all states for the triangles you send.
Group the triangles per state and matrix, but don't sort them
on the fly in real time. If you arrange them by matrix and state
beforehand, then object by object is fine.

Optimizing a Game
The Windows CE performance viewer is an interesting and
important tool that you can use to do all the optimization work on
a game. To activate this tool, you must activate it in the
Monitor's drop-down menu in the Dreamcast Tool, but only after you
have launched the game.
When you activate the Windows CE performance viewer, you will
see three horizontal bars on the screen. The first bar (light blue)
represents the time the hardware takes to render the scene. The
second bar (gray with red, green, or blue vertical lines)
represents the time spent either in the application or in Direct3D.
The third bar (purple) represents the frame rate.
The three bars grow from left to right. The slower a part is,
the longer its bar will be. On the second bar, you can
differentiate between the time spent in the application (gray) and
in Direct3D (colored lines).
You can see the results of every optimization explained in this
article by looking at the bars displayed by the Windows CE
performance viewer.
An efficient elimination algorithm reduces the time spent in
Direct3D, so you'll see fewer colored lines and more gray. If the
gray part of the bar grows more than the colored lines disappear,
then the game code took more time to eliminate the triangles than
to render them—thus increasing globally the time for each
frame.
Because each 
DrawPrimitiveand 
DrawIndexedPrimitivecall is represented by one colored line,
if a geometry is rendered triangle by triangle, a large part will
be interlaced with gray and colored lines. If the geometry is
rendered with only one 
DrawIndexedPrimitivecall, there will be one large colored
line. But this line will be much smaller than the previous
interlaced part. This shows how it can take less time to render the
same number of triangles if they are sent together in one large
list.
If a geometry can be automatically transformed into strips by
the 
DrawIndexedPrimitivecall, the large colored block will
shrink, and the global performance will be better. This is because
the number of vertices will be reduced in the mesh and because the
size of the colored line depends directly on the number of vertices
sent.
It is very easy with this tool to try out different modes,
flags, and to precisely measure the difference between them. We
really appreciated the direct feedback this tool can deliver. You
can disconnect some functionality by pressing a key and immediately
see the bar shrink.
Examples from Optimization Process
The following examples include screen shots, which are from our
optimization process—from a technical demo game. At the bottom of
each screen shot, notice the bars that indicate performance. These
bars are a performance monitor. Figure 1 illustrates the
performance monitor used, so you may better understand the screen
shots in Figures 2 through 5.




Figure 1. Performance monitor

In the first screen shot in Figure 2, none of the optimizations
has been implemented. The game is sending a lot of small
primitives, as shown by every little red or blue line.




Figure 2. No optimizations implemented

In Figure 3, primitives are aligned to 32 bytes, lined up one
behind the other.




Figure 3. Primitives aligned

In Figure 4, triangles are grouped by render state to reduce the
number of primitives.




Figure 4. Triangles grouped by render state

In Figure 5, strips were generated to reduce the number of
vertices.




Figure 5. Strips generated to reduce vertices


Summary
By following the guidelines in this article, you will be able to
achieve very high performance for your Windows CE–based game
application with Direct3D.
When we first launched our PC application on the Dreamcast,
performance was worse than 10 frames per second. But after we
applied the techniques explained in this article, performance
improved significantly. Now the performance is close to 60 frames
per second, and we still have more optimizations to do. We plan to
increase the size of our primitives even further and use fewer
textures for our objects. We are confident that, with these
additional optimizations, we will be able to achieve a performance
of better than 60 frames per second.
The solutions discussed in this article don't all bring the same
performance improvement, but the basic idea remains the same. Try
to send as many triangles using as few 
DrawPrimitiveor 
DrawIndexedPrimitivecalls as possible. Once you've achieved
that, reduce the number of vertices sent by sharing the vertices
that you do send.
It is very important to choose the right method for each kind of
geometry (humans, animals, cars, and so on) and to train artists to
create clean geometries that use just a few different textures with
texture coordinates that can be shared by the vertices.
--------------------------------------------

This document is provided for informational purposes only.
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS
SUMMARY.


Microsoft, Direct3D and Windows are either registered trademarks
or trademarks of Microsoft Corporation in the United States and/or
other countries.


Other product or company names mentioned herein may be the
trademarks of their respective owners.









© 2004 Microsoft Corporation. All
rights reserved.
Optimizing Dreamcast Microsoft Direct3D Performance

Contents

Introduction

Taking Advantage of the Power of the Dreamcast 3-D Chip

Improving Performance

Summary

Working with Geometry and Performance

Summary

Optimizing a Game

Examples from Optimization Process

Summary

`Introduction`