>Forget about the Verite...The I3D Voodoo will not be returned to EB
>From Bjorn's 3D-World:
>Rendition about VQuake and VHexen II Ok, all the other stuff taken
>care off. Then it's time to take on this hot potatoe that the Stealth
>II has become ...First. Stefan mailed me back an answer for the
>question about the "slow" (no performance gain) speed people have been
>getting in VQuake and VHexen II. This is what he said:
>I'm not sure how Golden Egg was able to get good frame rates on
>VQuake, either. All of our experiments had shown that VQuake wouldn't
>go much faster on the V2200. The reason is simple: the way the game
>draws is very CPU-dependent for its performance. You get much bigger
>performance increases with a faster CPU than you do with a faster
>renderer. (I can give you the relevant details of the graphics engine
>if you'd really like that.)
>So, yes it's true that VHexenII goes the same speed (at least on my
>machine) on a V2K as on a V1K (about 51fps timedemo demo1 at
>320x240). On the other hand, running it on a P2-300 on a V2K (also
>with the benefit of the faster AGP bus, but I'm not sure how much that
>matters yet) I get about 70fps in the same test.
>This is why, by the way, we have another programmer working on Hexen2
>and Quake2 for the V2K, that will not be so CPU bound, and will, if
>things go as planned, give better performance.
>For now, in Quake 2, I do get much better numbers on the V2K than on
>the V1K, but that is using timerefresh, so I can't guarantee that
>those numbers will carry over once timedemo gets added to Q2. (BTW,
>the P2-300 gets much better Q2 timerefresh scores than my P166MMX,
>So again, I don't know why Golden Egg got better performance then
>compared to now. Let me know if you'd like more details on why CPU
>power is so crucial to Quake-engine games.
Posted by Stefan Podell on October 30, 1997 at 04:34:56:
Okay, with my reputation tarnished by VQuake and VHexen2
not going any faster on the V2x00, I'm here to explain
why this is the case.
First, the VQuake/VHexen2 engine can be roughly broken
down into four parts (actually, most games are like
* Software setup of geometry commands
* Software creation of texture maps
* Transfer of textures and commands to the renderer
The first two parts are totally CPU dependent, the
third is bus dependent, and the fourth, in my case,
is Verite dependent.
Now, some history on the design of VQuake.
When we first started working with Id on accelerated
Quake, Id's design was much like their design for Quake
2: that there would be a "driver" part of the code,
somewhat separated from the "main" game. (In Quake's
case, though, since it is a DOS game, this would be
accomplished with different executables rather than the
more elegant DLL model employed by Quake 2.)
They had two 3D engine paths through the game. One was
a traditional triangle/polygon based engine, which is
what they predicted all 3D accelerators would use, and
one was a fairly elaborate "span sorting" scheme which
the software renderer used.
So we set out accelerating the game using the polygon
interface. It looked great with filtering, but it
wasn't terribly fast. Even after doing polygon sorting
on the world polygons and turning off the Z buffer for
those, the performance was not great. (You must
remember that the V1000 wasn't really designed to do
Z-buffering as its primary rendering style.) So Walt
Donovan, then with Rendition, and Michael Abrash, then
with Id, talked about using the software engine's span
For those who are interested, there have
been a few articles published by Michael on the guts
of the engine. I think Dr Dobb's Journal had the best,
most detailed one. To the best of my understanding,
the engine sorts all world surfaces (floors, walls,
ceilings, and sky) on each scanline of the screen,
and keeps track of the edges of each span. When it
goes to draw a particular surface (polygon), it is
then guaranteed that every pixel it is drawing is the
only world pixel that will be drawn at those
coordinates. (This is hard to explain...) Suffice to
say that when the world surfaces are drawn, exactly
the minimum possible number of pixels is drawn (i.e.,
a depth complexity of 1). Meanwhile, the Z buffer
has been filled with the correct Z values, but Z
comparison is not necessary, since we know the depth
complexity is one. Next, when we start drawing objects
(monsters, weapons, etc), we turn on the Z comparison
so that the objects are properly hidden by the world.
The reasoning behind this was that the Pentium was
pretty slow at drawing pixels, but fast at floating
point operations. By doing more of what the Pentium
was good at, Quake was able to do less of what the
Pentium was bad at. Overall, a performance win, with
the side benefit of allowing much more interesting
Walt and Michael decided that since the Verite 1000
wasn't terribly good at Z-buffered pixels, that if
we let the Pentium take care of this span sorting, we
could reduce the number of pixels the Verite would
draw. Furthermore, we'd be able to turn off the Z
compare function on the Verite. So now, rather
than having a depth complexity of around 1.5
(about 450000 pixels at 640x480) pixels that draw
at the peak rate of about 10MP/s on the V1000, we
had a depth complexity of 1 (300000 pixels) that
draw at a peak rate of about 17MP/s.
As with the software renderer, we could then turn on
the Z compare and draw all the more interesting
objects. Yes these pixels would be as slow as always,
but there are generally far fewer of them.
So we set out writing new microcode to support Quake's
span data format. When we finally got this working,
sure enough, the performance was way better than the
original polygon-style engine.
Back to my four-part description of Quake's engine, we
traded more CPU work in stage one for less Verite work
in stage four, which ended up getting us a big win.
Note that when you increase the vertical resolution in
the game, the engine must sort more scanlines. And if
you increase the resolution in either direction, the
renderer must draw more pixels.
Also note that when Quake was written, P133's were
pretty top-notch, and the software frame rates were
low enough that the span sorting was adequate to
keep up. With the Verite rendering much faster than
the Pentium could, suddenly the span sorting was the
bottleneck. It wasn't until we got to P200's that
the Verite was busy most of the time.
So whether you're on a V1000 or V2x00, the CPU has
a lot of work to do.
Okay, that's part one.
Part two is texture maps. (This will be much shorter.)
The way the software renderer in Quake works is to take
a small texture tile (like a couple of bricks) and
duplicate it into a larger texture map for the world
surface while it is applying the light map (dynamic or
static). It then caches that texture map and draws
with it. When it needs to draw that surface again, it
checks to see if the lights have changed (like when
you fire a weapon). If they have, it must regenerate
the texture map and recache it.
The VQuake engine does the same thing, with the extra
step of having to download the texture map to video
Quake also mipmaps these surfaces. The mipmap level
is chosen based on the size of the polygon (in pixels)
relative to the size of the texture map.
In VQuake, the texture cache is kept in video memory
along with the display buffers and Z buffer. The
quick equation for how much memory the display and Z
buffers take is Width * Height * 6 (3 buffers, each
16-bits deep). The rest is for texture maps (minus
about 128K for microcode).
So when you increase the resolution, two things
happen that increase the demands on the CPU for
texture map generation. First, you have less texture
memory, so textures will fall out of the cache more
often, requiring regeneration. Second, higher
resolution mipmaps will be chosen, further straining
the texture cache.
The assembly code for generating the textures is
darn near as good as it can be. I certainly can't
think of any instructions to remove, and Michael
Abrash, who wrote it, is a genius at this stuff.
We considered doing two pass lighting on the Verite,
but after some experiments decided the CPU could do
So again, no matter the Verite chip, the CPU will be
very busy. (The texture mapping, by the way is the
primary reason that timedemo works so much better as
a real benchmark than timerefresh. In the demo
sequences, there's lots of combat going on, which
pushes the system much harder.)
Alright. That's part two.
Part three is the bus. As you know, the currently
available Verites use the PCI bus, and are able to
use DMA asynchronously, which does not use the
CPU. The bus activity will steal some cycles from
the CPU, but not an appreciable amount.
Finally, the renderer. The V2x00 chips are *much*
faster at drawing than the V1000. The fastest the
V1000 could go was 25MP/s. The V2100 goes 40MP/s
and the V2200 goes 50MP/s. Adding features (Z,
alpha, fog, etc.) would slow the V1000 down a lot,
while having minimal impact on the V2x00 chips.
So I understand why people were expecting VQuake
and VHexen2 to go much faster on the V2x00. But
the fact of the matter is that a faster renderer
doesn't necessarily buy you much given the
architecture of this engine. And because of that,
we're working on a V2x00-specific version of the
engine, to take advantage of the extra pixel
power, while lightening the load on the CPU.
I must admit that I was beginning to wonder if
my beliefs about the engine's behavior were
really true. So I just did a weird hack on
VHexen2 to test something:
I put in a check at the time drawing commands
and texture maps are sent to the Verite to see
if the game was in "timedemo mode". If it was,
I just threw away the commands and continued.
Then, when timedemo was over, drawing would
kick back in and I would see the results. The
purpose of this was to simulate an *infinitely
fast* renderer and bus (something we'd all
like to have :-) (This is also known as a
"speed of light" test.)
Here are the results of running timedemo demo1
on my current VHexen2 build (beta 3 candidate).
I ran all the tests at 320x200, 512x384, and
640x480, with antialiasing set to 0 and to 7
at all resolutions.
The first three tests are with rendering turned
on, in other words, the numbers everyone has
been responding to so far. The last set of
numbers is my "infinitely fast" renderer.
(fps are antialias = 0/antialias = 7)
The first test was on my P166MMX (64MB RAM)
with my V1000 reference board, which runs at
the same speed as an Intergraph 3D 100.
(These seem slower to me than what I
remember, but I can't find where I wrote
down my old numbers, so I just re-ran
320x200 (51.0 / 43.1)
512x384 (26.7 / 24.2)
640x480 (16.9 / 15.5)
Next, the same PC with a V2200. The first
thing you'll notice is that it's a little
slower at low resolutions. I think this is
because the span microcode has a little extra
overhead on the 2200. It also doesn't
currently interleave buffers. I'm looking
into fixing those things.
320x200 (48.1 / 41.9)
512x384 (32.9 / 28.5)
640x480 (22.0 / 20.1)
Next, a V2200 in a P2-300 (this computer has
a pretty sucky hard drive in it and 32MB RAM,
so there was more swapping going on than
on my 166MMX)
320x200 (71.0 / 63.4)
512x384 (43.3 / 36.1)
640x480 (27.9 / 23.5)
And finally, my P166MMX with my "infinitely
fast" renderer/bus test
320x200 (56.0 / 48.2)
512x384 (44.1 / 37.8)
640x480 (32.4 / 29.7)
So you can see that at low resolutions, a
faster CPU gets you much more by way of
performance than a faster renderer. And as
resolution increases, infinitely fast
rendering is a good thing :-)
Again, all this adds up to us working on
a V2x00 version of this engine. We'll tell
you more as we know more about its progress.
This is not to say that there's no room for
improvement in my current version of the
game. So if I figure out any cool way to make
this go faster on a V2x00, I'll definitely
put it in. There is one thing Quake 2
does that I want to try in VHexen 2. I'll
let you know.
Stefan (maybe I should use a .plan file) Podell