On 8/7/2022 6:02 AM, Andy wrote:
> On 6/08/22 18:46, BGB wrote:
>> On 8/5/2022 8:29 PM, Andy wrote:
>>> I'm assuming you've implemented some kind of deferred shading tile
>>> renderer?, since block-rams would seem to be the perfect fit for tile
>>> buffers if they aren't too oddly sized.
>> No, errm, TKRA-GL is a software renderer implementing the OpenGL API
>> on the front end...
> So, pretty much like the standard Z-buffer software scanline renderer
> one would write on a PC then?
The same code can build and run on a PC as well, albeit it has a slight
disadvantage on PC due to it lacking a few special features I have in my
ISA. Despite my PC having 74x the clock speed, it only seems to pulls
off around 20x the fill rate (takes roughly 4x as many clock-cycles per
> They invented the PowerVR style tile renderer for a reason, so may as
> well steal from the best. :-)
> I'm guessing it might help boost your frame-rates for pretty much the
> same reasons, -- the small internal tile buffer negates the need to
> read/write to dram for every buffer lookup/update.
It would mostly help if one is throwing multiple cores at the problem.
Some of my past experiments with multi-threaded renders had split the
screen half or into quarters, with one thread working on each part of
Though, my current rasterizer is single threaded.
Originally, it was intended to be dual threaded, but ran into a problem
when I started exceeding the resource budgets needed for doing dual core
on the FPGA I am using, and the emphasis shifted to trying to make a
single core run fast.
L1/L2 misses from the raster drawing part aren't too bad IME.
Texture related misses were a bigger issue, which is part of why I am
using Morton ordering when possible.
A 256x256 texture has more texels than a 320x200 framebuffer has pixels.
Splitting up geometry and then drawing each tile sequentially is not as
likely to be helpful.
There is a possibility of a slight advantage to drawing geometry
Z-buffer-only first, and then going back and drawing surfaces.
This would matter a lot more for doing a lot of blending or possibly if
running shaders, since in this case per pixel cost is a bigger issue.
I already have some special cases where geometry hidden behind
previously drawn geometry will be culled.
Some of the span drawing loops have "Z-fail sub-loop" special cases:
If the first pixel would be Z-Fail, go into a Z-fail loop;
If we hit a pixel that is Z-Pass, branch back into Z-Pass loop.
The Z-Fail sub-loop simply updates the state variables and checks for
The Z-Pass loops generally use predication for Z test, though another
possible (valid) design would be to have the Z-Pass loop branch back
into the Z-Fail loop on Z-Fail.
> And two or more scans over the tile buffer lets you figure out exactly
> which pixels of the polygons that are visible need to be shaded and
> textured, versus the possibly not insignificant overdraw a z-buffer
> renderer might waste time and bandwidth on.
> Of course ripping up and changing your existing core probably isn't
> something you'd happily contemplate, so take my suggestion with a large
> grain of salt. ;-)
Also it isn't likely to offer a huge advantage in this case (with a
Also possibly counter-intuitively, a bigger amount of time is currently
going into the transform stages than into the raster-drawing parts.
So for GLQuake, time budget seems to be, roughly, say:
~ 50%, Quake itself;
~ 38%, transform stages
~ 4%, Edge Walking
~ 12%, Span Drawing
Part of the reason for the transform stage cost is that GLQuake draws a
fair number of primitives that end up being culled.
Say, for example, Quake tries to draw 1000 primitives in a frame, 300
fragmented, 900 are culled. Then, 400 are drawn.
Typically, the majority are culled due to frustum culling, and also a
fair number due to backface and Z-occlusion checks.
>> For a triangle, it walks from the top vertex to the middle vertex,
>> then recalculates the step values and goes from the middle to the bottom.
>> For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
>> In earlier stages, triangles and quads are treated as two different
>> primitive types.
>> Trying to draw a polygon primitive results in it being decomposed into
>> quads and triangles.
> So why'd you deprecate the quads as primitives?
Originally it only did triangles internally, but then I added quads as a
special case in the transform stages:
Projecting 4 vertices is less than 6;
They tessellate more efficiently;
This stops when it gets to the final rasterization stages, mostly
because the general-case logic for walking the edges of a quad is more
complicated than for a triangle (so didn't bother implementing it).
So, the "WalkEdgesQuad" function makes two calls to the
There are a few special cases that could probably be handled without too
much complexity, but they don't really happen that often in 3D scenery.
> I've been looking through the source code of Core Design's
> un-released/developed game 'TombRaider Anniversary Edition' (not to be
> confused with the Crystal Dynamics game which did get released),
> there's plenty of quad polygon handling functions to be seen, which
> backs up my intuition that tri's and quads should be handled equally
> well if at all possible.
Wasn't aware of any of the code for any of the Tomb Raider games having
been released, but then again I wasn't really much into Tome Raider.
Back when I was much younger (middle school), my brother had a
PlayStation, a few games I remember on it:
Final Fantasy VII;
There was also a demo CD with demo versions of games like Spyro the
Dragon and Tomb Raider and similar.
By high-school, he had a Dreamcast and the Sonic Adventure games and
similar, along with a PlayStation 2 with games like Grand Theft Auto 3
I had been mostly into PC stuff at the time (Quake 1/2, etc).
I had preferred Quake 1 over Quake 2, as while Quake 2 had a few things
going for it (a more hub-world like structure), Quake 1 was more
interesting. In high-school it was mostly Half-Life. I think by the time
I was taking college classes, was mostly poking around in Half-Life 2
and Garry's Mod, then Portal came out, ...
Well, until I started messing around with Minecraft, not really done
much else in gaming much past this point.
Contrary to some people, I suspect the HL/HL2 era is when graphics got
"good enough", despite newer advances in terms of rendering technology,
the "better graphics" don't really improve the gameplay experience all
One of the more interesting recent developments is real-time ray-tracing.
However, it is not so easy to write a ray-tracer that is competitive
with a raster renderer. I had experimented (not too long ago) with
implementing a software ray-tracer in the Quake engine, but it fell well
short of usable performance.
This was based on a modified version of Quake's normal line-tracing
code, which also had an issue that the world (as seen by line-tracing)
is not exactly the same as what is seen when drawing it as geometry
(there are a lot of "overhangs" in the BSP where the line-trace thinks
it has hit something solid, but there is no actual geometry there).
Also, line tracing the BSP is a lot slower than one might expect...
Had at one point also tried using line-traces to try to further cull
down the geometry in the PVS, but this turned out to be slower than just
drawing the geometry directly and letting GL sort it out.
IME, line-tracing over a regular grid structure (or an octree) tends to
be more efficient than doing so via a recursive BSP walk (an octree
based engine likely being more efficient if one wants to implement a
ray-cast or ray-tracing renderer).
But, OTOH, doing a modified version of Quake where I rebuild all of the
maps from the map source (with custom tools) using an octree and similar
rather than a BSP, is probably "not really worth it".
> <more snip>
>> I could make stuff look a lot better, but this would require:
>> Using RGBA32 buffers and textures;
> 16bit textures shaded into 24bit frame buffers was the standard for a
> while when memories were smaller weren't they? seemed to be acceptable
> for the time and probably not to shabby for the retro gaming inclined
> today either.
Probably, I am using RGB555A, which can sorta mimic RGBA32 and (on
average) looks better than RGBA4444 or similar.
RGBA32 for a framebuffer can look better, but using it would be kind of
a waste when the output framebuffer is using RGB555. And, on the FPGA
board I am using, the VGA output only has 4 bits per component, so even
the RGB555 output is effectively using a Bayer dither in this case.
Though, I did come up with a trick to mostly hide the Bayer dither by
rotating the dither mask for each VGA refresh.
Granted, it is possible that RGB888 could still offer some level of
benefit over RGB555 here.
Had considered going over to Z24.S8 buffers, but would have needed to
rewrite a lot of my span-drawing functions to use it (all those written
to assume a 16-bit Z-Buffer), and a 32-bit Z+Stencil buffer would be
kind of a waste if stencil is used infrequently (the Quake games don't
use any stencil effects).
Had also looked into Z12.S4, but this would cause unacceptable levels of
Z-fighting in my tests. This led to the stencil cases to use a separate
>> Not doing as much corner cutting;
>> Generally spending a lot more clock cycles on the 3D rendering;
>> It is pretty hard to try to get playable GLQuake on something running
>> at 50MHz without a GPU.
>> Like, if there is one big advantage that the PlayStation had, it was
>> that it had a GPU.
> Maybe there's a hint to be had there...
> Something like a big-little multi core design,
> your large but leaner WEX core handling all the game input, camera &
> object updates, then vector style churning though all the floating point
> geometry to leave an array of Z-sorted integer polygons that can be fed
> to a number of tiny 16/24bit risc/misc like cores to render into however
> many spare tile buffers you can fit into your fpga.
> And by tiny, I mean like a 16bit 6502 with half the instruction set
> thrown out, if an instruction doesn't aid in placing a texture sampled
> pixel into the tile buffer --- axe it!
It is possible, though if I were to try to fit TKRA-GL to it, it would
likely mean cores that were more like:
2-wide with 64-bit Packed-Integer SIMD;
Probably still needing a 32-bit address space.
Being able to dealing with transforms would still likely require
FP-SIMD, but could be reduced to the S.E8.F16 form (possibly with Lanes
1 and 2 operating independently). Could potentially omit support for
Binary64 FPU ops.
Some of my previously considered GPU related features, I had back-ported
to BJX2 as "proof of concept".
It is possible that the "GPU" could be running a more restricted BJX2
Not too long ago, I had considered another possibility:
BJX2 core is used as the GPU;
I add a secondary "CPU" core, mostly running RV64 or similar.
Say, the GLQuake engine-side logic is run on an RISC-V core.
Though, my attempts at RISC-V cores have thus far end up more expensive
than ideal (in my attempts, a full RV64G core would end up costing
*more* than another BJX2 core), and even making it single-issue isn't
really enough to compensate for this.
> I'm thinking a small quantity of tiny independent cores working
> simultaneously might work better over-all than one big complex core
> trying to do it all. YMMV
Only if the work can be split up to use the cores efficiently.
With the current balance, unless these cores could also handle vertex
transform and similar, they wont save much.
A bigger amount of savings would likely be possible with a redesigned
API design, possibly:
Rendering state configuration is folded into "objects";
Front-end interface mostly uses fixed-point or similar;
>> Ironically, due to the affine texturing and similar, my GLQuake port
>> tends to look kinda like it was running on a PlayStation.
>> Ironically, I think I may not be too far off from something like the
>> Sega Saturn, given that most of the examples I had seen of Saturn
>> games had very simplistic geometry and lots of pop-in at relatively
>> short distances.
>> Well, contrast with something like Mega-Man Legends (on the
>> PlayStation), which had very simplistic geometry but often fairly
>> large draw distances (in comparison).
>> Like this game was like "Behold, this character has cubes for arms and
>> hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
>> roof overhangs or windows here!"
> I always end with the Crystal Dynamics TombRaider games for my nostalgia
> trips, Lara has plenty of polgons, in all the right places, they even,
> uhh jigg...
> Ummm, better not finish that last sentence, lest the WokePC brigade are
> watching! ;-)
Kinda curious that they did this back then, whereas many later games
(Half-Life 2, Doom 3, etc) didn't bother with these sorts of effects
(they probably could have if they wanted as special cases of ragdoll
within their skeletal animation systems).
Many newer games apparently also do things like soft-body physics and
cloth simulation and similar.
I guess I am in the category of having a certain level of sentimentalism
for characters like Tron Bonne and similar (from the Mega-Man Legends
series), though in part because of finding her character relatable.
>> Then, there is GLQuake, which despite having "simple looking"
>> geometry, a lot of this geometry is already cut into a fair number of
>> small pieces by the BSP algorithm.
>> FWIW: The dynamic tessellation actually only splits a relative
>> minority of primitives, mostly limited to those a short distance from
>> the camera.
> I've always been tempted to write a game engine called the REPYES engine
> - remember every polygon you've ever seen.
> Basically a giant view direction and player position dependent database
> that loads and frees polygons and each of their individual texture maps
> to Vram from main memory, so that older laptops and such with weakish
> GPUs can enjoy near maximum / lush visible poly counts as they work
> their way through a game level.
> But instead of using BSPs to figure it all out, I'd just brute force
> paint polygonIDs into the frame buffer then trace over the buffer and
> record exactly which polygons were visible, step view direction, step
> position, rince repeat over all player accessible regions of the game map.
> But it'll probably never happen, cause, urrr, it's possibly quite a
> stupid thing to do in practice I guess.
> Yeah, best forget I mentioned that. :-)
In one of my more recent experimental engines with a Minecraft style
I basically had the camera cast out rays in every direction and then
building a list of visible blocks (recorded as their world coordinates).
The renderer then draws all of the blocks on this list.
This approach is faster and saves memory on BJX2 when compared with the
"build and draw a vertex array for every chunk" approach.
However, it doesn't scale as well with draw distance, and on a PC with a
GPU, it is faster to use per-chunk vertex arrays and occlusion queries
(however, using ray-casts to build block lists still uses less RAM).
Performance on my BJX2 core was comparable to Quake, but it can do
outdoors environments (nevermind the limited draw distance in this case).
Ironically, despite running on a 50MHz CPU core, performance still
somehow manages to be better than Minecraft with a similar draw distance
on a Vista era laptop.
Doing something vaguely similar, but with ray-casting over an octree,
also seems possible (where the oct-tree would keep dividing geometry
until it reaches a certain maximum number of polygons).
Unlike Quake, by using a few Minecraft style tricks, it would also be
possible to extend it to arbitrarily large environments. Say, the
top-level world is split up into a grid of 256 meter cube-shaped
regions, with each cube divided into an octree (each region being
roughly the size of a Quake 2 map).