FPGA use

Brian G. Lucas

unread,

Jan 7, 2024, 4:07:55 PMJan 7

to

Several posters on comp.arch are running their cpu designs on FPGAs.
I have several questions:
1. Which particular FPGA chip? (not just family but the particular SKU)
2. On what development board?
3. Using what tools?

Thanks,
brian

BGB

unread,

Jan 7, 2024, 7:20:03 PMJan 7

to

Main boards I am using:
Nexys A7, XC7A100T-1CSG324C
QMTECH XC7A200T, XC7A200T-1FBG484

A few other boards:
Arty S7-50, XC7S50
CMod S7, XC7S25

I am primarily using Vivado WebPack (AKA: the freeware version of Vivado).

I also technically have:
AVnet Ultra96
Got it from someone else in this group (which is cool).
But, never figured out how to make it actually do anything...

The Nexys A7 has the nice feature that one can stick the bitsteam onto
an SDcard, and just boot from the SDcard...

The Arty and CMod boards need to be booted via a USB cable, or have an
onboard Flash chip that can be configured to be able to boot the FPGA.

For the QMTech board, needed to get a JTAG cable, which cost almost as
much as the FPGA board itself...

Has the big annoyance that it needs to be booted via the JTAG cable
every time.

I have the "daughter board" for the QMTech board, which has an RP2040
chip and similar. Theoretically, the RP2040 was supposed to be able to
both serve the role of a JTAG interface and be able to boot the FPGA
from a bitstream loaded into the RP2040's Flash, but... Couldn't get
this part to work (the RP2040 itself seemed to work, just seemingly ).

As for cores I can fit on them:
XC7S25:
Small 1-wide cores, generally no FPU or MMU.
The CMod S7 could be more useful... if it had a RAM chip.
XC7S50:
Can sort-of fit a 2-wide core with an FPU and MMU.
Difficult to fit a full SIMD unit with 3-wide and 64 GPRs.
XC7A100T:
Could fit a full 3-wide BJX2 core, with full ISA the features.
XC7A200T:
Can fit two 3-wide BJX2 cores and a hardware rasterizer module.

> Thanks,
> brian

Robert Finch

unread,

Jan 8, 2024, 2:07:42 AMJan 8

to

I primarily use a Nexys Video board which has a Xilinx/AMD
XC7A200T-SBG484-1 chip on it. I have several older boards decreasing in
density which I no longer use. They were great for working on things,
but I keep growing and am into really complex designs now. I am saving
up for a larger board as most of the stuff I work on now is quite
crammed into the FPGA. I also have a CMOD-A7 with a XC7A35 chip.

I am using Vivado Webpack (the free version). I also have been using
vasm, and vlink, to assemble and link software. Using a compiler I wrote
myself – cc64. Also using MS Visual Studio, community edition, to
maintain home grown tools. And I use MS Office 365 for documentation. I
use Ultra-edit as a source code editor. Files are being archive in Github.

BGB

unread,

Jan 8, 2024, 3:08:40 AMJan 8

to

On 1/8/2024 1:07 AM, Robert Finch wrote:
> On 2024-01-07 4:07 p.m., Brian G. Lucas wrote:
>> Several posters on comp.arch are running their cpu designs on FPGAs.
>> I have several questions:
>> 1. Which particular FPGA chip? (not just family but the particular SKU)
>> 2. On what development board?
>> 3. Using what tools?
>>
>> Thanks,
>> brian
>
> I primarily use a Nexys Video board which has a Xilinx/AMD
> XC7A200T-SBG484-1 chip on it. I have several older boards decreasing in
> density which I no longer use. They were great for working on things,
> but I keep growing and am into really complex designs now. I am saving
> up for a larger board as most of the stuff I work on now is quite
> crammed into the FPGA. I also have a CMOD-A7 with a XC7A35 chip.
>

Had been tempted by the Nexys Video, but, expensive...

Not sure about going bigger than this, as this seems like the upper
limit of what is supported by Vivado Webpack...

Had considered CMOD-A7, but the S7 was cheaper, and didn't notice until
later that the S7 lacked the RAM chip (which the A7 would have had).

Both the CMOD boards would be a bit small for what I want to do, since
as noted, even the Arty S7-50 is a bit cramped for my uses at present.

Both CMOD boards would likely be practically limited mostly to
microcontroller-like tasks.

Had at one point gotten a board with a Lattice ECP5 FPGA (the Orange
Crab), but then realized that despite having a moderately decent FPGA
(in terms of LUTs), its usable IO was actually a bit lacking (the number
of pads on the board is misleading, as many of them are not FPGA digital
IO pins, but rather fixed-purpose ADC input pins).

It is "somewhere around here"...

> I am using Vivado Webpack (the free version). I also have been using
> vasm, and vlink, to assemble and link software. Using a compiler I wrote
> myself – cc64. Also using MS Visual Studio, community edition, to
> maintain home grown tools. And I use MS Office 365 for documentation. I
> use Ultra-edit as a source code editor. Files are being archive in Github.
>

For Native-Windows coding:
Visual Studio Community 2022
Though mostly building via the command-line
I use the Visual Studio IDE mostly as a debugger.
I mostly use Notepad2 for code editing.

I have BGBCC, as my own C compiler, but currently this only really
targets my own BJX2 ISA. I generally try to keep my coding practices
portable enough that I can still use MSVC or GCC for native builds.

Main reason for targeting x86-64 or similar would be if I wanted to use
a variant of my own languages (BGBScript and BGBScript2) for native code
on Windows. But, for most native development I use C.

I could almost just go with C# instead if this is what I wanted, as
language-wise, BGBScript2 is fairly similar to C# just with a syntax
that is a little closer to a mix of Java and ActionScript (but using
manual memory management, more like C and C++).

But, for the cases where one usually uses C, C manages to work fairly
well (and neither BS2 nor C# is likely to out-compete C at doing "normal
C stuff"...).

Most of my documentation is plain text files using MediaWiki notation,
and admittedly not as much of a fan of MarkDown. Would be nice if there
were a native WYSIWYG viewer/editor for either notation though
(seemingly about the only WYSIWYG editors for these exist as web-apps).

At one point, had used HTML.

Technically have OpenOffice, but seemingly pretty much no one
understands ".ODT" files, and the closest one can get them to something
"generally viewable" is either to export them as HTML or "Print to PDF".

At least MediaWiki remains usable as plain text, even without all the
formatting.

Robert Finch

unread,

Jan 8, 2024, 6:24:12 AMJan 8

to

On 2024-01-08 3:08 a.m., BGB wrote:
> On 1/8/2024 1:07 AM, Robert Finch wrote:
>> On 2024-01-07 4:07 p.m., Brian G. Lucas wrote:
>>> Several posters on comp.arch are running their cpu designs on FPGAs.
>>> I have several questions:
>>> 1. Which particular FPGA chip? (not just family but the particular SKU)
>>> 2. On what development board?
>>> 3. Using what tools?
>>>
>>> Thanks,
>>> brian
>>
>> I primarily use a Nexys Video board which has a Xilinx/AMD
>> XC7A200T-SBG484-1 chip on it. I have several older boards decreasing
>> in density which I no longer use. They were great for working on
>> things, but I keep growing and am into really complex designs now. I
>> am saving up for a larger board as most of the stuff I work on now is
>> quite crammed into the FPGA. I also have a CMOD-A7 with a XC7A35 chip.
>>
>
> Had been tempted by the Nexys Video, but, expensive...
>
> Not sure about going bigger than this, as this seems like the upper
> limit of what is supported by Vivado Webpack...
>

I do not think it is that expensive considering the amount of use I have
gotten out of it over the last five or six years. Cost divided by use is
pretty low. Expensive is somewhat relative, I think it is expensive now
because I am not a millionaire, yet. I have some qualms about spending a
lot of money on what in some ways is just an entertainment toy, but I am
likely to spend double the money at some point.

My issue with larger boards is the system build times. At some point it
is not practical as a hobby because builds take too long. It takes about
an hour or so to build the system for the current board. Not keen on the
idea of day long build times.

>
> Had considered CMOD-A7, but the S7 was cheaper, and didn't notice until
> later that the S7 lacked the RAM chip (which the A7 would have had).
>
> Both the CMOD boards would be a bit small for what I want to do, since
> as noted, even the Arty S7-50 is a bit cramped for my uses at present.
>

I think the CMODA7 has more I/O too. I had a RISCV project running on
the CMODA7.

Forgot to mention a tool: I am using a Windows workstation,
INTEL-i7-7770, with 32GB of RAM. It was advertised as a game machine. I
got it refurbished, and installed additional RAM myself. The hard disk
crashed on it, so I replaced that with an SSD. I used to buy the parts
and assemble the machines myself, but more recently I just try to buy a
suitable machine.

BGB

unread,

Jan 8, 2024, 12:38:23 PMJan 8

to

As the moment, my build times (on the XC7A200T) are around 25 minutes.

Though, this is not using the full FPGA, and with a fair bit of timing
slack at the moment. Takes a lot longer if timing is tight.

>>
>> Had considered CMOD-A7, but the S7 was cheaper, and didn't notice
>> until later that the S7 lacked the RAM chip (which the A7 would have
>> had).
>>
>> Both the CMOD boards would be a bit small for what I want to do, since
>> as noted, even the Arty S7-50 is a bit cramped for my uses at present.
>>
> I think the CMODA7 has more I/O too. I had a RISCV project running on
> the CMODA7.

Yeah.

A RAM chip would make a big difference here...
As does more IO.

Could fit a basic RISC-V core or similar on it, but probably not
something with multiple execute lanes and SSE style FPU-SIMD (where the
4-wide SIMD unit was being a big pain trying to fit it into the XC7S50;
but for what I wanted to run on it, would kinda really want the fast
FPU-SIMD).

Though, like many other things, this sub-project stalled.

My case:
Ryzen 2700X, 3.7 GHz base-clock, 8 core / 16 thread;
Currently 112GB of RAM (was 48GB, upgraded recently);
ASUS ROG MOBO;
GeForce RTX 3060 (recently upgraded from GTX 980);
Still have a DVD-RW drive, still useful occasionally.

Had bought 128GB, but the ASUS ROG MOBO I am using, can't handle a full
128GB (freaks out and drops back down to 4GB).

Windows can see that the full 128GB exists, but identifies 124.5 GB of
it as "System Reserved".

So, stuck with a wonky configuration of 3x 32GB sticks and 1x 16GB
stick, which does at least seem to work.

Currently 12TB worth of HDD space, 1TB SSD (for OS).
Can't install any more HDDs, out of SATA ports.

CPU was left at base clock:

When I was running a Piledriver core, had down-clocked it from its stock
4.2 GHz to 3.6 GHz and disabled Turbo Boost, mostly because it was a lot
more reliable and well-behaved at the lower speed (didn't overhead under
load, wasn't constantly trying to melt).

Similar situation with the former Phenom II, where I had ended up
down-clocking it to 2.8 GHz from the stock 3.4, for similar reasons...

My Ryzen7 has a base-clock of 3.7 but occasionally turbos up, but
generally only under 1 or 2 cores; if trying to use all the cores, it
settles around the 3.7 GHz base clock anyways. Generally well behaved
enough that I have been able to leave it at the stock settings.

The RAM is IIRC running at 2733 MHz. Operation at much higher speeds
than this seems to be unreliable.

Can note that I also have an old Dell rack server (gotten as second-hand
/ they were throwing it out):
Dual Xeon E5410, 2.133 GHz IIRC
8GB of RAM as 16 sticks (so, like, 512MB each).

For a while, while its single-thread performance isn't that great, it
held up very well under multi-threaded workloads (against the Phenom and
Piledriver).

However, it seems roughly break-even vs my Ryzen. The Ryzen is
"obviously faster" at 4 threads, but too much past this and stuff slows
down to some extent and per-thread performance drops off (seemingly
leaving it pretty close to a tie under this situation).

However, running stuff like Verilator on my main PC is more convenient,
uses less power, and doesn't sound like a jet turbine...

...

Anton Ertl

unread,

Jan 9, 2024, 2:56:11 AMJan 9

to

BGB <cr8...@gmail.com> writes:
>Had bought 128GB, but the ASUS ROG MOBO I am using, can't handle a full
>128GB (freaks out and drops back down to 4GB).
>
>Windows can see that the full 128GB exists, but identifies 124.5 GB of
>it as "System Reserved".

This sounds to me like a Windows problem, not a problem of the board.
We have a number of systems, many of them with ASUS boards, with 128GB
RAM, and they all work fine, but of course none of them runs Windows.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,

Jan 9, 2024, 2:26:32 PMJan 9

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> BGB <cr8...@gmail.com> writes:
>>Had bought 128GB, but the ASUS ROG MOBO I am using, can't handle a full
>>128GB (freaks out and drops back down to 4GB).
>>
>>Windows can see that the full 128GB exists, but identifies 124.5 GB of
>>it as "System Reserved".
>
> This sounds to me like a Windows problem, not a problem of the board.
> We have a number of systems, many of them with ASUS boards, with 128GB
> RAM, and they all work fine, but of course none of them runs Windows.

The workstation still under my desk at work (delivered 2015, a
replacement has finally arrived end of last year) has 512 GB of
main memory. It ran using full memory using Windows 8.1,
and currently runs Windows 10.

Needless to say I would prefer a Linux box, but the IT departments
of big corporations have their own priorities...

Anton Ertl

unread,

Jan 9, 2024, 5:43:32 PMJan 9

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> BGB <cr8...@gmail.com> writes:
>>>Had bought 128GB, but the ASUS ROG MOBO I am using, can't handle a full
>>>128GB (freaks out and drops back down to 4GB).
>>>
>>>Windows can see that the full 128GB exists, but identifies 124.5 GB of
>>>it as "System Reserved".
>>
>> This sounds to me like a Windows problem, not a problem of the board.
>> We have a number of systems, many of them with ASUS boards, with 128GB
>> RAM, and they all work fine, but of course none of them runs Windows.
>
>The workstation still under my desk at work (delivered 2015, a
>replacement has finally arrived end of last year) has 512 GB of
>main memory. It ran using full memory using Windows 8.1,
>and currently runs Windows 10.

Different variants of Windows support different amounts of memory.
See, e.g.,
<https://www.compuram.de/blog/wie-viel-ram-lasst-sich-unter-32-bit-und-64-bit-betriebssystemen-maximal-adressieren/>

This says that, e.g., Windows 11 Home only supports 128GB, and Windows
11 Enterprize 32-bit supports 4GB. Interesting entries in this table
are:

Windows 2003 Server Datacenter Edition 32-Bit 128GB
Windows 2008 Server Datacenter 32-Bit 64GB

Makes you wonder why they reduced the capabilities between 2003 and
2008.

Scott Lurndal

unread,

Jan 9, 2024, 6:30:13 PMJan 9

to

The 32-bit processors were limited to 36 bits, which is 64GB. Not sure
how the ever would have managed to fit 128GB in and use all of it.

BGB

unread,

Jan 9, 2024, 7:10:31 PMJan 9

to

I am running Windows 10 Professional, it is supposed to be able to
support 2TB of RAM apparently...

But, whatever is going on, seemingly 112GB is fine, but 128GB is not...

> - anton

Scott Lurndal

unread,

Jan 9, 2024, 7:24:02 PMJan 9

to

Are the graphics on-chip? They may be sharing DRAM.

BGB

unread,

Jan 9, 2024, 9:12:22 PMJan 9

to

Dedicated graphics card, RTX 3060, via PCIe.

CPU is a Ryzen 2700X, which does not have an integrated GPU.
Looks like only a subset of the low-end models have integrated GPU.

I guess, some stats for the CPU:
https://www.amd.com/en/product/7656

Closest looking motherboard:
https://rog.asus.com/us/motherboards/rog-strix/rog-strix-b450-f-gaming-model/spec/

https://rog.asus.com/us/motherboards/rog-strix/rog-strix-b450-f-gaming-model/

Not certain if the same motherboard, appears to be the closest match I
can find at the moment.

Looking elsewhere, it says that the AMD B450 chipset has a 128GB limit.

Not sure if just for RAM, or if this would also include MMIO for PCIe
devices and similar...

David Brown

unread,

Jan 10, 2024, 2:08:51 AMJan 10

to

Try booting it from a Linux live USB, with 128 GB in it. That should at
least let you know if you have a hardware problem (perhaps a faulty
DIMM) or a Windows problem.

Using mismatched DIMMs like this can significantly affect the speed of
the memory. You may find the system is using the DIMMs one at a time,
in series, because it can't match them up in sets for parallel access.
For some work, your system will then be a lot slower than if you had 4
16 GB DIMMs for a total of 64 GB.

And maybe you'll like Linux. It is not uncommon for tough jobs - such
as FPGA builds - to run significantly faster under Linux than Windows,
on the same hardware. Windows-only programs that you need to use can
often run fine under Wine (or Steam, if the Windows program you /need/
to run is a game :-) ), and there's always VirtualBox or KVM.

David Brown

unread,

Jan 10, 2024, 2:12:14 AMJan 10

to

On 08/01/2024 18:38, BGB wrote:
> On 1/8/2024 5:24 AM, Robert Finch wrote:
>> On 2024-01-08 3:08 a.m., BGB wrote:
>>
>> My issue with larger boards is the system build times. At some point
>> it is not practical as a hobby because builds take too long. It takes
>> about an hour or so to build the system for the current board. Not
>> keen on the idea of day long build times.
>>
>
> As the moment, my build times (on the XC7A200T) are around 25 minutes.
>
> Though, this is not using the full FPGA, and with a fair bit of timing
> slack at the moment. Takes a lot longer if timing is tight.
>

I remember working with a fairly large CPLD in the late 90's, with a
16-bit Windows system. It could take 2 or 3 hours for the build, if it
fitted - and 6 to 10 hours if it failed. And being 16-bit Windows, the
system was useless for anything else while the build was running. That
was tedious development work!

Scott Lurndal

unread,

Jan 10, 2024, 9:33:04 AMJan 10

to

Almost certainly the latter. Which explains the missing GB.

BGB

unread,

Jan 10, 2024, 11:57:51 AMJan 10

to

The DIMM itself works, just the problem is when there are 4 of them
installed.

From what I can gather, the most likely limitation at the moment is
that it has hit the limit of the AMD B450 chipset on my motherboard
(looking more, it seems the Zen+ may also have this limit as well).

Likely, the combination of 128GB RAM + PCIe MMIO and similar, exceeds
the 128GB limit of the chipset or similar.

Also putting in all of the RAM causes the PC to boot-cycle several times
before it actually boots.

> Using mismatched DIMMs like this can significantly affect the speed of
> the memory. You may find the system is using the DIMMs one at a time,
> in series, because it can't match them up in sets for parallel access.
> For some work, your system will then be a lot slower than if you had 4
> 16 GB DIMMs for a total of 64 GB.
>

I hadn't actually noticed slowdowns in my own tests.
Granted, yes, I am aware with this being not ideal...

Had I realized sooner that there would have been an issue, probably
would have only ordered half as much RAM, and upgraded my PC to 96GB
(this could have matched sticks, but would have been less RAM).

>
> And maybe you'll like Linux. It is not uncommon for tough jobs - such
> as FPGA builds - to run significantly faster under Linux than Windows,
> on the same hardware. Windows-only programs that you need to use can
> often run fine under Wine (or Steam, if the Windows program you /need/
> to run is a game :-) ), and there's always VirtualBox or KVM.
>

I have run Linux before, just not as my main OS, in a fairly long time.
Usual drawbacks are things like software support and driver issues (GPU,
Sound, and Ethernet, being the main perennial issues).

Granted, I guess it is possible Linux could be like, "well the RAM is
there, gonna use it", and possibly just mouse-hole out the part that is
eaten up by PCIe (such as the graphics card), dunno, would need to test...

Anton Ertl

unread,

Jan 10, 2024, 12:18:41 PMJan 10

to

BGB <cr8...@gmail.com> writes:
>Looking elsewhere, it says that the AMD B450 chipset has a 128GB limit.

AM4 generally does not support more than 128GB RAM, because DDR4 can
only have 16GB/chip-select and channel, and AM4 only supports 4
chip-selects per channel and 2 channels. But B450 certainly does
support 128GB. We have 128GB in at least one machine with a B450
board:

# dmidecode|grep -B1 B450
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: TUF B450M-PLUS GAMING

# dmesg
...
[ 3.859281] EDAC MC: UMC0 chip selects:
[ 3.859282] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[ 3.860295] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[ 3.861280] EDAC MC: UMC1 chip selects:
[ 3.861280] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[ 3.862238] EDAC amd64: MC: 2: 16384MB 3: 16384MB
...
# free
total used free shared buff/cache available
Mem: 131832604 104094356 17986860 24772 11018444 27738248
Swap: 0 0 0

Apparently the kernel uses a little over 2GB for its own purposes (the
number of total KB shown is only 125.7GB).

David Brown

unread,

Jan 10, 2024, 1:26:12 PMJan 10

to

If he could see 124.5 GB ram but was missing 3.5 GB, it would be a
perfect explanation. But he can only see 3.5 GB and is missing 124.5 GB.

Torbjorn Lindgren

unread,

Jan 10, 2024, 1:32:22 PMJan 10

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Different variants of Windows support different amounts of memory.
>See, e.g.,
><https://www.compuram.de/blog/wie-viel-ram-lasst-sich-unter-32-bit-und-64-bit-betriebssystemen-maximal-adressieren/>
>
>This says that, e.g., Windows 11 Home only supports 128GB, and Windows
>11 Enterprize 32-bit supports 4GB. Interesting entries in this table
>are:
>
>Windows 2003 Server Datacenter Edition 32-Bit 128GB
>Windows 2008 Server Datacenter 32-Bit 64GB
>
>Makes you wonder why they reduced the capabilities between 2003 and
>2008.

Microsoft's actual documentation[1] say:
Windows Server 2003 R2 Datacenter Edition, X86: 64 GB (16 GB with 4GT)

There's no 32-bit entries with more than 64GB supported.

The MS document does say that PAE support 37-bit (128GB) *addressing*,
and Googling shows that some early Windows PAE implementation such as
2003 (and "hacked" XP) allowed (but not supported) up to 128GB before
MS implemented limits. And people had "interesting" issues when they
did try it.

From my vague memories the default Windows virtual memory map (2GB
application, 2 GB system) is the reason why MS "only" allowed 64GB and
it's also why MS only supported 16GB on the same OS with the "4GT"
mode (3GB application, 1GB system). Basically, my theory is that it's
to avoid running out of system space in the virtual mapping (which is
still 32-bit).

I DO have a recollection that there was some kind of Windows edition
that allowed 128GB via PAE?, perhaps that WAS a variant? of 2003
Datacenter and MS later decided to try to erase history?

IIRC the trick is to completely split user-space and kernel-space
virtual mappings (IE 4/0+0/4 instead of the default of 2/2 or 4GT's
3/1) and just eat the MASSIVE performance penalty that result from
this when moving data between user-space and kernel-space.

You do get the "32-bit programs can used 4GB instead of 2GB (default)
or 3GB (4GT)" benefit with the "performance sucks" though.

But you'd need to be pretty damned desperate to do this once 64-bit OS
was well established. And I have to wonder how many hardware drivers
worked with this anyway.

1. https://learn.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases

BGB

unread,

Jan 10, 2024, 4:41:22 PMJan 10

to

Yes.

If it were, say, 116 GB usable, with 12GB or so quietly being cut off
for the graphics card and similar, this would be fine...

But, with it falling back to having only 3.5GB available for use by the
system, this makes the computer basically unusable...

In other random news, went to dentist to get dental work done (after
roughly 2 decades of not going to dentist). For now, they did several
fillings, I guess next they also want to do wisdom-teeth removal, grr...

While waiting:
Two females were sitting nearby, one kept looking in my direction for
some reason (general description: dyed hair, slightly overweight,
estimated age range, late 20s to mid 30s; was not anyone I recognized).
The other was thinner and younger (with significant facial piercings),
more interested in her phone; they seemed to know each other. Both were
wearing pajamas style clothing.

Further away in the room, there were a male and female with several
small children (and a baby in a plastic baby holder / carrying-case
thing, padded / cloth-lined interior with baby secured in place using
straps, with an external carrying handle).

Also another female with short hair (a "buzzcut" style), also slightly
overweight, and wearing black yoga pants; she was initially looking at
the small children but seemed to quickly lose interest and also started
messing around with her phone after that, ...

Etc...

Afterwards, the dentists used several types of pain suppressor (first a
paste, then injections). Now my face is numb, along with various other
sensory anomalies. Should theoretically clear up after a while.
Seems to interfere with general tactile perception and fine motor
control, had noted some visual and auditory anomalies but these have
since mostly cleared up.

Did not go to the machine shop today. Probably not a good idea to try to
work with tools or machines in this condition.

After initial drilling, they seemed to use some sort of light-cured
resin, with a fairly bright blue-colored light tool (sort of like a
curved dental-tool flashlight). Based on color, I am guessing it was
around 390nm to 400nm (was a similar color to some 400nm LEDs I had
gotten in the past, but a somewhat higher brightness; though, it looked
like there might have also been some ~ 465nm mixed in as well).

Then in the process, the dentist did make a random comment to the other
helper that I had "unusual dental anatomy" (the structure of the ridges
and grooves is apparently atypical).

Though, it is kind of funny that many of the times I have gone to
hospitals, doctors have often just ended up making comments along the
lines of "well, that is weird", about lots of random stuff.

Then again, maybe everyone gets this, just about different types of
features? Statistically, I wouldn't expect there to be any reason to
fall much outside the bell curve on most features.

Nor, in my case, to necessarily fall outside the realm of what would be
"typical" for someone of a mixed Celtic/Norse/Ashkenazim background.
Which, are not exactly rare demographics in the US.

...

Anssi Saari

unread,

Jan 11, 2024, 7:28:03 AMJan 11

to

David Brown <david...@hesbynett.no> writes:

> I remember working with a fairly large CPLD in the late 90's, with a
> 16-bit Windows system. It could take 2 or 3 hours for the build, if
> it fitted - and 6 to 10 hours if it failed. And being 16-bit Windows,
> the system was useless for anything else while the build was running.
> That was tedious development work!

I have a similar memory but I'm not sure what kind of Windows it
was. FPGA was something fairly large for the time, the FPGA tool was
probably early Quartus and there was no progress indication at all. When
you hit start, mouse cursor changed to a hourglass for hours and then it
was done, pass or fail. I think someone rigged the then new fangled
invention called a web camera to watch the display so we didn't have to
walk to the machine to see if it was finished or not.

BGB-Alt

unread,

Jan 11, 2024, 3:16:34 PMJan 11

to

Not sure, is this with an integrated or discrete GPU?...

But, yeah, as-is, at least for Win10 (Pro, X64), my current setup is the
one that has the most RAM available that I can get, and the performance
doesn't seem to have taken an obvious performance hit relative to the
96GB setup, so...

But, yeah, IIRC, the setup is currently:
1A, 1B, 2A: 32GB sticks.
2B: 16GB stick.

Currently 2733 MHz IIRC, as 2933 MHz seemed to have stability issues
(the new RAM modules, from Corsair, claimed to be good for 3200 MHz, but
I didn't see stable results much over 2733; was running 2933 with the
old modules, which IIRC claimed 3000 MHz...).

IIRC, I have Windows currently set up with around 384GB of swap space,
spread across several HDDs.

...

Anton Ertl

unread,

Jan 12, 2024, 2:23:52 AMJan 12

to

Discrete. The CPU is a 3900X.

>Currently 2733 MHz IIRC, as 2933 MHz seemed to have stability issues
>(the new RAM modules, from Corsair, claimed to be good for 3200 MHz, but
>I didn't see stable results much over 2733; was running 2933 with the
>old modules, which IIRC claimed 3000 MHz...).

The machine reported above uses DIMMs rated at 2666MT/s, and runs them
at 2666MT/s. dmidecode outputs:

...
Speed: 2666 MT/s
...
Configured Memory Speed: 2666 MT/s
...

>IIRC, I have Windows currently set up with around 384GB of swap space,
>spread across several HDDs.

Have you computed how long it would take to page 384GB out to the HDDs
and to page them back in? IME paging to HDDs does not make sense
anymore (and paging to SSDs is questionable, too).

Michael S

unread,

Jan 12, 2024, 8:30:10 AMJan 12

to

What David describes does not sound like Altera.
By late 90s in Altera world everybody were using MAX+Plus II. It was
32-bit.
Altera's biggest CPLD family back then was Max 9000 which was not
particularly big. May be, the biggest design could take 30 minutes to
compile, but I never encountered that. 5-7 minutes was more typical on
decent Pentium-II under Win NT4. And it didn't even need a lot of RAM.
64 MB was fully sufficient.
Now, the biggest contemporary Altera FPGAs (Flex 10K family) is a
different story. I never used biggest members of the family myself,
but heard that compilation could take several hours.

I still have project with ACEX FPGA to maintain. ACEX has exactly the
same architecture as Flex 10K, the only difference is fewer SKUs and
much lower price tag. But ACEX is supported by Quartus-II v.6 (circa
2006) so I have no need to somehow convince old MAX+Plus II software to
work on newer OSes. So on this front I have no war stories to share.

As for Qaurtus-II, I don't remember ever using very early versions.
Likely v.4 is the first I used. This one, of course, had progress bar
(unreliable, but then every progress bar I had ever seen in this sorts
of software was unreliable) and it could beep when it ends compilation.

Robert Finch

unread,

Jan 12, 2024, 9:25:29 AMJan 12

to

I bought an Trex-C1 FPGA board which has an Altera FPGA on it a few
years ago. Used it and Quartus? for about a year. Worked very well. I
seem to recall running into a licensing issue, but then I got a newer,
larger FPGA board. I still have the board around somewhere, not wanting
to part with it.

David Brown

unread,

Jan 12, 2024, 11:54:08 AMJan 12

to

On 12/01/2024 14:30, Michael S wrote:
> On Thu, 11 Jan 2024 14:27:59 +0200
> Anssi Saari <anssi...@usenet.mail.kapsi.fi> wrote:
>
>> David Brown <david...@hesbynett.no> writes:
>>
>>> I remember working with a fairly large CPLD in the late 90's, with a
>>> 16-bit Windows system. It could take 2 or 3 hours for the build, if
>>> it fitted - and 6 to 10 hours if it failed. And being 16-bit
>>> Windows, the system was useless for anything else while the build
>>> was running. That was tedious development work!
>>
>> I have a similar memory but I'm not sure what kind of Windows it
>> was. FPGA was something fairly large for the time, the FPGA tool was
>> probably early Quartus and there was no progress indication at all.
>> When you hit start, mouse cursor changed to a hourglass for hours and
>> then it was done, pass or fail. I think someone rigged the then new
>> fangled invention called a web camera to watch the display so we
>> didn't have to walk to the machine to see if it was finished or not.
>
> What David describes does not sound like Altera.
> By late 90s in Altera world everybody were using MAX+Plus II. It was
> 32-bit.

It was before then. Probably a Mach4 or Mach5, from Vantis (originally
AMD, then later Lattice, if I remember correctly). I don't recall which
parts we used when we had the worst build times, but we were absolutely
pushing the limits of the devices - at least 95% macrocell usage. And
the PC used for the job was not top of the range, by any means.

BGB

unread,

Jan 12, 2024, 12:45:11 PMJan 12

to

OK.

Was using a GTX 980 for a fairly long time, then got an RTX 3050...

>> Currently 2733 MHz IIRC, as 2933 MHz seemed to have stability issues
>> (the new RAM modules, from Corsair, claimed to be good for 3200 MHz, but
>> I didn't see stable results much over 2733; was running 2933 with the
>> old modules, which IIRC claimed 3000 MHz...).
>
> The machine reported above uses DIMMs rated at 2666MT/s, and runs them
> at 2666MT/s. dmidecode outputs:
>
> ...
> Speed: 2666 MT/s
> ...
> Configured Memory Speed: 2666 MT/s
> ...
>

Looks it up, yeah, it is MT/s, not MHz, in this case.

The RAM is often sold as 3000 or 3200, idea is that one runs them at
that speed.

Usually they list JEDEC and XMP1/XMP2 speeds, eg:
JDEC: 2133 in this case.
XMP: 3000 or 3200, but freely modifiable in the BIOS.

IIRC, timings are something like 16-20-20.
XMP1 and XMP2 generally seem to be equivalent.

In the past, much of the RAM came with huge elaborate heatsinks and
often RGB LEDs, the newer RAM has a less massive heatsink, and no RGB.

At one point, there was some RAM online that seemed to have some sort of
"RGB angel wings" thing going on. Seems to have gone away when I was
last looking online for RAM.

Not entirely sure how a person was supposed to get it seated (like,
normally, one needs to be able to be able to apply pressure at various
points across the module, which can't really happen in this case if the
top is covered with big spikey angel wings...).

Most of the other (remaining) RAM seemingly content to have a top ridge
of RGB or similar.

Looking at it, a lot of the RGB'ed RAM also has brand names like, say:
BallistX, T-Force, Dominator, Fury, WarHawk, Vengeance, ...
Seems like they are trying to go for a gamer machismo thing...
With RGB, they could have also gone for the whole "kawaii uwu" thing.
Maybe also made the RAM heatsinks pink or lavender, ...

Don't remember who made the absurd looking angel-wing RAM, have a vague
memory it may have been T-Force, but don't really remember. All the
stuff they seem to be selling now has a much more sensible heatsink design.

Well, nevermind if the MOBO itself has a bunch of RGB LEDs on it, and
turns into a sort of lightshow whenever the computer is turned on (I
guess theoretically, there is software to control this, but meh,
whatever...).

>> IIRC, I have Windows currently set up with around 384GB of swap space,
>> spread across several HDDs.
>
> Have you computed how long it would take to page 384GB out to the HDDs
> and to page them back in? IME paging to HDDs does not make sense
> anymore (and paging to SSDs is questionable, too).
>

Performance of the swapfile isn't the issue, but rather so that computer
doesn't go "oh crap" and die once it uses up all the RAM.

Like, just went and looked and saw that PC was sitting at around 240 GB
of commit-charge, much of this likely "Firefox doing the Firefox thing"
(if one leaves it running for long enough, it eventually expands and
consumes all available RAM and swap space, until killed off and
restarted...).

> - anton

Marcus

unread,

Jan 13, 2024, 4:38:43 AMJan 13

to

On 2024-01-07 22:07, Brian G. Lucas wrote:
> Several posters on comp.arch are running their cpu designs on FPGAs.

Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
a soft processor for FPGA use. Furthermore I implement a kind of
computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.

Here's a recent video of the computer running Quake:

https://vimeo.com/901506667

I use/target two different FPGA boards, but I mainly use one of them for
development.

> I have several questions:
> 1. Which particular FPGA chip? (not just family but the particular SKU)

a) Intel Cyclone-V 5CEBA4F23C7N (my main development FPGA)

b) Intel MAX 10 10M50DAF484 (this is the smaller one of the two)

> 2. On what development board?

a) Terasic DE0-CV

b) Terasic DE10-Lite

> 3. Using what tools?

Development: Sublime Text + VS Code + GHDL + gtkwave (all free).

Programming: Intel Quartus Prime Lite Edition, v19.1.0 (it's free).

>
> Thanks,
> brian

BGB

unread,

Jan 13, 2024, 3:45:14 PMJan 13

to

On 1/13/2024 3:38 AM, Marcus wrote:
> On 2024-01-07 22:07, Brian G. Lucas wrote:
>> Several posters on comp.arch are running their cpu designs on FPGAs.
>
> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
> a soft processor for FPGA use. Furthermore I implement a kind of
> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>
> Here's a recent video of the computer running Quake:
>
> https://vimeo.com/901506667
>
> I use/target two different FPGA boards, but I mainly use one of them for
> development.
>

At least it is going fast...

If I run the emulator at 110 MHz:
Software Quake gets ~ 10.4 fps.
GLQuake gets 12.7 fps.

Though, the GLQuake performance partly took a hit recently as I had been
moving away from running the GL backend directly in the program, to
instead run it via system calls.

I had partly integrated some features from the version that was stuck
onto the Quake engine into the other branch which was modified to work
inside the TKGDI process, such as support for the rasterizer module.

Looking in the profile output, it appears it is still doing a bit of the
GL rendering via the software span-drawing though.

Though, in the process, it has gone from the use of "hybrid poor-mans
perspective correct" back to plain "affine texturing with dynamic
subdivision", with a comparably finer subdivision.

So, proper perspective correct would involve:
Divide ST coords by Z before rasterization;
Interpolate as 1/Z;
Dynamically calculate "Z" via "1/(interpolated 1/Z)";
Scale ST coords by Z during rasterization.

Poor man's version:
Divide ST coords by Z before rasterization;
Interpolate as Z;
Scale ST coords by Z during rasterization.
This version isn't as good as the proper version, and adds some of its
own issues vs affine.

Affine:
Interpolate ST coords directly (no Z scaling).

However, larger primitives (*) with affine texturing need to be split
apart into smaller primitives during rendering, which adds cost in terms
of transform/projection, which it seems is a more significant part of
the cost when using the hardware rasterizer module.

Actual perspective-correct could be better here, but the "quickly and
semi-accurately calculate 1/(1/Z) part" is a challenge.

*: At the moment, basically any triangle with a circumference larger
than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going too
much bigger makes the affine warping a lot more obvious.

The Software Quake in this case, is a modified version of the Quake C
software renderer:
Was modified early on to use 16-bit pixels rather than 8-bit pixels;
Initially, this was YUV655, but then went to RGB555.
A few functions were rewritten in ASM.
Though, still basically all scalar code;
The bulk of the renderer is still C though.

There is still some weirdness in a few places where the math still
assumes YUV, which leads to things like the menu background blending
being the wrong color (never got around to fixing this), ...
(The video there seemed to show a dithered effect, which is a little
different from a color-blend).

Did gain some alpha blended effects (such as a translucent console),
because these seemed cool at the time, and isn't too hard to pull off
with RGB pixels.

Note that my GLQuake port is still faster than the Quake
software-renderer, even with software-rasterized OpenGL.

Does sort of imply a faster software renderer could still be possible...

Though, in my Doom port, I did eventually go from the use of
color-blending (for things like screen flashes) to the use of
integrating the color-flash into the active "colormap" table (*), which
is used every time a span or column is drawn in Doom (not so much in SW
Quake; where texturing+lighting is precalculated via a "surface cache").

*: It being computationally faster to RGB blend the current version of
the colormap table, than to RGB blend the final screen image (with
menus/status-bar/etc being drawn via the unblended colormap).

Though, I did once experiment with eliminating the colormap table
entirely in Doom, and using purely RGB modulation (like one might do in
an GL style rasterizer), but this was slower than using the colormap table.

At the moment, Doom at least mostly holds over 20 fps (at 50MHz), having
gained a few fps on average with a recent experimental optimization:
Temporary variables which are used exclusively as function-call inputs
may have the expression output directly to the register corresponding to
the function argument, rather than first going to a callee-save register
and then being MOV'ed to the final argument register.

Effect seems to be:
Makes binary 3% smaller;
Makes Doom roughly 9% faster;
Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
Cause the number of bundled instructions to drop by 1% though;
...

Note that this only applies to temporaries, not to expressions performed
via local variables or similar, which still use callee-save registers.

Had sort of hoped it would save more, but it seems like many of the
"MOV's" for function arguments are coming from local variables rather
than temporaries (but, unlike a temporary, the contents of a local
variable still need to still be intact after the function call).

After a fair bit of debugging (to get the built program to not be
entirely broken), this change has a more obvious effect on the
performance of ROTT (which gets around 70% faster and ~ 6% smaller).
(Though, there is some other unresolved, less-recent bug, that has
seemed to cause MIDI playback in ROTT to sound like broken garbage).

Though, for ROTT this wasn't isolated from another few recent optimizations:
Eliminating initial condition-check with "for()" loops of the form:
for(i=M; i<N; i++)
When M<N, and both are constant.
Reworking "*ptr++" in the RIL-IR stage to eliminate an extra "MOV";
Also eliminates using an extra temporary (manifest as the "MOV").
Involved detecting and handling this as a single operation.
And generating the RIL3 stack-operations in a different order.
Didn't bother detecting/handling preincrement cases yet though.
Making expressions like "x=*ptr;" not use an extra temporary;
...

Well, and other changes:
Making the size-limit for inline "memcpy()" smaller,
added a copy-slide and generated memcpy's for intermediate cases.
Was:
< 128 byte: generate inline.
< 512 byte: maybe special-case inline (if speed-optimized).
Now:
< 64 byte: generate inline
< 512: call a generated unrolled copy-loop/slide.
This mostly being because handling larger cases inline is bulky.
It takes around 512 bytes of ".text" to copy 512 bytes inline...

Some of this is basically a case of going through some debug ASM and
looking for "stupid instruction sequences", and trying to figure out
what causes them and how to fix it.

However, "obvious cases that save lots of instructions" are becoming
much less common.

And, some other optimizations, such as "constant propagation" would be a
lot more difficult to pull off... Where, say, the value of a constant
would be seen via a variable rather than a "#define" or similar; my
compiler already has the optimization of replacing expressions like
"2+3" with "5".

The big problem with constant propagation is that whether or not a
constant can be propagated depends on local visibility and control flow
(and would likely be of very limited effectiveness if it could not cross
boundaries between basic-blocks).

For example, if it could not cross a basic-block boundary, it would have
still been N/A for the previous "for() loop" optimization (which in this
case was handled via AST level pattern matching).

Some of the remaining inefficiencies cross multiple levels in the
compiler, which is annoying...

Then there are a lot of things that GCC does, that I have little idea
how to pull off at the moment.

For example, it assigns local variables to registers which seem to be
localized and flow across basic-block boundaries; currently BGBCC does
nothing of the sort (closest it can do is rank the most-used variables,
and static-assign them to registers for the scope of the whole function;
anything else using spill-and-fill via the stack frame).

Sadly, despite having 64 GPRs, still have not entirely eliminated the
use of spill and fill. The mechanism that can eliminate spill-and-fill
on a function scale (by assigning everything to registers), is basically
defeated as soon as anything takes the address of a local variable or
similar (whole function falls back to the generic strategy; the local
variable in question going over to not caching the value in a register
at all, and instead using spill/fill every time that variable is
accessed, anywhere in the function...).

...

>> I have several questions:
>> 1. Which particular FPGA chip? (not just family but the particular SKU)
>
> a) Intel Cyclone-V 5CEBA4F23C7N (my main development FPGA)
>
> b) Intel MAX 10 10M50DAF484 (this is the smaller one of the two)
>

Mostly still XC7A100T and XC7A200T.
Advantage of the latter in this case that I can fit multiple cores.
Where the single-core config uses around 70% of an XC7A100T.

With some limitations, can sorta shoe-horn it into an XC7S50, though not
with the entire feature-set.

>> 2. On what development board?
>
> a) Terasic DE0-CV
>
> b) Terasic DE10-Lite
>

Had once looked into these, but didn't get them as they weren't super
cheap, and were different enough as to require some porting effort.

Did at one point synthesize the BJX2 core in Quartus though...

>> 3. Using what tools?
>
> Development: Sublime Text + VS Code + GHDL + gtkwave (all free).
>
> Programming: Intel Quartus Prime Lite Edition, v19.1.0 (it's free).
>

All Verilog here...

Seems the version I am using is some sort of intermediate between
Verilog and SystemVerilog. Vivado accepts it as Verilog, but for Quartus
I needed to tell it that it was SystemVerilog.

Otherwise, seems to work fine in Verilator and similar as well.

>>
>> Thanks,
>> brian
>
>

BGB

unread,

Jan 13, 2024, 4:22:41 PMJan 13

to

On 1/13/2024 2:45 PM, BGB wrote:
> On 1/13/2024 3:38 AM, Marcus wrote:
>> On 2024-01-07 22:07, Brian G. Lucas wrote:
>>> Several posters on comp.arch are running their cpu designs on FPGAs.
>>
>> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
>> a soft processor for FPGA use. Furthermore I implement a kind of
>> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>>
>> Here's a recent video of the computer running Quake:
>>
>> https://vimeo.com/901506667
>>
>> I use/target two different FPGA boards, but I mainly use one of them for
>> development.
>>
>
> At least it is going fast...
>

Clarification, all of the rest was for my BJX2 project...
I realized this part may have been ambiguous...

Not trying to down sell your effort, getting 30+ fps from Quake on an
FPGA is still pretty good...

BGB

unread,

Jan 15, 2024, 3:49:58 PMJan 15

to

On 1/13/2024 2:45 PM, BGB wrote:

> On 1/13/2024 3:38 AM, Marcus wrote:
>> On 2024-01-07 22:07, Brian G. Lucas wrote:
>>> Several posters on comp.arch are running their cpu designs on FPGAs.
>>
>> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
>> a soft processor for FPGA use. Furthermore I implement a kind of
>> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>>
>> Here's a recent video of the computer running Quake:
>>
>> https://vimeo.com/901506667
>>
>> I use/target two different FPGA boards, but I mainly use one of them for
>> development.
>>
>
> At least it is going fast...
>
> If I run the emulator at 110 MHz:
> Software Quake gets ~ 10.4 fps.
> GLQuake gets 12.7 fps.
>
>
> Though, the GLQuake performance partly took a hit recently as I had been
> moving away from running the GL backend directly in the program, to
> instead run it via system calls.
>

Generally, it is around 1 system call per draw operation, and system
calls for things like binding a texture, uploading an image, etc.

Possible could be to try to eliminate the texture binding and instead
pass one or more texture handles for each draw operation or similar.

Self-correction:
It is 144 for a triangle and 192 for a quad, 48 pixels on each side...

I had mistakenly done a square-root here, noting that I compared these
values as distance^2, but then forgot to take into account that the
values themselves were squared, so it was 48 pixels linear, rather than
square-root of 48 pixels.

Using 7 pixels would have probably led to a lot less affine distortion,
but would have have a much higher overhead. Goal being mostly to break
apart large primitives where the affine distortion is more obvious.

Though, one possibility that was not explored at the time, would be to
do the transform as-if it were doing perspective correct rendering, then
breaking up the primitive in screen space (then, for affine rendering,
one would calculate 1/(1/Z) and scale S/T by Z for each sub-primitive).

The current implementation is more like:
Project primitive into screen space;
Calculate edge-lengths;
If too big, break apart at the midpoint along each edge;
This part happens in world space;
Try again with each sub-piece.
Mostly using a stack to manage the primitive transforms (the primitive
is fully drawn once this stack is empty).

Which, as noted, is not the most efficient way possible to do it, as
each fragmented primitive also needs to send all of the sub-vertices
through the vertex transform.

May also make sense to clip the primitive to the frustum, as primitives
crossing outside the frustum or across the near clip plane, behave
particularly badly with the perspective-correct math.

But, this part could possibly use a bit of restructuring, as I
originally wrote it with the assumption of a (comparably slow) software
rasterized backend. Hadn't expected the front-end stages to become the
bottleneck...

Note that the framebuffer and depth buffer in my case were generally:
RGB555A
0rrrrrgggggbbbbb (A=255)
1rrrraggggabbbba (A=0/32/64/96/128/160/192/224).
Z16 (No Stencil)
Z12.S4 (Stencil)

Though, 12-bit depth does lead to fairly obvious Z fighting.

Note that 4 stencil bits is basically the minimum needed to pull off
effects like stencil shadows.

The reason for not using 32-bit Color and Depth buffers mostly has to do
with memory bandwidth and performance.

>
> The Software Quake in this case, is a modified version of the Quake C
> software renderer:
> Was modified early on to use 16-bit pixels rather than 8-bit pixels;
>     Initially, this was YUV655, but then went to RGB555.
> A few functions were rewritten in ASM.
>     Though, still basically all scalar code;
>     The bulk of the renderer is still C though.
>
> There is still some weirdness in a few places where the math still
> assumes YUV, which leads to things like the menu background blending
> being the wrong color (never got around to fixing this), ...
> (The video there seemed to show a dithered effect, which is a little
> different from a color-blend).
>
> Did gain some alpha blended effects (such as a translucent console),
> because these seemed cool at the time, and isn't too hard to pull off
> with RGB pixels.
>
>
> Note that my GLQuake port is still faster than the Quake
> software-renderer, even with software-rasterized OpenGL.
>
> Does sort of imply a faster software renderer could still be possible...
>

The OpenGL API and front-end isn't exactly low-overhead, so
lower-overhead could be possible.

Just my implementation "cheats" a bit by using a lot of SIMD ops,
whereas Quake itself tends to be pretty much exclusively scalar code,
generally working with vectors via function calls and "float *" pointers
and similar.

Well, or:
typedef float vec3_t[3];
Which basically achieves the same effect in practice.

In my case, no vector extensions, just sort of a limited subset of MMX
and SSE style SIMD operations (implemented on top of GPRs or GPR pairs).

Well, and unlike MMX, there are generally no packed byte operations
(smallest packed element being a 16-bit word).

A bunch more fiddling, it is now down a little more:
The "MOV Reg,Reg" case is down to around 10% of the total binary.
A lot of this was by fiddling with stuff mostly at the IR stages to
reduce the number of temporaries used in cases where using an extra
temporary was not necessary.

Still some remain, but attempts to eliminate these cases had broken the
program being compiled.

>
>
> After a fair bit of debugging (to get the built program to not be
> entirely broken), this change has a more obvious effect on the
> performance of ROTT (which gets around 70% faster and ~ 6% smaller).
> (Though, there is some other unresolved, less-recent bug, that has
> seemed to cause MIDI playback in ROTT to sound like broken garbage).
>

Not sure why ROTT seemed to have seen such a disproportionately larger
result...

But, ROTT went from "fairly slow" up to "relatively fast" (now running
at closer to Doom-like speeds).

Hexen is still fairly slow though, as Hexen manages to be almost as slow
as Quake, despite being Doom-engine based, and my Doom engine port is
comparably faster.

Previously, ROTT was around a similar speed to Hexen.

Though, of the ports, ROTT is ironically the only one still doing its
rendering in 8-bit color, but this was partly because the engine had
tried using the VGA in very weird ways (*), and I ended up implementing
the ROTT port partly by faking the VGA behavior by implementing most of
the graphics-hardware interface via function calls.

*: Rather than linear 320x200, it used the VGA in a sort of planar mode:
4 planes, each 96x200, for ~ 384x200 mode, but only 320x200 is visible.
For some effects, it dynamically modifies the color palette, ...

The other games had merely used linear 320x200, and had seemingly
already made some provisions for rendering via 16-bit pixels.

Though, this did still leave the matter of how to implement things like
screen color-flashes (which in 256 color versions were pulled off by
dynamically modifying the color palette).

Early on, had used framebuffer level color-blending. For Doom, later
went to dynamically updating the "colormap" table based on the screen
flash (but always drawing HUD/UI using the original colormap, as Doom
redraws things like the HUD incrementally, and so drawing it with the
colormap can lead to ugly artifacts if screen-flashing gets involved).

For things like the "invisibility" effect, had to rework this to working
with RGB555 (original version fed the pixels back through the colormap
table a second time to darken each pixel by an amount given in a lookup
table; wrote something different that did a similar effect but with
packed RGB555 pixels instead).

Meanwhile, Hexen had done a different effect of doing a 50% blend
between the sprite color and the background color (in indexed color,
this was done using a lookup table; in RGB555, the blending is done in
RGB space).

Though, a simple 50% blend can be "cheesed" as, say:
newclr=((clr1&0x7BDE)+(clr2&0x7BDE))>>1;
Rather than needing to unpack and blend each component, then repack the
result.

Seemingly, some modern "Doom" engines (like ZDoom) were instead using
Hexen's effect rather than the original Doom effect.

Robert Finch

unread,

Jan 15, 2024, 10:37:21 PMJan 15

to

I added a color blend operation to the Thor/Q+ instruction set a while
ago as part of the graphics operations, thinking that the CPU could take
over some of the graphics ops performed by an accelerator. The color
blend operator blends two 30-bit RGB10-10-10 colors with a fixed-point
alpha 1.9 bits. It computes the color blend in two clock cycles. There
is also a ‘transform’ instruction that translates and rotates a point in
a 3D space. Performs all the matrix math (nine multiplies and adds) in
about four clock cycles.

BGB

unread,

Jan 16, 2024, 12:16:46 AMJan 16

to

Hmm...

Hadn't gone that route...

I have RGB555 pack/unpack, packed word multiply, ...

But, the latency of doing a "simple blend" operation in this way is
comparably high (though, TKRA-GL mostly uses these SIMD ops internally).

I had a color-blend that used a 3-bit alpha, but, the logic for this was
more like:
tclr=((clr1&0x7BDE)+(clr2&0x7BDE))>>1;
if(sel&1) { mclr=clr1; } else { mclr=clr2; }
tclr=((tclr&0x7BDE)+(mclr&0x7BDE))>>1;
if(sel&1) { mclr=clr1; } else { mclr=clr2; }
tclr=((tclr&0x7BDE)+(mclr&0x7BDE))>>1;
if(sel&1) { mclr=clr1; } else { mclr=clr2; }
tclr=((tclr&0x7BDE)+(mclr&0x7BDE))>>1;
Generally implemented with normal CPU instructions.

For things like Doom screen flashing, had a version of this which worked
4 pixels at a time, so, 0x7BDE7BDE7BDE7BDEULL, but basically the same
idea...

While SIMD ops would do the blend more accurately, it doesn't gain as
much overall in terms of cycles as the logic still has a moderately long
instruction sequence, and would need to handle blending each of the
pixels individually.

However, for things like flashing the screen, this is slow enough to
momentarily tank the framerate (say, whenever picking up an item).

Similarly, some power-ups which involved a color-flash effect (such as
"Berserk"), would effectively wreck the framerate.

At least, for Doom, switched to updating the colormap instead. But, my
Heretic and Hexen ports still use the original strategy.

The projection transform is performed mostly using 4-element
floating-point SIMD vectors.

To a large degree, my C variant had been extended to support SIMD
vectors in C, which are then wrapped in macros so that the code can
remain portable (GCC and MSVC use different interfaces for SIMD vectors).

General FP-SIMD ops are:
PADD.F / PSUB.F / PMUL.F: ADD/SUB/MUL, 2x Binary32
PADD.H / PSUB.H / PMUL.H: ADD/SUB/MUL, 4x Binary16
PADDX.F / PSUBX.F / PMULX.F: ADD/SUB/MUL, 4x Binary32

There is a shuffle for 16-bit words. But for 32-bit elements, the
shuffle is faked using other ops:
MOVLLD Rm, Ro, Rn // Rn = { Rm[31: 0], Ro[31: 0] };
MOVHLD Rm, Ro, Rn // Rn = { Rm[63:32], Ro[31: 0] };
MOVLHD Rm, Ro, Rn // Rn = { Rm[31: 0], Ro[63:32] };
MOVHHD Rm, Ro, Rn // Rn = { Rm[63:32], Ro[63:32] };
Where, two of these ops can perform any possible 4-element shuffle.
MOVLD/MOVHD also exist as shorthands for MOVLLD and MOVHHD.

And:
PSHUFX.F Xm, Imm8, Xn
Exists as a pseudo-op over a pair of MOVxxD instructions.

Technically, SSE 'SHUF' behavior can also be faked as well using these
instructions.

There are also some format conversion ops, etc.
Note that there are no SIMD vector divide ops.
Also, packed integer <-> floating point conversion is "a bit wonky".

...

Until fairly recently, all of the TKRA-GL matrix math stuff was in C,
but I recently got around to rewriting some of it into ASM.

A few advanced/niche instructions I did have:
BLKUTX2: Extract a texel from a UTX2 block.
As 4x packed 16-bit word, unit-range.
Low 4 bits of Ro interpreted as a pixel index, presumably Morton.
LDTEX: Perform a texel fetch from memory;
Was effectively a memory load, but:
AGU had a hack case that morton-shuffled and masked the coords;
Index expressed as 2x 16.16 fixed point (S/T).
The BLKUTX2 logic was shoved onto the end of the memory load.

The LDTEX instruction was used in some of the "more common" span
rasterizer functions, but most continued still usinbg the BLKUTX2
instruction (more general use of LDTEX would have significantly
increased the amount of ASM code for the various span-drawing functions).

Motivation to further optimize the span funcs was limited, as things
like the transform stages (all still a big mass of C code) was eating up
more of the CPU time.

When originally writing TKRA-GL, I had put a lot more emphasis on the
edge-walk and span-drawing code, as I had expected this is where the
bottleneck would have been.

Had considered an operation to perform the H-blending for bilinear
interpolation, but never did anything with this (so, the bilinear had
been done using the normal packed SIMD ops).

For the hardware ratherizer module though, there are multiple texel
fetchers and a blend. Though, it is using a cheaper 3-texel
approximation of the bilinear filter, vaguely like that famously used in
the N64.

Terje Mathisen

unread,

Jan 16, 2024, 1:47:47 AMJan 16

to

BGB wrote:
> On 1/13/2024 2:45 PM, BGB wrote:
>> *: At the moment, basically any triangle with a circumference larger
>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>> too much bigger makes the affine warping a lot more obvious.
>>
>>
>
> Self-correction:
> It is 144 for a triangle and 192 for a quad, 48 pixels on each side...
>
> I had mistakenly done a square-root here, noting that I compared these
> values as distance^2, but then forgot to take into account that the
> values themselves were squared, so it was 48 pixels linear, rather than
> square-root of 48 pixels.
>
> Using 7 pixels would have probably led to a lot less affine distortion,
> but would have have a much higher overhead. Goal being mostly to break
> apart large primitives where the affine distortion is more obvious.

I am pretty sure that the original SW Quake used 16-pixel spans, with a
single 1/Z division for each span, so interpolated between affine and
perspective correct?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Jan 16, 2024, 1:58:24 AMJan 16

to

Terje Mathisen wrote:
> BGB wrote:
>> On 1/13/2024 2:45 PM, BGB wrote:
>>> *: At the moment, basically any triangle with a circumference larger
>>> than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going
>>> too much bigger makes the affine warping a lot more obvious.
>>>
>>>
>>
>> Self-correction:

>> Â It is 144 for a triangle and 192 for a quad, 48 pixels on each

>> side...
>>
>> I had mistakenly done a square-root here, noting that I compared these
>> values as distance^2, but then forgot to take into account that the
>> values themselves were squared, so it was 48 pixels linear, rather
>> than square-root of 48 pixels.
>>
>> Using 7 pixels would have probably led to a lot less affine
>> distortion, but would have have a much higher overhead. Goal being
>> mostly to break apart large primitives where the affine distortion is
>> more obvious.
>
> I am pretty sure that the original SW Quake used 16-pixel spans, with a
> single 1/Z division for each span, so interpolated between affine and
> perspective correct?

PS. A key idea (from Mike Abrash) was that he managed to overlap
(nearly?) all of the FDIV latency with integer ops drawing that 16-pixel
span, so the division became close to free!

BGB

unread,

Jan 16, 2024, 4:42:18 AMJan 16

to

Yeah, though the "7 pixel" figure was a bit of a mental screw-up on my
part; in TKRA-GL, it is more like 48 pixels... Took the square root of
48, should not have square-rooted it, as the 48 was being squared...

There are cases where this limit is reduced though:
Steep angles;
Crossing frustum edge;
Crossing near clip plane;
...

But, yeah, in any case, I may need to rework how I do this:
Possibly switching to perspective-correct rendering, or doing the
subdivision in screen-space rather than world space (likely with the
primitives being initially clipped against the view frustum).

As for SW Quake, yeah.
IIRC, the ASM version redid the 1/Z every 16 pixels or so.
Though, this was 8 pixels originally for the C version.
C works well;
32 pixels leads to more obvious distortion.

Granted, 48 is worse than 32, but going too much smaller, greatly
increases the time spent fragmenting and projecting primitives.

> Terje
>