Whither the Mill?

Stephen Fuld

unread,

Dec 13, 2023, 11:25:46 AM12/13/23

to

When we last heard from the merry band of Millers, they were looking for
substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.

But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for
not huge amounts of money. Whether this is worthwhile, I cannot say.

Anyway, has all development stopped? Or is their "sweat equity" model
still going on?

Inquiring minds want to know.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Scott Lurndal

unread,

Dec 13, 2023, 12:32:58 PM12/13/23

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>When we last heard from the merry band of Millers, they were looking for
>substantial funding from a VC or similar. I suppose that if they had
>gotten it, we would have heard, so I guess they haven't.
>
>But I think there are things they could do to move forward even without
>a large investment. For example, they could develop an FPGA based
>system, even if it required multiple FPGAs on a custom circuit board for
>not huge amounts of money. Whether this is worthwhile, I cannot say.
>

There might even be some way of renting time on a real
emulator from cadence (Palladium) or synopsys (Zebu).

Although in my experience those who have them use them
24x7.

BGB

unread,

Dec 14, 2023, 2:25:50 AM12/14/23

to

Yeah, don't think I have seen anything from Ivan on here in a while...

In my case, I was doing everything on Spartan-7 and Artix-7 boards, and
had OK results (within the limits of what is possible on an FPGA).

Kinda wish it could be faster, but alas.

Sadly, anything much bigger (or faster) than the XC7A200T actually
requires paying money for the non-free version of Vivado...

Ironically, this makes the XC7A200T more valuable in a way than the
XC7K325T, as while technically smaller and weaker, it is basically the
biggest FPGA one can get before needing to hand over absurd amounts of
money to AMD/Xilinx.

Well, there is also the XC7K70T, which is technically faster, but has a
lot less LUTs.

And the XC7K160T, which is faster and only slightly smaller, but
significantly more expensive.

But, if one wants an FPGA they could "afford to put in stuff and
potentially have someone be willing to buy it", would need to aim a
little lower here. Can't put all that fancy of a soft-processor in an
XC7S50 or XC7A35T, but could be more reasonable to put into
consumer-electronics devices.

Though, sadly, a soft-processor can't really match something like an ARM
chip in terms of performance per dollar. Though, would be nice if
*someone* could dethrone ARM in terms of perf/$ (RISC-V holds promise,
but only really if someone releases a chip that is both cheap and fast).

And, custom ASIC's are only really an option if one has a huge amount of
money up-front.

Though printable electronics with semi-conductive ink seems promising,
but even here, the ink is stupid expensive and one would still need to
build a special-purpose printer to be able to make use of it (and not
particularly high-density; so probably would be physically much larger
and slower than a design running on an FPGA).

Though, not really sure what the densities or clock speeds of printed
electronics are like.

...

Though, at least in my case, it is all mostly a hobby project.
Unless "someone with a lot of money" thinks it is cool.

...

Though, one possible feature in my case of my project being FPGA based,
is that theoretically I could get one of those gameboy-like FPGA-based
emulator things and port my stuff to this, though most of these devices
seem to be based around the Cyclone V for whatever reason, ...

George Neuner

unread,

Dec 15, 2023, 12:48:08 PM12/15/23

to

There was a post, ostensibly from Ivan, in their web forum just a few
days ago. No news though - just an acknowledgement of another user's
post.

Last I heard, the next (current?) round of financing was - at least in
part - to be used for FPGA "proof of concept" implementations.

Problem is the Mill really is a SoC, and (to me at least) the design
appears to be so complex that it would require a large, top-of-line
(read "expensive") FPGA to fit all the functionality.

Then there is their idea that everything - from VHDL to software build
toolchain to system software - be automatically generated from a
simple functional specification. Getting THAT right is likely proving
far more difficult than simply implementing a fixed design in an FPGA.

YMMV,
George

BGB

unread,

Dec 15, 2023, 2:05:57 PM12/15/23

to

On 12/15/2023 11:48 AM, George Neuner wrote:
> On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
> <sf...@alumni.cmu.edu.invalid> wrote:
>
>> When we last heard from the merry band of Millers, they were looking for
>> substantial funding from a VC or similar. I suppose that if they had
>> gotten it, we would have heard, so I guess they haven't.
>>
>> But I think there are things they could do to move forward even without
>> a large investment. For example, they could develop an FPGA based
>> system, even if it required multiple FPGAs on a custom circuit board for
>> not huge amounts of money. Whether this is worthwhile, I cannot say.
>>
>> Anyway, has all development stopped? Or is their "sweat equity" model
>> still going on?
>>
>> Inquiring minds want to know.
>
> There was a post, ostensibly from Ivan, in their web forum just a few
> days ago. No news though - just an acknowledgement of another user's
> post.
>
>
> Last I heard, the next (current?) round of financing was - at least in
> part - to be used for FPGA "proof of concept" implementations.
>
> Problem is the Mill really is a SoC, and (to me at least) the design
> appears to be so complex that it would require a large, top-of-line
> (read "expensive") FPGA to fit all the functionality.
>

Yeah. the lower end isn't cheap, the upper end is absurd...

For FPGA's over $1k, almost makes more sense to ignore that they exist
(also this appears to be around the cutoff point for the free version of
Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).

> Then there is their idea that everything - from VHDL to software build
> toolchain to system software - be automatically generated from a
> simple functional specification. Getting THAT right is likely proving
> far more difficult than simply implementing a fixed design in an FPGA.
>

Yeah.

Long ago, I watched another project (FoNC, led by Alan Kay) that was
also trying to go this route. I think the idea was that they wanted to
try to find a way to describe the entire software stack (from OS to
applications) in under 20k lines.

Practically, it seemed to mostly end up going nowhere best I can tell, a
lot of "design", nothing that someone could actually use.

Though, if one sets the limits a little higher, there is a lot one can do:
One can at least, surely, make a usable compiler tool chain in under 1
million lines of code (at present, BGBCC weighs in at around 250 kLOC,
could be smaller; but, fitting a "basically functional" C compiler into
30k lines, or around the size of the Doom engine, seems a little harder).

Though, an intermediate option, would be trying to pull off a "semi
decent" compiler in under 100K lines.

If the compiler is kept smaller, it is faster to recompile from source.

Also, it would be nice to have a basically usable OS and core software
stack in under 1M lines.

Say, by not trying to be everything to everyone, and limiting how much
is allowed in the core OS (or is allowed within the build process for
the core OS).

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).
A (moderate sized) C compiler;
(but not GCC, which is also well over this size limit).
A shell+utils comparable to BusyBox;
Various core OS libraries and similar, etc.

For this, will assume an at least nominally POSIX like environment.

Programs that run on the OS would not be counted in the line-count budget.

How to deal with multi-platform portability would be more of an open
question, as this sort of thing tends to be a big source of code
expansion (or, for an OS kernel, the matter of hardware drivers, ...).

But, as can be noted, pretty much any project that gains mainstream
popularity seems to spiral out of control regarding code-size.

> YMMV,
> George

MitchAlsup

unread,

Dec 15, 2023, 4:01:52 PM12/15/23

to

BGB wrote:

> On 12/15/2023 11:48 AM, George Neuner wrote:
>> On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
>> <sf...@alumni.cmu.edu.invalid> wrote:
>>
>>> When we last heard from the merry band of Millers, they were looking for
>>> substantial funding from a VC or similar. I suppose that if they had
>>> gotten it, we would have heard, so I guess they haven't.
>>>
>>> But I think there are things they could do to move forward even without
>>> a large investment. For example, they could develop an FPGA based
>>> system, even if it required multiple FPGAs on a custom circuit board for
>>> not huge amounts of money. Whether this is worthwhile, I cannot say.
>>>
>>> Anyway, has all development stopped? Or is their "sweat equity" model
>>> still going on?
>>>
>>> Inquiring minds want to know.
>>
>> There was a post, ostensibly from Ivan, in their web forum just a few
>> days ago. No news though - just an acknowledgement of another user's
>> post.
>>
>>
>> Last I heard, the next (current?) round of financing was - at least in
>> part - to be used for FPGA "proof of concept" implementations.
>>
>> Problem is the Mill really is a SoC, and (to me at least) the design
>> appears to be so complex that it would require a large, top-of-line
>> (read "expensive") FPGA to fit all the functionality.
>>

> Yeah. the lower end isn't cheap, the upper end is absurd...

Look into the cost of making a mask-set at 7nm or at 3nm. Then we can
have a discussion on how high the number has to be to rate absurd.

> For FPGA's over $1k, almost makes more sense to ignore that they exist
> (also this appears to be around the cutoff point for the free version of
> Vivado as well; but one would have thought Xilinx would have already
> gotten their money by someone having bought the FPGA?...).

>> Then there is their idea that everything - from VHDL to software build
>> toolchain to system software - be automatically generated from a
>> simple functional specification. Getting THAT right is likely proving
>> far more difficult than simply implementing a fixed design in an FPGA.
>>

> Yeah.

> Long ago, I watched another project (FoNC, led by Alan Kay) that was
> also trying to go this route. I think the idea was that they wanted to
> try to find a way to describe the entire software stack (from OS to
> applications) in under 20k lines.

Was the language of choice APL-like ??

> Practically, it seemed to mostly end up going nowhere best I can tell, a
> lot of "design", nothing that someone could actually use.

> Though, if one sets the limits a little higher, there is a lot one can do:
> One can at least, surely, make a usable compiler tool chain in under 1
> million lines of code (at present, BGBCC weighs in at around 250 kLOC,
> could be smaller; but, fitting a "basically functional" C compiler into
> 30k lines, or around the size of the Doom engine, seems a little harder).

> Though, an intermediate option, would be trying to pull off a "semi
> decent" compiler in under 100K lines.

> If the compiler is kept smaller, it is faster to recompile from source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal
and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

We now have compilers struggling to achieve 10,000 lines per second per CPU
with machines of 0.2ns cycle time -- 75× faster {times the number of CPUs
thrown at the problem.}

> Also, it would be nice to have a basically usable OS and core software
> stack in under 1M lines.

There is no salable market for an OS that sheds featured for compactness.

> Say, by not trying to be everything to everyone, and limiting how much
> is allowed in the core OS (or is allowed within the build process for
> the core OS).

> Though, within moderate limits, 1M lines would basically be enough to fit:
> A basic kernel;
> (this excludes the Linux kernel, which is well over the size limit).

If there were an efficient way to run the device driver sack in user-mode
without privilege and only the MMI/O pages this driver can touch mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF--

> A (moderate sized) C compiler;
> (but not GCC, which is also well over this size limit).

In 1990 C was a small language, In 2023 that statement is no longer true.
In 1990 the C compiler had 2 or 3 passes, in 2023 the LLVM compile has
<what> 35 passes (some of them duplicates as one pass converts into some-
thing a future pass will convert into something some other pass can
optimize.)
In 1990 your C compiler ran natively on your machine.
In 2023 your LLVM compiler compiles 6+ front end languages and compiles
to 20+ target ISAs and has to produce good code on all of them.

> A shell+utils comparable to BusyBox;

Until someone prevents someone else from writing new shells, filters,
and utilities, there is no way to moderate the growth in Shell+utils.

> Various core OS libraries and similar, etc.

> For this, will assume an at least nominally POSIX like environment.

> Programs that run on the OS would not be counted in the line-count budget.

> How to deal with multi-platform portability would be more of an open
> question, as this sort of thing tends to be a big source of code
> expansion (or, for an OS kernel, the matter of hardware drivers, ...).

> But, as can be noted, pretty much any project that gains mainstream
> popularity seems to spiral out of control regarding code-size.

With 20TB disk drives, 32 GB main memory sizes, Fiber internet;
what is the reason for worrying about something you can do almost
nothing about.

>> YMMV,

Indeed.

>> George

EricP

unread,

Dec 15, 2023, 4:18:07 PM12/15/23

to

Found a recent article that says Xilinx prices run from 8$ to $100,
low end Intel fpga's start at $3, but the high end Stratix models
go from $10,000 to $100,000.

Scott Lurndal

unread,

Dec 15, 2023, 5:39:42 PM12/15/23

to

mitch...@aol.com (MitchAlsup) writes:

>BGB wrote:

>> For FPGA's over $1k, almost makes more sense to ignore that they exist
>> (also this appears to be around the cutoff point for the free version of
>> Vivado as well; but one would have thought Xilinx would have already
>> gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
cost is in the noise.

For a hobby? Well...

>> If the compiler is kept smaller, it is faster to recompile from source.
>
>In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
>10,000 lines of code per second for an IBM-like minicomputer (less decimal
>and string) and did a pretty good job of spitting out high performance
>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

>> Though, within moderate limits, 1M lines would basically be enough to fit:
>> A basic kernel;
>> (this excludes the Linux kernel, which is well over the size limit).
>
>If there were an efficient way to run the device driver sack in user-mode
>without privilege and only the MMI/O pages this driver can touch mapped
>into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

MitchAlsup

unread,

Dec 15, 2023, 6:06:48 PM12/15/23

to

Why should device be able to access user VaS outside of the buffer the
user provided, OH so long ago ??

BGB-Alt

unread,

Dec 15, 2023, 6:20:24 PM12/15/23

to

This sort of thing is only really within reach of big companies...

The Spartan and Artix boards are within reach of hobbyists.
Kintex is, sorta, if a person has a lot of money to burn on it.

>> For FPGA's over $1k, almost makes more sense to ignore that they exist
>> (also this appears to be around the cutoff point for the free version
>> of Vivado as well; but one would have thought Xilinx would have
>> already gotten their money by someone having bought the FPGA?...).
>
>
>>> Then there is their idea that everything - from VHDL to software build
>>> toolchain to system software - be automatically generated from a
>>> simple functional specification. Getting THAT right is likely proving
>>> far more difficult than simply implementing a fixed design in an FPGA.
>>>
>
>> Yeah.
>
>> Long ago, I watched another project (FoNC, led by Alan Kay) that was
>> also trying to go this route. I think the idea was that they wanted to
>> try to find a way to describe the entire software stack (from OS to
>> applications) in under 20k lines.
>
> Was the language of choice APL-like ??
>

Alan Kay was known for Smalltalk, and the languages they were using were
using a Smalltalk like syntax IIRC.

I never really got much into Smalltalk though as it tended to be
difficult to make sense of.

But, I guess, they didn't achieve the goals of either keeping it under
the size limit, or of making something usable.

>> Practically, it seemed to mostly end up going nowhere best I can tell,
>> a lot of "design", nothing that someone could actually use.
>
>
>
>> Though, if one sets the limits a little higher, there is a lot one can
>> do:
>> One can at least, surely, make a usable compiler tool chain in under 1
>> million lines of code (at present, BGBCC weighs in at around 250 kLOC,
>> could be smaller; but, fitting a "basically functional" C compiler
>> into 30k lines, or around the size of the Doom engine, seems a little
>> harder).
>
>> Though, an intermediate option, would be trying to pull off a "semi
>> decent" compiler in under 100K lines.
>
>
>
>> If the compiler is kept smaller, it is faster to recompile from source.
>
> In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
> 10,000 lines of code per second for an IBM-like minicomputer (less
> decimal and string) and did a pretty good job of spitting out high
> performance
> code; on a machine with a 150ns cycle time.
>
> We now have compilers struggling to achieve 10,000 lines per second per CPU
> with machines of 0.2ns cycle time -- 75× faster {times the number of CPUs
> thrown at the problem.}
>

If the compiler is 250k lines of C, it can still compile in a few
seconds on a modern PC.

If if is several million lines with a bunch of C++ thrown in (or
entirely in C++), then it takes a bit longer.

Recompiling LLVM and Clang is a bit much even with a fairly beefy PC.

>> Also, it would be nice to have a basically usable OS and core software
>> stack in under 1M lines.
>
> There is no salable market for an OS that sheds featured for compactness.
>

Could be easier to port to new targets, less RAM and space needed.
If the footprint is small enough to fit on a moderately cheap SPI Flash,
one can use a moderately cheap SPI Flash.

Though, for end-user use, one is probably going to need things like a
web-browser and similar, and "small but actually useful" web browser
probably isn't going to happen (IOW: people aren't going to use
something that can't do much more than a basic subset of static HTML).

>> Say, by not trying to be everything to everyone, and limiting how much
>> is allowed in the core OS (or is allowed within the build process for
>> the core OS).
>
>> Though, within moderate limits, 1M lines would basically be enough to
>> fit:
>> A basic kernel;
>> (this excludes the Linux kernel, which is well over the size limit).
>
> If there were an efficient way to run the device driver sack in user-mode
> without privilege and only the MMI/O pages this driver can touch mapped
> into his VAS. Poof none of the driver stack is in the kernel. --IF--
>

Yeah, or "superusermode" drivers (in my scheme).

Where the drivers aren't technically in the kernel, but still have
access to hardware MMIO and similar.

Though, absent some design changes, superusermode can easily bypass my
existing memory protection scheme if it so chooses. Would need to come
up with a way to allow actual usermode tasks to be able to have
selective access to MMIO to be able to have any hope of protecting the
OS from malicious drivers.

Though, if it needs to run on x86 or ARM, this is more of a problem, and
there is likely little practical alternative either than:
Run drivers in bare kernel space;
Run drivers in logical processes with a bunch of extra overhead.

>> A (moderate sized) C compiler;
>> (but not GCC, which is also well over this size limit).
>
> In 1990 C was a small language, In 2023 that statement is no longer true.
> In 1990 the C compiler had 2 or 3 passes, in 2023 the LLVM compile has
> <what> 35 passes (some of them duplicates as one pass converts into some-
> thing a future pass will convert into something some other pass can
> optimize.)
> In 1990 your C compiler ran natively on your machine.
> In 2023 your LLVM compiler compiles 6+ front end languages and compiles
> to 20+ target ISAs and has to produce good code on all of them.
>

C proper hasn't changed *that* much.
C++ kinda wrecks this.

>> A shell+utils comparable to BusyBox;
>
> Until someone prevents someone else from writing new shells, filters,
> and utilities, there is no way to moderate the growth in Shell+utils.
>

Yeah...

If you want something like Bash + GNU CoreUtils, it is going to be a lot
bigger than something along the lines of Ash + BusyBox.

I was considering possibly reworking how shell works in my case;
currently the shell is in the kernel (though now splits off into
separate tasks for each shell instance), but a design more akin to
BusyBox could make more sense.

But, not entirely a fan of GPL (which BusyBox uses), and while ToyBox
has a better license, I am admittedly less of a fan of the main author
(in past interactions he had acted like a condescending jerk, this isn't
really a win for me even if the design and license seems good in other
areas).

>> Various core OS libraries and similar, etc.
>
>> For this, will assume an at least nominally POSIX like environment.
>
>
>> Programs that run on the OS would not be counted in the line-count
>> budget.
>
>> How to deal with multi-platform portability would be more of an open
>> question, as this sort of thing tends to be a big source of code
>> expansion (or, for an OS kernel, the matter of hardware drivers, ...).
>
>> But, as can be noted, pretty much any project that gains mainstream
>> popularity seems to spiral out of control regarding code-size.
>
> With 20TB disk drives, 32 GB main memory sizes, Fiber internet;
> what is the reason for worrying about something you can do almost
> nothing about.
>

The issue is more about porting effort and compile times and similar,
than storage or downloading...

>
>>> YMMV,
>
> Indeed.
>
>>> George

Scott Lurndal

unread,

Dec 15, 2023, 7:04:15 PM12/15/23

to

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

Think network controller fetching packets from userspace.

Niklas Holsti

unread,

Dec 16, 2023, 2:22:38 AM12/16/23

to

On 2023-12-16 0:39, Scott Lurndal wrote:
> mitch...@aol.com (MitchAlsup) writes:

[snip]

>> In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
>> 10,000 lines of code per second for an IBM-like minicomputer (less decimal
>> and string) and did a pretty good job of spitting out high performance
>> code; on a machine with a 150ns cycle time.
>
> As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.

Anton Ertl

unread,

Dec 16, 2023, 7:16:19 AM12/16/23

to

Especially given that 10Klines/s is probably around 500KB/s which has
to be read from disk and probably a similar amount that has to be
written to disk. What were the I/O throughputs available at the time?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,

Dec 16, 2023, 7:30:51 AM12/16/23

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Niklas Holsti <niklas...@tidorum.invalid> writes:
>>On 2023-12-16 0:39, Scott Lurndal wrote:
>>> mitch...@aol.com (MitchAlsup) writes:
>>
>> [snip]
>>
>>>> In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
>>>> 10,000 lines of code per second for an IBM-like minicomputer (less decimal
>>>> and string) and did a pretty good job of spitting out high performance
>>>> code; on a machine with a 150ns cycle time.
>>>
>>> As did our COBOL compiler (which ran in 50KB).
>>
>>
>>Are you both sure that those numbers are really lines per *second*? They
>>seem improbably high, and compilation speeds in those years used to be
>>stated in lines per *minute*.
>
> Especially given that 10Klines/s is probably around 500KB/s which has
> to be read from disk and probably a similar amount that has to be
> written to disk. What were the I/O throughputs available at the time?

It depends a bit how the Fortran and Cobol statements were stored.
If they were stored in punched card format, 80 characters per line,
then it would be 800000 characters per second read. Object code,
probably much less, but the total could still come to around
1 MB/s.

The IBM 3350 (introduced in 1975) is probably fairly representative
of the high end of that era, it had a data transfer speed of 1198
kB/second, and a seek time of 25 milliseconds.

So, 10000 lines/s would almost definitely have been I/O bound at the
time.

Scott Lurndal

unread,

Dec 16, 2023, 10:11:08 AM12/16/23

to

Yes, lines per minute is the proper metric. Note that for many
years, the compilation rate was bounded by the speed of the card
reader (300 to 600 cards per minute).

moi

unread,

Dec 16, 2023, 1:04:53 PM12/16/23

to

Almost certainly per minute.
I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
It achieved 20K cards per minute and was considered to be very fast.

--
Bill F.

BGB

unread,

Dec 16, 2023, 1:45:39 PM12/16/23

to

Lines per minute seems to make sense.

Modern PC's are orders of magnitude faster, but still don't have
"instant" compile times by any means.

Could be faster though, but would likely need languages other than C or
(especially) C++.

For both languages, one has the overheads of needing to read in a whole
lot of header code (often expanding out to 100s of kB or sometimes a few
MB) often for only 5-20kB of actual source code.

C++ then ruins compiler speed with things like templates.

Though, final code generation often does take some extra time.

For example, in BGBCC, a lot of time tends to be spent in the
"WEXifier", which mostly tries to shuffle instructions around and bundle
them in parallel (a lot of this being in terms of the code for figuring
out whether instructions can swap places, be run in parallel, and for
comparing relative costs).

...

MitchAlsup

unread,

Dec 16, 2023, 2:01:10 PM12/16/23

to

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!

You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.

OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.

EricP

unread,

Dec 16, 2023, 2:25:39 PM12/16/23

to

The OS can't remove the page RW access for a user mode page while an
IO device is DMA writing the page, if that's what you meant,
as the DMA-in may be writing to a smaller buffer within a larger page.
It is perfectly normal for a thread to continue to work in buffer
bytes adjacent to the one currently involved in an async IO.

Scott Lurndal

unread,

Dec 16, 2023, 4:43:03 PM12/16/23

to

It doesn't, necessarily. The IOMMU translation table is a
proper subset of the user's virtual address space. The
application tells the kernel which portions of the address
space are valid DMA regions for the device to access.

BGB-Alt

unread,

Dec 16, 2023, 5:49:23 PM12/16/23

to

One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).

Like, say, for a filesystem, it is presumably:
read syscall from user to OS;
route this to the corresponding VFS driver;
Requests spanning multiple blocks being broken up into parts;
VFS driver checks the block-cache / buffer-cache;
If found, copy from cache into user-space;
If not found, send request to the underlying block device;
Wait for response (and/or reschedule task for later);
Copy result back into userland.

Though, it may make sense that if a request isn't available immediately,
and there is some sort of DMA mechanism, the OS could block the task and
then resume it once the data becomes available. For polling IO, doesn't
likely make much difference as the CPU is basically stuck in a busy loop
either way until the IO finishes.

Though, could make sense for hardware accelerating pixel-copying
operations for a GUI.

For GUI, there would be multiple stages of copying, say:
Copying from user buffer to window buffer;
Copying from window buffer to screen buffer;
Copying from screen buffer to VRAM.

For video playback or GL, there may be an additional stage of copying
from GL's buffer to a user's buffer, then from the user's buffer to the
window buffer. Though, considering possibly adding a shortcut path where
GL and video codecs copy more directly into the window buffer (bypassing
needing to pass the frame data through the userland program).

Could be also possible maybe to have GL render directly into the window
buffer, which could be possible if they have the same format/resolution,
and the window buffer is physically mapped (say, for my current hardware
rasterizer module).

If running a program full-screen, it is possible to copy more directly
from the user buffer into VRAM, saving some time here.

Some time could be saved here if one had hardware support for these
sorts of "copy pixel buffers around and convert between formats" tasks,
but to be useful, this would need to be able to work with virtual
memory, which adds some complexity (would either need to be CPU-like
and/or have a page-walker; neither is particularly cheap).

Could maybe offload the task to the rasterizer module, but would need to
add a page-walker to the rasterizer... Though, trying to deal with some
scenarios (such as the final conversion/copy to VRAM) would add a lot of
extra complexity. For now, its framebuffer/zbuffer/textures need to be
in physically-mapped addresses (also with a 128-bit buffer alignment).

Though, cheaper could be to make use of the second CPU core, but then
schedule things like pixel copy operations to it (maybe also things like
vertex transform and similar for OpenGL). Currently, if enabled, the
second core hasn't seen a lot of use thus far in my case.

...

Thomas Koenig

unread,

Dec 16, 2023, 5:56:14 PM12/16/23

to

BGB <cr8...@gmail.com> schrieb:

> On 12/16/2023 12:04 PM, moi wrote:
>> On 16/12/2023 07:22, Niklas Holsti wrote:
>>> On 2023-12-16 0:39, Scott Lurndal wrote:
>>>> mitch...@aol.com (MitchAlsup) writes:
>>>
>>> [snip]
>>>
>>>>> In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
>>>>> 10,000 lines of code per second for an IBM-like minicomputer (less
>>>>> decimal
>>>>> and string) and did a pretty good job of spitting out high performance
>>>>> code; on a machine with a 150ns cycle time.
>>>>
>>>> As did our COBOL compiler (which ran in 50KB).
>>>
>>>
>>> Are you both sure that those numbers are really lines per *second*?
>>> They seem improbably high, and compilation speeds in those years used
>>> to be stated in lines per *minute*.
>>>
>>
>> Almost certainly per minute.
>> I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
>> It achieved 20K cards per minute and was considered to be very fast.
>>
>
> Lines per minute seems to make sense.
>
>
> Modern PC's are orders of magnitude faster, but still don't have
> "instant" compile times by any means.
>
> Could be faster though, but would likely need languages other than C or
> (especially) C++.

I assume you never worked with Turbo Pascal.

That was amazing. It compiled code so fast that it was never a
bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
The first version I ever used, 3.0 (?) compiled from memory to
memory, so even slow I/O (to floppy disc, at the time) was not
an issue.

This was made possible by using a streamlined one-pass compiler. It
didn't do much optimization, but when the alternative was BASIC, the
generated code was still extremely fast by comparision.

There were a few drawbacks. The biggest one was that programming errors
tended to freeze the machine. Another (not so important) was that,
if you were one of the lucky people to have an 80x87 coprocessor, the
generated code did not check for overflow of the coprocessor stack.

MitchAlsup

unread,

Dec 16, 2023, 6:01:20 PM12/16/23

to

Which is my point !! you only want the device to see that <small> subset
of the requesting application--not the whole address space. Done right
the device can still use the application virtual address, but the device
is not allowed to access stuff not associated with the request at hand
right now.

For example, you are a large entity and and Chinese disk drives are way
less expensive than non-Chinese; so you buy some. Would you let those
disk drives access anything in some requestors address space--no, you
would only allow that device to access the user supplied buffer and
whatever page rounding up that transpires.

Principle of least Privilege works in the I/O space too.

MitchAlsup

unread,

Dec 16, 2023, 6:06:02 PM12/16/23

to

BGB-Alt wrote:

Why did you acquire an alt ?? Ego perhaps ??

MitchAlsup

unread,

Dec 16, 2023, 6:12:13 PM12/16/23

to

BGB-Alt wrote:

> On 12/16/2023 1:25 PM, EricP wrote:
>> MitchAlsup wrote:

> One thing I don't get here is why there would be direct DMA between
> userland and the device (at least for filesystem and similar).

> Like, say, for a filesystem, it is presumably:
> read syscall from user to OS;
> route this to the corresponding VFS driver;
> Requests spanning multiple blocks being broken up into parts;
> VFS driver checks the block-cache / buffer-cache;
> If found, copy from cache into user-space;
> If not found, send request to the underlying block device;
> Wait for response (and/or reschedule task for later);
> Copy result back into userland.

This is correct enough for a file system buffered by a disk cache.

Are ALL file systems buffered in a disk cache ??

I have MM (memory to memory move:: memmove() if you will) that transmits
up to 1 page of data as if atomically (single "bus" transaction.)

> Could maybe offload the task to the rasterizer module, but would need to
> add a page-walker to the rasterizer... Though, trying to deal with some
> scenarios (such as the final conversion/copy to VRAM) would add a lot of
> extra complexity. For now, its framebuffer/zbuffer/textures need to be
> in physically-mapped addresses (also with a 128-bit buffer alignment).

> Though, cheaper could be to make use of the second CPU core, but then
> schedule things like pixel copy operations to it (maybe also things like
> vertex transform and similar for OpenGL). Currently, if enabled, the
> second core hasn't seen a lot of use thus far in my case.

> ....

BGB-Alt

unread,

Dec 16, 2023, 6:17:24 PM12/16/23

to

On 12/16/2023 5:01 PM, MitchAlsup wrote:
> BGB-Alt wrote:
>
> Why did you acquire an alt ?? Ego perhaps ??

This account is for when I am posting from my machine shop...
It is registered to a different email address, is a different account, ...

Quadibloc

unread,

Dec 16, 2023, 6:40:59 PM12/16/23

to

On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld wrote:

> Anyway, has all development stopped? Or is their "sweat equity" model
> still going on?

I've checked the Mill web site, and Ivan Godard last posted to the forums
there just five days ago. So I can only assume that all is well, but
perhaps he has entered a phase of work on the Mill that is keeping him
busy. Which would seem to be good news.

John Savard

Scott Lurndal

unread,

Dec 16, 2023, 7:17:41 PM12/16/23

to

I thought I made that clear from the start.

>
>For example, you are a large entity and and Chinese disk drives are way
>less expensive than non-Chinese; so you buy some. Would you let those
>disk drives access anything in some requestors address space--no, you
>would only allow that device to access the user supplied buffer and
>whatever page rounding up that transpires.

So far as I know there are no chinese disk drives that support
SR-IOV.

Scott Lurndal

unread,

Dec 16, 2023, 7:25:03 PM12/16/23

to

https://www.dpdk.org/
https://opendataplane.org/

Are two very common use cases for usermode drivers.

>
>Like, say, for a filesystem, it is presumably:
> read syscall from user to OS;
> route this to the corresponding VFS driver;
> Requests spanning multiple blocks being broken up into parts;
> VFS driver checks the block-cache / buffer-cache;
> If found, copy from cache into user-space;
> If not found, send request to the underlying block device;
> Wait for response (and/or reschedule task for later);
> Copy result back into userland.

No, it would be for the user mode application to access
disk/ssd/nvme blocks directly and impose whatever structure on those
blocks that it wishes. No OS intervention at all, DMA directly
into userspace instead of bouncing through kernel.

The NVME controllers use a command ring, and when virtualized,
each VF provides a command ring directly to the user mode
application - the application can insert commands (read, write,
erase, etc) into the ring, write to the doorbell register
a and wait for completion by polling or waiting for a virtio
interrupt.

Again the application is just reading blocks and interpreting
them any way it wishes (e.g. for a database application
which doesn't need a filesystem).

BGB

unread,

Dec 16, 2023, 8:18:26 PM12/16/23

to

Yeah, I mostly missed out on that era.

Didn't get much into computers until I was in the "late single digits"
age range, and by this point the world was mostly 386 and 486 PC's
running Windows 3.x and similar.

Seemingly, Pascal was already "mostly dead" by this point.

When I started messing with programming in elementary school:
First was QBasic, but other than this I was also messing around with
TurboC. Not long after (when the world migrated to Win95) had jumped
over to Cygwin.

During middle and high-school, mostly during the Win98 era, mostly used
Cygwin and MinGW. Though, I was weird, and mostly ended up running
WinNT4 and Win2K (and dual booting with Linux) rather than Win9X.

Then later jumped over to MSVC / Visual Studio for native windows
programs while taking college classes.

Though, part of the jump was because, at this point, Visual Studio had
become basically freeware; and Visual Studio had a much better debugger
(gdb kinda sucks...).

Still, much time has passed for me, and in a fairly short time I will
cross over into having existed for 4 decades.

> This was made possible by using a streamlined one-pass compiler. It
> didn't do much optimization, but when the alternative was BASIC, the
> generated code was still extremely fast by comparision.
>

I remember QBasic.

Didn't take long to start to see the limitations...

> There were a few drawbacks. The biggest one was that programming errors
> tended to freeze the machine. Another (not so important) was that,
> if you were one of the lucky people to have an 80x87 coprocessor, the
> generated code did not check for overflow of the coprocessor stack.

OK.

For most of my life, x87 had been built into the CPU.

John Levine

unread,

Dec 16, 2023, 8:33:02 PM12/16/23

to

According to Thomas Koenig <tko...@netcologne.de>:

>> Modern PC's are orders of magnitude faster, but still don't have
>> "instant" compile times by any means.
>>
>> Could be faster though, but would likely need languages other than C or
>> (especially) C++.
>
>I assume you never worked with Turbo Pascal.
>
>That was amazing. It compiled code so fast that it was never a
>bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.

Back around 1970 the Dartmouth Time-Sharing System (DTSS) ran on a GE
635, which was about the same performance as the original PDP-10 and a
front end DAtanet 30 which had about the compute power of a modern
toaster. By clever system design they made it support 100 users, and
the response time was really good. The time from when you typed RUN to
when your program was compiled and started running was too fast to
notice.

It was a real time-sharing system that supported multiple languages,
not just BASIC, and the languages were all compiled, not interpreted.
The compilers were so fast that for years they never bothered to write
a linker, since you could just compile all the source code for your
routines togther. (They finally wrote a linker they added PL/I.)

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Andreas Eder

unread,

Dec 17, 2023, 6:25:07 AM12/17/23

to

On Fr 15 Dez 2023 at 13:05, BGB <cr8...@gmail.com> wrote:

> Also, it would be nice to have a basically usable OS and core software
> stack in under 1M lines.
>

> Say, by not trying to be everything to everyone, and limiting how much
> is allowed in the core OS (or is allowed within the build process for
> the core OS).
>

> Though, within moderate limits, 1M lines would basically be enough to fit:
> A basic kernel;
> (this excludes the Linux kernel, which is well over the size limit).

> A (moderate sized) C compiler;
> (but not GCC, which is also well over this size limit).

> A shell+utils comparable to BusyBox;

> Various core OS libraries and similar, etc.
>
> For this, will assume an at least nominally POSIX like environment.
>
> Programs that run on the OS would not be counted in the line-count budget.

Have you had a look at plan9 yet?

'Andreas

EricP

unread,

Dec 17, 2023, 1:13:43 PM12/17/23

to

Zero-copy IO. That has always been available on WinNT provided hardware
supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file
cache it gets copied, otherwise it DMA's directly to/from the user buffer.
Normally one wants cached file blocks but there are times when one doesn't
and wants the more optimal direct buffer IO (eg, a video player).

There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO. A single virtual buffer becomes a list of
physical fragments, so a scatter-gather list becomes a list of lists
of physical byte buffer fragments, called a Memory Descriptor List (MDL)
in Windows.

And then SR-IOV adds virtual machines to the mix, where a guest OS
physical address becomes a hypervisor guest virtual address,
and not only are guest buffers in guest user space, but the guest OS
MDL's are themselves in hypervisor virtual space and require their own
hypervisor MDL's (lists of lists of lists of fragments).

>
> Like, say, for a filesystem, it is presumably:
> read syscall from user to OS;
> route this to the corresponding VFS driver;
> Requests spanning multiple blocks being broken up into parts;
> VFS driver checks the block-cache / buffer-cache;
> If found, copy from cache into user-space;
> If not found, send request to the underlying block device;
> Wait for response (and/or reschedule task for later);
> Copy result back into userland.

Yes, pretty much (there is page mangement, quota management).
Except if I request a direct IO it DMA's direct to/from the user buffer,
if hardware supports that.

> Though, it may make sense that if a request isn't available immediately,
> and there is some sort of DMA mechanism, the OS could block the task and
> then resume it once the data becomes available. For polling IO, doesn't
> likely make much difference as the CPU is basically stuck in a busy loop
> either way until the IO finishes.

Yes, that's DMA resource management. Basically each system has a certain
number of scatter-gather IO mappers, now implemented by the IOMMU page table.
Each IO queues a request for its mappers, and the DMA resource manager doles
out a set of IO mapping registers, which may be less that you requested
in which case you break up your IO into multiple requests.
Then you program the scatter-gather map using info from the IO's MDL,
pass the mapped IO space addresses to the device, and Bob's your uncle.
When the IO completes, your driver tears down its IO map and releases
the mapping registers to the next waiting IO.

> Though, could make sense for hardware accelerating pixel-copying
> operations for a GUI.

On Windows the Gui is managed completely differently.
I'm not familiar enough with the details to comment other than to say
it is executed as privileged subroutines by the calling thread but in
super mode, which allows it direct access to the calling virtual space.

Scott Lurndal

unread,

Dec 17, 2023, 2:24:13 PM12/17/23

to

EricP <ThatWould...@thevillage.com> writes:
>BGB-Alt wrote:
>> On 12/16/2023 1:25 PM, EricP wrote:
>>> MitchAlsup wrote:
>>>> Scott Lurndal wrote:
>>>>
>>>>> mitch...@aol.com (MitchAlsup) writes:
>>>>>> Scott Lurndal wrote:
>>>>>>

>>
>> One thing I don't get here is why there would be direct DMA between
>> userland and the device (at least for filesystem and similar).
>
>Zero-copy IO. That has always been available on WinNT provided hardware
>supports it. General byte-buffer IO could always do zero-copy DMA,
>with HW support. For files one can do IO direct to a user buffer with
>certain restrictions, buffers must be file block size and alignment.
>I haven't checked but guessing that if the file block is already in file
>cache it gets copied, otherwise it DMA's directly to/from the user buffer.
>Normally one wants cached file blocks but there are times when one doesn't
>and wants the more optimal direct buffer IO (eg, a video player).
>
>There is also scatter-gather IO, intended for network cards,
>where the IO is a list of byte sized and aligned virtual buffers.
>
>The all interacts with DMA and page management because the physical
>page frames that contain the bytes must be pinned in memory for the
>duration of the DMA IO.

PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system
architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.

> A single virtual buffer becomes a list of
>physical fragments, so a scatter-gather list becomes a list of lists
>of physical byte buffer fragments, called a Memory Descriptor List (MDL)
>in Windows.
>
>And then SR-IOV adds virtual machines to the mix,

Not necessarily just virtual machines - it's also used
to expose the virtual function to user mode code in
a bare metal (or virtualized) operating system.

Chris M. Thomasson

unread,

Dec 17, 2023, 3:56:39 PM12/17/23

to

Fwiw, for some damn reason this make me think about plan9 from some
posts way back on comp.programming.threads. I need to find some time to
find them: Here is one that mentioned it:

https://groups.google.com/g/comp.programming.threads/c/nyrEJDt8FvM/m/uZUcQcnWPLQJ

BGB

unread,

Dec 17, 2023, 8:05:43 PM12/17/23

to

OK.

Nothing like this in my case, only buffered IO.

Currently, the buffering is managed by the filesystem driver rather than
the block-device.

So, say, reading/writing the SDcard is normally unbuffered, but the FAT
driver will keep a cache of previously accessed clusters and similar. It
might make sense to move this into a more general-purpose mechanism though.

For FAT though, there may be wonk in that (AFAIK) there is no strict
requirement that the start of the data area be aligned to the cluster
size (so, say, one could potentially have a volume with 32K clusters
aligned on a 2K boundary). Well, unless this is disallowed and I missed it.

If I were designing my own filesystem, I would probably have done some
things differently. Though, my ideas didn't really look like EXTn either.

Had previously considered something that would have looked like
something partway between EXT2 and a somewhat simplified NTFS, but not
done much here as it would make a lot of hassle on the Windows side of
things.

Mostly would want a few features that seem a bit lacking in FAT.

Though, did recently discover the existence of the "Projected
FileSystem" API in Windows, which allows the possibility of implementing
custom user-mode filesystems on Windows (sorta; it is a bit wonky).

This does open / re-open some possibilities.

> There is also scatter-gather IO, intended for network cards,
> where the IO is a list of byte sized and aligned virtual buffers.
>
> The all interacts with DMA and page management because the physical
> page frames that contain the bytes must be pinned in memory for the
> duration of the DMA IO. A single virtual buffer becomes a list of
> physical fragments, so a scatter-gather list becomes a list of lists
> of physical byte buffer fragments, called a Memory Descriptor List (MDL)
> in Windows.
>
> And then SR-IOV adds virtual machines to the mix, where a guest OS
> physical address becomes a hypervisor guest virtual address,
> and not only are guest buffers in guest user space, but the guest OS
> MDL's are themselves in hypervisor virtual space and require their own
> hypervisor MDL's (lists of lists of lists of fragments).
>

OK.

I can note that in my project, there is no DMA mechanism as of yet.
Pretty much everything is either MMIO mapped buffers or polling IO.

When I looked at a network card before (once, long ago), IIRC its design
was more like:
There were a pair of ring-buffers, for TX and RX;
One would write frames to the TX buffer, and update the pointers, and
the card would send them;
When a frame arrived, it would add it into the buffer, update the
pointers, and then raise an IRQ.

This design being used in the ye olde RTL8139 and similar.

Had looked at another Ethernet interface, and it differed in that it
only had a 2K buffer for a single frame:
When a frame arrived, it was written into the buffer, and an interrupt
was raised;
When set to transmit, the buffer contents were transmitted, and then an
interrupt would be raised.

Seemingly, this interface would be unable to receive a frame while
trying to transmit a frame. Nor could it deal with a new frame arriving
before the previous frame had been read by the driver.

Though, this latter one was on an FPGA based soft-processor.

If/when I get to it, had considered using the pair of ring buffers
design, each probably 8K or 16K (where 8K is enough for 4 full-sized
Ethernet frames, each limited typically to around 1500 bytes of payload;
16K could give more "slack" for the driver, at the expense of using more
BlockRAM).

>>
>> Like, say, for a filesystem, it is presumably:
>> read syscall from user to OS;
>> route this to the corresponding VFS driver;
>> Requests spanning multiple blocks being broken up into parts;
>> VFS driver checks the block-cache / buffer-cache;
>> If found, copy from cache into user-space;
>> If not found, send request to the underlying block device;
>> Wait for response (and/or reschedule task for later);
>> Copy result back into userland.
>
> Yes, pretty much (there is page mangement, quota management).
> Except if I request a direct IO it DMA's direct to/from the user buffer,
> if hardware supports that.
>

OK.
No equivalent in my case (slow polling IO only for now).

Though, did go the route of allowing accessing an SDcard with 8-bytes
per SPI transfer, which at least "kicked the can down the road slightly"
as originally, with 1 byte SPI bursts, the overhead of the MMIO polling
interface was slower than an SDcard running on a 10MHz SPI interface.

The 8-byte transfers made it faster, but ended up mostly settling on
13MHz SPI (fastest speed where I could get reliable results on the
actual hardware I was using, *1).

*1: Where, seemingly the combination of micro-SD to full-size SD
extender cable + microSD to full-size SD adapter on the card, was
seemingly not ideal for signal integrity (but used mostly because
otherwise microSD cards are too small / easy to drop and not be able to
find again; whereas full-size SD cards are easier to handle). Wouldn't
have expected the attenuation to be *that* bad though.

Though, it works "well enough", since if it were that much faster, would
need to create a new interface.

>> Though, it may make sense that if a request isn't available
>> immediately, and there is some sort of DMA mechanism, the OS could
>> block the task and then resume it once the data becomes available. For
>> polling IO, doesn't likely make much difference as the CPU is
>> basically stuck in a busy loop either way until the IO finishes.
>
> Yes, that's DMA resource management. Basically each system has a certain
> number of scatter-gather IO mappers, now implemented by the IOMMU page
> table.
> Each IO queues a request for its mappers, and the DMA resource manager
> doles
> out a set of IO mapping registers, which may be less that you requested
> in which case you break up your IO into multiple requests.
> Then you program the scatter-gather map using info from the IO's MDL,
> pass the mapped IO space addresses to the device, and Bob's your uncle.
> When the IO completes, your driver tears down its IO map and releases
> the mapping registers to the next waiting IO.
>

OK.

>> Though, could make sense for hardware accelerating pixel-copying
>> operations for a GUI.
>
> On Windows the Gui is managed completely differently.
> I'm not familiar enough with the details to comment other than to say
> it is executed as privileged subroutines by the calling thread but in
> super mode, which allows it direct access to the calling virtual space.
>

I am not sure how the Windows GUI works.

In my case though, my experimental GUI had effectively worked by using
something resembling a COM object to communicate with a GUI system
running in a different task (it basically manages redrawing the window
stack and sending it out to the display and similar).

But, here, the whole process involves a bunch of pixel-buffer copying,
which isn't terribly fast (currently eats up a bigger chunk of time than
running Doom itself does).

Contrast, Win32 GDI seems to be more object based, rather than built on
top of drawing into pixel buffers and copying them around.

However, IME, my way of using Win32 GDI was mostly to create a Bitmap
object, update it, and endlessly draw it into the window, which is
basically the native model in TKGDI.

Seemingly, X11 was a little different as well, with commands for drawing
stuff (like color-fills, lines, and text). Though, presumably all of
this would just end up going into a pixel buffer.

Because each drawing operation supplies a BITMAPINFOHEADER for the thing
to be drawn, there can also be format conversion in the mix.

So, say, running Doom or similar in the GUI mode doesn't give
particularly high framerates.

Had recently added tabs to the console, and had used it to launch Doom
and Quake at the same time (as a test). Both of them ran, showing that
the multi-tasking does in fact work (within the limits of still being
cooperative multitasking).

However, performance was so bad as to make both of them basically
unusable (all of this dropped frame rate to around 2 frames / second).

https://twitter.com/cr88192/status/1735233196562796800

BGB

unread,

Dec 17, 2023, 8:10:32 PM12/17/23

to

Have heard of Plan9 before, never really looked at the code nor looked
much into it.

Was also aware of Minix, but what little I looked into it made it seem
fairly limited in some areas (though it seems to have changed things a
fair bit in more recent versions). Seems to be using the BSD userland.

But, yeah, for my project, it might make sense to find some sort of
userland I can port and use on top of TestKern, don't necessarily want
to write all of the userland myself.

Would have the functional limitation that I would need to be able to
build it with my compiler, which means basically "generic C only".

...

> 'Andreas

Paul A. Clayton

unread,

Dec 17, 2023, 9:56:15 PM12/17/23

to

On 12/17/23 2:24 PM, Scott Lurndal wrote:
> EricP <ThatWould...@thevillage.com> writes:
[snip zero-copy and scatter-gather I/O]

>> The all interacts with DMA and page management because the physical
>> page frames that contain the bytes must be pinned in memory for the
>> duration of the DMA IO.
>
> PCI express has an optional feature, PRI (Page Request Interface)
> that allows the hardware to request that a page be 'pinned' just
> for the duration of a DMA operation. The ARM64 server base system
> architecture document requires that the host support PRI. This
> works in conjunction with PCIe ATS (Address Translation Services)
> which allows the endpoint device to ask the host for translations
> and cache them in the endpoint so the endpoint can use physical
> addresses directly. This is usually implemented by the IOMMU
> on the host treating the endpoint as if it had a remote TLB cache.

Interesting. I had proposed some years ago that rather than
pinning a physical page for I/O a page be provided when needed
from a free list (including that the data could be cached/buffered
with a virtual address tag).

The Mill's backless memory is similar, deferring physical memory
allocation until cache eviction using a free list (that is
refilled by a thread that is activated at low water mark)

Thanks to search on Google Groups I found the message (dated Sep
28, 2010, 6:07:08 PM). I wrote:
-> Could a really smart IOTLB help with this? If the target of
-> a write is a virtual address, the IOTLB might translate it to
-> an IO Hub local memory address (and/or cache it, perhaps using
-> virtual tags). (It seems it might be useful to distinguish
-> between different purposes of non-cacheable storage. I would
-> guess that the DMA kind is primarily meant to avoid cache
-> pollution not ensure that side-effects occur.)
->
-> Along similar lines, I wondered if a smart IOTLB could be
-> used to make page-pinning only a 'kiss of cowpox' not a
-> 'Kiss of Death'. If an IOTLB could dynamically assign
-> pages from a free list, a huge number of virtual pages
-> could be 'locked'. (It would still be possible for a
-> write to page-fault--if the system software could not
-> provide pages to the free list fast enough to meet the
-> demand by the IOTLB--and read page-faults would be
-> possible; but software might be able to retry the IO
-> requests.)
->
-> (I also wonder if processor TLB COW support would be
-> worthwhile. Aside from COW, such might be used by a
-> user-level memory allocator to free and allocate pages.
-> A shared page free list might allow tighter memory
-> usage. [ISTR the BSD malloc tried to free pages back
-> to the OS. The above mechanism would simply put a
-> hardware managed buffer between the use memory
-> management and the OS.])
->
-> (Even further off-topic, could 'cache' pages be useful?
-> I.e., the software handles re-generation/fill and only
-> needs a low-overhead exception when the page has been
-> reclaimed for other uses. Rather than having system
-> software save and then restore the cache page, it
-> could just be dropped. Even if the cost of restoration
-> is greater than the cost of a save and restore, this
-> sort of caching could be a win if the probability of
-> reuse is low enough.)

The Google groups url:
https://groups.google.com/g/comp.arch/c/u7z9E-zvoPo/m/fmGM4_Ih7ywJ

Terje Mathisen

unread,

Dec 18, 2023, 6:11:33 AM12/18/23

to

Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:

>> Modern PC's are orders of magnitude faster, but still don't have
>> "instant" compile times by any means.
>>
>> Could be faster though, but would likely need languages other than C or
>> (especially) C++.
>
> I assume you never worked with Turbo Pascal.

I was going to bring up TP but you beat me to it. :-)

>
> That was amazing. It compiled code so fast that it was never a
> bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
> The first version I ever used, 3.0 (?) compiled from memory to
> memory, so even slow I/O (to floppy disc, at the time) was not
> an issue.

TP1.0 was an executable which in ~37KB managed to fit an IDE,
compiler/linker/loader/debugger and RTL, and if you abstained form
getting human readable error messages you could save about 1.5KB.

>
> This was made possible by using a streamlined one-pass compiler. It
> didn't do much optimization, but when the alternative was BASIC, the
> generated code was still extremely fast by comparision.

That compiler had zero optimation, it was a pure pattern match->emit
code engine that would reload the same variable from RAM on every
statement, but as you said, still far faster than the alternatives.

When speed was an actual issue I would switch to (inline) assembler,
even though that was initially just a way to embed machine code directly
so I had to assemble it in DEBUG.

>
> There were a few drawbacks. The biggest one was that programming errors
> tended to freeze the machine. Another (not so important) was that,
> if you were one of the lucky people to have an 80x87 coprocessor, the
> generated code did not check for overflow of the coprocessor stack.
>

The fp code generated by TP would never overflow the 87 stack afair,
since it would do single operations and pop the results at once?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Dec 18, 2023, 9:22:17 AM12/18/23

to

On 12/18/2023 5:11 AM, Terje Mathisen wrote:
> Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>> Modern PC's are orders of magnitude faster, but still don't have
>>> "instant" compile times by any means.
>>>
>>> Could be faster though, but would likely need languages other than C or
>>> (especially) C++.
>>
>> I assume you never worked with Turbo Pascal.
>
> I was going to bring up TP but you beat me to it. :-)
>>
>> That was amazing. It compiled code so fast that it was never a
>> bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
>> The first version I ever used, 3.0 (?) compiled from memory to
>> memory, so even slow I/O (to floppy disc, at the time) was not
>> an issue.
>
> TP1.0 was an executable which in ~37KB managed to fit an IDE,
> compiler/linker/loader/debugger and RTL, and if you abstained form
> getting human readable error messages you could save about 1.5KB.

Yeah, in any case, small compiler is possible.
And, we don't necessarily need some 10+ MLOC monstrosity to do so...

>>
>> This was made possible by using a streamlined one-pass compiler. It
>> didn't do much optimization, but when the alternative was BASIC, the
>> generated code was still extremely fast by comparision.
>
> That compiler had zero optimation, it was a pure pattern match->emit
> code engine that would reload the same variable from RAM on every
> statement, but as you said, still far faster than the alternatives.
>
> When speed was an actual issue I would switch to (inline) assembler,
> even though that was initially just a way to embed machine code directly
> so I had to assemble it in DEBUG.

Early in my SH/BJX1 project, BGBCC wasn't too far off:
Used R8..R14 for caching variables;
Would often move values into R4..R7 to operate on them.

The variable Load and Store operations would:
Do a MOV if value is in a register;
Load from memory otherwise, putting the value into a register.

With all variables being flushed at the end of a basic block (with any
dirty variables being written back to memory).

In my case, I had used a similar model in my JIT compilers (generally on
x86).

So, the an ADD operation might look like:
MOV R8, R4
MOV R9, R5
ADD R5, R4
MOV R4, R10

I then switched to a model which was more like:
Get Var1 as a register for Read;
Get Var2 as a register for Read;
Get Var3 as a register for Write;
Do the operation;
Release Var1, Var2, Var3.

Which could avoid needing a bunch of extra MOV's and similar.

The idea for the mostly stalled TKUCC effort would be to use a similar
model to the current form of BGBCC, just focusing more on minimalism,
and probably using separate compilation. Though, there are pros/cons for
"generate everything all at once"; which requires more memory, but has
more opportunity for optimizations, or at least for pruning stuff.

Though, have noted that GCC seems to have devised a different mechanism
to prune stuff with separate compilation (via "-ffunction-sections" and
"-fdata-sections"), namely, to put every function and variable into its
own section in the object files, which can then be pruned based on
reachability, which are then combined into a single section during linking.

Did recently notice in some fiddling that some things were invoking GCC
like:
echo ... | $CC -E -xc - | ...

Was kind of a pain, but added similar behavior to BGBCC in the attempt
to make BGBCC better able to mimic GCC's command-lines.

Did need to have it omit line numbers in this case, as BGBCC had used a
different notation for encoding these:
BGBCC:
/*"fname"lnum*/ line
GCC:
# lnum "fname"
line
And the way the commands were doing text parsing was incompatible with
BGBCC's line-numbering scheme.

Also early versions of my BJX2 core, in addition to the slow memory bus,
also did not have pipelined memory operations (and memory access
operations used the same OPM/OK signaling scheme as the bus).

In this case, the cost of extra MOV instructions was considered minor
relative to the cost of the memory loads/stores.

Situation has at least improved since then.
Still very often fighting bugs though...

Scott Lurndal

unread,

Dec 18, 2023, 10:43:57 AM12/18/23

to

"Paul A. Clayton" <paaron...@gmail.com> writes:
>On 12/17/23 2:24 PM, Scott Lurndal wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>[snip zero-copy and scatter-gather I/O]
>>> The all interacts with DMA and page management because the physical
>>> page frames that contain the bytes must be pinned in memory for the
>>> duration of the DMA IO.
>>
>> PCI express has an optional feature, PRI (Page Request Interface)
>> that allows the hardware to request that a page be 'pinned' just
>> for the duration of a DMA operation. The ARM64 server base system
>> architecture document requires that the host support PRI. This
>> works in conjunction with PCIe ATS (Address Translation Services)
>> which allows the endpoint device to ask the host for translations
>> and cache them in the endpoint so the endpoint can use physical
>> addresses directly. This is usually implemented by the IOMMU
>> on the host treating the endpoint as if it had a remote TLB cache.
>
>Interesting. I had proposed some years ago that rather than
>pinning a physical page for I/O a page be provided when needed
>from a free list (including that the data could be cached/buffered
>with a virtual address tag).

In most usage cases, the page being DMA'd from/to has other
unrelated data in it, rather than being fully dedicated to
a single buffer or set of buffers.

The PRI is more about making sure the OS makes the page present
before the DMA operation begins and ensuring that it won't go
away before the DMA operation ends.

MitchAlsup

unread,

Dec 18, 2023, 12:42:03 PM12/18/23

to

Paul A. Clayton wrote:

> On 12/17/23 2:24 PM, Scott Lurndal wrote:
>> EricP <ThatWould...@thevillage.com> writes:
> [snip zero-copy and scatter-gather I/O]
>>> The all interacts with DMA and page management because the physical
>>> page frames that contain the bytes must be pinned in memory for the
>>> duration of the DMA IO.
>>
>> PCI express has an optional feature, PRI (Page Request Interface)
>> that allows the hardware to request that a page be 'pinned' just
>> for the duration of a DMA operation. The ARM64 server base system
>> architecture document requires that the host support PRI. This
>> works in conjunction with PCIe ATS (Address Translation Services)
>> which allows the endpoint device to ask the host for translations
>> and cache them in the endpoint so the endpoint can use physical
>> addresses directly. This is usually implemented by the IOMMU
>> on the host treating the endpoint as if it had a remote TLB cache.

> Interesting. I had proposed some years ago that rather than
> pinning a physical page for I/O a page be provided when needed
> from a free list (including that the data could be cached/buffered
> with a virtual address tag).

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

EricP

unread,

Dec 18, 2023, 2:00:10 PM12/18/23

to

I don't know how one would make use of that on Windows as it completely
separates the IO off so that the OS can switch to a different process
address space while the DMA takes place. The data structures to support
paging might not be easily accessible which would introduce long latency
in the middle of a DMA - which is exactly why it doesn't do this.
(I don't think Linux allows paging inside the OS or drivers either.)

On Windows you can have paging while managing a device if you put
the driver code in either a privileged user or super mode thread,
and then you deal with any timing issues.
The old floppy driver worked this way - as an OS thread.
But that was a very slow device and used programmed IO not DMA.

Paul A. Clayton

unread,

Dec 18, 2023, 8:29:11 PM12/18/23

to

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]

> Guest OS can pin a guest physical page, but HyperVisor decides
> if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error. I would also guess that some I/O operations
could be merely retried, but some might just be lost. For a
virtualized I/O device, it would seem that the OS would be
confused if a (virtual) physical page was reported as having an
access error but perhaps there would be some generic transaction
failed indicator with information about retrying.

(Even with a pool of free pages and significant virtually tagged
caching, a page freeing thread could be "outrun" by I/O requesting
new pages. This presents denial of service attack potential as
well as ordinary danger of resource starvation. [For short DMAs,
caching-only might be practical with a main memory page never
being allocated. This would require unpinning/binding the page
after the data was copied; the copy could be "free" since the data
would be transferred to a processor cache anyway.])

Managing/avoiding oversubscription of resources is probably a week
or more of a OS design course. I sometimes wish I could spend a
few hundred years in a time bubble studying some of these things.

MitchAlsup

unread,

Dec 18, 2023, 10:41:12 PM12/18/23

to

Paul A. Clayton wrote:

> On 12/18/23 12:39 PM, MitchAlsup wrote:
> [snip page pinning for DMA]
>> Guest OS can pin a guest physical page, but HyperVisor decides
>> if the page is present or absent in memory.

> Out of curiosity, what happens when an I/O device tries to DMA to
> a page which the OS thinks is pinned. I would *guess* that a DMA
> operation that fails for an unvirtualized I/O device merely
> presents an error.

If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.

If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the
I/O device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)

> I would also guess that some I/O operations
> could be merely retried, but some might just be lost. For a
> virtualized I/O device, it would seem that the OS would be
> confused if a (virtual) physical page was reported as having an
> access error but perhaps there would be some generic transaction
> failed indicator with information about retrying.

> (Even with a pool of free pages and significant virtually tagged
> caching, a page freeing thread could be "outrun" by I/O requesting
> new pages.

Les the "proper supervisor" sort it out. Keep HW out of the game.

Scott Lurndal

unread,

Dec 19, 2023, 9:28:15 AM12/19/23

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

>On 12/18/23 12:39 PM, MitchAlsup wrote:
>[snip page pinning for DMA]
>> Guest OS can pin a guest physical page, but HyperVisor decides
>> if the page is present or absent in memory.
>
>Out of curiosity, what happens when an I/O device tries to DMA to
>a page which the OS thinks is pinned.

The I/O device simple pushes data to the physical address. It's
the responsibility of the operating software to ensure the
physical address given to the device (either via ATS where the
device hosts the "tlb" or via the IOMMU) is correct and legal.

If the IOMMU translation tables mark the page as absent, an error response
will be returned to the device. If ATS was used, and the
host didn't invalidate the translation at the host, the
device will DMA to the specified physical address regardless
of whether it is the correct page.

Paul A. Clayton

unread,

Jan 2, 2024, 1:28:15 PMJan 2

to

On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
>
>> On 12/18/23 12:39 PM, MitchAlsup wrote:
>> [snip page pinning for DMA]
>>> Guest OS can pin a guest physical page, but HyperVisor decides
>>> if the page is present or absent in memory.
>
>> Out of curiosity, what happens when an I/O device tries to DMA to
>> a page which the OS thinks is pinned. I would *guess* that a DMA
>> operation that fails for an unvirtualized I/O device merely
>> presents an error.
>
> If the page fault occurs in the level 1 table, Guest OS gets a
> device page fault exception, if it happens in the level 2 table
> HyperVisor gets a device page fault exception.
>
> If the device can recover from page faults, the proper supervisor
> "does OS stuff" and then signals the device to proceed with the
> still pending device request. The "does OS stuff" does for the I/O
> device pretty much what the proper supervisor does with a
> CPU page fault--with all the nuances and idiosyncrasies (or more.)

If the HV encounters a device that cannot handle a page fault for
a page that it decided not to allocate but the OS did (knowing
that that specific device could not handle page faults), what
error status is sent to the OS? The HV cannot simply pass along a
"page fault" error because the OS _knows_ that the page was
allocated; that would break pure virtualization and potentially
seriously confuse the OS if virtualization was not considered as a
possibility (e.g., the OS might assume the device had either a
transient or persistent error that caused the wrong error type to
be returned, confirm it as persistent after the second encounter,
and mark the device as broken).

The HV could allocate the page for any such device, but that
requires the HV to be bloated with device driver specifics and to
check page allocation whenever an OS gave a DMA target to such a
device.

Am I missing something?

[snip previous confusion comment]

>> (Even with a pool of free pages and significant virtually tagged
>> caching, a page freeing thread could be "outrun" by I/O requesting
>> new pages.
>
> Les the "proper supervisor" sort it out. Keep HW out of the game.

Software cannot present the illusion of the page being pinned (for
devices that cannot handle page faults).

Hardware can easily cache data sent from a device at the "I/O Hub"
level using virtual addresses. With "Shadow Memory" ("physical"
addresses outside the memory region that have an additional
translation layer using a TLB-like structure; "Increasing TLB
Reach Using Superpages Backed by Shadow Memory" (Mark Swanson et
al., 1998) proposed this concept) — extended with delayed
allocated support — would allow physically tagged processor caches
to cache DMA that is not backed by actual main memory.

(Of course, elliptical orbits can be approximated to arbitrary
accuracy with epicycles. Accumulating fixes to increasingly rare
behavioral deviations will not lead to an elegant design — even if
it is common practice. On the other hand, just accepting the
coarse approximation of circular orbits, while more elegant, seems
flawed. Some applications will encounter the three body problem
and be forced into a sort of inelegance. For page pinning, I think
hardware page buffering may be worth the complexity, especially
since the functionality presents other opportunities. Since the
Mill provides the same with its Backless Memory, I am not the only
one who thinks the complexity is acceptable — this does not mean
that the complexity cost is not excessive and your experience
indicating it is excessive certainly urges more caution.)

By adding a small pool of free pages and a means to request more
when a low-water mark is reached, hardware/firmware could reduce
the frequency of HV/OS involvement and, except under extreme
utilization when the caches overflow (which could be rather
extreme if even 20% of last-level cache was usable) and the page
free pool is empty, allow "legacy" devices that demand pages be
present to operate as if they were present.

I realize that introducing a "bug" (really any behavioral variance
that violates expectations) that only manifests under extreme
circumstances is problematic. Such a variance would at least have
the documentation of an unexpected but sensible error notification
(possibly just the I/O device giving a page fault error when
hardware cannot keep up); since this is like a paravirtualization
feature the OS/HV would not be confused. (A HV using it could
mostly work with an OS that did not use the feature — rarely
encountering impossible I/O device page faults.)

The distinction between hardware, firmware, and paravirtualizing
hypervisor seems somewhat fuzzy, especially if the hypervisor is
provided by the hardware designer. If moving functionality from a
hypervisor to firmware/hardware can make useful new functionality
possible/practical (which appears to be the case with the above
proposal), then I think such an expansion of hardware
responsibility should be considered. (I fear the Mill will not
get even an FPGA implementation that would really test the limits
of Backless Memory, so this seems likely to be a mere "academic"
proposition.)

Scott Lurndal

unread,

Jan 2, 2024, 6:39:41 PMJan 2

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

If the HV is allowing direct access to the device, and allowing
the device to use physical addresses via cached translations,
then the device must support both PCIe ATS and PRI. The former
handles the translations and the later requests that a page
be "pinned" for a subsequent DMA operation.

The HV controls the IOMMU which provides both the ATS and PRI interfaces
to the device. So the HV can invalidate a translation held in the
device (for ATS) or refuse to pin a page (or unpin a page).

MitchAlsup

unread,

Jan 3, 2024, 1:01:41 PMJan 3

to

Or have a HostBridge that provides translation services to
virtualized devices....

Scott Lurndal

unread,

Jan 3, 2024, 1:12:17 PMJan 3

to

All of the major operating systems fully support PCIe ATS and PRI
standards.

Leveraging that makes your processor viable, using a custom host
bridge doesn't.

MitchAlsup

unread,

Jan 3, 2024, 4:10:57 PMJan 3

to

But there are existing devices which do not.

> Leveraging that makes your processor viable, using a custom host
> bridge doesn't.

For devices that do not support, MY 66000 I/O MMU alleviates the
difference.