OK.
Nothing like this in my case, only buffered IO.
Currently, the buffering is managed by the filesystem driver rather than
the block-device.
So, say, reading/writing the SDcard is normally unbuffered, but the FAT
driver will keep a cache of previously accessed clusters and similar. It
might make sense to move this into a more general-purpose mechanism though.
For FAT though, there may be wonk in that (AFAIK) there is no strict
requirement that the start of the data area be aligned to the cluster
size (so, say, one could potentially have a volume with 32K clusters
aligned on a 2K boundary). Well, unless this is disallowed and I missed it.
If I were designing my own filesystem, I would probably have done some
things differently. Though, my ideas didn't really look like EXTn either.
Had previously considered something that would have looked like
something partway between EXT2 and a somewhat simplified NTFS, but not
done much here as it would make a lot of hassle on the Windows side of
things.
Mostly would want a few features that seem a bit lacking in FAT.
Though, did recently discover the existence of the "Projected
FileSystem" API in Windows, which allows the possibility of implementing
custom user-mode filesystems on Windows (sorta; it is a bit wonky).
This does open / re-open some possibilities.
> There is also scatter-gather IO, intended for network cards,
> where the IO is a list of byte sized and aligned virtual buffers.
>
> The all interacts with DMA and page management because the physical
> page frames that contain the bytes must be pinned in memory for the
> duration of the DMA IO. A single virtual buffer becomes a list of
> physical fragments, so a scatter-gather list becomes a list of lists
> of physical byte buffer fragments, called a Memory Descriptor List (MDL)
> in Windows.
>
> And then SR-IOV adds virtual machines to the mix, where a guest OS
> physical address becomes a hypervisor guest virtual address,
> and not only are guest buffers in guest user space, but the guest OS
> MDL's are themselves in hypervisor virtual space and require their own
> hypervisor MDL's (lists of lists of lists of fragments).
>
OK.
I can note that in my project, there is no DMA mechanism as of yet.
Pretty much everything is either MMIO mapped buffers or polling IO.
When I looked at a network card before (once, long ago), IIRC its design
was more like:
There were a pair of ring-buffers, for TX and RX;
One would write frames to the TX buffer, and update the pointers, and
the card would send them;
When a frame arrived, it would add it into the buffer, update the
pointers, and then raise an IRQ.
This design being used in the ye olde RTL8139 and similar.
Had looked at another Ethernet interface, and it differed in that it
only had a 2K buffer for a single frame:
When a frame arrived, it was written into the buffer, and an interrupt
was raised;
When set to transmit, the buffer contents were transmitted, and then an
interrupt would be raised.
Seemingly, this interface would be unable to receive a frame while
trying to transmit a frame. Nor could it deal with a new frame arriving
before the previous frame had been read by the driver.
Though, this latter one was on an FPGA based soft-processor.
If/when I get to it, had considered using the pair of ring buffers
design, each probably 8K or 16K (where 8K is enough for 4 full-sized
Ethernet frames, each limited typically to around 1500 bytes of payload;
16K could give more "slack" for the driver, at the expense of using more
BlockRAM).
>>
>> Like, say, for a filesystem, it is presumably:
>> read syscall from user to OS;
>> route this to the corresponding VFS driver;
>> Requests spanning multiple blocks being broken up into parts;
>> VFS driver checks the block-cache / buffer-cache;
>> If found, copy from cache into user-space;
>> If not found, send request to the underlying block device;
>> Wait for response (and/or reschedule task for later);
>> Copy result back into userland.
>
> Yes, pretty much (there is page mangement, quota management).
> Except if I request a direct IO it DMA's direct to/from the user buffer,
> if hardware supports that.
>
OK.
No equivalent in my case (slow polling IO only for now).
Though, did go the route of allowing accessing an SDcard with 8-bytes
per SPI transfer, which at least "kicked the can down the road slightly"
as originally, with 1 byte SPI bursts, the overhead of the MMIO polling
interface was slower than an SDcard running on a 10MHz SPI interface.
The 8-byte transfers made it faster, but ended up mostly settling on
13MHz SPI (fastest speed where I could get reliable results on the
actual hardware I was using, *1).
*1: Where, seemingly the combination of micro-SD to full-size SD
extender cable + microSD to full-size SD adapter on the card, was
seemingly not ideal for signal integrity (but used mostly because
otherwise microSD cards are too small / easy to drop and not be able to
find again; whereas full-size SD cards are easier to handle). Wouldn't
have expected the attenuation to be *that* bad though.
Though, it works "well enough", since if it were that much faster, would
need to create a new interface.
>> Though, it may make sense that if a request isn't available
>> immediately, and there is some sort of DMA mechanism, the OS could
>> block the task and then resume it once the data becomes available. For
>> polling IO, doesn't likely make much difference as the CPU is
>> basically stuck in a busy loop either way until the IO finishes.
>
> Yes, that's DMA resource management. Basically each system has a certain
> number of scatter-gather IO mappers, now implemented by the IOMMU page
> table.
> Each IO queues a request for its mappers, and the DMA resource manager
> doles
> out a set of IO mapping registers, which may be less that you requested
> in which case you break up your IO into multiple requests.
> Then you program the scatter-gather map using info from the IO's MDL,
> pass the mapped IO space addresses to the device, and Bob's your uncle.
> When the IO completes, your driver tears down its IO map and releases
> the mapping registers to the next waiting IO.
>
OK.
>> Though, could make sense for hardware accelerating pixel-copying
>> operations for a GUI.
>
> On Windows the Gui is managed completely differently.
> I'm not familiar enough with the details to comment other than to say
> it is executed as privileged subroutines by the calling thread but in
> super mode, which allows it direct access to the calling virtual space.
>
I am not sure how the Windows GUI works.
In my case though, my experimental GUI had effectively worked by using
something resembling a COM object to communicate with a GUI system
running in a different task (it basically manages redrawing the window
stack and sending it out to the display and similar).
But, here, the whole process involves a bunch of pixel-buffer copying,
which isn't terribly fast (currently eats up a bigger chunk of time than
running Doom itself does).
Contrast, Win32 GDI seems to be more object based, rather than built on
top of drawing into pixel buffers and copying them around.
However, IME, my way of using Win32 GDI was mostly to create a Bitmap
object, update it, and endlessly draw it into the window, which is
basically the native model in TKGDI.
Seemingly, X11 was a little different as well, with commands for drawing
stuff (like color-fills, lines, and text). Though, presumably all of
this would just end up going into a pixel buffer.
Because each drawing operation supplies a BITMAPINFOHEADER for the thing
to be drawn, there can also be format conversion in the mix.
So, say, running Doom or similar in the GUI mode doesn't give
particularly high framerates.
Had recently added tabs to the console, and had used it to launch Doom
and Quake at the same time (as a test). Both of them ran, showing that
the multi-tasking does in fact work (within the limits of still being
cooperative multitasking).
However, performance was so bad as to make both of them basically
unusable (all of this dropped frame rate to around 2 frames / second).
https://twitter.com/cr88192/status/1735233196562796800