BareMetal File System - Initial specifications

567 views
Skip to first unread message

Ian Seyler

unread,
Feb 17, 2012, 10:40:21 AM2/17/12
to BareMetal OS
Here are the current specs for the new file system. These are the
initial specs so work is still being done. Once it is complete the
coding for the driver will be started.

https://github.com/ReturnInfinity/BareMetal-OS/blob/master/docs/BareMetal%20File%20System.md

Josh Vega

unread,
Feb 18, 2012, 1:04:22 AM2/18/12
to bareme...@googlegroups.com
Interesting... I'm going to have to disagree with the directory record size. I'm guessing that a directory record is a file, right? So by record soze do you mean "maximum file size" or "size of record structure in the directory structure?

Otherwise, it seems great! If this works out well, we won't have to do anything major to it until they start making drives larger than 16GiBs!

/Josh

42Bastian

unread,
Feb 18, 2012, 12:35:27 PM2/18/12
to bareme...@googlegroups.com

Why the artificial limitation on 64 directory entries (== files) ?

There is no way to recover a file if its directory record is corrupted.

Actually the filesystem can get damaged if only in one record has a wrong
size.

I'd keep at least two copies of the bitmap and the directory record.

--
42Bastian

Message has been deleted

Ian Seyler

unread,
Feb 19, 2012, 2:42:08 PM2/19/12
to BareMetal OS
The directory record holds the file records. In this spec the
directory record in 4096 bytes so it can contain 64 file records.
Maximum file size is 18,446,744,073,709,551,615 bytes since it is a 64-
bit value.

-Ian

Ian Seyler

unread,
Feb 19, 2012, 2:45:34 PM2/19/12
to BareMetal OS
No real reason, but I do want it small enough to be cache-able in
memory.

Since files are all contiguous it would be fairly straightforward to
recover a file with a corrupted record.

As for keeping 2 copies I was thinking of duplicating the first two
blocks on the disk to the physical end of the disk.

-Ian


On Feb 18, 12:35 pm, 42Bastian <e...@monlynx.de> wrote:
> > Here are the current specs for the new file system. These are the
> > initial specs so work is still being done. Once it is complete the
> > coding for the driver will be started.
>
> >https://github.com/ReturnInfinity/BareMetal-OS/blob/master/docs/BareM...

Ian Seyler

unread,
Jul 18, 2012, 10:09:50 AM7/18/12
to bareme...@googlegroups.com
A BMFS volume would be bootable and the plan was to have PURE64.SYS and KERNEL64.SYS in the free space on Block 0. I don't see it as practical to waste two file records on these small files.

Your bitmap idea (or rather the lack of) is very interesting! With only a specified number of files it would be very easy to calculate where the free blocks are. I think that would be much more efficient then crawling though a bitmap. With the directory cached in memory (since it is only 4KiB) it would be pretty simple. Excellent idea!

Yes, the remaining 8 bytes of each file record could be used for CRC32. That would take 4 bytes, correct?

The plan is to keep a copy of block 0 at block n-1 for redundancy. Similar to how GPT works.

Can you elaborate on the 256KiB erase blocks? I'll check my unpublished Pure64 code for BMFS work I recall doing. Also, my AHCI code has not been published yet.

-Ian


On Wednesday, July 18, 2012 7:28:35 AM UTC-4, Ben Dyer wrote:
Given the spec has an MBR, I assume BMFS volumes are intended be bootable — do you plan to store PURE64.SYS and KERNEL.SYS as ordinary files, or will some of the 1016KiB of free space in Block 0 be used for system files, perhaps with a secondary (smaller) directory structure for that section?

I'm wondering about the free space bitmap, and specifically whether it's worth storing that on disk given it can be trivially recreated from the directory at system startup. It seems to me that if it were stored, you'd have to be able to recreate it to cope with power failures etc, and it's probably faster to calculate than to load from disk. That sort of structure is also going to be suboptimal for SSDs, since it's likely to generate a lot of unaligned writes. 

In fact, with only 64 files the create syscall could work from the directory info, removing the need for a bitmap entirely.

Just a few other thoughts:
- Could some of the 8 free bytes in the Directory Record structure be used to store a version counter and a CRC32 of that record? The version counter would help in the case that there are redundant valid records on the disk, but the system died between updating them, and the CRC32 would ensure corrupted records were ignored.
- It would be worth versioning and checksumming the first 4KiB of Block 0, for the same reasons.
- It's good that both of these structures fit into 4KiB each since that's a common page size for SSDs, but it would probably be even better if the legacy MBR sector (which probably does not require frequent re-writing), the directory structure, and any further data structures likely to be updated independently be aligned on separate 256KiB erase blocks.

This FS would be a much nicer base than FAT16 for some of the stuff I'm working on, so I'm currently writing some Python tools to generate and fill VMDK images for it, as well as hacking some sort support into Pure64.

Ben Dyer

unread,
Jul 18, 2012, 11:03:09 AM7/18/12
to bareme...@googlegroups.com
Hi Ian,

On 19/07/2012, at 00:09 , Ian Seyler wrote:

> A BMFS volume would be bootable and the plan was to have PURE64.SYS and KERNEL64.SYS in the free space on Block 0. I don't see it as practical to waste two file records on these small files.

Makes sense.

> Yes, the remaining 8 bytes of each file record could be used for CRC32. That would take 4 bytes, correct?

Correct — I was thinking 4 bytes of that, plus one additional byte as a counter value. Every time the directory is updated the counter value would be incremented, and to compare the recency of two blocks

> Can you elaborate on the 256KiB erase blocks?

It actually varies between flash technologies — for example, on consumer-grade Intel SSDs the page size is 8KiB and the (erase) block size is 2MiB — but most enterprise SLC has 2KiB or 4KiB pages and 128KiB or 256KiB erase blocks. There's a good overview here:
http://en.wikipedia.org/wiki/Write_amplification

And some practical numbers here:
http://www.qdpma.com/Storage/SSD.html

The relevance is that the SSD can't overwrite data at the NAND level; once a page has had any data written to it, if you want to rewrite it, the SSD controller has to:
* Read the entire block of 64+ pages to its internal cache (probably < 1 ms);
* Erase the whole block of 64+ pages (1–4ms depending on technology);
* Write all the individual pages back except for the one you wanted to overwrite, which has the new data written instead (probably a few ms).

Since that's really slow, in reality SSDs have a mapping layer that enables them to just write a never-before-written page, and remap that new page in place of the old one. That works great and keeps things fast until you've written to all of the pages on the SSD (which happens very quickly if you're doing writes that aren't aligned to the page size), but after that point every write has to involve some garbage collection, consolidation and rewriting.

There have been a few articles on this but I think the first chart on this page sums it up nicely:
http://www.xbitlabs.com/articles/storage/display/intel-ssd-520_4.html

Essentially, write performance after the disk has been filled twice over is about 1/5th of what it was when the drive was new.

And of course, once you've erased a block enough times, the risk of it failing increases — for enterprise SLC flash you get maybe 100,000 erases, whereas for consumer MLC you might get as few as 5,000.

So, on Linux there's a general recommendation that partitions be aligned to the erase block size of the device; the 2MiB granularity of file allocations in BMFS works nicely in that regard, and since there's space to keep things separated in Block 0 it might help a little, particularly with TRIM support in the OS/filesystem and SSD. It's nowhere near as important as alignment of writes to the page size though.

> I'll check my unpublished Pure64 code for BMFS work I recall doing. Also, my AHCI code has not been published yet.

Thanks, that would be great.

Regards,
Ben

Ben Dyer

unread,
Jul 18, 2012, 11:18:16 AM7/18/12
to bareme...@googlegroups.com
One other thing:

On 19/07/2012, at 00:09 , Ian Seyler wrote:

> A BMFS volume would be bootable and the plan was to have PURE64.SYS and KERNEL64.SYS in the free space on Block 0. I don't see it as practical to waste two file records on these small files.

Unless PURE64.SYS is configured to chain-load KERNEL64.SYS with a certain maximum file size, I guess this would need a table for offsets/lengths of those two files (maybe more, as I seem to recall someone mentioning cli.asm might be separated from the kernel?), or the use of an object file format that contains enough information to load them separately.

An object file format would have the possible advantage of resolving issue #4 on GitHub [1] without requiring everyone pass -fno-zero-initialized-in-bss to gcc, although the format would probably have to be something gcc/nasm/ld can generate — and looking at ELF output from the standard tools [2] the overhead might not be worth it.


[1]: https://github.com/ReturnInfinity/BareMetal-OS/issues/4
[2]: http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html

Ian Seyler

unread,
Jul 18, 2012, 1:27:37 PM7/18/12
to bareme...@googlegroups.com
Hi Ben,

Thanks. I'll take a look at those links.

Here is the Pure64 SATA+BMFS code I mentioned (last modified in April): http://cl.ly/IA8f/pure64-sata-bmfs.zip
I recall that Pure64 was able to read the kernel (as a regular file) from a SATA drive.

-Ian

Ian Seyler

unread,
Jul 18, 2012, 1:36:50 PM7/18/12
to bareme...@googlegroups.com
Those details would need to be worked out still. In the Pure64 code from April I did have the kernel in the filesystem. None of this is an issue with PXE booting.

Initially I would just hard code it to load 64KiB from the disk where the kernel is stored if not as a file. We are sitting at 16KiB now. The TCP/IP stack should bump that up a bit.

ELF isn't on my radar at the moment. That may change though when more serious programs are ported.

Re teensy: 4KiB for a program that just returns a number is a classic example of bloat :) I'll have to try compiling that for BareMetal as a flat binary and see what the result is.

-Ian

42Bastian

unread,
Jul 18, 2012, 4:39:11 PM7/18/12
to bareme...@googlegroups.com
Hi

> Re teensy: 4KiB for a program that just returns a number is a classic
> example of bloat :) I'll have to try compiling that for BareMetal as
> a flat binary and see what the result is.

Well, compiled with bare metal gcc the ELF w/o debug info is only 1088
bytes. A lot of this is of course the init code, which could be omitted if
a loader would handle .data/.bss section initialization.

size t.elf
text data bss dec hex filename
325 24 32 381 17d t.elf

With some simple tuning: Linking with -n and removing .comment
the file size is 992byte.
Which isn't to bad. Esp. because, the overhead will not largely grow
unless you start to use C++ ;-)

--
42Bastian

Ben Dyer

unread,
Jul 19, 2012, 11:44:39 PM7/19/12
to bareme...@googlegroups.com
On 19/07/2012, at 03:36 , Ian Seyler wrote:

> Those details would need to be worked out still. In the Pure64 code from April I did have the kernel in the filesystem. None of this is an issue with PXE booting.
>
> Initially I would just hard code it to load 64KiB from the disk where the kernel is stored if not as a file. We are sitting at 16KiB now. The TCP/IP stack should bump that up a bit.

Yeah, the PXE code seems more finished at the moment. However, I'm trying to set up some unit tests to build a VMDK, run VirtualBox for a bit and then check the disk image for expected output — so while it'd be possible to have those scripts run a TFTP server and all the rest as well, it's simpler to stick with flat files.

I've written an MBR that loads 8KiB from (512-byte) sectors 16–23 and runs it; I'll allow 64KiB for the kernel right after it. Once the plan for the free space in Block 0 is pinned down, it'll be easy to switch that around to whatever it needs to be. The readsector routine has also been adapted to deal with 64-bit sector numbers in case the data needs to be pulled from Block N-1 instead.

It currently doesn't read the partition table, since I'm not sure how that fits in with your plans for BMFS volume information.

I merged your BMFS/SATA changes into the current Pure64 master branch; I'm pushing these changes to a branch of my fork [1] and will continue to pull in changes from upstream so this can all be merged back at some later stage. I might also try to come up with a makefile that allows building versions to boot from FAT16, BMFS, raw binary or PXE.


> ELF isn't on my radar at the moment. That may change though when more serious programs are ported.

It's an interesting question. I think ELF is probably overkill because of the facilities for shared libraries, relocatable code etc — there's no point supporting shared libraries when only a couple of processes will ever be installed on a system, and not really much point relocating if you're not supporting shared libraries or a native toolchain.

Older, simpler formats might be a better fit — a.out [2] perhaps?

I'm hoping to get a chance to do a toy port of one of our applications once the networking code is in, so I'll provide further feedback during that process.


[1]: https://github.com/bendyer/Pure64/tree/bendyer-bmfs-sata-integration
[2]: http://en.wikipedia.org/wiki/A.out

Simon Heath

unread,
Jul 20, 2012, 11:03:22 AM7/20/12
to baremetal-os
Excerpts from Ben Dyer's message of 2012-07-19 20:44:39 -0700:
> > ELF isn't on my radar at the moment. That may change though when more serious programs are ported.
>
> It's an interesting question. I think ELF is probably overkill because of the facilities for shared libraries, relocatable code etc — there's no point supporting shared libraries when only a couple of processes will ever be installed on a system, and not really much point relocating if you're not supporting shared libraries or a native toolchain.

On the other hand, loading static ELF files really isn't particularly
difficult. I wrote an ELF loader in 250 lines or so for another OS
project, which essentially just checked the magic number and relocated
the sections to the right places in memory.

Though it's possible that having all that spare stuff in the header
and ignoring it because you'll never use it is exactly what you
want to avoid. :-P I'm relatively new to Baremetal, so I defer to the
experts.

Simon

Ian Seyler

unread,
Jul 20, 2012, 11:45:10 AM7/20/12
to bareme...@googlegroups.com
I did put more effort into the PXE code since it is what BareMetal Node uses.

Awesome work! I look forward to pulling your changes into the official branch.

The PDP 7/11 a.out looks pretty good. The BareMetal File System is loosely based on what RT-11 used on the PDP-11. PE is complicated as well: http://code.google.com/p/corkami/wiki/PE101

-Ian

Ian Seyler

unread,
Jul 20, 2012, 11:48:44 AM7/20/12
to bareme...@googlegroups.com
Though it's possible that having all that spare stuff in the header 
and ignoring it because you'll never use it is exactly what you
want to avoid.  :-P  I'm relatively new to Baremetal, so I defer to the
experts.

Exactly. If we aren't going to use the extra features then it makes no sense to implement it when flat binaries work well as-is. Again, depending on future needs it may be a requirement to change.

-Ian 

Ben Dyer

unread,
Jul 21, 2012, 12:04:13 AM7/21/12
to bareme...@googlegroups.com
Hi Ian,

Just wanted to run this by you before I get stuck into it — I'm setting up a makefile for Pure64 to simplify selection of different filesystems, hardware, and output disk image file formats.

The options I'm thinking of are:
• Disk interface type: legacy IDE or AHCI
• Filesystems: FAT16 or BMFS
• Output image formats: VMDK or raw
• Output image size
• Loader filename (for FAT16)
• Kernel filename (for FAT16)
• Chain-load target file path (will be concatenated onto pure64.sys with the appropriate conditional code included)

The src directory would be structured something like:

src/
bootsectors/
bmfs.asm
fat16.asm
pxestart.asm
filesystems/
bmfs.asm
fat16.asm
interfaces/
ide.asm
ahci.asm
init/
acpi.asm
cpu.asm
ioapic.asm
isa.asm
smp.asm
smp_ap.asm
interrupt.asm
pci.asm
pure64.asm
syscalls.asm
sysvar.asm

sysvar.asm would conditionally define variables for the relevant filesystem and hard disk controller options; the filesystem and relevant init files would be conditionally included, and would present a consistent interface to ensure orthogonality of filesystem and controller selection.

If the makefile is not used, running "nasm pure64.asm -o pure64.sys" would create a binary using legacy IDE support with FAT16, just as it does now.

Regards,
Ben

42Bastian

unread,
Jul 21, 2012, 1:43:45 AM7/21/12
to bareme...@googlegroups.com
Hi Ben

> The options I'm thinking of are: � Disk interface type: legacy IDE or
> AHCI � Filesystems: FAT16 or BMFS � Output image formats: VMDK or
> raw � Output image size � Loader filename (for FAT16) � Kernel
> filename (for FAT16) � Chain-load target file path (will be
> concatenated onto pure64.sys with the appropriate conditional code
> included)

Sound great (also for beginners with BM).

--
42Bastian

Ian Seyler

unread,
Jul 22, 2012, 11:16:04 AM7/22/12
to bareme...@googlegroups.com
Hi Ben,

This sounds great! Awesome idea to make thinks simpler.

Thanks,
-Ian

Ben Dyer

unread,
Jul 23, 2012, 10:55:09 AM7/23/12
to bareme...@googlegroups.com
This is now available at https://github.com/bendyer/Pure64/tree/bendyer-bmfs-sata-integration

New functionality is described in the readme.md file.

I haven't yet submitted a PR to upstream as the following is still in progress:
• Setting up a QEMU environment for testing;
• Testing PXE netboot on VirtualBox;
• Finishing FAT16 disk image generation on Linux (using parted).

While testing, there are a couple of things I've noticed in BMOS itself:
• The HPET initialisation code causes my version of VMware Fusion (4.1.3) to freeze — the VM won't even respond to remote debugger connection attempts, and can't be quit via the UI. VirtualBox doesn't seem to be affected, since HPET appears to be disabled by default. The freeze happens between lines 221 and 229 of os/init_64.asm
• BMOS doesn't support AHCI or BMFS, so as soon as control is transferred from Pure64 things stop working properly. Since there's now a consistent interface to the AHCI and PIO code in Pure64 it'd be easy to copy that over to BMOS; adding the draft BMFS support in should be straightforward as well. Would you like me to sort that out?

Another question regarding disk/filesystem procedures: if the new network stack is using callbacks rather than a polling approach [1], should there be a callback-based interface for disk access? With the DMA support in the AHCI code, and BMFS being designed around larger read/write sizes, avoiding polling should significantly reduce CPU utilisation in I/O-heavy workloads.


[1]: https://github.com/ReturnInfinity/BareMetal-OS/issues/20

Ian Seyler

unread,
Jul 23, 2012, 8:11:55 PM7/23/12
to bareme...@googlegroups.com
This is some stellar work Ben.

You can comment out the HPET code for now. Debugging can be done on it later.

If you want to merge the AHCI and BMFS code into BMOS as well that would be great!

Callbacks should be used for disk access as well. I have removed the ring buffer support in the 'networkcallback' branch of BareMetal OS. I have not implemented the network callback yet as I'm still figuring out the best way to implement it.

Awesome progress!

-Ian

Ben Dyer

unread,
Jul 24, 2012, 7:37:34 AM7/24/12
to bareme...@googlegroups.com
On 24/07/2012, at 10:11 , Ian Seyler wrote:

> If you want to merge the AHCI and BMFS code into BMOS as well that would be great!

Will do.

> Callbacks should be used for disk access as well. I have removed the ring buffer support in the 'networkcallback' branch of BareMetal OS. I have not implemented the network callback yet as I'm still figuring out the best way to implement it.

Fair enough — I was going to try to copy the approach used by the network interface code as far as possible :)

I guess the main question is how I/O scheduling and concurrency will fit into the os_smp_* concurrency procedures; in the short term, I'll set everything up as blocking, and we can split it up as required once scheduling/concurrency is pinned down.

Ian Seyler

unread,
Aug 7, 2012, 9:40:59 AM8/7/12
to bareme...@googlegroups.com
I have updated the BMFS spec based on your suggestion about removing the free block bitmap. Very clever idea! Is it ok if I include your name the doc?

Still working on the callback layout...

-Ian

Ben Dyer

unread,
Aug 7, 2012, 9:49:48 AM8/7/12
to bareme...@googlegroups.com
Hi Ian,

Sure. Sorry about the delay on the pull request, I've now given up trying to get qemu running properly on OS X 10.8 and am setting it up in a Windows VM instead.

Regards,
Ben

Ben Dyer

unread,
Aug 8, 2012, 12:09:57 AM8/8/12
to bareme...@googlegroups.com
Hi Ian,

On 07/08/2012, at 23:40 , Ian Seyler <ise...@gmail.com> wrote:

> Still working on the callback layout...

Are you planning on running the callbacks from I/O interrupts, or will they be run at user level like the process itself?

I've been giving the issue some thought and if it's the latter, it seems to me that process suspend/resume around blocking I/O operations is probably going to be the easiest to work with. Relatively few changes to the existing process queue would be required, and it wouldn't complicate the concurrency model by introducing new concepts.

Regards,
Ben

Ian Seyler

unread,
Aug 8, 2012, 10:13:33 AM8/8/12
to bareme...@googlegroups.com
I know that pain all too well. I do my QEMU work on Windows as well.

-Ian

Ian Seyler

unread,
Aug 8, 2012, 10:17:43 AM8/8/12
to bareme...@googlegroups.com
Keep in mind that everything runs in ring 0 so there is no context switching like in other operating systems.

The problem with the process queue is that there is no way to pre-empt something. If all cores are busy then the new job in the queue will not execute.

I have completed some stack fudging features in the past for another project (Calling an OS function from an interrupt). Looking into that.

-Ian

Ben Dyer

unread,
Aug 8, 2012, 11:27:24 AM8/8/12
to bareme...@googlegroups.com
On 09/08/2012, at 00:17 , Ian Seyler <ise...@gmail.com> wrote:

> The problem with the process queue is that there is no way to pre-empt something. If all cores are busy then the new job in the queue will not execute.

If the expected time required for a process to execute is small enough, that won't be an issue, but effectively that's just co-operative multitasking across the process set. For the jobs I have in mind, that's probably not a problem, but it'd make coding for HPC applications or long-running tasks unnecessarily complex.

> I have completed some stack fudging features in the past for another project (Calling an OS function from an interrupt). Looking into that.

So are you thinking of adding some sort of pre-emptively scheduled I/O tasks to the concurrency model, or would the user-developed/application code be expected to register I/O handlers that would be called from the ISRs?


Ian Seyler

unread,
Aug 8, 2012, 11:29:14 AM8/8/12
to bareme...@googlegroups.com
I found my stack fudge code. My reasoning for going with this was to not have a crash within the interrupt call.

Current setup:
 Application is running
 Network interrupt is triggered, executes, and finishes
 Application execution is returned to where it left off

Putting the callback within the interrupt handler will cause issues if the callback is faulty. For instance if the callback never finishes.

Future setup:
  Application installs network callback
  Application is running
  Network interrupt is triggered, executes, adjusts the stack, and finishes
  Application callback is executed, and finishes
  Application execution is returned to where it left off

-Ian

Ben Dyer

unread,
Aug 8, 2012, 11:45:10 AM8/8/12
to bareme...@googlegroups.com
On 09/08/2012, at 01:29 , Ian Seyler <ise...@gmail.com> wrote:

> Future setup:
> Application installs network callback
> Application is running
> Network interrupt is triggered, executes, adjusts the stack, and finishes
> Application callback is executed, and finishes
> Application execution is returned to where it left off

Ah, that makes sense. So although the network interrupts are initially serviced by the OS, the application is ultimately responsible for handling everything.

Are callbacks going to be for raw packets, so applications link with/include the IP stack themselves, or will the callbacks be defined on higher-level events like TCP connect, retransmit etc?

Ian Seyler

unread,
Aug 8, 2012, 2:21:18 PM8/8/12
to bareme...@googlegroups.com
Yes, but it depends on the type of packet that is received (or perhaps how the callback is installed, example: promiscuous mode).

The IP stack will live in the network interrupt handler. I would not want the application dealing with re-transmits, etc. For instance the network handler would deal with ARP and PING replies directly and just pass the data on to the application.

Since we don't have IP yet the network handler will just pass all received packets to the callback handler.

-Ian

Ben Dyer

unread,
Aug 9, 2012, 2:12:33 AM8/9/12
to bareme...@googlegroups.com
Sounds great — I've just been looking at some other IP stacks (uIP and lwIP) to get a sense of the various programming models that can be used, and the approach you're describing sounds like it'll have a good balance of architectural flexibility and power/performance.

Adrien Arculeo

unread,
Aug 9, 2012, 5:36:24 AM8/9/12
to bareme...@googlegroups.com
Ian,

talking about IP stack, you know I would love to work on a HTTP Server for BMOS.

I still can't find the (initial) IP stack in the sources on Github. Are there yet?

Adrien

From: "Ben Dyer" <ben_...@mac.com>
To: bareme...@googlegroups.com
Sent: Thursday, 9 August, 2012 8:12:33 AM
Subject: Re: Pure64 filesystem support

Ian Seyler

unread,
Aug 9, 2012, 10:35:52 AM8/9/12
to bareme...@googlegroups.com
Hi Adrien,

The coder I hired to build the stack dropped out of the project halfway through and only implemented ARP, ICMP, and IP. TCP and UDP were not completed. I ended up removing his code from GitHub since it wasn't complete. You can find it in an older revision: GitHub

As for current progress I may have more to report next week. I was approached by someone about a month ago about writing a stack for BareMetal OS. He has experience with it before and I was informed yesterday that he was a few days from completion. Once I know more I will announce it.

-Ian

Adrien Arculeo

unread,
Aug 9, 2012, 10:33:51 AM8/9/12
to bareme...@googlegroups.com
Hoho!

I was right to ask! I follow silently all that is happening on BareMetal OS but I did not expect this one.

And honestly I did not get all your discussion about the file system.

I am dying to see both.

Adrien

From: "Ian Seyler" <ise...@gmail.com>
To: bareme...@googlegroups.com
Sent: Thursday, 9 August, 2012 4:35:52 PM

Ben Dyer

unread,
Aug 9, 2012, 10:43:43 AM8/9/12
to bareme...@googlegroups.com
Adrien,

I'm interested in the same thing — I've been planning something similar to Python's Tornado framework, but with request handlers written in C and able to be deployed/updated on the fly, without restarting or interrupting requests.

The rationale behind building such a thing on BMOS, rather than a general-purpose OS, is that I think crash-only approaches [1] are going to be the only way to ensure high availability in increasingly complex distributed systems. I've had a lot of success with applications designed around this approach running under Linux, however after 9 months in production the operating system and related services are now the most common failure mode — obscure log files consuming all disk space, and various other misconfigurations that only crop up after considerable uptime. By using something which I understand from the ground up, I can ensure that there are no edge cases or components that behave unexpectedly, and can design everything to be terminated at any time and recover gracefully.

In addition, by stripping away unnecessary dependencies and complexity, it would be possible to do automated deployment and on-line software upgrades directly from version control, without the need for multiple layers of package management and automation solutions (for an application using the Tornado framework, an automated deploy solution would involve pip/easy_install, yum/apt-get, plus puppet/chef).

Anyway, if you're interested we should compare notes once the IP stack is in place, and see if our HTTP server needs are in alignment.

Regards,
Ben


[1]: http://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c22.pdf

Ian Seyler

unread,
Aug 9, 2012, 10:52:14 AM8/9/12
to bareme...@googlegroups.com
In regards to uIP I have had some success with porting version 0.9 to BareMetal OS (It was responding to pings).

I had issues compiling the more advanced apps like the HTTP server and decided that it would be best to have the stack in the OS instead of the app.

-Ian

Adrien Arculeo

unread,
Aug 9, 2012, 11:04:03 AM8/9/12
to bareme...@googlegroups.com
Ben,

I follow this wonderful project for a very similar reason. If you understand what you are doing from end to end, you have a chance to succeed, doomed otherwise.

What I like with BareMetal OS is that, even with a low level of x64 understanding (I am more a RISC person) I can write a program (using C) without complex limitations. For instance, I abandoned a project of web server well on its way just because you cannot transfer an open connection from a thread to another. You just cannot give the descriptor to something else. So I couldn't do what I wanted, i.e. on a N core machine, have 1 listener put requests in a local queue and N-1 workers extract from the queue, process and answer. Come on! That is the only design you can think of to maximise the throughput of a server whatever you are serving.

The only problem I have really with BareMetal OS is that, not knowing ASM, I need to use C and C's memory management is just disastrous for software quality. I need to remove rust on this.

It is not a surprise I "meet" you on this mailing list. Where else? Not on the "Bloat OS" mailing-list.

Ian,

you guest it. We are two guys waiting on this stack to be ready to work from it.

Adrien

Sent: Thursday, 9 August, 2012 4:43:43 PM

Subject: Re: Pure64 filesystem support

Adrien Arculeo

unread,
Aug 9, 2012, 11:11:59 AM8/9/12
to bareme...@googlegroups.com
Fully agreed.

In 2012, network is a core service, there should be interrupts for it. It reminds me of a version of the Z80 in the early '00s called eZ80 where four 8 bits registers for IP addresses were added. 

http://en.wikipedia.org/wiki/Zilog_Z80#Derivatives

Again, not a lot of time available but happy to put it on this.

Adrien


From: "Ian Seyler" <ise...@gmail.com>
To: bareme...@googlegroups.com
Sent: Thursday, 9 August, 2012 4:52:14 PM

Subject: Re: Pure64 filesystem support

Ben Dyer

unread,
Aug 9, 2012, 11:28:44 AM8/9/12
to bareme...@googlegroups.com
On the topic of memory management, I found JPL's C standards [1] interesting.

They strictly prohibit dynamic memory allocation except at application startup, which of course gets rid of a huge number of potential errors. The high-performance web server nginx also uses a similar approach in that it permits only a certain number of buffers of a fixed size to be used by a request or response before it writes the data to a temporary file.

Another interesting restriction was "only one unbounded loop per process", i.e. you are permitted to use a main event loop, but all others must have a statically-verifiable upper bound. This prevents you from getting stuck in an infinite loop, similar to the crash-only principle of having timeouts for everything. Very easy to enforce when you have complete transparency in the OS, very difficult otherwise.

[1]: http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf


On 10/08/2012, at 01:04 , Adrien Arculeo <aarc...@1024degres.com> wrote:

> Ben,
>
> I follow this wonderful project for a very similar reason. If you understand what you are doing from end to end, you have a chance to succeed, doomed otherwise.
>
> What I like with BareMetal OS is that, even with a low level of x64 understanding (I am more a RISC person) I can write a program (using C) without complex limitations. For instance, I abandoned a project of web server well on its way just because you cannot transfer an open connection from a thread to another. You just cannot give the descriptor to something else. So I couldn't do what I wanted, i.e. on a N core machine, have 1 listener put requests in a local queue and N-1 workers extract from the queue, process and answer. Come on! That is the only design you can think of to maximise the throughput of a server whatever you are serving.
>
> The only problem I have really with BareMetal OS is that, not knowing ASM, I need to use C and C's memory management is just disastrous for software quality. I need to remove rust on this.
>
> It is not a surprise I "meet" you on this mailing list. Where else? Not on the "Bloat OS" mailing-list.
>
> Ian,
>
> you guest it. We are two guys waiting on this stack to be ready to work from it.
>
> Adrien

Adrien Arculeo

unread,
Aug 9, 2012, 2:34:08 PM8/9/12
to bareme...@googlegroups.com
ok but this only works in a preemptive OS or in an environment where the compiler is patched. BareMetal OS is not the first and I think should not have the latter.

But keeping your point, you suggest to allocate memory in block at start and use it up to the point where it is freed automatically at the end.

That is remove malloc, realloc and free and do this instead : 

void main()
{
  char memory[1024 * 1024 *1024];
  char *memory_pointer;
  /*use 1G of memory as you like*/
 ...
}

Is that what you suggest? Why not, I like the idea not to deal with malloc, but it seems that malloc and free, as bad as they can be were invented for some case where this is not available. A virtue I can see is that you know before hand if you are going to fail for a memory reason.

You could set this as a good practice but how could you set this as a development rule under BareMetal OS?

Adrien

Sent: Thursday, 9 August, 2012 5:28:44 PM

Ben Dyer

unread,
Aug 9, 2012, 11:01:31 PM8/9/12
to bareme...@googlegroups.com
On 10/08/2012, at 04:34 , Adrien Arculeo <aarc...@1024degres.com> wrote:

> Is that what you suggest? Why not, I like the idea not to deal with malloc, but it seems that malloc and free, as bad as they can be were invented for some case where this is not available. A virtue I can see is that you know before hand if you are going to fail for a memory reason.

Not exactly — you'd use a bunch of statically-allocated buffers for various things, rather than just one huge one. And this would be on a per-request basis, so it'd mean you'd know exactly how many concurrent requests you could deal with. It'd also mitigate the impact of a whole class of DoS attacks since you're forced to set sane upper limits on the sizes of requests, headers, and bodies.

As far as security is concerned, fixed-size buffers are only a problem when using the C string library, which should be avoided anyway because it's too difficult to use safely.

And yes, it's a bit more difficult to write code like that but I suspect if you look at it in terms of [time taken to write code without malloc] vs [time taken to write code with malloc + time taken to debug memory leaks + time taken to deal with memory being accessed after being freed + time taken to deal with out-of-memory conditions + time taken to consider the impact of heap fragmentation] it's probably quicker to spend more time writing the initial code, and much less time debugging.

> You could set this as a good practice but how could you set this as a development rule under BareMetal OS?


I don't think avoidance of dynamic memory allocation makes sense as a general development rule, but for my specific interest — high-concurrency, fault-tolerant network services — it seems to be a common approach and I think would be worth enforcing within individual request handlers. I'd also like to use memory protection on a per-request basis to ensure they're completely isolated from each other.




Adrien Arculeo

unread,
Aug 10, 2012, 11:36:54 AM8/10/12
to bareme...@googlegroups.com
If I understand better.

What you suggest is rather more "think of the memory you need beforehand for each variable, allocate them statically and work with that constraint as if you were programming for an embedded system in which the memory limit is very hard."

I started with embedded systems so I feel comfortable about that and yes, in a server, you need to know some fact like "n concurrent user per amount of RAM per user < Max memory of the server".

I am going to read the crash only software document this weekend.

Adrien

Sent: Friday, 10 August, 2012 5:01:31 AM

Subject: Re: Pure64 filesystem support

Ian Seyler

unread,
Aug 12, 2012, 2:39:29 PM8/12/12
to bareme...@googlegroups.com
Thanks for sharing this document! My thoughts were similar in regards to memory allocation at application startup. This also leads to a much more simplified memory manager at the OS level.

-Ian

Daniel Cegiełka

unread,
Nov 2, 2012, 6:31:45 AM11/2/12
to bareme...@googlegroups.com
Hi,
Are there any news on the implementation of the IP stack in BMOS?

Best regards,
Daniel

Ian Seyler

unread,
Nov 6, 2012, 2:08:05 PM11/6/12
to bareme...@googlegroups.com
The developer that we had lined up did not complete the project. Unfortunately the IP stack is stuck in limbo again.

-Ian
Reply all
Reply to author
Forward
0 new messages