The good news is that despite all those crashes, Minix3 has not yet
managed to corrupt my database. I take a kernel panic over a silent DB
corruption any day, because a silent corruption is the worst nightmare,
one that no ordinary human can imagine.
This means that Minix3 does honor disk write order with respect to
write(2) and fsync(2) requests properly. In the past couple of hours I
have seen PostgreSQL go through several minutes of crash recovery,
replaying megabytes of lost updates from the WAL. Every single time the
DB comes up in a consistent state.
For those who are interested (I know some here are), here is a brief
overview how PostgreSQL does IO and recovers from a system crash:
Table and index blocks (8K each) are managed in a shared memory area,
the "buffer cache". All changes to blocks in the cache are also recorded
in a sort of diff format in the Write Ahead Log (WAL). The WAL is
managed as a ring buffer in another shared memory area.
There are two background processes associated with these shared memory
areas. The WAL writer, constantly writing the WAL data out to disk, and
the Background Writer, constantly writing dirty blocks out to disk.
The problem here is that the Background Writer should NEVER get ahead of
the WAL Writer. That would spell disaster in the headline and corrupted
database in the fine print. You will soon see why.
When a transaction modifies data in a block, it first adds the "diff"
information to the WAL (shared memory). It then changes the block data
(also in shared memory) and it records the WAL position of its "diff" in
the block header. The WAL writer will pick up the WAL data from its
shared memory (ring buffer) and at certain points it will fsync() the
file. After that it will set a variable (in yet another shared memory
area) to that new position. The BackgroundWriter compares the WAL
position of any dirty block to that variable, and will not write it to
disk if it is too recent. This way no data block on disk can ever be
ahead of the WAL. PostgreSQL keeps changes to data files in memory until
their WAL records have not only been written, but fsync()'d to disk.
With that guarantee in place, only the WAL needs to make it to the disk
to be able to recover from a crash. Oh ... of course ... there is one
other thing ... checkpoints. A checkpoint is basically a WAL position.
You understand by now that the WAL is nothing more than a time line. A
checkpoint merely says "no data block on disk is older than this". That
checkpoint position is recorded in the control file. Older WAL can be
discarded.
So, when PostgreSQL recovers from a crash, it reads its own WAL from the
last checkpoint position and applies all the "diffs", that according to
the block headers didn't get written by the BackgroundWriter.
However, this all is working only if the OS and disk (firmware) never
lie about write ordering. Consumer grade disks are notorious about lying
when it comes to write caching. In the above scenario you can't have a
disk saying "I wrote that WAL record", then you write a data block, and
after a crash it has the new data block but says "you never gave me that
WAL record". Write caching on the disk controller level can lead to that.
Are you dizzy now? I was when I first tried to understand all this. It
certainly isn't simple, but PostgreSQL does get all this right with
hundreds of concurrent client connections and thousands of disk blocks
cached in shared memory.
Minix3 gets this right too. But I would like it to do so without a
kernel panic ....