Write zeros explicitly to the end of the file

73 views
Skip to first unread message

Artyom Ivanov

unread,
Jul 25, 2025, 8:58:19 AMJul 25
to firebird-devel
Hi everyone,
I'm currently investigating why we write zeros to the end of the file (`PIO_init_data()`), and I don't quite understand why we're doing this.
The file extension is done via `fallocate()`, and this call ensures that the new space will be initialized with zeros (source1, source2). In other words, the operating system guarantees that when reading this space through the file system, we will see zeros (primarily for security reasons and because of how SSDs work), even though physically, anything could be in that location on the disk (if you try to read it by bypassing the file system, for example, through `dd`, then most likely, there will not be zeros there). If you really need to zero out the disk at the moment of allocating new space, there is the `FALLOC_FL_ZERO_RANGE` flag, but not all file systems support it, and we don't really need it, we just want to make sure there is no garbage when reading.
There was also an idea that explicit zeroing could be done to avoid creating a sparse file, but a sparse file can occur when there are holes (pages filled with zeros) in the middle of the file, which is not our case.
There was also an idea that explicit zeroing could be done to avoid creating a sparse file, but a sparse file can occur when there are holes (pages filled with zeros) in the middle of the file, which is not our case.
Therefore, everything is moving towards getting rid of explicit zeroing. This gives the restore a good boost, since we're writing twice as many bytes due to the zeroing writes, but here it is worth mentioning that, due to FW being disabled, the restore writes less than twice as much data to the disk (here I am talking about the number of bytes written to disk, not `pwrite` system calls, since zeroing is done by writing a large number of pages in a single system call, i.e., simply counting the difference in `pwrite` calls does not give the whole picture).

Vlad Khorsun

unread,
Jul 27, 2025, 12:14:38 PMJul 27
to firebir...@googlegroups.com
25.07.2025 15:58, Artyom Ivanov:
> Hi everyone,
> I'm currently investigating why we write zeros to the end of the file (`PIO_init_data()`), and I don't quite understand why we're
> doing this.

If you want investigation, check relates issues:

Improve performance of database file growth after CORE-1228 was fixed [CORE1469] #1886
Use fast file grow on those Linux systems which supports it [CORE4443] #4763

> The file extension is done via `fallocate()`, and this call ensures that the new space will be initialized with zeros (source1
> <https://man7.org/linux/man-pages/man2/fallocate.2.html>, source2 <https://www.linuxquestions.org/questions/linux-newbie-8/
> fallocate-does-it-fill-the-space-with-zeros-4175578213/>). In other words, the operating system guarantees that when reading this
> space through the file system, we will see zeros (primarily for security reasons and because of how SSDs work), even though
> physically, anything could be in that location on the disk (if you try to read it by bypassing the file system, for example, through
> `dd`, then most likely, there will not be zeros there).

It is interesting to know how exactly OS (or FS?) implements this guarantee.

For example, Windows maintains "valid data" marker for the every file (file stream)
and any attempt to read after this marker returns zero's. So far, so good. But any write
after "valid data" marker will force OS to fill the gap between marker and write position
by zero's. This is why we prefer to do it by self - in predictable way and using relatively
big IO block's for efficiency. At the same time the size of "init" block is much less than
size of file extension.

> If you really need to zero out the disk at the moment of allocating new
> space, there is the `FALLOC_FL_ZERO_RANGE` flag, but not all file systems support it, and we don't really need it, we just want to
> make sure there is no garbage when reading.

Perhaps, there was no advanced options such as FALLOC_FL_ZERO_RANGE when CORE4443 was
implemented, I don't remember such details.

> There was also an idea that explicit zeroing could be done to avoid creating a sparse file, but a sparse file can occur when there
> are holes (pages filled with zeros) in the middle of the file, which is not our case.
> Therefore, everything is moving towards getting rid of explicit zeroing. This gives the restore a good boost, since we're writing
> twice as many bytes due to the zeroing writes, but here it is worth mentioning that, due to FW being disabled, the restore writes
> less than twice as much data to the disk (here I am talking about the number of bytes written to disk, not `pwrite` system calls,
> since zeroing is done by writing a large number of pages in a single system call, i.e., simply counting the difference in `pwrite`
> calls does not give the whole picture).

Do you have a real numbers or it is just an estimation ? I remember that performance penalty
caused by filling file by zero's was about 15-20%. Of course it was about Windows/HDD at that time.

Regards,
Vlad

Artyom Ivanov

unread,
Jul 29, 2025, 8:00:38 AMJul 29
to firebird-devel
> It is interesting to know how exactly OS (or FS?) implements this guarantee.

The FS is responsible for implementation, while the OS only provides an interface (in this case, the `fallocate` function) for interacting with the FS. And as I understand it, FS must implement this interface as described, i.e., with a zeroing mechanism (otherwise it makes no sense), or return `EOPNOTSUPP`.
Out of interest, I decided to see how this is implemented in ext4 (considered one of the simplest FS, which is why I chose it). I had no previous experience analyzing Linux code, so there may be errors in my conclusions:
When `fallocate` is called, new blocks are allocated that form an extent (a set of sequential blocks). This extent is marked as `unwritten`, but physically the extent may contain information from the previous file. When reading this extent, an error code is returned that is interpreted as a `unwritten` extent, and zeros are returned. When a block is written to this extent, it is divided into 3 extents, 1 `written` and 2 `unwritten` (the granularity of the extent is the size of 1 block (usually 4KB), so we get 1 `written` that contains only 1 block). I have omitted many specific cases where behavior is changing, but this is the main idea.


> > If you really need to zero out the disk at the moment of allocating new space, there is the FALLOC_FL_ZERO_RANGE flag, but not all file systems support it, and we don't really need it, we just want to make sure there is no garbage when reading.

> Perhaps, there was no advanced options such as FALLOC_FL_ZERO_RANGE when CORE4443 was implemented, I don't remember such details.

I misinterpreted the meaning of the `FALLOC_FL_ZERO_RANGE` flag, it does not write anything to the device. Essentially, this flag allows you to logically zero out the selected area. If we take the ext4 implementation, the written space simply becomes `unwritten`, meaning that the old information remains on the device.


> Do you have a real numbers or it is just an estimation ? I remember that performance penalty caused by filling file by zero's was about 15-20%. Of course it was about Windows/HDD at that time.

I ran two `gbak restore` tests with a database that contains the single table:
1. `CREATE TABLE TAB1(V1 VARCHAR(3), V2 VARCHAR(9), V3 INT, V4 INT, V5 INT, V6 INT);` - 14-15% boost.
2. `CREATE TABLE TAB1(V1 VARCHAR(25), V2 VARCHAR(25), V3 VARCHAR(25), V4 VARCHAR(25));` - 19-20% boost.
Both tables were ~15GB, I ran `gbak` with `-par 12` flag, Linux + NVME SSD.
The only thing I done is make `PIO_init_data()` no-op to see this performance boost (this is only for performance test, making `PIO_init_data()` no-op introducing some issues when edge cases are encountered (not enough space for example), but this can be fixed by changing underlying logic).
Reply all
Reply to author
Forward
0 new messages