Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

file born 30 seconds after its creation on ext4 - bug?

20 views
Skip to first unread message

Vincent Lefevre

unread,
Apr 26, 2022, 9:50:05 AM4/26/22
to
On an ext4 filesystem, I got a file born 30 seconds after its
actual creation. Is this a bug?

I know that such issues can be observed with NFS, but here this
is just a local ext4 filesystem.

Here are the details.

I started a shell script:

cventin:~> ps -p 667828 -o lstart,cmd
STARTED CMD
Tue Apr 26 14:43:15 2022 /bin/sh /home/vlefevre/wd/mpfr/tests/mpfrtests.sh

This script creates a file mpfrtests.cventin.lip.ens-lyon.fr.out
very early. But the first attempts to look at this file failed:

cventin:~/software/mpfr> tail -n 30 mpfrtests.*.out; ll mpfrtests.*.out
zsh: no match
zsh: no match
cventin:~/software/mpfr[1]> tail -n 30 mpfrtests.*.out; ll mpfrtests.*.out
zsh: no match
zsh: no match
cventin:~/software/mpfr[1]> lt|head <14:43:42
total 7016
-rw-r--r-- 1 188644 2022-04-26 14:43:42 config.log
-rw-r--r-- 1 2861 2022-04-26 14:43:42 conftest.c
-rw-r--r-- 1 0 2022-04-26 14:43:42 conftest.err
-rw-r--r-- 1 1907 2022-04-26 14:43:42 confdefs.h
-rwxr-xr-x 1 632161 2022-04-26 14:43:16 configure.lineno*
drwxr-xr-x 2 4096 2022-04-26 14:43:11 doc/
drwxr-xr-x 3 4096 2022-04-26 14:43:11 tune/
-rwxr-xr-x 1 23568 2022-04-26 14:43:11 depcomp*
drwxr-xr-x 5 36864 2022-04-26 14:43:11 tests/
cventin:~/software/mpfr> lt|head <14:43:47
total 6416
-rw-r--r-- 1 19436 2022-04-26 14:43:47 config.log
-rw-r--r-- 1 561 2022-04-26 14:43:47 conftest.c
-rw-r--r-- 1 0 2022-04-26 14:43:47 conftest.err
-rw-r--r-- 1 4138 2022-04-26 14:43:47 mpfrtests.cfgout
-rw-r--r-- 1 500 2022-04-26 14:43:47 confdefs.h
-rwxr-xr-x 1 632161 2022-04-26 14:43:45 configure.lineno*
-rw-r--r-- 1 878 2022-04-26 14:43:45 mpfrtests.cventin.lip.ens-lyon.fr.out
drwxr-xr-x 3 4096 2022-04-26 14:43:44 tune/
drwxr-xr-x 4 36864 2022-04-26 14:43:44 tests/

According to /usr/bin/stat, the file birth is

Birth: 2022-04-26 14:43:45.537241731 +0200

thus 30 seconds after the script started!

Note that the configure.lineno file is created *after*
mpfrtests.cventin.lip.ens-lyon.fr.out, and one can see that
at 14:43:16, configure.lineno was already created.

This is a 12-core Debian/unstable machine with

Linux cventin 5.17.0-1-amd64 #1 SMP PREEMPT Debian 5.17.3-1 (2022-04-18) x86_64 GNU/Linux

--
Vincent Lefèvre <vin...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Thomas Schmitt

unread,
Apr 26, 2022, 1:10:05 PM4/26/22
to
Hi,

Vincent Lefevre wrote:
> This script creates a file mpfrtests.cventin.lip.ens-lyon.fr.out
> very early. But the first attempts to look at this file failed:

It looks as if the file indeed does not exist when you inquire it.
So if it got created by the script, maybe it was deleted or renamed
shortly afterwards and created again 30 seconds later ?

Consider to inquire the file in the script for birth time and inode number
immediately after it was created. Compare these properties with the
properties of the file which you see 30 seconds later.


Have a nice day :)

Thomas

Nicholas Geovanis

unread,
Apr 26, 2022, 1:40:05 PM4/26/22
to
On Tue, Apr 26, 2022 at 8:45 AM Vincent Lefevre <vin...@vinc17.net> wrote:
On an ext4 filesystem, I got a file born 30 seconds after its
actual creation. Is this a bug?

Only experimentation can really back me up on this, but consider the following:

Every time you use the "|" operator or the ";" separator on a command-line,
new processes are being spawned. Which wait to be dispatched on a core.
But you are not serializing the dispatch of those processes, and especially with
16 fast cores, you can't predict their order of execution.

Nicholas Geovanis

unread,
Apr 26, 2022, 1:50:05 PM4/26/22
to
On Tue, Apr 26, 2022 at 12:37 PM Nicholas Geovanis <nickge...@gmail.com> wrote:
On Tue, Apr 26, 2022 at 8:45 AM Vincent Lefevre <vin...@vinc17.net> wrote:
On an ext4 filesystem, I got a file born 30 seconds after its
actual creation. Is this a bug?

Only experimentation can really back me up on this, but consider the following:

Every time you use the "|" operator or the ";" separator on a command-line,
new processes are being spawned. Which wait to be dispatched on a core.
But you are not serializing the dispatch of those processes, and especially with
16 fast cores, you can't predict their order of execution.

A couple more observations:
(1) It looks like you're trying to observe behavior in the very same filesystem in which
the running executable is loaded from and its log files are being written-to. That's
alot of variables in motion at once.

(2) Yes, the "|" is in a sense serializing I/O in "lt|head". But the filesystem is syncing
buffered and disk-based content separately from that.

Stefan Monnier

unread,
Apr 26, 2022, 2:20:05 PM4/26/22
to
> On an ext4 filesystem, I got a file born 30 seconds after its
> actual creation. Is this a bug?

I doubt it.
Note that a file's atime/mtime/ctime is a property of the file itself,
whereas "appearing" is defined by when the name you're checking
becomes a link to it.

IOW, most likely the file was first created under a different name and
then renamed. This is very standard practice to make sure the final
file name never refers to an incomplete file.


Stefan

Vincent Lefevre

unread,
Apr 26, 2022, 10:40:05 PM4/26/22
to
On 2022-04-26 19:01:26 +0200, Thomas Schmitt wrote:
> It looks as if the file indeed does not exist when you inquire it.
> So if it got created by the script, maybe it was deleted or renamed
> shortly afterwards and created again 30 seconds later ?

The script doesn't do that. The file is created with

echo "* $fqdn ($(${1:-.}/config.guess) / ${line#PROC:})" > "$out"

and every other line that writes to the file use >> "$out".

Note that if the file were created again, the first line with the FQDN
would no longer be in the file, but the line is still there.

Vincent Lefevre

unread,
Apr 26, 2022, 10:50:05 PM4/26/22
to
On 2022-04-26 12:47:53 -0500, Nicholas Geovanis wrote:
> On Tue, Apr 26, 2022 at 12:37 PM Nicholas Geovanis <nickge...@gmail.com>
> wrote:
>
> > On Tue, Apr 26, 2022 at 8:45 AM Vincent Lefevre <vin...@vinc17.net>
> > wrote:
> >
> >> On an ext4 filesystem, I got a file born 30 seconds after its
> >> actual creation. Is this a bug?
> >>
> >
> > Only experimentation can really back me up on this, but consider the
> > following:
> >
> > Every time you use the "|" operator or the ";" separator on a command-line,
> > new processes are being spawned. Which wait to be dispatched on a core.
> > But you are not serializing the dispatch of those processes, and
> > especially with
> > 16 fast cores, you can't predict their order of execution.
> >
>
> A couple more observations:
> (1) It looks like you're trying to observe behavior in the very same
> filesystem in which
> the running executable is loaded from and its log files are being
> written-to. That's
> alot of variables in motion at once.
>
> (2) Yes, the "|" is in a sense serializing I/O in "lt|head". But the
> filesystem is syncing
> buffered and disk-based content separately from that.

There are no parallel writes to the file, i.e. everything is
serialized from this point of view.

Vincent Lefevre

unread,
Apr 26, 2022, 11:10:06 PM4/26/22
to
You mean renamed by the Linux kernel??? (The script doesn't rename it.)

FYI, the script is the following one:

https://gitlab.inria.fr/mpfr/misc/-/blob/fed7770cf5f712871bd116ef80d93ea5885fc3f7/vl-tests/mpfrtests.sh

and the file in question is what appears as "$out".

And what I did was running from a MPFR working tree

/path/to/mpfrtests.sh < /path/to/mpfrtests.data

where the mpfrtests.data file is the following one:

https://gitlab.inria.fr/mpfr/misc/-/blob/e0204b3423b9bc25c31548d2acc5b8e19a73f48d/vl-tests/mpfrtests.data

Note that I had never had any issue until now. This was the first
time I got errors saying that the file was missing.

Thomas Schmitt

unread,
Apr 27, 2022, 3:10:06 AM4/27/22
to
Hi,

i wrote:
> > So if it got created by the script, maybe it was deleted or renamed
> > shortly afterwards and created again 30 seconds later ?

Vincent Lefevre wrote:
> The script doesn't do that. The file is created with
> echo "* $fqdn ($(${1:-.}/config.guess) / ${line#PROC:})" > "$out"
> [...]
> Note that if the file were created again, the first line with the FQDN
> would no longer be in the file, but the line is still there.

This is indeed a riddle. But is there a hard reason not to insert

stat "$out" >/tmp/stat_out_after_creation 2>&1

after that line and to compare /tmp/stat_out_after_creation with the
stat(1) result when the expected file appears 30 seconds later ?


> https://gitlab.inria.fr/mpfr/misc/-/blob/fed7770cf5f712871bd116ef80d93ea5885fc3f7/vl-tests/mpfrtests.sh
> # Written in 2011-2021 by Vincent Lefevre <vin...@vinc17.net>.

So you are supposed to really know what it does. :))

Whatever, it is fewly plausible that the file would not show up in stat(1)
or ls(1) but the script can happily do
... >> "$out"
Each of these lines is supposed to do a name lookup of the file.

Maybe another stat(1) should be placed between
printf "OK, output in %s\n" "$out"
and
exit 0

Vincent Lefevre

unread,
Apr 27, 2022, 4:30:05 AM4/27/22
to
Hi,

On 2022-04-27 09:07:37 +0200, Thomas Schmitt wrote:
> i wrote:
> > > So if it got created by the script, maybe it was deleted or renamed
> > > shortly afterwards and created again 30 seconds later ?
>
> Vincent Lefevre wrote:
> > The script doesn't do that. The file is created with
> > echo "* $fqdn ($(${1:-.}/config.guess) / ${line#PROC:})" > "$out"
> > [...]
> > Note that if the file were created again, the first line with the FQDN
> > would no longer be in the file, but the line is still there.
>
> This is indeed a riddle. But is there a hard reason not to insert
>
> stat "$out" >/tmp/stat_out_after_creation 2>&1
>
> after that line and to compare /tmp/stat_out_after_creation with the
> stat(1) result when the expected file appears 30 seconds later ?

Because the issue was unexpected (this was the first time it occurred
after 11 years of testing: I wrote the first version of the script in
2011, and I typically look at the output file just after starting the
script) and is probably unlikely to occur again.

Moreover I don't see why this behavior would be allowed with a local
filesystem. If one has in a script with "set -e" (to detect errors,
like in my script)

something > out
something else

and if in an interactive shell, one can see the side effect of
"something else", then one should expect file "out" to be visible
in the directory. (This is/was not necessarily the case with NFS
due to caching, even on the same machine.)

> > https://gitlab.inria.fr/mpfr/misc/-/blob/fed7770cf5f712871bd116ef80d93ea5885fc3f7/vl-tests/mpfrtests.sh
> > # Written in 2011-2021 by Vincent Lefevre <vin...@vinc17.net>.
>
> So you are supposed to really know what it does. :))

Yes, I'm the only author of the script, and I've been working on
it for 11 years. It was much simpler 11 years ago, but there was
already the same kind of code concerning the output file.

> Whatever, it is fewly plausible that the file would not show up in stat(1)
> or ls(1) but the script can happily do
> ... >> "$out"
> Each of these lines is supposed to do a name lookup of the file.

unless there is some caching bug. The script is likely to run on
the same CPU core, so that the file would still be visible along
the script, possibly via a cache. But some process running on a
different core might not see it for some time.

However, there's also the fact that the birth time was 30 seconds
ahead of the actual file creation, while there was no lockup.

Thomas Schmitt

unread,
Apr 27, 2022, 5:40:05 AM4/27/22
to
Hi,

Vincent Lefevre wrote:
> the issue [...] is probably unlikely to occur again.

In this case we will hardly be able to find an explanation.


> However, there's also the fact that the birth time was 30 seconds
> ahead of the actual file creation, while there was no lockup.

Yes, your observations are not yet consistently explainable.
So some of the normal assumptions about your situation must be wrong.
Question is which ones.

The file times which you showed are consistent with a file that was indeed
created and written 30 seconds after your script was supposed to have
created it and to have written to it with repeated name lookups.

> > -rw-r--r-- 1 878 2022-04-26 14:43:45 mpfrtests.cventin.lip.ens-lyon.fr.out
> > Birth: 2022-04-26 14:43:45.537241731 +0200

On the other hand the content looks like the normal work result of your
script which already had ended half a minute ago.

It is normal that data get onto the physical storage medium only quite a
long time after a program wrote them. But this is supposed to be kept
consistent by the VFS and virtual memory of the Linux kernel.


> The script is likely to run on
> the same CPU core, so that the file would still be visible along
> the script, possibly via a cache.

A connection to the CPU cache would be a strange low level problem of
kernel or hardware.
I understand that the filesystem driver writes to memory pages which
are associated to storage device memory. The pages and their association
are managed by the virtual memory facility of the kernel.
https://www.kernel.org/doc/html/latest/filesystems/vfs.html#the-address-space-object
Any attempt to access the associated to storage device memory of a not
yet written page is supposed to be directed to the cached page in RAM.

If it has indeed to do with the CPU cache then a particular cache would have
delayed its writing to RAM for 30 seconds but would have served its own CPU
with the full results of file system driver and virtual memory activities
around the new file. No inconsistent partial results would have been written
to page cache in RAM which would have caused protests during your attempts
to see the file.

But i deem it unlikely that the kernel threads which operated filesystem and
virtual memory are (nearly) always running on the same CPU cache which is
not shared with all other CPU cores.
Further, if the memory operations were just pending in some secluded cache
why does the inode then bear the time when that cache would finally have
released its content to the wider accessible RAM.


If i was in your situation, i'd add diagnostic messages to the script in the
hope (or fear) that the glitch happens again.
Especially the inode numbers during and after the script run would be
interesting.

Vincent Lefevre

unread,
Apr 27, 2022, 9:20:05 AM4/27/22
to
On 2022-04-27 11:39:17 +0200, Thomas Schmitt wrote:
> If it has indeed to do with the CPU cache then a particular cache would have
> delayed its writing to RAM for 30 seconds but would have served its own CPU
> with the full results of file system driver and virtual memory activities
> around the new file. No inconsistent partial results would have been written
> to page cache in RAM which would have caused protests during your attempts
> to see the file.
>
> But i deem it unlikely that the kernel threads which operated filesystem and
> virtual memory are (nearly) always running on the same CPU cache which is
> not shared with all other CPU cores.
> Further, if the memory operations were just pending in some secluded cache
> why does the inode then bear the time when that cache would finally have
> released its content to the wider accessible RAM.

I was thinking of a potentially specific FS cache. But perhaps this
is not related to the CPU at all. It could be that a potential bug
(either in the kernel code or in the compiler) would affect some
system calls but not others. So, the ">> $out" done in the script
would work, but not other kinds of operations on $out.

In the strace output of the script, I get

openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

when the file is created, then

openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

for the commands that append to this file, and one with

openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666 <unfinished ...>
<... openat resumed>) = 3

about 30 seconds later.

So, perhaps there were no issues with openat, but when reading
the directory, the file could not be found because some internal
structures might have been incomplete.

This doesn't explain why the birth time of the file was 30 seconds
late. But according to fs/ext4/inode.c, inode handling seems a bit
complex.

To the question why this doesn't occur every time or at least more
often: there might have been an interrupt at the wrong place, or
something like that.

> If i was in your situation, i'd add diagnostic messages to the script in the
> hope (or fear) that the glitch happens again.

I think that a better thing would be to write a specific test script
doing file creations and writes in append mode, that on many files,
and see at the end whether some birth times were unexpectedly late.

> Especially the inode numbers during and after the script run would be
> interesting.

But note that getting other information about the file during the
script could void the effects of a potential bug.

Greg Wooledge

unread,
Apr 27, 2022, 9:30:05 AM4/27/22
to
On Wed, Apr 27, 2022 at 03:11:22PM +0200, Vincent Lefevre wrote:
> So, perhaps there were no issues with openat, but when reading
> the directory, the file could not be found because some internal
> structures might have been incomplete.

If so, this is a bug at the kernel level, perhaps in the VFS code
or a file system driver. I'm no expert on such things.

> To the question why this doesn't occur every time or at least more
> often: there might have been an interrupt at the wrong place, or
> something like that.

It's going to be extremely hard to make any progress on your issue if
you can't reliably reproduce the outcome. Frustrating as that may be.

Are you able to reproduce the issue at all, even like 1% of the time?
If so, you could set up a script that sometimes reproduces, and run it
in a loop until you get the erroneous behavior. Of course, the script
would have to generate enough debugging output to be useful for
diagnosing the underlying problem, and you'd probably have a second
script controlling the loop which runs the first script, and then looks
for the anomalous output in the logs, to stop the loop.

Vincent Lefevre

unread,
Apr 27, 2022, 10:10:05 AM4/27/22
to
On 2022-04-27 09:26:50 -0400, Greg Wooledge wrote:
> Are you able to reproduce the issue at all, even like 1% of the time?

In the past 11 years, I may have ran the script (and looked at the
output just after that) hundreds or thousands of times; the commands
to look at the output have been in my shell history for 6 years.

This was the first time I noticed such an issue. But if this is a
kernel bug, it may be a new bug.

> If so, you could set up a script that sometimes reproduces, and run it
> in a loop until you get the erroneous behavior.

I'll try to do that.

> Of course, the script would have to generate enough debugging output
> to be useful for diagnosing the underlying problem, and you'd
> probably have a second script controlling the loop which runs the
> first script, and then looks for the anomalous output in the logs,
> to stop the loop.

The first thing would be to try to reproduce the issue without
much debug info (in order to get the same system calls as the ones
generated by my normal script).

Stefan Monnier

unread,
Apr 27, 2022, 10:50:05 PM4/27/22
to
Vincent Lefevre [2022-04-27 15:59:57] wrote:
> On 2022-04-27 09:26:50 -0400, Greg Wooledge wrote:
>> Are you able to reproduce the issue at all, even like 1% of the time?
> In the past 11 years, I may have ran the script (and looked at the
> output just after that) hundreds or thousands of times; the commands
> to look at the output have been in my shell history for 6 years.
>
> This was the first time I noticed such an issue. But if this is a
> kernel bug, it may be a new bug.

Another option might be that your system's time was "reset".
This shouldn't happen, but it can happen if your NTP was down, the
machine got out-of-sync over time and you restart the NTP server at
which point it may(!) decide to jump the clock if the difference is
large enough (i.e. too large to catch up gradually).
Can't remember how large is "large enough".

Anyway, just throwing this out there in case it rings a bell,


Stefan

Greg Wooledge

unread,
Apr 27, 2022, 11:10:05 PM4/27/22
to
On Wed, Apr 27, 2022 at 10:45:09PM -0400, Stefan Monnier wrote:
> Another option might be that your system's time was "reset".
> This shouldn't happen, but it can happen if your NTP was down, the
> machine got out-of-sync over time and you restart the NTP server at
> which point it may(!) decide to jump the clock if the difference is
> large enough (i.e. too large to catch up gradually).
> Can't remember how large is "large enough".

It would depend on which NTP implementation is in use. The traditonal
"ntp" package should not do a large jump like that, except at boot. It
should just make the clock drift toward the correct time.

Other implementations may be more aggressive about it. Debian 11 uses
systemd-timesyncd by default, but I don't know how it behaves. I've
not used chrony either. I did experiment with openntpd for a while, but
it was many years ago when it was pretty new, so it might have changed
a lot.

In any case, that's a clever theory. The OP could look for evidence of
an NTP time jump in whatever logs survive from the offending time
period.

Vincent Lefevre

unread,
Apr 28, 2022, 4:40:05 AM4/28/22
to
On 2022-04-27 22:45:09 -0400, Stefan Monnier wrote:
> Another option might be that your system's time was "reset".
> This shouldn't happen, but it can happen if your NTP was down, the
> machine got out-of-sync over time and you restart the NTP server at
> which point it may(!) decide to jump the clock if the difference is
> large enough (i.e. too large to catch up gradually).
> Can't remember how large is "large enough".

I don't think that systemd does that. Anyway, even if this were
possible, that would make the output inconsistent. I recall:

cventin:~/software/mpfr[1]> lt|head <14:43:42
total 7016
-rw-r--r-- 1 188644 2022-04-26 14:43:42 config.log
-rw-r--r-- 1 2861 2022-04-26 14:43:42 conftest.c
-rw-r--r-- 1 0 2022-04-26 14:43:42 conftest.err
-rw-r--r-- 1 1907 2022-04-26 14:43:42 confdefs.h
-rwxr-xr-x 1 632161 2022-04-26 14:43:16 configure.lineno*
drwxr-xr-x 2 4096 2022-04-26 14:43:11 doc/
drwxr-xr-x 3 4096 2022-04-26 14:43:11 tune/
-rwxr-xr-x 1 23568 2022-04-26 14:43:11 depcomp*
drwxr-xr-x 5 36864 2022-04-26 14:43:11 tests/
cventin:~/software/mpfr> lt|head <14:43:47
total 6416
-rw-r--r-- 1 19436 2022-04-26 14:43:47 config.log
-rw-r--r-- 1 561 2022-04-26 14:43:47 conftest.c
-rw-r--r-- 1 0 2022-04-26 14:43:47 conftest.err
-rw-r--r-- 1 4138 2022-04-26 14:43:47 mpfrtests.cfgout
-rw-r--r-- 1 500 2022-04-26 14:43:47 confdefs.h
-rwxr-xr-x 1 632161 2022-04-26 14:43:45 configure.lineno*
-rw-r--r-- 1 878 2022-04-26 14:43:45 mpfrtests.cventin.lip.ens-lyon.fr.out
drwxr-xr-x 3 4096 2022-04-26 14:43:44 tune/
drwxr-xr-x 4 36864 2022-04-26 14:43:44 tests/

where mpfrtests.cventin.lip.ens-lyon.fr.out was actually created
a fraction of second before configure.lineno in the script.

"14:43:42" is the time I ran the first "lt|head" and "14:43:47"
is the time I ran the second "lt|head", where "lt" is ls with
various options, including "-t" to sort in decreasing date order.

With time jumps, this is theoretically possible, but this would
mean that the time jumped at least 3 times:
1. To make the birth time as 14:43:45.
2. To make the last modified time earlier than 14:43:11 (so that
mpfrtests.cventin.lip.ens-lyon.fr.out doesn't appear in the
first "lt|head" output).
3. To go back after 14:43:42.

And this would not explain the

cventin:~/software/mpfr> tail -n 30 mpfrtests.*.out; ll mpfrtests.*.out
zsh: no match
zsh: no match
cventin:~/software/mpfr[1]> tail -n 30 mpfrtests.*.out; ll mpfrtests.*.out
zsh: no match
zsh: no match

which I did a few seconds before 14:43:42.

If the time did not jump, then the birth time 14:43:45.537241731
matches the behavior I've observed in the above commands, i.e. as if
this file were created at this time.

Thomas Schmitt

unread,
Apr 28, 2022, 5:30:05 AM4/28/22
to
Hi,

Vincent Lefevre wrote:
> and one with
> openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666 <unfinished ...>
> <... openat resumed>) = 3
> about 30 seconds later.

Oh. So the script was still running when the file finally appeared to lt,
tail, and ll ?

Is the text snippet "<unfinished ...> <... openat resumed>" literally in
the output of strace ? (Or does stand for some other text ?)

Did you check the kernel logs for unusual events around 2022-04-26 14:43 ?


> This doesn't explain why the birth time of the file was 30 seconds
> late.

I developed the theory that it might be an effect of journaling. Like:
Inode creation fails on storage device level, journal records the ongoing
write requests and virtual memory serves the read requests, 30 seconds
later journal creates inode at its final place on the storage device.
But as it looks in
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Journal_.28jbd2.29
the journal is a mere data cache with no own means to create inodes and their
timestamps.

Still i deem it the most plausible theory that the inode to which the script
wrote in its first second is not the same inode to which it later wrote and
which finally shows up with shell tools.
But i lack of any idea how this can happen as rare and unexpected event.

Vincent Lefevre

unread,
Apr 28, 2022, 5:40:05 AM4/28/22
to
On 2022-04-27 11:39:17 +0200, Thomas Schmitt wrote:
> It is normal that data get onto the physical storage medium only quite a
> long time after a program wrote them. But this is supposed to be kept
> consistent by the VFS and virtual memory of the Linux kernel.
[...]
> I understand that the filesystem driver writes to memory pages which
> are associated to storage device memory. The pages and their association
> are managed by the virtual memory facility of the kernel.
> https://www.kernel.org/doc/html/latest/filesystems/vfs.html#the-address-space-object
> Any attempt to access the associated to storage device memory of a not
> yet written page is supposed to be directed to the cached page in RAM.

If I understand correctly, the VFS does not just have cached pages,
but also its own structures, like inodes. So, I'm wondering whether
the following could be possible:

* openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_TRUNC, 0666) creates
a file in the VFS, which is not written back to the actual FS.

* The subsequent openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666)
append data to this file in the VFS, still not written back to the
actual FS.

When I did

tail -n 30 mpfrtests.*.out; ll mpfrtests.*.out

this had the effect to look at the entries in the current directory.
For some reason (a bug occurring under some particular conditions?),
the dirty state due to the data written above to the VFS was ignored,
so that the file was not found. Ditto for the first "lt|head".

Between the first "lt|head" and the second one, the data were written
back to the actual FS.

This explanation is possible only if the birth time is the time at
which the VFS inode was written as the inode of the actual FS, not
the time at which the file was created in the VFS. This may be the
case, as the VFS does not seem to have the concept of birth time
(fs/inode.c has things like atime, ctime and mtime, but that's all).

Vincent Lefevre

unread,
Apr 28, 2022, 6:20:05 AM4/28/22
to
Hi,

On 2022-04-28 11:26:36 +0200, Thomas Schmitt wrote:
> Vincent Lefevre wrote:
> > and one with
> > openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666 <unfinished ...>
> > <... openat resumed>) = 3
> > about 30 seconds later.
>
> Oh. So the script was still running when the file finally appeared to lt,
> tail, and ll ?

Yes, the script takes several dozens of minutes to complete
on this machine.

> Is the text snippet "<unfinished ...> <... openat resumed>" literally in
> the output of strace ? (Or does stand for some other text ?)

Yes, this is just from raw copy-paste, no editing. I suppose
that the scheduler interrupted the system call.

> Did you check the kernel logs for unusual events around 2022-04-26 14:43 ?

Yes, actually the systemd journal (which gives additional information).
Nothing at this time:

[...]
Apr 26 14:42:16 cventin su[662519]: pam_unix(su:session): session closed for user root
Apr 26 14:45:01 cventin CRON[768983]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
[...]

When I looked at it, I found I/O errors with sr0 and sr1 that occurred
a bit later. I initially thought of a possible hardware problem, but
they were common and unrelated. They are triggered by Wine, which is
executed by my script (as I also test MPFR under Wine).

My bug report:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1010209

(either the kernel is really doing silly things, or this should just
be debug information that should not be in the logs by default).

> > This doesn't explain why the birth time of the file was 30 seconds
> > late.
>
> I developed the theory that it might be an effect of journaling. Like:
> Inode creation fails on storage device level, journal records the ongoing
> write requests and virtual memory serves the read requests, 30 seconds
> later journal creates inode at its final place on the storage device.
> But as it looks in
> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Journal_.28jbd2.29
> the journal is a mere data cache with no own means to create inodes and their
> timestamps.

I initially thought about journaling too.

> Still i deem it the most plausible theory that the inode to which the script
> wrote in its first second is not the same inode to which it later wrote and
> which finally shows up with shell tools.

I don't see how this is possible, except if you mean VFS inode
(with no concept of birth time) and ext4 inode. See my other mail
about that.

duh

unread,
Apr 29, 2022, 10:20:05 AM4/29/22
to
Having skimmed over a number of the replies, and really not being
qualified, may I just

toss out a probably useless ideas to use the "sync" command. Looking at
the 'man sync'

shows at the bottom several variants or whatever to sync. Just a thought
since when does

the data get transferred to the disk versus just being held in memory or
whatever?

This is probably just a useless tangent based on my ignorance, but once
in awhile it is possible

to discover something when falls into a hole.

Marc Auslander

unread,
Apr 29, 2022, 2:10:05 PM4/29/22
to
On 4/29/2022 10:20 AM, duh wrote:
>
> On 4/27/22 11:05 PM, Greg Wooledge wrote:

....
> Having skimmed over a number of the replies, and really not being
> qualified, may I just
>
> toss out a probably useless ideas to use the "sync" command. Looking at
> the 'man sync'
>
> shows at the bottom several variants or whatever to sync. Just a thought
> since when does
>
> the data get transferred to the disk versus just being held in memory or
> whatever?
>
> This is probably just a useless tangent based on my ignorance, but once
> in awhile it is possible
>
> to discover something when falls into a hole.

sync isn't about this. linux caches file system pages in memory - both
content and metadata. sync is about forcing the changed pages back to
disk, for example before shutting down. It's done automatically - maybe
every 30 seconds (I'm not sure about linux on this). But sync does not
change what programs see unless they use a direct to disk read, which is
certainly not what's going on here.

sp...@caiway.net

unread,
Apr 29, 2022, 4:10:04 PM4/29/22
to
On Fri, 29 Apr 2022 13:45:15 -0400

> linux caches file system pages in memory

how does one force to refresh this memory with a command?

Thanks!


to...@tuxteam.de

unread,
Apr 30, 2022, 1:40:05 AM4/30/22
to
What do you mean by "refresh"? What's in the cache is always
the "freshest" version: nobody's writing to the disk behind
the operating system's back (unless they insist into getting
into trouble).

Cheers
--
t
signature.asc

Thomas Schmitt

unread,
Apr 30, 2022, 4:40:06 AM4/30/22
to
Hi,

sp...@caiway.net wrote:
> how does one force to refresh this memory with a command?

to...@tuxteam.de wrote:
> What do you mean by "refresh"? What's in the cache is always
> the "freshest" version:

Indeed. With normal filesystem operations there should be no need to call
something like sync(2) in order to get a consistent representation of the
current filesystem state.

That's actually the riddle of this thread:
The filesystem behavior outside the script and the time stamps look like
the file was created 30 seconds after the script began to open, write, and
close the file.
The file content indicates that it was written shortly after the script
began.

I understand that Vincent Lefevre suspects these discrepancies to be a bug
in the ext4 driver. I rather suspect that ext4 is ok and that we observe
the effects of some other glitch which caused the ext4 driver to create
the finally observed file again, 30 seconds later.


> nobody's writing to the disk behind the operating system's back
> (unless they insist into getting into trouble).

Burn programs do, because the kernel insists in being incapable of burning
CD-R and other sequential media types or states.
ioctl(SG_IO) performs SCSI transactions on the storage device by the kernel
but without the higher level drivers being aware. E.g. you don't see the
traffic counted in files of /proc and the block device driver will continue
to show the medium state before the burn run.
Remedy is to eject and reload the optical medium. On hard disks there
would be ioctl(BLKRRPART). But for CDROM there is nothing comparable in
Linux.

to...@tuxteam.de

unread,
Apr 30, 2022, 8:10:05 AM4/30/22
to
On Sat, Apr 30, 2022 at 10:33:34AM +0200, Thomas Schmitt wrote:
> Hi,
>
> sp...@caiway.net wrote:
> > how does one force to refresh this memory with a command?
>
> to...@tuxteam.de wrote:
> > What do you mean by "refresh"? What's in the cache is always
> > the "freshest" version:
>
> Indeed. With normal filesystem operations there should be no need to call
> something like sync(2) in order to get a consistent representation of the
> current filesystem state.

Exactly: what we "see", i.e. what we interact with via the operating
system facilities is the cache's view. The disk's view will become
consistent with that at some time in the future (at least we hope
that).

[...]

> I understand that Vincent Lefevre suspects these discrepancies to be a bug
> in the ext4 driver. I rather suspect that ext4 is ok and that we observe
> the effects of some other glitch which caused the ext4 driver to create
> the finally observed file again, 30 seconds later.

It sure looks like a bug. But it would be a bug at a spot where one would
expect that it should have bitten oodles of other people by now, so that
feeling that we must be missing something is as understandable.

Cheers
--
t
signature.asc

Curt

unread,
Apr 30, 2022, 8:40:05 AM4/30/22
to
On 2022-04-30, Thomas Schmitt <scdb...@gmx.net> wrote:
>
> Indeed. With normal filesystem operations there should be no need to call
> something like sync(2) in order to get a consistent representation of the
> current filesystem state.
>

What does the following mean, then, in that light:

Because of delayed allocation and other performance optimizations, ext4's
behavior of writing files to disk is different from ext3. In ext4, when a
program writes to the file system, it is not guaranteed to be on-disk unless
the program issues an fsync() call afterwards.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-ext4

Or this:

So on ext4 ‘data=ordered’ and ‘data=writeback’ are somehow similar when a file
is enlarged, which is quite confusing and not clearly stated in the man pages.
And the delayed allocation will actually commit the data to disk only after
30-150 seconds (it is not very clear on this exact window of data loss) even
when ‘commit=5′ is supposed (cf. ‘man mount’) to do it after 5 seconds. In
conclusion, ext4 with default options guarantees after a crash only the
atomicity and consistency of the filesystem changes (all metadata) with a
maximum loss of 5 seconds on metadata changes. The data changes may suffer a
loss of 30-150 seconds and in the majority of cases all changed files in this
window will be completely wiped with zero bytes! The atomicity of file changes
is not working anymore with the rename-idiom.

https://www.pointsoftware.ch/2014/02/05/linux-filesystems-part-4-ext4-vs-ext3-and-why-delayed-allocation-is-bad/

Or maybe none of this is pertinent.

to...@tuxteam.de

unread,
Apr 30, 2022, 9:10:05 AM4/30/22
to
On Sat, Apr 30, 2022 at 12:31:07PM -0000, Curt wrote:
> On 2022-04-30, Thomas Schmitt <scdb...@gmx.net> wrote:
> >
> > Indeed. With normal filesystem operations there should be no need to call
> > something like sync(2) in order to get a consistent representation of the
> > current filesystem state.
> >
>
> What does the following mean, then, in that light:
>
> Because of delayed allocation and other performance optimizations, ext4's
> behavior of writing files to disk is different from ext3. In ext4, when a
> program writes to the file system, it is not guaranteed to be on-disk unless
> the program issues an fsync() call afterwards.

You still can't observe that during a normally running system. You'll
see the same file system behaviour regardless of whether the data have
touched the disk or are still in cache.

The only way to actually observe this in action is to "pull the plug"
before data has a chance to reach the disk. Then, of course, some
files you thought were there will have magically disappeared.

Cheers
--
t
signature.asc

Thomas Schmitt

unread,
Apr 30, 2022, 9:20:05 AM4/30/22
to
Hi,

Curt wrote:
> What does the following mean, then, in that light:
> Because of delayed allocation and other performance optimizations, ext4's
> behavior of writing files to disk is different from ext3. In ext4, when a
> program writes to the file system, it is not guaranteed to be on-disk unless
> the program issues an fsync() call afterwards.
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-ext4

Well, the wording "delayed allocation and other performance optimizations"
could mean a lot of weird things.
But the subsequent praragraphs clearly concern the RAM-to-disk migration of
memory pages which are associated to the filesystem's disk storage.


> And the delayed allocation will actually commit the data to disk only after
> 30-150 seconds (it is not very clear on this exact window of data loss)

I understand this and above that the data content already exists in RAM
(i.e. the written inode data with the birth timestamp, plus the file content
as written by the running script) but gets onto the physical storage medium
only later.
The motivation for these discussions is possible data loss if the system
suddenly stops working. Inconsistent filesystem behavior is not mentioned
(but to be expected when the next run of the system encounters the dirty
filesystem).


An explanation of the observed problem would need:
- a mechanism which delayed the content production of the inode while it
was already in use for open and write,
- or a mechanism which caused ext4 to hide the inode to other processes
and to write a wrong birth timestamp,
- or a mechanism which deleted the file shortly after it was created and
re-created it 30 seconds later with its full expected content,
- or something of which i cannot think yet.

(The birth timestamp happens to match roughly the time when the file finally
became visible to other processes. It matches the modification time of the
finally visible file.)

--------------------------------------------------------------------------

The strange strace report around the time when the file finally appeared

openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666 <unfinished ...>
<... openat resumed>) = 3

does not mean any disturbance, but only that strace had to deal with more
than one thread or process at the time. Between above two lines there is
supposed to have been another line with a system call, though.
man 1 strace:

If a system call is being executed and meanwhile another one is being
called from a different thread/process then strace will try to preserve
the order of those events and mark the ongoing call as being unfin‐
ished. When the call returns it will be marked as resumed.

[pid 28772] select(4, [3], NULL, NULL, NULL <unfinished ...>
[pid 28779] clock_gettime(CLOCK_REALTIME, {1130322148, 939977000}) =
0
[pid 28772] <... select resumed> ) = 1 (in [3])

Vincent Lefevre

unread,
May 1, 2022, 7:20:05 PM5/1/22
to
Not necessarily. The bug might be new. It might be difficult to
reproduce, in particular if there is a race condition behind. It
might have occurred on other machines, but remained unnoticed,
because to notice it, one needs a specific sequence of operations
performed in a timely manner.

Bugs may exist, but may remain unnoticed for years. See for instance
this one I had found in 2006:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356083#15

"Good catch. Wow, that bug has been in GNU cmp ever since it was
entered into CVS in 1991."

Vincent Lefevre

unread,
May 1, 2022, 7:40:05 PM5/1/22
to
On 2022-04-30 15:19:21 +0200, Thomas Schmitt wrote:
> An explanation of the observed problem would need:
> - a mechanism which delayed the content production of the inode while it
> was already in use for open and write,
> - or a mechanism which caused ext4 to hide the inode to other processes
> and to write a wrong birth timestamp,
> - or a mechanism which deleted the file shortly after it was created and
> re-created it 30 seconds later with its full expected content,
> - or something of which i cannot think yet.
>
> (The birth timestamp happens to match roughly the time when the file finally
> became visible to other processes. It matches the modification time of the
> finally visible file.)

I'm wondering whether the data are transferred from the VFS to ext4
necessarily within the same openat system call or could just be kept
in the VFS as long as they are not needed elsewhere, i.e. the VFS
behaving like a cache. In the latter case, since the VFS doesn't
have a notion of birth timestamp (from the code I've read), a bug
in the VFS code could explain the behavior I had observed. This is
the only explanation I could have.

> --------------------------------------------------------------------------
>
> The strange strace report around the time when the file finally appeared
>
> openat(AT_FDCWD, "….out", O_WRONLY|O_CREAT|O_APPEND, 0666 <unfinished ...>
> <... openat resumed>) = 3
>
> does not mean any disturbance, but only that strace had to deal with more
> than one thread or process at the time. Between above two lines there is
> supposed to have been another line with a system call, though.
> man 1 strace:
>
> If a system call is being executed and meanwhile another one is being
> called from a different thread/process then strace will try to preserve
> the order of those events and mark the ongoing call as being unfin‐
> ished. When the call returns it will be marked as resumed.
>
> [pid 28772] select(4, [3], NULL, NULL, NULL <unfinished ...>
> [pid 28779] clock_gettime(CLOCK_REALTIME, {1130322148, 939977000}) = 0
> [pid 28772] <... select resumed> ) = 1 (in [3])

Thanks for the information. I'd say that "resumed" is misleading, as
it makes me think that "select" was interrupted. IMHO, "continued"
would be much better.

Greg Wooledge

unread,
May 1, 2022, 7:40:05 PM5/1/22
to
On Mon, May 02, 2022 at 01:30:08AM +0200, Vincent Lefevre wrote:
> I'm wondering whether the data are transferred from the VFS to ext4
> necessarily within the same openat system call or could just be kept
> in the VFS as long as they are not needed elsewhere, i.e. the VFS
> behaving like a cache. In the latter case, since the VFS doesn't
> have a notion of birth timestamp (from the code I've read), a bug
> in the VFS code could explain the behavior I had observed. This is
> the only explanation I could have.

At this point, you should seriously consider asking the Linux Kernel
Mailing List, or some other place where people might know the answers
to these questions.

to...@tuxteam.de

unread,
May 2, 2022, 1:30:05 AM5/2/22
to
On Mon, May 02, 2022 at 01:10:04AM +0200, Vincent Lefevre wrote:
> On 2022-04-30 14:06:53 +0200, to...@tuxteam.de wrote:

[...]

> > It sure looks like a bug. But it would be a bug at a spot where one would
> > expect that it should have bitten oodles of other people by now, so that
> > feeling that we must be missing something is as understandable.
>
> Not necessarily. The bug might be new. It might be difficult to
> reproduce, in particular if there is a race condition behind. It
> might have occurred on other machines, but remained unnoticed,
> because to notice it, one needs a specific sequence of operations
> performed in a timely manner.

[...]

Yes, of course. I didn't say it can't be a bug. Just it may be
hard to convince people. As it is sporadic, it may be hard to
come up with a somewhat reliable reproducer.

Cheers
--
t
signature.asc

David Wright

unread,
May 2, 2022, 11:10:06 AM5/2/22
to
On Mon 02 May 2022 at 01:30:08 (+0200), Vincent Lefevre wrote:
> On 2022-04-30 15:19:21 +0200, Thomas Schmitt wrote:
> > An explanation of the observed problem would need:
> > - a mechanism which delayed the content production of the inode while it
> > was already in use for open and write,
> > - or a mechanism which caused ext4 to hide the inode to other processes
> > and to write a wrong birth timestamp,
> > - or a mechanism which deleted the file shortly after it was created and
> > re-created it 30 seconds later with its full expected content,
> > - or something of which i cannot think yet.
> >
> > (The birth timestamp happens to match roughly the time when the file finally
> > became visible to other processes. It matches the modification time of the
> > finally visible file.)
>
> I'm wondering whether the data are transferred from the VFS to ext4
> necessarily within the same openat system call or could just be kept
> in the VFS as long as they are not needed elsewhere, i.e. the VFS
> behaving like a cache. In the latter case, since the VFS doesn't
> have a notion of birth timestamp (from the code I've read), a bug
> in the VFS code could explain the behavior I had observed. This is
> the only explanation I could have.

I'm not very familiar with files' birth as it's a relatively new
addition to filesystems, particularly how to display it even when
present. So I looked it up, and the ext4 wiki says it's the time
at which the inode is created. I also read in ext4 wikipedia:

Multiblock allocator

When ext3 appends to a file, it calls the block allocator, once
for each block. Consequently, if there are multiple concurrent
writers, files can easily become fragmented on disk. However, ext4
uses delayed allocation, which allows it to buffer data and
allocate groups of blocks. Consequently, the multiblock allocator
can make better choices about allocating files contiguously on
disk. The multiblock allocator can also be used when files are
opened in O_DIRECT mode. This feature does not affect the disk
format.

So I wondered whether a delayed birth time could be caused by the
filesystem waiting a while before it actually starts creating any
allocation for the file on the disk. Having amassed some data,
depending on its size, it decides where it's going to write it,
ie which block group. Only then does it create the file's inode,
so that it can keep file contents and inode close together.

Cheers,
David.

to...@tuxteam.de

unread,
May 2, 2022, 12:20:06 PM5/2/22
to
On Mon, May 02, 2022 at 10:04:00AM -0500, David Wright wrote:

> I'm not very familiar with files' birth as it's a relatively new
> addition to filesystems, particularly how to display it even when
> present. So I looked it up, and the ext4 wiki says it's the time
> at which the inode is created. I also read in ext4 wikipedia:

[...]

> So I wondered whether a delayed birth time could be caused by the
> filesystem waiting a while before it actually starts creating any
> allocation for the file on the disk. Having amassed some data,
> depending on its size, it decides where it's going to write it,
> ie which block group. Only then does it create the file's inode,
> so that it can keep file contents and inode close together.

That actually makes sense. It would be surprising behaviour,
but at least whithin reach; whether it's intentional or may be
considered a bug is, of course, another thing

Thanks for diving into details, cheers
--
t
signature.asc
0 new messages