Nanosecond fs timestamp support: sad

Matt Mackall

unread,

Jul 21, 2011, 2:10:02 PM7/21/11

to

So it turns out that the resolution on filesystem timestamps is tied to
HZ rather than gettimeofday or similar, which means the resolution
improvement over seconds is.. not much. And not nearly as much as
advertised!

This means I can touch a file something like 70k times per second and
get only 300 distinct timestamps on my laptop. And only 100 distinct
timestamps on a typical distro server kernel.

Meanwhile, I can call gettimeofday 35M times per second and get ~1M
distinct responses.

Given that we can do gettimeofday three orders of magnitude faster than
we can do file transactions and it has four orders of magnitude better
resolution, shouldn't we be using it for filesystem time when
sb->s_time_gran is less than 1/HZ?

--
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

unread,

Jul 22, 2011, 2:10:01 AM7/22/11

to

Matt Mackall <m...@selenic.com> writes:

> This means I can touch a file something like 70k times per second and
> get only 300 distinct timestamps on my laptop. And only 100 distinct
> timestamps on a typical distro server kernel.

You should use the inode generation number if you really want
to see every update.

> Meanwhile, I can call gettimeofday 35M times per second and get ~1M
> distinct responses.

They key word here is "I".

> Given that we can do gettimeofday three orders of magnitude faster than
> we can do file transactions and it has four orders of magnitude better
> resolution, shouldn't we be using it for filesystem time when
> sb->s_time_gran is less than 1/HZ?

Some systems have a quite slow gettimeofday()
That was the primary motivation for using jiffies.

Also adding more granuality makes it more expensive,
because there's additional work every time it changes.
Even jiffies already caused regressions.

-Andi
--
a...@linux.intel.com -- Speaking for myself only

NeilBrown

unread,

Jul 22, 2011, 2:40:02 AM7/22/11

to

On Thu, 21 Jul 2011 23:01:24 -0700 Andi Kleen <an...@firstfloor.org> wrote:

> Matt Mackall <m...@selenic.com> writes:
>
>
> > This means I can touch a file something like 70k times per second and
> > get only 300 distinct timestamps on my laptop. And only 100 distinct
> > timestamps on a typical distro server kernel.
>
> You should use the inode generation number if you really want
> to see every update.

I assume you mean i_version which gets incremented (under a spinlock) if the
filesystem asks for it.

This doesn't let you compare the ages of two files. I wonder if that is
important. Is it important to you Matt?

>
> > Meanwhile, I can call gettimeofday 35M times per second and get ~1M
> > distinct responses.
>
> They key word here is "I".
>
> > Given that we can do gettimeofday three orders of magnitude faster than
> > we can do file transactions and it has four orders of magnitude better
> > resolution, shouldn't we be using it for filesystem time when
> > sb->s_time_gran is less than 1/HZ?
>
> Some systems have a quite slow gettimeofday()
> That was the primary motivation for using jiffies.
>
> Also adding more granuality makes it more expensive,
> because there's additional work every time it changes.
> Even jiffies already caused regressions.
>
> -Andi

I imagine a scheme where 'stat' would set a flag if it wasn't set, and
file_update_time would:
- if the flag is set, use gettimeofday and clear the flag
- if the flag is not set, use jiffies

so if you are looking, you will see i_mtime changing precisely but if not,
you don't pay the price.
This wouldn't allow precise ordering of distinct files either of course.

NeilBrown

Matt Mackall

unread,

Jul 22, 2011, 3:40:02 PM7/22/11

to

On Fri, 2011-07-22 at 16:33 +1000, NeilBrown wrote:
> On Thu, 21 Jul 2011 23:01:24 -0700 Andi Kleen <an...@firstfloor.org> wrote:
>
> > Matt Mackall <m...@selenic.com> writes:
> >
> >
> > > This means I can touch a file something like 70k times per second and
> > > get only 300 distinct timestamps on my laptop. And only 100 distinct
> > > timestamps on a typical distro server kernel.
> >
> > You should use the inode generation number if you really want
> > to see every update.
>
> I assume you mean i_version which gets incremented (under a spinlock) if the
> filesystem asks for it.

Indeed. Only usefully exists on ext4 and requires extra system calls.

> This doesn't let you compare the ages of two files. I wonder if that is
> important. Is it important to you Matt?

Sort of. We track a 'latest seen timestamp' so we can consider files
before that time unchanged and we need only concern ourselves with the
looking for invisible changes that occur inside that quantum.

> I imagine a scheme where 'stat' would set a flag if it wasn't set, and
> file_update_time would:
> - if the flag is set, use gettimeofday and clear the flag
> - if the flag is not set, use jiffies
>
> so if you are looking, you will see i_mtime changing precisely but if not,
> you don't pay the price.

Hmm, interesting.

> This wouldn't allow precise ordering of distinct files either of course.

Yeah, I don't think we want to introduce observable non-causality in
filesystem time. There might be something clever we can do here, but it
would require some Deep Thought. But if successful, we could mitigate
some of the repeated inode dirtying caused by jiffies-resolution
timestamping.

--
Mathematics is the supreme nostalgia of our time.

Andi Kleen

unread,

Jul 22, 2011, 5:00:01 PM7/22/11

to

> Indeed. Only usefully exists on ext4 and requires extra system calls.

Not sure what you mean? It's in stat(2), just like the timestamps.

As for XFS, btrfs etc. I guess it could be added there.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Matt Mackall

unread,

Jul 22, 2011, 5:20:02 PM7/22/11

to

On Fri, 2011-07-22 at 22:59 +0200, Andi Kleen wrote:
> > Indeed. Only usefully exists on ext4 and requires extra system calls.
>
> Not sure what you mean? It's in stat(2), just like the timestamps.

I don't see anything that looks like a version or generation number in
either the man pages, the asm-generic/stat.h, or glibc's asm/stat.h.
Pointer?

The only interface I'm aware of is the EXT?_IOC_GETVERSION interface.
Looks like that is supported by BTRFS.

--
Mathematics is the supreme nostalgia of our time.

Andi Kleen

unread,

Jul 22, 2011, 5:50:02 PM7/22/11

to

On Fri, Jul 22, 2011 at 04:11:42PM -0500, Matt Mackall wrote:
> On Fri, 2011-07-22 at 22:59 +0200, Andi Kleen wrote:
> > > Indeed. Only usefully exists on ext4 and requires extra system calls.
> >
> > Not sure what you mean? It's in stat(2), just like the timestamps.
>
> I don't see anything that looks like a version or generation number in
> either the man pages, the asm-generic/stat.h, or glibc's asm/stat.h.
> Pointer?

Hmm you're right. I thought it was in there, but apparently not.
I think it should be added there though. We still have some unused
fields.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

J. Bruce Fields

unread,

Jul 22, 2011, 6:20:02 PM7/22/11

to

On Fri, Jul 22, 2011 at 11:47:32PM +0200, Andi Kleen wrote:
> On Fri, Jul 22, 2011 at 04:11:42PM -0500, Matt Mackall wrote:
> > On Fri, 2011-07-22 at 22:59 +0200, Andi Kleen wrote:
> > > > Indeed. Only usefully exists on ext4 and requires extra system calls.
> > >
> > > Not sure what you mean? It's in stat(2), just like the timestamps.
> >
> > I don't see anything that looks like a version or generation number in
> > either the man pages, the asm-generic/stat.h, or glibc's asm/stat.h.
> > Pointer?
>
> Hmm you're right. I thought it was in there, but apparently not.
> I think it should be added there though. We still have some unused
> fields.

But last I checked I thought it was only ext4 that actually incremented
the i_version on IO, and even then only when given a (non-default) mount
option.

My notes on what needs to be done there:

- collect data to determine whether turning on i_version causes
any significant performance regressions.
- Last I talked to him, Ted Tso recommended running
Bonnie on a local disk, since it does a lot of little
writes, which is somewhat of a worst case, as it will
generate extra metadata updates for each write.
Compare total wall-clock time, number of iops, and
number of bytes (using some kind of block tracing).
- If there aren't any problems, turn it on by default, and we're
done. If there are unfixable problems, consider something
more complicated (like turning on i_version automatically when
someone asks for it).
- We need to check that i_version is also doing something
sensible on directory as well as on file inodes.
- We also need to think about what it does after reboots. (E.g.
what is an nfs server to do if clients see the i_version go
backwards (and hence possible repeat old values) after a
reboot?)
- Double-check the order that data updates and i_version updates
are done in. (Ideal would be if they were atomic, but for
nfsd's purposes at least it should be adequate if the
i_version comes after, and no later than the next commit.)

--b.

J. Bruce Fields

unread,

Jul 22, 2011, 6:40:02 PM7/22/11

to

(Well,and talk the other filesystem implementors into doing it.)

--b.

NeilBrown

unread,

Jul 22, 2011, 7:00:02 PM7/22/11

to

On Fri, 22 Jul 2011 18:31:58 -0400 "J. Bruce Fields" <bfi...@fieldses.org>
wrote:

But does anyone apart from NFSv4 actually *want* i_version as opposed to the
more-generally-useful precise timestamps?

If not, we probably should tell NFSv4 to use timestamps and focus on making
them work well.
??

The timestamp used doesn't need to update ever nanosecond. I think if it
were just updated on every userspace->kernel transition (or effective
equivalents inside kernel threads) that would be enough capture all
causality. I wonder how that would be achieved.. I wonder if RCU machinery
could help - doesn't it keep track of when threads schedule ... or something?

NeilBrown

J. Bruce Fields

unread,

Jul 22, 2011, 7:10:01 PM7/22/11

to

It *seems* like a generally useful idea, but I don't know of any other
users.

> If not, we probably should tell NFSv4 to use timestamps and focus on making
> them work well.
> ??

Well, sure, I couldn't complain about that if that proved possible.

--b.

J. Bruce Fields

unread,

Jul 22, 2011, 7:50:01 PM7/22/11

to

On Fri, Jul 22, 2011 at 07:06:12PM -0400, J. Bruce Fields wrote:
> On Sat, Jul 23, 2011 at 08:59:15AM +1000, NeilBrown wrote:
> > But does anyone apart from NFSv4 actually *want* i_version as opposed to the
> > more-generally-useful precise timestamps?
>
> It *seems* like a generally useful idea, but I don't know of any other
> users.

(Out of curiosity: what actually *needs* real timestamps?:
- They're generally useful to people, of course; ("what did I
change last tuesday?")
- Make uses them, though in theory perhaps it could do the same
job by caching records like "object X was built from
versions a, b, and c of objects A, B, and C respectively".

But a lot of uses are probably just to answer the question "did this
file change since the last time I looked at it"?

Of course, however theoretically useful, there's always the argument
that linux-specific interfaces are unlikely to be used by anyone except
Lennart Poettering.)

--b.

NeilBrown

unread,

Jul 22, 2011, 8:10:01 PM7/22/11

to

On Fri, 22 Jul 2011 19:49:21 -0400 "J. Bruce Fields" <bfi...@fieldses.org>
wrote:

> On Fri, Jul 22, 2011 at 07:06:12PM -0400, J. Bruce Fields wrote:

> > On Sat, Jul 23, 2011 at 08:59:15AM +1000, NeilBrown wrote:
> > > But does anyone apart from NFSv4 actually *want* i_version as opposed to the
> > > more-generally-useful precise timestamps?
> >
> > It *seems* like a generally useful idea, but I don't know of any other
> > users.
>
> (Out of curiosity: what actually *needs* real timestamps?:
> - They're generally useful to people, of course; ("what did I
> change last tuesday?")

In the same vein they are useful for archiving. "what has changed since I
last started an archive?"

NFSv3 caching obviously uses them too.

> - Make uses them, though in theory perhaps it could do the same
> job by caching records like "object X was built from
> versions a, b, and c of objects A, B, and C respectively".

In theory....

>
> But a lot of uses are probably just to answer the question "did this
> file change since the last time I looked at it"?

I think everything could fall in two one of two categories.
a/ did this file change since the last time I looked at it?
b/ did this file change since the last time that file changed?

The former can be achieved with versions or timestamps.
The latter requires globally coherent high precision timestamps... or
something like dependency tracking which would probably be even more
expensive and - as you say - non-standard.

NeilBrown

Matt Mackall

unread,

Jul 22, 2011, 8:10:01 PM7/22/11

to

In theory, a microsecond timestamp (ie gtod) may already not be good
enough for all applications. But i_version also doesn't allow comparing
across files.

> If not, we probably should tell NFSv4 to use timestamps and focus on making
> them work well.
> ??
>
> The timestamp used doesn't need to update ever nanosecond. I think if it
> were just updated on every userspace->kernel transition (or effective
> equivalents inside kernel threads) that would be enough capture all
> causality. I wonder how that would be achieved.. I wonder if RCU machinery
> could help - doesn't it keep track of when threads schedule ... or something?

Sort of.

Some observations:

- we only need to go to higher resolution when two events happen in the
same time quantum
- this applies at both the level of seconds and jiffies
- if the only file touched in a given quantum gets touched ago, we don't
need to update its timestamp if stat wasn't also called on it in this
quantum
- we never need to use a higher resolution than the global
min(s_time_gran)

For instance, if a machine is idle, except for writing to a single file
once a second, 1s resolution suffices.

If a machine is idle, except for writing to the same file 1000 times per
second, and no one is watching it, 1s still suffices (inode is dirtied
once per second).

Any time two files are touched in the same second, the second one (and
later files) needs jiffies resolution. Similarly, any time two files are
touched in the same jiffy, the second one should use gtod().

The global status bits needed to track this could be managed fairly
efficiently with cmpxchg.

(Arguably, we should supply > 1s resolution whether they're strictly
needed or not on filesystems with nanosecond support, so that people
casually inspecting timestamps don't wonder where their nanoseconds
went.)

--
Mathematics is the supreme nostalgia of our time.

Andreas Dilger

unread,

Jul 22, 2011, 9:30:01 PM7/22/11

to

As an FYI, Lustre uses i_version to store the transaction number in which a file changed. It sets the i_version itself. If NFSv4 were to set i_version when it needs to transition the state of a file then it wouldn't cause overhead on filesystems that are not being used for NFS export.

I don't think timestamps can ever be completely safe for distributed state management, unless the kernel bends the rules on what a timestamp IS, e.g. by never reverting the ctime when the clock moves backward and such.

Cheers, Andreas

> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

J. Bruce Fields

unread,

Jul 22, 2011, 9:40:02 PM7/22/11

to

Right, so there was a rough algorithm hashed out somewhere around here:

http://thread.gmane.org/gmane.linux.kernel/1022866/focus=1024624

that depended on those observations.

NFS presents a worst-case as the standard NFSv3 read and write
operations include timestamps in the result. So every single IO comes
with a stat. So either you have a clock good enough to give a distinct
timestamp for all of those, or you fall back on a global counter that
ends up serializing all IO. I think. I admit I'm not sure I understand
your proposal below.

--b.

Trond Myklebust

unread,

Jul 22, 2011, 10:40:01 PM7/22/11

to

...or you admit that NFSv3 is no longer able to keep up with modern
processing speeds and storage, and you ditch it in favour of NFSv4.

Time-stamps are _not_ the optimal way to label changes in a clustered
environment (or even a multi-cpu/multi-core environment): aside from the
various issues involving absolute time vs. wall clock time, you also
have to deal with clock synchronisation across those nodes/cpus/cores at
the < microsecond resolution level. Have fun doing that...

Trond

Andi Kleen

unread,

Jul 23, 2011, 10:00:02 PM7/23/11

to

> with a stat. So either you have a clock good enough to give a distinct
> timestamp for all of those, or you fall back on a global counter that
> ends up serializing all IO. I think. I admit I'm not sure I understand

Not global counter, but per inode. That's very reasonable because there's
already locking on the inode level.

-Andi

Paul E. McKenney

unread,

Jul 25, 2011, 11:10:02 AM7/25/11

to

On Sat, Jul 23, 2011 at 08:59:15AM +1000, NeilBrown wrote:

RCU does track thread scheduling, but currently only pays attention to
it if there is an RCU grace period in progress. It would be easy to
make it track more precisely, though, if that would help something.

That said, I suspect that Peter Zijlstra would be extremely unhappy with
any proposed change that (say) acquired a global lock on every thread
schedule. And I don't believe that he would be all that happy even with a
change that added a non-global lock acquisition to each context switch...

Thanx, Paul

Pavel Machek

unread,

Jul 29, 2011, 3:50:01 PM7/29/11

to

Hi!

> > If not, we probably should tell NFSv4 to use timestamps and focus on making
> > them work well.
> > ??
> >
> > The timestamp used doesn't need to update ever nanosecond. I think if it
> > were just updated on every userspace->kernel transition (or effective
> > equivalents inside kernel threads) that would be enough capture all
> > causality. I wonder how that would be achieved.. I wonder if RCU machinery
> > could help - doesn't it keep track of when threads schedule ... or something?
>
> Sort of.
>
> Some observations:
>
> - we only need to go to higher resolution when two events happen in the
> same time quantum
> - this applies at both the level of seconds and jiffies
> - if the only file touched in a given quantum gets touched ago, we don't
> need to update its timestamp if stat wasn't also called on it in this
> quantum

parse error aroound 'ago'.

> - we never need to use a higher resolution than the global
> min(s_time_gran)
>
>
> For instance, if a machine is idle, except for writing to a single file
> once a second, 1s resolution suffices.

Are you sure? As soon as you get network communication...

> Any time two files are touched in the same second, the second one (and
> later files) needs jiffies resolution. Similarly, any time two files are
> touched in the same jiffy, the second one should use gtod().

For make. I don't see how this is globally true.

I do

( date; > stamp; date ) | ( sleep 5; cat > counterexample )

I know timestamp should be between two dates, but it is not.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Matt Mackall

unread,

Jul 29, 2011, 5:40:02 PM7/29/11

to

On Fri, 2011-07-29 at 21:49 +0200, Pavel Machek wrote:
> Hi!
>
> > > If not, we probably should tell NFSv4 to use timestamps and focus on making
> > > them work well.
> > > ??
> > >
> > > The timestamp used doesn't need to update ever nanosecond. I think if it
> > > were just updated on every userspace->kernel transition (or effective
> > > equivalents inside kernel threads) that would be enough capture all
> > > causality. I wonder how that would be achieved.. I wonder if RCU machinery
> > > could help - doesn't it keep track of when threads schedule ... or something?
> >
> > Sort of.
> >
> > Some observations:
> >
> > - we only need to go to higher resolution when two events happen in the
> > same time quantum
> > - this applies at both the level of seconds and jiffies
> > - if the only file touched in a given quantum gets touched ago, we don't
> > need to update its timestamp if stat wasn't also called on it in this
> > quantum
>
> parse error aroound 'ago'.

This should read:

- if only one file is touched in a given quantum, we don't need to
update its timestamp if stat wasn't called on it in the same quantum

> > - we never need to use a higher resolution than the global
> > min(s_time_gran)
> >
> >
> > For instance, if a machine is idle, except for writing to a single file
> > once a second, 1s resolution suffices.
>
> Are you sure? As soon as you get network communication...

I don't think you can generally compare filesystem timestamps to other
time sources reliably. For instance, network filesystems might have
their own notions of current time.

> > Any time two files are touched in the same second, the second one (and
> > later files) needs jiffies resolution. Similarly, any time two files are
> > touched in the same jiffy, the second one should use gtod().
>
> For make. I don't see how this is globally true.
>
> I do
>
> ( date; > stamp; date ) | ( sleep 5; cat > counterexample )
>
> I know timestamp should be between two dates, but it is not.

You're claiming the timestamp on 'stamp' should be strictly between the
two dates reported?

This is true today if and only if you measure in seconds (and your
filesystem's clock is synced with your local clock). If you measure in
resolutions greater than the filesystem resolution (currently limited to
jiffies) even on a local filesystem, it will be wrong.

--
Mathematics is the supreme nostalgia of our time.