Faster System.nanotime() ?

695 views
Skip to first unread message

Carl Mastrangelo

unread,
Apr 29, 2019, 5:36:34 PM4/29/19
to mechanical-sympathy
This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime?    It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction.   From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.  

Ngo The Trung

unread,
Apr 30, 2019, 1:49:23 AM4/30/19
to mechanica...@googlegroups.com
As I understand it, rdtsc can be privileged.

Christian A. Steiner

unread,
Apr 30, 2019, 1:49:42 AM4/30/19
to mechanica...@googlegroups.com

One of the problems that I remember is that every core has its own counter, if you want to get reliable measurements you would need to pin the thread to a core between measurements (calls to RDTSC). At least that is what I remember from the top of my head. I guess that’s why it’s not an option for the JVM. Although, they could provide a Unsafe function ;)

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dor laor

unread,
Apr 30, 2019, 1:50:05 AM4/30/19
to mechanica...@googlegroups.com
It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
and you can verify it from userspace/java too. 
I bet it's easy to google the reason

On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime?    It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction.   From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.  

--

Nitsan Wakart

unread,
Apr 30, 2019, 5:52:18 AM4/30/19
to mechanica...@googlegroups.com
Zing has the option to do just that on systems which reliably support it (-XX:+UseRdtsc IIRC). So yes it can be done, and is sometimes even the right thing to do.

Ben Evans

unread,
Apr 30, 2019, 6:07:11 AM4/30/19
to mechanica...@googlegroups.com
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?

In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html

Nitsan Wakart

unread,
Apr 30, 2019, 9:27:05 AM4/30/19
to mechanica...@googlegroups.com
The code has a nice explanation of the workaround they need to resort to to ensure a monotonic time source.

https://doc.rust-lang.org/src/std/time.rs.html#157

        // And here we come upon a sad state of affairs. The whole point of
        // `Instant` is that it's monotonically increasing. We've found in the
        // wild, however, that it's not actually monotonically increasing for
        // one reason or another. These appear to be OS and hardware level bugs,
        // and there's not really a whole lot we can do about them. Here's a
        // taste of what we've found:
        //
        // * #48514 - OpenBSD, x86_64
        // * #49281 - linux arm64 and s390x
        // * #51648 - windows, x86
        // * #56560 - windows, x86_64, AWS
        // * #56612 - windows, x86, vm (?)
        // * #56940 - linux, arm64
        // * https://bugzilla.mozilla.org/show_bug.cgi?id=1487778 - a similar
        //   Firefox bug
        //
        // It simply seems that this it just happens so that a lot in the wild
        // we're seeing panics across various platforms where consecutive calls
        // to `Instant::now`, such as via the `elapsed` function, are panicking
        // as they're going backwards. Placed here is a last-ditch effort to try
        // to fix things up. We keep a global "latest now" instance which is
        // returned instead of what the OS says if the OS goes backwards.
        //
        // To hopefully mitigate the impact of this though a few platforms are
        // whitelisted as "these at least haven't gone backwards yet".

In other words, 60% of the time it works every time.

Carl Mastrangelo

unread,
Apr 30, 2019, 2:46:09 PM4/30/19
to mechanical-sympathy
If the kernel implementation of clock_gettime is correct, copying the implementation in the vDSO into the JVM should be okay too, right?  If rdtsc was not stable, it should have the same problems if invoked from either place?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Carl Mastrangelo

unread,
Apr 30, 2019, 2:47:58 PM4/30/19
to mechanical-sympathy
JFR intrinsics have access to it as well, but nanotime() doesn't use it, which seemed odd to me.  


Out of curiosity, how long do nanotime() calls on Zing take?


On Tuesday, April 30, 2019 at 2:52:18 AM UTC-7, Nitsan Wakart wrote:
Zing has the option to do just that on systems which reliably support it (-XX:+UseRdtsc IIRC). So yes it can be done, and is sometimes even the right thing to do.

On Tue, Apr 30, 2019 at 7:50 AM dor laor <dor...@gmail.com> wrote:
It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
and you can verify it from userspace/java too. 
I bet it's easy to google the reason

On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime?    It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction.   From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.  

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Avi Kivity

unread,
Apr 30, 2019, 2:56:54 PM4/30/19
to mechanica...@googlegroups.com

The implementation in the vdso changes with the running kernel and the hardware it runs on. You can't just copy it.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
May 1, 2019, 12:58:35 PM5/1/19
to mechanical-sympathy
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.

The bottom line is that TSC can be relied on bare metal (where there is no hypervisor scheduling of guest OS cores) if the system is set up right, but can do very wrong things otherwise. People who really care about low cost time measurement (like System.nanotime()) can control their systems to make this work and elect to rely on it (that's exactly what Zing's -XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by default.

On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?

In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html

On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>
>> This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime?    It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction.   From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

dor laor

unread,
May 1, 2019, 4:38:33 PM5/1/19
to mechanica...@googlegroups.com
On Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.

A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently which
happens but I doubt VMware (and better hypervisors) take the TSC back


The bottom line is that TSC can be relied on bare metal (where there is no hypervisor scheduling of guest OS cores) if the system is set up right, but can do very wrong things otherwise. People who really care about low cost time measurement (like System.nanotime()) can control their systems to make this work and elect to rely on it (that's exactly what Zing's -XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by default.

On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?

In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html

On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>
>> This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime?    It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction.   From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Greg Young

unread,
May 1, 2019, 6:03:32 PM5/1/19
to mechanica...@googlegroups.com
"A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose."


Hmm... I use RDTSC in a profiler I wrote and have seen some oddities related to this. I think you might have just cost me research time... and now I owe you a pint next time we run into each other. Thanks!

On Wed, May 1, 2019 at 12:58 PM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.

The bottom line is that TSC can be relied on bare metal (where there is no hypervisor scheduling of guest OS cores) if the system is set up right, but can do very wrong things otherwise. People who really care about low cost time measurement (like System.nanotime()) can control their systems to make this work and elect to rely on it (that's exactly what Zing's -XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by default.

On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?

In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html

On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>
>> This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime?    It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction.   From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Studying for the Turing test

Gil Tene

unread,
May 2, 2019, 10:14:51 AM5/2/19
to mechanica...@googlegroups.com


Sent from my iPad

On May 1, 2019, at 1:38 PM, dor laor <dor....@gmail.com> wrote:

On Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.

A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently which
happens but I doubt VMware (and better hypervisors) take the TSC back

A hypervisor wouldn't take the TSC backwards within one vcore. 

But vcores are scheduled individually, which means that any slewing done to hide a long jump forward in the physical TSC in situations where a vcore was not actually running on a physical core for a “long enough” period of time is done individually within each vcore and its virtualized TSC. (synchronizing the virtualized TSC slewing across vcores would require either synchronizing their scheduling such that the entire VM would be either “on” or “off” cores at the same time, or making the virtualuzed a TSC only tick forward in large quantum’s, or only when all vcores are actively running on physical cores,  all of which would cause some other dramatic strangeness).

Multiple vcores belonging to the same guest OS can (and usually will) end up running simultaneously on multiple real cores, which obviously means that during slewing periods they will be showing vastly differing virtualized TSC values (with gaps of 10s of msec) until the “slewing” is done. All it takes is a “lucky timing” context switch within the Guest OS, moving a thread from one vcore to another (for whichever of the many reasons the guest OS might decide to do that) for *your* program to observe the TSC “jumping backwards” by 10s of msec between one RDTSC execution and another.

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/7WnH37dA6Yc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

dor laor

unread,
May 2, 2019, 5:53:57 PM5/2/19
to mechanica...@googlegroups.com
On Thu, May 2, 2019 at 7:14 AM Gil Tene <g...@azul.com> wrote:


Sent from my iPad

On May 1, 2019, at 1:38 PM, dor laor <dor....@gmail.com> wrote:

On Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.

A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently which
happens but I doubt VMware (and better hypervisors) take the TSC back

A hypervisor wouldn't take the TSC backwards within one vcore. 

But vcores are scheduled individually, which means that any slewing done to hide a long jump forward in the physical TSC in situations where a vcore was not actually running on a physical core for a “long enough” period of time is done individually within each vcore and its virtualized TSC. (synchronizing the virtualized TSC slewing across vcores would require either synchronizing their scheduling such that the entire VM would be either “on” or “off” cores at the same time, or making the virtualuzed a TSC only tick forward in large quantum’s, or only when all vcores are actively running on physical cores,  all of which would cause some other dramatic strangeness).

Multiple vcores belonging to the same guest OS can (and usually will) end up running simultaneously on multiple real cores, which obviously means that during slewing periods they will be showing vastly differing virtualized TSC values (with gaps of 10s of msec) until the “slewing” is done. All it takes is a “lucky timing” context switch within the Guest OS, moving a thread from one vcore to another (for whichever of the many reasons the guest OS might decide to do that) for *your* program to observe the TSC “jumping backwards” by 10s of msec between one RDTSC execution and another.

It's the same issue as a physical machine with multiple sockets, the tsc isn't synced across those different sockets.
The hypervisor keeps an offset per unscheduled vcore and makes sure it is monotonic. Although we at KVM considered to
slew/speed the TSC on vcores, primarily for live migration, we didn't do it in practice. One of my old team members wrote this
pretty good write up (in 2011 but still relevant):

Gil Tene

unread,
May 2, 2019, 8:20:03 PM5/2/19
to mechanica...@googlegroups.com

On May 2, 2019, at 2:53 PM, dor laor <dor....@gmail.com> wrote:

On Thu, May 2, 2019 at 7:14 AM Gil Tene <g...@azul.com> wrote:


Sent from my iPad

On May 1, 2019, at 1:38 PM, dor laor <dor....@gmail.com> wrote:

On Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.

A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently which
happens but I doubt VMware (and better hypervisors) take the TSC back

A hypervisor wouldn't take the TSC backwards within one vcore. 

But vcores are scheduled individually, which means that any slewing done to hide a long jump forward in the physical TSC in situations where a vcore was not actually running on a physical core for a “long enough” period of time is done individually within each vcore and its virtualized TSC. (synchronizing the virtualized TSC slewing across vcores would require either synchronizing their scheduling such that the entire VM would be either “on” or “off” cores at the same time, or making the virtualuzed a TSC only tick forward in large quantum’s, or only when all vcores are actively running on physical cores,  all of which would cause some other dramatic strangeness).

Multiple vcores belonging to the same guest OS can (and usually will) end up running simultaneously on multiple real cores, which obviously means that during slewing periods they will be showing vastly differing virtualized TSC values (with gaps of 10s of msec) until the “slewing” is done. All it takes is a “lucky timing” context switch within the Guest OS, moving a thread from one vcore to another (for whichever of the many reasons the guest OS might decide to do that) for *your* program to observe the TSC “jumping backwards” by 10s of msec between one RDTSC execution and another.

It's the same issue as a physical machine with multiple sockets, the tsc isn't synced across those different sockets.

Except that since ~Nehalem, the hardware (if built to recommended specs) does keep the tsc on all cores sync'ed across all sockets. The only non-perfectly sync'ed TSCs I've ever seen on any modern Intel hardware were due to BIOS messing with them after the hardware reset that placed them all in sync has already happened.. On those platforms where I observed out-of-sync TSCs, all hyper-threads except for one per socket were perfectly in sync, And the two hyperthreads on core 0 were out of sync with each other]

The hypervisor keeps an offset per unscheduled vcore and makes sure it is monotonic. Although we at KVM considered to
slew/speed the TSC on vcores, primarily for live migration, we didn't do it in practice.

If KVM doesn't slew, KVM guests may be fine on modern hardware….

How did you deal with Guest OSs panicking if the observed large intra-core TSC skips in critical code (when scheduled out for a long period of time)?

One of my old team members wrote this
pretty good write up (in 2011 but still relevant):

That's a good writeup, with lots of good detail. But since then (2011):
- the statement "...multi-socket systems are likely to have individual clocksources rather than a single, universally distributed clock" has changed, and most multi-socket systems have a universal sync'ed clock.
- TSC's that are invariant to pstate and cstate are the norm on modern hardware.

Which means that the TSC can actually be relied on in most bare-metal cases.
signature.asc
Reply all
Reply to author
Forward
0 new messages