One of the problems that I remember is that every core has its own counter, if you want to get reliable measurements you would need to pin the thread to a core between measurements (calls to RDTSC). At least that is what I remember from the top of my head. I guess that’s why it’s not an option for the JVM. Although, they could provide a Unsafe function ;)
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime? It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction. From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Zing has the option to do just that on systems which reliably support it (-XX:+UseRdtsc IIRC). So yes it can be done, and is sometimes even the right thing to do.
On Tue, Apr 30, 2019 at 7:50 AM dor laor <dor...@gmail.com> wrote:
It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executedon different sockets it can result in wrong answers and negative time. Today most systems do have a stable tscand you can verify it from userspace/java too.I bet it's easy to google the reason
On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime? It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction. From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
The implementation in the vdso changes with the running kernel
and the hardware it runs on. You can't just copy it.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?
In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html
On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>
>> This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime? It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction. From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...Two ways I've seen this go wrong even on modern hardware include:A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.
The bottom line is that TSC can be relied on bare metal (where there is no hypervisor scheduling of guest OS cores) if the system is set up right, but can do very wrong things otherwise. People who really care about low cost time measurement (like System.nanotime()) can control their systems to make this work and elect to rely on it (that's exactly what Zing's -XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by default.
On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?
In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html
On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>
>> This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime? It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction. From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...Two ways I've seen this go wrong even on modern hardware include:A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.The bottom line is that TSC can be relied on bare metal (where there is no hypervisor scheduling of guest OS cores) if the system is set up right, but can do very wrong things otherwise. People who really care about low cost time measurement (like System.nanotime()) can control their systems to make this work and elect to rely on it (that's exactly what Zing's -XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by default.
On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?
In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html
On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>
>> This may be a dumb question, but why (on Linux) is System.nanotime() a call out to clock_gettime? It seems like it could be inlined by the JVM, and stripped down to the rdtsc instruction. From my reading of the vDSO source for x86, the implementation is not that complex, and could be copied into Java.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...
Two ways I've seen this go wrong even on modern hardware include:
A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.
B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.
A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently whichhappens but I doubt VMware (and better hypervisors) take the TSC back
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/7WnH37dA6Yc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Sent from my iPadOn Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...
Two ways I've seen this go wrong even on modern hardware include:
A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.
B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.
A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently whichhappens but I doubt VMware (and better hypervisors) take the TSC back
A hypervisor wouldn't take the TSC backwards within one vcore.
But vcores are scheduled individually, which means that any slewing done to hide a long jump forward in the physical TSC in situations where a vcore was not actually running on a physical core for a “long enough” period of time is done individually within each vcore and its virtualized TSC. (synchronizing the virtualized TSC slewing across vcores would require either synchronizing their scheduling such that the entire VM would be either “on” or “off” cores at the same time, or making the virtualuzed a TSC only tick forward in large quantum’s, or only when all vcores are actively running on physical cores, all of which would cause some other dramatic strangeness).
Multiple vcores belonging to the same guest OS can (and usually will) end up running simultaneously on multiple real cores, which obviously means that during slewing periods they will be showing vastly differing virtualized TSC values (with gaps of 10s of msec) until the “slewing” is done. All it takes is a “lucky timing” context switch within the Guest OS, moving a thread from one vcore to another (for whichever of the many reasons the guest OS might decide to do that) for *your* program to observe the TSC “jumping backwards” by 10s of msec between one RDTSC execution and another.
On May 2, 2019, at 2:53 PM, dor laor <dor....@gmail.com> wrote:On Thu, May 2, 2019 at 7:14 AM Gil Tene <g...@azul.com> wrote:Sent from my iPadOn Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a software thread, process, system, etc.) on systems, but AFAIK "most" modern x86-64 bare metal systems can be set up for good clean, monotonic system-wide TSC-ness. The hardware certainly has the ability to keep those TSCs in sync (enough to not have detectable non-sync effects) both within a socket and across multi-socket systems (when the hardware is "built right"). The TSCs all get reset together and move together unless interfered with...Two ways I've seen this go wrong even on modern hardware include:A) Some BIOSes resetting TSC on a single core or hyperthread on each socket (usually thread 0 of core 0) for some strange reason during the boot sequence. [I've conclusively shown this on some 4 socket Sandy Bridge systems.] This leads different vcores to have vastly differing TSC values, which gets bigger with every non-power-cycling reboot, with obvious negative effects and screams from anyone relying on TSC consistency for virtually any purpose.B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some versions of VMWare) will virtualize the TSC and "slew" the virtualized value to avoid presenting guest OSs with huge jumps in TSC values when a core was taken away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC will incrementally move forward in small jumps until it catches up. The purpose of this appears to be to avoid triggering guest OS panics in code that watches TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC values can easily jump backward, even within a single software thread.A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take it backward, unless they virtualize the cpu bits for stable tsc differently whichhappens but I doubt VMware (and better hypervisors) take the TSC backA hypervisor wouldn't take the TSC backwards within one vcore.But vcores are scheduled individually, which means that any slewing done to hide a long jump forward in the physical TSC in situations where a vcore was not actually running on a physical core for a “long enough” period of time is done individually within each vcore and its virtualized TSC. (synchronizing the virtualized TSC slewing across vcores would require either synchronizing their scheduling such that the entire VM would be either “on” or “off” cores at the same time, or making the virtualuzed a TSC only tick forward in large quantum’s, or only when all vcores are actively running on physical cores, all of which would cause some other dramatic strangeness).Multiple vcores belonging to the same guest OS can (and usually will) end up running simultaneously on multiple real cores, which obviously means that during slewing periods they will be showing vastly differing virtualized TSC values (with gaps of 10s of msec) until the “slewing” is done. All it takes is a “lucky timing” context switch within the Guest OS, moving a thread from one vcore to another (for whichever of the many reasons the guest OS might decide to do that) for *your* program to observe the TSC “jumping backwards” by 10s of msec between one RDTSC execution and another.It's the same issue as a physical machine with multiple sockets, the tsc isn't synced across those different sockets.
The hypervisor keeps an offset per unscheduled vcore and makes sure it is monotonic. Although we at KVM considered toslew/speed the TSC on vcores, primarily for live migration, we didn't do it in practice.
One of my old team members wrote thispretty good write up (in 2011 but still relevant):