Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

rdtsc instruction

295 views
Skip to first unread message

Bartc

unread,
Oct 15, 2015, 12:14:06 PM10/15/15
to
This reads the time stamp counter.

But why is it so slow?

I had tried at one time to help profile a byte-code interpreter, but
most of the time was spent executing rdtsc! It slowed everything down by
about 4 times.

I tried it again today, and it was 3 times as slow as Window's
GetTickCount! (A loop normally taking 1.2 seconds to execute, was 1.8
seconds calling GetTickCount, and 3.0 seconds with rdtsc.)

(rdtsc is used in assembly code, or as an inline assembly instruction.)

--
Bartc

Bernhard Schornak

unread,
Oct 15, 2015, 3:19:01 PM10/15/15
to
Execution time for RDTSC(P) depends on the processor design as well
as on traffic between different cores and internal storage like the
timestamp counter(s). RDTSCP needs about 70...90 clocks to execute,
because it has to wait for pending instructions to execute and also
has to wait for the internal data transfer bus ('HyperTransport' on
AMD machines) - only one core can access this bus at any time. Same
applies to RDTSC. It should be a little bit faster, though, because
it does not wait for pending instructions.

If you want to measure latencies, it is no good idea to issue RDTSC
after each iteration. Better use one inner loop running your testee
N times 'embedded' in a test logic beginning and ending with RDTSC.


Greetings from Augsburg

Bernhard Schornak

Rick C. Hodgin

unread,
Oct 15, 2015, 9:19:24 PM10/15/15
to
What machine is it being run on? It may not exist and is being trapped
to a #UD undefined opcode exception, and an OS handler is simulating
the value.

There is also advice that you use it only when it's guaranteed your
thread has affinity for a single core:

https://en.wikipedia.org/wiki/Time_Stamp_Counter

And there's a warning for Windows users:

"Under Windows platforms, Microsoft strongly discourages using
the TSC for high-resolution timing for exactly these reasons,
providing instead the Windows APIs QueryPerformanceCounter and
QueryPerformanceFrequency."

Best regards,
Rick C. Hodgin

Bartc

unread,
Oct 16, 2015, 7:08:28 AM10/16/15
to
On 15/10/2015 19:49, Rick C. Hodgin wrote:
> On Thursday, October 15, 2015 at 12:14:06 PM UTC-4, Bartc wrote:
>> This reads the time stamp counter.
>>
>> But why is it so slow?

> What machine is it being run on? It may not exist and is being trapped
> to a #UD undefined opcode exception, and an OS handler is simulating
> the value.

Your link says it's been present since the Pentium. I'm running on some
AMD device; not sure which, but it's not a 486!

> There is also advice that you use it only when it's guaranteed your
> thread has affinity for a single core:
>
> https://en.wikipedia.org/wiki/Time_Stamp_Counter
>
> And there's a warning for Windows users:
>
> "Under Windows platforms, Microsoft strongly discourages using
> the TSC for high-resolution timing for exactly these reasons,
> providing instead the Windows APIs QueryPerformanceCounter and
> QueryPerformanceFrequency."

Yes, I get that there might be problems in using it as an accurate
timer. But I only need a rough idea of elapsed time.

(I've just discovered that if you do Windows programming, then if you
don't check the message queue for five seconds, your windows get cleared
and marked 'unresponsive'. So if busy executing byte-code for example,
then every N byte-codes it might be necessary to check the elapsed time
so that when 1 second has passed, I can call PeekMessage to reset the
time-out.

Using GetTickCount is one slow way of doing that, compared with less
than 10ns per byte-code, I expected rdtsc to super-fast in comparison.)

--
Bartc

Rick C. Hodgin

unread,
Oct 16, 2015, 11:24:37 AM10/16/15
to
On Friday, October 16, 2015 at 7:08:28 AM UTC-4, Bartc wrote:
> On 15/10/2015 19:49, Rick C. Hodgin wrote:
> > On Thursday, October 15, 2015 at 12:14:06 PM UTC-4, Bartc wrote:
> >> This reads the time stamp counter.
> >>
> >> But why is it so slow?
>
> > What machine is it being run on? It may not exist and is being trapped
> > to a #UD undefined opcode exception, and an OS handler is simulating
> > the value.
>
> Your link says it's been present since the Pentium. I'm running on some
> AMD device; not sure which, but it's not a 486!

:-)

Hey, Bart. I know you. I know how much you hate bloated tools and
wasted hard disk space. I had to be sure you weren't running MS-DOS
6.22 on a 486, right?

:-)

> > There is also advice that you use it only when it's guaranteed your
> > thread has affinity for a single core:
> >
> > https://en.wikipedia.org/wiki/Time_Stamp_Counter
> >
> > And there's a warning for Windows users:
> >
> > "Under Windows platforms, Microsoft strongly discourages using
> > the TSC for high-resolution timing for exactly these reasons,
> > providing instead the Windows APIs QueryPerformanceCounter and
> > QueryPerformanceFrequency."
>
> Yes, I get that there might be problems in using it as an accurate
> timer. But I only need a rough idea of elapsed time.

Yup. And there's a Win32 function which allows you set thread affinity
so you're always running on the same core:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx

> (I've just discovered that if you do Windows programming, then if you
> don't check the message queue for five seconds, your windows get cleared
> and marked 'unresponsive'. So if busy executing byte-code for example,
> then every N byte-codes it might be necessary to check the elapsed time
> so that when 1 second has passed, I can call PeekMessage to reset the
> time-out.

A common thing to do is to run the other code you have in a separate
thread, and keep your main message queue running in the startup
thread. Note also, however, that Windows will only dispatch messages
to threads which inquire on HWND values from the same thread they
were created, unless you re-direct them with SetWindowLong() and
GWL_WNDPROC.

> Using GetTickCount is one slow way of doing that, compared with less
> than 10ns per byte-code, I expected rdtsc to super-fast in comparison.)

In my prior experience (mostly 2011 and earlier) it was very fast.
I'm wondering why it's changed so drastically. Seems silly. I would
think there would just need to be a simple synchronization as to clock
cycles, and that's just a constantly incrementing count. Perhaps it
hints at the fact that the internals of modern CPUs are moving away
from synchronous, and becoming asynchronous. Even so, I don't see how
they couldn't not keep a simple incrementing register per tick in sync
across all cores. Surely the clock generator could always send out a
high speed signal to that unit, even if the clock is being slowed down
to conserve power / heat.

Melzzzzz

unread,
Oct 16, 2015, 1:27:08 PM10/16/15
to
On Thu, 15 Oct 2015 21:05:12 +0200
Bernhard Schornak <scho...@nospicedham.web.de> wrote:

> Bartc wrote:
>
>
> > This reads the time stamp counter.
> >
> > But why is it so slow?
> >
> > I had tried at one time to help profile a byte-code interpreter,
> > but most of the time was spent executing rdtsc! It slowed
> > everything down by about 4 times.
> >
> > I tried it again today, and it was 3 times as slow as Window's
> > GetTickCount! (A loop normally taking 1.2 seconds to execute, was
> > 1.8 seconds calling GetTickCount, and 3.0 seconds with rdtsc.)
> >
> > (rdtsc is used in assembly code, or as an inline assembly
> > instruction.)
>
>
> Execution time for RDTSC(P) depends on the processor design as well
> as on traffic between different cores and internal storage like the
> timestamp counter(s). RDTSCP needs about 70...90 clocks to execute,
> because it has to wait for pending instructions to execute and also
> has to wait for the internal data transfer bus ('HyperTransport' on
> AMD machines) - only one core can access this bus at any time. Same
> applies to RDTSC. It should be a little bit faster, though, because
> it does not wait for pending instructions.

On Haswell, RDTSC is about 50% faster then RDTSCP.


Melzzzzz

unread,
Oct 16, 2015, 1:27:22 PM10/16/15
to
On Thu, 15 Oct 2015 17:08:08 +0100
Bartc <b...@nospicedham.freeuk.com> wrote:

> This reads the time stamp counter.
>
> But why is it so slow?

Dunno. RDTSCP on Haswell takes about same time as multiplication of two
4x4 (float64) matrices (avx);)


wolfgang kern

unread,
Oct 16, 2015, 2:58:51 PM10/16/15
to

Bartc asked:

> This reads the time stamp counter.

haven't seen it

> But why is it so slow?

since some years the TSC became part of the North-bridge... and even
this means a CPU-near path it became a PCI-device after all.

My AMD K7 show ~19 cycles latency for RDTSC,
while all later CPUs show 65..99 cycles for RDTSC.

> I had tried at one time to help profile a byte-code interpreter, but
> most of the time was spent executing rdtsc! It slowed everything down by
> about 4 times.

> I tried it again today, and it was 3 times as slow as Window's
> GetTickCount! (A loop normally taking 1.2 seconds to execute, was 1.8
> seconds calling GetTickCount, and 3.0 seconds with rdtsc.)
>
> (rdtsc is used in assembly code, or as an inline assembly instruction.)

RDTSC is still a valuable instruction, and once you checked its own
latency it's also pretty usable to determine single instruction timing.
But keep in mind that a multy-core, IRQ-driven OS will spoil your
measuring attemts ....
So a large time-window with the count of iterations is for sure the
wrong way to benchmark a piece of code in such environments.

__
wolfgang

Bernhard Schornak

unread,
Oct 17, 2015, 9:53:05 AM10/17/15
to
It probably has a very deep prefetcher, so RDTSCP has to wait for
a lot of instructions to be executed. On the other hand, RDTSC is
not as accurate as RDTSCP, because pending instructions influence
the test result. That's the reason why one should issue one CPUID
before the RDTSC. Doing that probably is slower than RDTSCP - two
instructions doing one part of the work, each, should use up some
more execution time than one instruction performing both tasks in
one gulp...


Have a nice weekend!

Bernhard Schornak

Melzzzzz

unread,
Oct 17, 2015, 12:08:22 PM10/17/15
to
On Sat, 17 Oct 2015 15:46:55 +0200
Heh, CPUID (EAX=0) alone on Haswell is 3 times slower then RDTSCP ;)

>
>
> Have a nice weekend!
>
> Bernhard Schornak

All the best!
Branimir Maksimovic.

0 new messages