I need to be able to time very short loops accurately (As in "see how long
it takes to execute". I DON'T want to have a short-interval timer. There's
a big difference). This is important in my code, because I'm writing a DLL
that's a plug-in to a real-time multitrack HD audio recorder. The plug-in
provides audio processing functions to the main program. A single file
sometimes gets 50MB, and this program works with multiple such files.
You can see how a slow loop that operates on each sample can severely
bring down the system.
Anyway, I tried various ways to time the loops accurately, but haven't found
anything accurate enough. The best I found is to use
QueryPerformanceCounter(), which gives me a timing resolution of 838ns,
quite small steps. The problem is not with the timer's resolution, but with
the fact that every time I do the timing, I get different results. This
varies from day to day, and appears to be connected with what mood the OS is
in on that day, and also the position of the moon.
I try to make the timing intervals small, so that my code isn't pre-empted
before the timing period is over. I also tried putting a Sleep(0), or
Sleep(10), call just before I start timing, to make sure the likelihood of
my code being pre-empted is small. I also boost the PriorityClass and
ThreadPriority to it's max level just before the timing starts, and restore
it just after. These techniques does help to some extent, but I still get
results that vary as much as 35% from one to the next run, and from day to
day. Obviously I can't trust this to see if one version of my code is 10%
faster than another or not.
It appears to me as if there's some low-level hardware interrupts that is
happening that I have no control over.
I also looked at the "Zen Timer", but it appears as if that's for DOS only.
So, my simple question is: How can I actually know how many cycles MY code
takes to complete. If I knew that, it wouldn't matter if my code was
interrupted, because the cycle count would be for my code only.
Imagine this feature in VC++:
You compile your code with debug info on, but the release build. Then you
start debugging, and place a breakpoint just before the code you want to
time. When you start single-stepping, you open a debug window (just like
the Watch, Memory, etc), and there you have some options as to what type of
CPU you want to simulate, and you can also reset the cycle counter. Then,
as you single-step (or run up to the next breakpoint), the debugger adds
cycles to the counter, depending on the selected CPU. I see no
reason why it can't do this. It already knows the assembly instructions, it
just needs to be taught how long each one takes, as well as about pairing,
overlapped instructions, etc etc. This would be SO easy to time critical
code then, because you can know EXACTLY how may cycles a piece of code will
take to execute, as well as simulating running it on different CPUs.
When can we expect such a cool feature in VC++? Anyone have any idea? And
why would it NOT be possible to do it? Would it be possible to add a
third-party plug-in to VC++ to do this, and are there any available at this
Any insight, advice, comments welcome.
> Then, as you single-step [...], the debugger adds
> cycles to the counter
You have correctly summarized the problems inherent in timing small
sections of code, but I don't think your fix will work.
The time to access a *single* aligned DWORD may vary from an apparent
zero cycles (if run in parallel to, say, an FP instruction, with a
memory barrier following) to some 10 million (!) cycles (cache miss, TLB
miss, PTE faulted in from disk, data page faulted in from disk). Unless
you run this on a box with no VM, you are out of luck. Oh, and I hope
you unplugged the network card.
Even if we ignore all this, the debugger won't be able to accurately
keep track, as a single-step or other sort of interrupted flow of
execution means that the instruction queue is empty when your code is
On the other hand, in real life, your DLL will also run in a real, busy,
system ... so maybe it would make more sense to run a test series under
heavy load and _then_ measure the total execution time, to arrive at the
slack time left on that config under such and such a load. In that case,
just run the code long enough to average out the asynchronous nature of
interrupts and the like.
If you post a reply, kindly refrain from emailing it, too.
Note to spammers: fel...@mvps.org is my real email address.
No anti-spam address here. Just one comment: IN YOUR FACE!
>I need to be able to time very short loops accurately
Did you try GetThreadTimes? This should give you the amount of time that the
thread spent executing, no matter how many times it was preempted by other
threads. Now, I'm not sure who gets charged by NT for time spent in hardware
interrupts, or exactly how and when NT charges the time (it should be doing
either at the end of a quantum, or when the thread goes to sleep/wait
but I think you ought to try this function out and see if the results are
Hope that helps.
dsta...@unibrain.com (Business mail only please),
dsta...@softlab.ece.ntua.gr (Personal mail), ICQ UIN 169876
Software Systems & Applications Chief Engineer, UniBrain, www.unibrain.com
Any sufficiently advanced bug is indistinguishable from a feature.
Some make it happen, some watch it happen, and some say, "what happened?"
How can we remember our ignorance, which our progress requires,
when we are using our knowledge all the time?
How a man plays a game shows something of his character,
how he loses shows all of it.
Yes, I understand all this, but the idea is to simulate a "perfect" machine
so that what you see is the best possible performance of your code under the
best possible situation. This way, you can concentrate on your code to get
IT'S cycle count as low as possible. There's nothing you can do in your
code to guard against a noisy system, but if you can get YOUR OWN code
optimized as good as possible, then you've done all you can.
BTW, I have a Beta copy of Intel's VTune 3.0, and it actually has a similar
feature, where it will show you in detail a selected range of code's cycle
time, pairing, penalties, etc, etc. It makes some basic assumptions, as in
that the data is already in the cache, etc, etc, and then does a simulation
on the CPU of your choice. Unfortunately it's a little buggy, and it's
REALLY slow. And it's a pain to switch between VC++ and VTune for every
change you make in your code. (About 10 minutes turnaround on my computer).
The fact that it's able to give me this kind of info in the first place
tells me it's definitely possible.
Also, you can give a listing of you disassembled code to a assembly
programming guru, and in a short time he can tell you: "Under perfect
conditions, this code will take xxx cycles to execute". Why can't the
debugger do the same for me?
Excellent suggestion! While this is definitely a good way to go,
unfortunately, I'm using W98, and, of course, it's not available under W98
But thanks for telling me about this function. I was unaware of it.
Sometime in the near future, I should switch to NT, and then I can start
playing with it. Until then, I guess I'm stuck in W98-land...
> and in a short time he can tell you: "Under perfect
> conditions, this code will take xxx cycles to execute".
> Why can't the debugger do the same for me?
The debugger can't do it because there are not enough people asking for
just that. And while I do not claim any honorific (except "slob",
maybe), I can tell you that I hate cycle-counting. Passionately. I'd
rather have a real result, with imprecision, from a profiler than a
manual count from me. :-\
> Now, I'm not sure who gets charged by NT for time spent in hardware
I fear this won't help, as the time spent in ISRs and such is charged to
the thread currently running. (Not 100% sure, though.)
For a definitive answer, we lean back and wait if Jamie H. notices us.
If u r using a pentium or above chip, then you may want to use
an opcode which just does what you need, it gets a cycle count
from the chip. Lookup on a RDTSC in a pentium assembly coding
Here is some code which emits the correct assembly code.
It compiles with VC 5.0 but i am sure you can modify it to
work with any other compiler.
/* cl -W3 -O2 -Ox rdtsc1.c */
#define RDTSC __asm _emit 0x0f __asm _emit 0x31
static __inline unsigned __int64 get_clock ()
unsigned long lo;
unsigned long hi;
return (((unsigned __int64) hi)<<32) + lo;
unsigned __int64 t0, t1;
t0 = get_clock ();
t1 = get_clock ();
printf ("Cycles elapsed = %I64d\n", t1 - t0);
Steven Schulze wrote in message ...
So do I, which is why I'd love such a feature.
>rather have a real result, with imprecision, from a profiler than a
>manual count from me. :-\
Yes, but while it might be ok for some people, other people might need more
precise info, since their projects might require it.
BTW, I was thinking - the compiler DEFINATELY already know this info, since
it needs it to do the optimizations. Why can't we be privy to this info as
well, if we need it?
Thanks, I'll DEFINITELY look into.
BTW, does the RDTSC use the same clock as QueryPerformanceCounter()?
I believe you will still have some accuracy problems using this (RDTSC)
instruction due to the context switches that can occur while you are
I'm not sure how much this helps but I recall reading in one of the
programming journals (WDJ, MSJ or WinTech) about a VxD someone wrote to
monitor context switches so that you could account for the number of CPU
cycles executed out of context from your timed code. The VxD allowed you
you register your RDTSC counter variable and thread ID with it. The VxD
would then subtract from your counter variable, the number of CPU cycles
executed outside of your threads context. Perhaps someone else remembers
this article more specifically.
You might also want to take a look at
for other issues that can affect accuracy of this instruction.
Hope this helps,
-- Ian MacDonald
I have somewhat of an answer to this problem. What I did (using
QueryPerformanceCounter()), was that I wrote a class that has a function
called Reset(), Start(), Stop(), and Show(). First you call Reset(), which
resets all members, then you run your code multiple times (I run it up to
1000 times), and every time you first call Start(), then do the code, then
call Stop(). The class then adds this time to the total time, as well as
keeping track of the fastest as well as the slowest run.
When you then call Show(), it shows the fastest, slowest and average times.
This gives pretty good results, because there's GOT to be at least one run
where the code wasn't pre-empted. Also, I try to keep the segments of code
I time pretty short, so that it's possible to get through it without
pre-empting at least once.
The problem is simply that I think QueryPerformanceCounter is flaky.
Sometimes I get a time of 0 (even WITH my code in-between Start() and
Stop()), which should be impossible, given that the resolution is so small
(838ns). It's almost as if the counter itself is updated from software that
needs to be pre-empted first, although the info I have on it suggests it's a
But by combining the RDTSC with the method I describe above, I might get
Note that perfect conditions are:
1. No other code running while the test code is running.
2. No caches enabled.
3. No other hardware process like DMA or refresh working.
4. No weird programming tricks being done in interrupt service routines.
In other words, it's not a real number, but just a guess. The only way
to make these decisions is to start with the cycle counts, program up a
test case, and TEST IT! This will still be only an approximation of what
happens when it gets out in the field.
RDTSC uses the clock speed of the processor, so if you are
using a 400MHZ processor each cycle would be 2.5 nanosecs.
What i use this most for is measuring cycles of assembly code.
Also, you may want to subtract the overhead of RDTSC call itself
so that it does not affect your measurements, to do this call RDTSC
twice in a row and make a note of the cycles elapsed this would be
>I believe you will still have some accuracy problems using this (RDTSC)
>instruction due to the context switches that can occur while you are
To avoid context switches as much as possible, do a yield before
u start timing an assembly code section. This ensures that you
have a new timeslice from the OS when you wakeup.
I think my original point is being lost here...
Here's my main point.
If the Cycle Guru can tell me: "This code of yours will execute in 755
cycles under perfect conditions". I then re-write my code, and ask him
again, and he says "Now your code will execute in 621 cycles under perfect
Now, that kind of info can help me immensely to speed up my code. See, I
don't need to worry about DMA interrupts, etc, etc, because if I can get my
code to perform as fast as possible under "perfect" conditions, I know it'll
also perform better than my original code under a noisy system.
The problem with "TEST IT!" as you put it, is that I can't get an accurate
reading of how long MY code takes to execute, because the results vary too
much. I can't make informed decisions as to whether version A or version B
of my code is faster for that very reason.
BTW, Intel has an "Optimizing Tutor" that shows you how to optimize code,
and shows how many cycles a certain version of code will take to execute vs
another version. Obviously this is a legitimate way to analyze code, but
you insist that it is not. I wish I had a way to see my code in the same
way as Intel shows in it's examples. That's what I'm saying.
Also, tell me, how does the people that hand-optimize code figure out which
version of code will run faster?
> Also, tell me, how does the people that hand-optimize code figure out which
> version of code will run faster?
Steve, there is an ancient rule of thumb that suggests that 10% of an
application's code will take 90% of the CPU time. My rule for optimization is
to wait until the end of the development, then if the code executes too slowly
for comfortable use, instrument the code with some performance tool; find the
10%; hand optimize it; repeat the process until the code executes "fast
enough". If there is no 10% (or 20%) then there may be some basic problem in
I'm afraid that the thread is simply saying you can't get exactly what you want
in a absolute sense. You can get what you want in a relative sense.
I know EXACTLY where time is spent in my code. It's the loop that executes
50 million times for a 50 million byte file. I can pinpoint it down to 7
lines in my code.
Also, FOR THIS SPECIFIC application, there's no "fast enough", there's only
"faster is better". It's an application that processes multiple audio files
(up to 44), each with a size of to about 50MB for 5 minutes of audio (that's
the extreme case, though). It's a program that tries to do stuff in
real-time. Anytime you have the slightest performance bottleneck, you could
lose the ability to process one or two more of those 44 files in real-time.
So, this isn't a simple case of having the spell-checker take 5 instead of 6
seconds to finish - how nice, this is a REAL case where speed makes a
difference in how the program can be applied.
>I'm afraid that the thread is simply saying you can't get exactly what you
>in a absolute sense. You can get what you want in a relative sense.
BTW, as I asked before, how does the compiler make it's decisions as to
which version of your code will execute faster when it does it's
optimizations, since it's not really executing your code to make a
"relative" decision, as you say? How on earth does it do it, since any
cycle counting on assembly code is being shot down as being unrealistic?
Why should I not be able to look at my code and make conclusions based on
information about cycle times, just as the compiler does (yes, I know about
pairing, penalties, etc, etc)? Or are we as programmers simply not able to
work down to this level anymore? I guess so.
Why don't you share those 7 lines of code with us? We can trade messages
forever, but your code won't get any faster this way!
If you are shy, or those seven lines are very dear, try the book "Zen of
Code Optimization" by Michael Abrash.
Consider this an FYI which I offer in case any of this is new to you. If
not, please ignore it, I'm not trying to heighten your exasperation.
You can get a listing of the machine instructions generated by the compiler
with the /FA switch. There used to be a time when this listing didn't take
into account optimizations, I don't know if that is still true. If so, you
should be able to look at the machine instructions with a debugger. Intel's
processor manual list the number of clock cycles an instruction takes. MS
reprinted that info with the copy of MASM I bought several years ago. As it
is now fashionable to exclude printed docs under the guise of saving forests
I'm not sure that the assembler includes the info any more.
The trouble with the info it is that is not all simple. The number of clock
cycles an instruction takes depends on the processor, the addressing mode
and perhaps even the value of the arguments. For example, if I am reading
the table correctly, the integer multiply instruction (IMUL) takes from 13
to 42 clocks on a 486 using double word operands.
Also, I'm sure later versions of MASM can generate the processor
timing information on a listing, so I guess that by assembling the
compiler generated assembler output with MASM, you could get the
timing information on a listing. It's a bit long winded, so it's not
something you'd want to do very often.
Address is altered to discourage junk mail.
Please post responses to the newsgroup thread,
there's no need for follow up email copies.
IMHO it is not true. If you get your code run 100 cycles less it will
not be a point in the system there all other threads may take a several
cycles to execute. It is just a drop in the ocean.
I think you better develop your code in a way then no other
will be permitted to run (in case you do it under win95). This will save
Actually, that was a generalization. I have about 40 - 50 such small
functions, varying from about 4 to 100 lines of code.
BTW, I've been able to write a class now that uses the RDTSC to measure
elapsed cycles. While it's not perfect, it's pretty good. I can get
repeatable results (which is what I need) down to less than 0.1% (after
doing multiple runs and taking the minimum result). Not bad.
No, it's not. If the LOOP takes 700 cycles for one version of the code, and
the loop takes 600 cycles for a different version of code (but doing the
same thing), then that's a 17% improvement for that specific loop. Now, if
my program spends an awful lot of time in that loop (as my code does,
processing a large file of audio data), then it's a SIGNIFICANT improvement.
I'm pretty aware of the fact that trying to optimize the WHOLE program is
fruitless, but since my program does what it does, it can benefit a lot from
optimizing the small sections of code that the program probably spends 95%
or more of it's time in (while processing).
Ian MacDonald wrote:
> In article <#U7t9OGt...@uppssnewspub05.moswest.msn.net>,
> I believe you will still have some accuracy problems using this (RDTSC)
> instruction due to the context switches that can occur while you are
> timing code.
May '95 !?
Holy smokes. It seems like just yesterday.
I guess it just goes to show that you should never throw away any of your
Thanks for remembering.