Measuring execution speed -- profiling in Forth?

Assad

unread,

Mar 15, 2014, 9:48:00 AM3/15/14

to

Background:

A while back I'd asked a question about efficient implementations of
algorithms in Forth, specifically the trade-offs between fewer memory
accesses (presumed always slow) vs. more ordinary operations (presumed
to be fast).

The answers quite helpfully pointed out that *theoretical* predictions
of an algorithm's efficiency can be quite different from *empirical*
evidence of running an algorithm, since the actual speed of memory
accesses in the algorithm depend significantly on the design of a
particular machine as well as how the program gets compiled.

As an example, rigidly minimising memory accesses on the principle
that more accesses are *always* slower is apparently not always the
case with modern cache designs.

Question:

So a practical question:

How to measure the execution speed of an algorithm in Forth?

In other languages, I would use a profiler which would break out the
execution speed taken by various parts of the code, which is quite
helpful.

How might one do this in Forth?

If I were to build it myself, I would essentially fetch system time
(in milliseconds) and then have it accumulate the time spent within a
word into a variable.

Although I believe someone mentioned that GForth has a profiler...?

Thanks,
Assad

Albert van der Horst

unread,

Mar 15, 2014, 10:40:24 AM3/15/14

to

In article <qdm8i912ie2rkiq7v...@4ax.com>,

On Pentiums practical succes can be had with the real time counter,
allowing for nanosecond resolution.
You need a single code word in assembler.
This depends on the Forth. The following code from ciforth can help:

0 ( TICKS PAST? ) CF: ?64 \ AvdH A2oct21
1 \ We can't use the assembler in 64 bits.
2 WANT ASSEMBLER HEX \ Just for using CODE
3
4 CODE (TICKS)
5 0F C, 31 C, \ RDTSC,
6 50 C, 52 C, \ PUSH|X, AX| PUSH|X, DX|
7 48 C, AD C, FF C, 20 C, \ NEXT,
8 END-CODE \ Code now in 2 32 bit things.
9 \ Leave a DOUBLE value.
10 : TICKS (TICKS) 20 LSHIFT OR 0 ;
11
12 \ For a TIME in ticks: it IS in the past.
13 : PAST? DNEGATE TICKS D+ SWAP DROP 0< 0= ;
14 DECIMAL
15

You see that the code is hex, such that no assembler is needed.
However you must figure out whether the data stack is implemented
as the Pentium SP register. Furthermore NEXT, is even more
system dependant.

Despite rumours, in my experience the results are quite reproducible.
The timing for the tingle tangle (with real time decisions about
when to play and 60 mS activation time for relays) relies on it.

The 32 bit code checks whether the instruction is available.
For 64 bits processors in practice one can forego that.

WANT MEASURE-PRIME
TASK : ISN'T UNIQUE

THE ISO BYTE BENCHMARK LASTED 1.431mS
FORGET ``MEASURE-PRIME'' Y/N

(This is one iteration of the Gillbreath sieve).
Typical deviations stay within 5%.

>
>Thanks,
>Assad
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Elizabeth D. Rather

unread,

Mar 15, 2014, 5:40:20 PM3/15/14

to

On 3/15/14 3:48 AM, Assad wrote:
...

> Question:
>
> So a practical question:
>
> How to measure the execution speed of an algorithm in Forth?
>
> In other languages, I would use a profiler which would break out the
> execution speed taken by various parts of the code, which is quite
> helpful.
>
> How might one do this in Forth?
>
> If I were to build it myself, I would essentially fetch system time
> (in milliseconds) and then have it accumulate the time spent within a
> word into a variable.
>
> Although I believe someone mentioned that GForth has a profiler...?

Different Forths have different instrumentation features. FORTH, Inc.
systems (SwiftForth and SwiftX) both have clocks whose resolution
depends on the facilities of the individual platform, but in all cases
run a single-cell circular timer that's used for timing intervals.
COUNTER returns the current value of the timer, so two calls to COUNTER
will give two measures that can be subtracted to get the interval. The
word TIMER calls COUNTER, does the subtraction, and displays the
interval to milliseconds. So, to get the time occupied by FOO, you'd run
something like this:

: TEST ( -- ) COUNTER 10000 0 DO FOO LOOP TIMER ;

You can run this with nothing inside the loop to get the cost of the
actual loop, if you wish. You can adjust the number of times through the
loop as appropriate to get sufficient resolution.

SwiftX (our cross-compiler) would run this entirely on the target, using
its clock, so host issues wouldn't affect the timing.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

foxaudio...@gmail.com

unread,

Mar 16, 2014, 6:50:13 PM3/16/14

to

This goes way back and I can't put my fingers on the code right now, but I recall an article in Forth Dimensions (maybe?) where ':' was re-defined to add profiling to every word after the re-defined version was in the system.

Does anyone else remember that or am I delusional?

BF

Rod Pemberton

unread,

Mar 16, 2014, 8:24:33 PM3/16/14

to

On Sun, 16 Mar 2014 18:50:13 -0400, <foxaudio...@gmail.com> wrote:
> On Saturday, March 15, 2014 9:48:00 AM UTC-4, Assad Ebrahim wrote:

>> [...]

>> Although I believe someone mentioned that GForth has a profiler...?
>

> This goes way back and I can't put my fingers on the code right now, but
> I recall an article in Forth Dimensions (maybe?) where ':' was
> re-defined to add profiling to every word after the re-defined version
> was in the system.
>
> Does anyone else remember that or am I delusional?
>

There couple articles in FD and it was discussed on c.l.f. in the past too.
So, you're not delusional ... ;-)

These two are likely the articles you remember:
V05N2 "Tracer for Colon Definitions" by Rieks Joosten
V06N5 "An Augmented TRACE" by Andreas Goppold

This article has a tracer too:
V11N5 "Developing a Step Trace" by Chester H. Page

Forth Dimensions
http://www.forth.org/fd/contents.html

Rod Pemberton

Bernd Paysan

unread,

Mar 16, 2014, 8:32:42 PM3/16/14

to

foxaudio...@gmail.com wrote:

> This goes way back and I can't put my fingers on the code right now, but I
> recall an article in Forth Dimensions (maybe?) where ':' was re-defined to
> add profiling to every word after the re-defined version was in the
> system.
>
> Does anyone else remember that or am I delusional?

That's one possibility; you'll get a count, how often each word is called.

Gforth git has a word TIMER: (in assert.fs), you cam define several times to
time different parts of the code. My convention is that each timer starts
with a +, and you define a generic timer +REST. If you want to check how
much a particular word contributes to the overall run-time, you define this
word (let's call it FOO) as follows:

timer: +rest
timer: +foo

: foo ( in -- out ) +rest \ does something which needs to be timed
bla bla +foo ;

.TIMES will print out all timers (in reverse order of their definition).
Whether the timing actually is measured depends on the switch PROFILE(,
which you can turn on with +DB PROFILE( and off with -DB PROFILE( (it's
turned on by default). It is supposed that you say PROFILE( .TIMES ) so
that you get the timers output only when you are in profiling mode. This is
the actual run-time in nanoseconds plus overhead of the measurement...

This code is still experimental, comments welcome.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/

Coos Haak

unread,

Mar 16, 2014, 8:32:54 PM3/16/14

to

Op Sat, 15 Mar 2014 13:48:00 +0000 schreef Assad:

If you can read Dutch (the commented source is given in English) you could
try the profiler of Marcel Hendrix in the Vijgeblad (figleaf) archive. You
can find it here:
http://www.forth.hcc.nl/vijgebladarchief
Issue 42, pages 18-34.

--
Coos

CHForth, 16 bit DOS applications
http://home.hccnet.nl/j.j.haak/forth.html

David Kuehling

unread,

Mar 17, 2014, 3:00:24 AM3/17/14

to

>>>>> "Albert" == Albert van der Horst <alb...@spenarnc.xs4all.nl> writes:

> In article <qdm8i912ie2rkiq7v...@4ax.com>,
> Assad <assad....@alum.swarthmore.edu> wrote:

[..]

>> How to measure the execution speed of an algorithm in Forth?
>>
>> In other languages, I would use a profiler which would break out the
>> execution speed taken by various parts of the code, which is quite
>> helpful.
>>
>> How might one do this in Forth?
>>
>> If I were to build it myself, I would essentially fetch system time
>> (in milliseconds) and then have it accumulate the time spent within a
>> word into a variable.
>>
>> Although I believe someone mentioned that GForth has a profiler...?

> On Pentiums practical succes can be had with the real time counter,
> allowing for nanosecond resolution. You need a single code word in
> assembler. This depends on the Forth. The following code from ciforth
> can help:

Here is code for Gforth on AMD64 (x86-64)-systems that reads the cycle
counter. This should work on any 64-bit x86 linux:

abi-code tsc@ ( -- u )
rdtsc
.d ax -8 di d) mov
.d dx -4 di d) mov
-8 di d) ax lea \ compute new sp in result reg
ret
end-code

You can then go on and redefine : and ; to add collect execution count
and time information for all forth routines defined in your code. You
can see a simple profiler based on this concept is here:

http://mosquito.dyndns.tv/opensvn/free/branches/brainless-abi/profiler.fs

A more portable version without using the x86 cycle-counter is here:

http://sourceforge.net/p/forth-brainless/code/HEAD/tree/trunk/profiler.fs

It uses Gforth's CPUTIME word instead (less accurate and more overhead).

Of course the overhead added to all definitions is very noticeable and
collected timing data will therefore not be representative (unless you
are more careful to only enable the profiler on definitions that take
about an order of magnitude more execution time than the profiling
instrumentation code takes).

cheers,

David
--
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk2.gpg
Fingerprint: B63B 6AF2 4EEB F033 46F7 7F1D 935E 6F08 E457 205F

Rod Pemberton

unread,

Mar 17, 2014, 3:33:13 AM3/17/14

to

On Sat, 15 Mar 2014 09:48:00 -0400, Assad
<assad....@alum.swarthmore.edu> wrote:

> So a practical question:
>
> How to measure the execution speed of an algorithm in Forth?
>

A few people mentioned the RDTSC instruction for x86 processors.
As mentioned in the past here, RDTSC is only for single-core
processors, Pentium or later. RDTSC doesn't work correctly on
multiple core processors. RDTSCP is for multiple core processors.
You're also supposed to issue a serializing instruction prior to
executing RDTSC or RDTSCP. One such instruction is CPUID which
can be used to determine if the RDTSC and/or RDTSCP instructions
are available. Even when using RDTSC and/or RDTSCP some processors:
1) fail to update the TSC (Time Stamp Counter) using the actual
clock speed of the processor
2) suffer from TSC drifts
3) have TSCs which are affected by power management events

x86 PCs have other timers you can use, depending on the generation
of the PC and availability of access to the timers via the operating
system, such as: DRAM refresh, RTC, PIT, LAPIC, ACPI PMT, and HPET.

DRAM - Dynamic Random Access Memory refresh
RTC - Real Time Clock
PIT - Programmable Interval Timer
LAPIC - Local Advanced Programmable Interrupt Controller
ACPI PMT - Advanced Configuration and Power Interface Power Management
Timer
HPET - High Precision Event Timer

Rod Pemberton

m.a.m....@tue.nl

unread,

Mar 17, 2014, 4:07:40 AM3/17/14

to

On Monday, March 17, 2014 1:32:54 AM UTC+1, Coos Haak wrote:
> Op Sat, 15 Mar 2014 13:48:00 +0000 schreef Assad:

[..]

> http://www.forth.hcc.nl/vijgebladarchief
> Issue 42, pages 18-34.

[..]

There is an *extensive* CLF thread "lprofiler Forth tool" from 1999,
which also contains the URL to my personal page where the source code
can be downloaded. (A better source might be in the old iForth 4.0.7 eval).

If I have time I will revive this subject, the CLF thread generated a lot of interesting ideas.

-marcel

Assad

unread,

Mar 17, 2014, 4:23:34 AM3/17/14

to

On Sat, 15 Mar 2014 13:48:00 +0000, Assad
<assad....@alum.swarthmore.edu> wrote:

>
>How to measure the execution speed of an algorithm in Forth?
>

>How might one do this [profile code] in Forth?
>
>If I were to build it myself, I would..

>
>Although I believe someone mentioned that GForth has a profiler...?
>

Thanks all for the many suggestions. Great stuff -- there's lots to
try!

- Assad

Albert van der Horst

unread,

Mar 20, 2014, 4:43:10 PM3/20/14

to

In article <op.xcuu9nka6zenlw@localhost>,

Rod Pemberton <dont_us...@xnothavet.cqm> wrote:
>On Sat, 15 Mar 2014 09:48:00 -0400, Assad
><assad....@alum.swarthmore.edu> wrote:
>
>> So a practical question:
>>
>> How to measure the execution speed of an algorithm in Forth?
>>
>
>A few people mentioned the RDTSC instruction for x86 processors.
>As mentioned in the past here, RDTSC is only for single-core
>processors, Pentium or later. RDTSC doesn't work correctly on

Party line/folklore/urban legend.
Read my post. We Forthers care about whether it actually works.

Just try it.
TICKS 1000 MS TICKS

>multiple core processors. RDTSCP is for multiple core processors.
>You're also supposed to issue a serializing instruction prior to
>executing RDTSC or RDTSCP. One such instruction is CPUID which
>can be used to determine if the RDTSC and/or RDTSCP instructions
>are available. Even when using RDTSC and/or RDTSCP some processors:
> 1) fail to update the TSC (Time Stamp Counter) using the actual
> clock speed of the processor
> 2) suffer from TSC drifts
> 3) have TSCs which are affected by power management events

Have you actually tried any of this out.

Can you point to anywhere the obscurest corner of the Internet
to find a story about a multicore processor that fails to have
the RDTSC?

<SNIP>

>
>Rod Pemberton

Groetjes Albert

Rod Pemberton

unread,

Mar 20, 2014, 6:54:42 PM3/20/14

to

On Thu, 20 Mar 2014 16:43:10 -0400, Albert van der Horst
<alb...@spenarnc.xs4all.nl> wrote:
> In article <op.xcuu9nka6zenlw@localhost>,
> Rod Pemberton <dont_us...@xnothavet.cqm> wrote:
>> On Sat, 15 Mar 2014 09:48:00 -0400, Assad
>> <assad....@alum.swarthmore.edu> wrote:

>>> So a practical question:
>>>
>>> How to measure the execution speed of an algorithm in Forth?
>>>
>>
>> A few people mentioned the RDTSC instruction for x86 processors.
>> As mentioned in the past here, RDTSC is only for single-core
>> processors, Pentium or later. RDTSC doesn't work correctly on
>
> Party line/folklore/urban legend.

Really? ...

> Read my post. We Forthers care about whether it actually works.
>
> Just try it.
> TICKS 1000 MS TICKS
>

...

>> multiple core processors. RDTSCP is for multiple core processors.
>> You're also supposed to issue a serializing instruction prior to
>> executing RDTSC or RDTSCP. One such instruction is CPUID which
>> can be used to determine if the RDTSC and/or RDTSCP instructions
>> are available. Even when using RDTSC and/or RDTSCP some processors:
>> 1) fail to update the TSC (Time Stamp Counter) using the actual
>> clock speed of the processor
>> 2) suffer from TSC drifts
>> 3) have TSCs which are affected by power management events
>
> Have you actually tried any of this out.
>

I do have sample C code in DOS for calling RDTSC, but not RDTSCP.
Except for the sample, I don't have any code that uses either.
I do need to implement it for my Forth and my stalled OS project.

> Can you point to anywhere the obscurest corner of the Internet
> to find a story about a multicore processor that fails to have
> the RDTSC?
>

Fails to *have* an RDTSC instruction, or fails to have a correctly
*working* RDTSC instruction?

Every x86 processor since the Pentium _should_ have the RDTSC instruction.

As for pointing to stories about multicore processors that fail to
have a correctly working RDTSC and/or RDTSCP anywhere in the obscurest
corners of the Internet, yes, I can point to a few dark corners:

"TSC drift can occur on K8 AMD multi-processor platforms and
single-processor dual-core platforms as they do not provide frequency
independent TSC. This drift does not occur on single-processor single-core
platforms for obvious reasons."
http://web.archive.org/web/20101129154633/http://developer.amd.com/documentation/articles/Pages/1214200692.aspx

"...efficeon updates RDTSC independently of the actual CPU clock speed and
at a rate that corresponds to the maximum CPU frequency."
http://www.vanshardware.com/articles/2004/05/040517_efficeonFreeze/040517_efficeonFreeze.htm

"The behavior of the RDTSCP instruction is implementation dependent. The
TSC counts at a constant rate, but may be affected by power management
events (such as frequency changes), depending on the processor
implementation.
If CPUID 8000_0007.edx[8] = 1, then the TSC rate is ensured to be invariant
across all P-States, C-States, and stop-grant transitions (such as STPCLK
Throttling); therefore, the TSC is suitable for use as a source of time."
-from AMD64 Architecture Programmer's Manual Volume 3: General-Purpose
and System Instructions, AMD64 24594 r3.14

The following .pdf says that out-of-order-execution CPU's (PII,
PPro) need to use CPUID and RDTSC:
1) together
2) together three or four times before using them to time code

"Using the RDTSC Instruction for Performance Monitoring"
http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

FYI, list of serializing instructions via Sandpile.org
http://www.sandpile.org/x86/coherent.htm

Most of that was harvested from other posts of mine to comp.lang.asm.x86 in
'06 and '08. I updated one link to WayBack archive's cache of an AMD page.

Rod Pemberton

Assad Ebrahim

unread,

Apr 1, 2014, 7:18:13 AM4/1/14

to

On 15 Mar 2014 14:40:24 GMT, alb...@spenarnc.xs4all.nl (Albert van der

Horst) wrote:

>In article <qdm8i912ie2rkiq7v...@4ax.com>,
>Assad <assad....@alum.swarthmore.edu> wrote:
>>
>>How to measure the execution speed of an algorithm in Forth?
>>

>>If I were to build it myself, I would essentially fetch system time
>>(in milliseconds) and then have it accumulate the time spent within a
>>word into a variable.
>>
>

>On Pentiums practical succes can be had with the real time counter,
>allowing for nanosecond resolution.
>You need a single code word in assembler.
>This depends on the Forth. The following code from ciforth can help:
>
> 0 ( TICKS PAST? ) CF: ?64 \ AvdH A2oct21
> 1 \ We can't use the assembler in 64 bits.
> 2 WANT ASSEMBLER HEX \ Just for using CODE
> 3
> 4 CODE (TICKS)
> 5 0F C, 31 C, \ RDTSC,
> 6 50 C, 52 C, \ PUSH|X, AX| PUSH|X, DX|
> 7 48 C, AD C, FF C, 20 C, \ NEXT,
> 8 END-CODE \ Code now in 2 32 bit things.
> 9 \ Leave a DOUBLE value.
>10 : TICKS (TICKS) 20 LSHIFT OR 0 ;
>11
>12 \ For a TIME in ticks: it IS in the past.
>13 : PAST? DNEGATE TICKS D+ SWAP DROP 0< 0= ;
>14 DECIMAL
>15
>
>

>The 32 bit code checks whether the instruction is available.
>For 64 bits processors in practice one can forego that.
>

Using rdtsc worked great.

I modified the code to work on Win32Forth using its assembler:

\ modified for W32F
\ thanks to code snippet from AvdH A2oct21
code ticks ( -- ts64_hi ts64_lo ) \ returns timestamp (64 bit
double value). Units: CPU cycles for a single core
push ebx \ preserve TOS
mov ecx, edx \ protect edx since it will be clobbered
rdtsc \ read time-stamp counter into edx:eax
push edx \ to NOS high order 32-bits
mov ebx, eax \ to TOS low order 32-bits
mov edx, ecx \ restore edx
next,
end-code

\ using Bernd's benchmark method... now with ticks
: benchmark ( [xt] -- ) ticks ' execute ticks 2swap d- swap . . ; \ provide time in cpu cycles for executing xt (can ignore the first value if timing <1s)

>You see that the code is hex, such that no assembler is needed.
>However you must figure out whether the data stack is implemented
>as the Pentium SP register. Furthermore NEXT, is even more
>system dependant.

Where would I start to figure out how the internals are implemented in
GForth 0.7?

I took a peek into gforth\arch\386\machine.h, but didn't get very far.

Would be nice to be able to write something like the above in GForth
using the GForth assembler for 386 and above.

Any help to get started here would be appreciated!

(If someone has a simple code snippet for GForth assembler that works
on Windows, that would get me started and I could figure out the rest
from there... ;)

>
>Despite rumours, in my experience the results are quite reproducible.
>The timing for the tingle tangle (with real time decisions about
>when to play and 60 mS activation time for relays) relies on it.

Yes -- seems so.

Running your sleep tests using

1000 ms

and varying the length of the sleep showed acceptable precision sub
second with the low 32 bits.

For elapsed times greater than 1 second need to use the full 64 bits
returned from ticks.

If low precision is fine, then time&date gives accuracy +/- 1s.