Intel I9's out there that need a workout?

Albert van der Horst

unread,

Jun 16, 2018, 7:43:34 PM6/16/18

to

In
https://github.com/albertvanderhorst/primecounting/blob/master/dynamic

you find the parallel prime counting program r10par.frt and the
Forth compiler lina8G to be run on a 64 bit linux system.
It uses dynamic programming instead of the usual sieve systems.

Download both and make them executable.
The option -p gives the number of parallel processes, default 2.

A typical run on AMD FX8370 (8 core)

time r10par.frt -p 7 1,000,000,000,000
#S : ISN'T UNIQUE
Under 1000000000000: 37607912018 primes

real 0m16.044s
user 0m14.825s
sys 0m0.028s

The i7's gives dramatically better results than the AMD's.
(This means machines that are rated similarly by gamers,
factors 3 to 5 are observed.)

I'm interested in results on the newer i9 intel processors,
especially the speedup compared to no parallelism.

The example above require sqrt(10^12) * 2 CELLS, i.e. 16 Mbyte.
The 8G config should allow you to get to 10^18, but if you try
that you probably should use the -g option, to be sure.
Do not use (as opposed to allocate) more memory than
you have physically available. You can use an 8G lina as long
as your swap space is greater than 8G.

(

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Paul Rubin

unread,

Jun 16, 2018, 9:04:20 PM6/16/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
> The i7's gives dramatically better results than the AMD's....

> I'm interested in results on the newer i9 intel processors

The i9s have the same core architectures as i7, just more cores and
other things to make the product higher-end.

You can try on a multicore xeon at low hourly cost at hetzner.de/cloud .

Albert van der Horst

unread,

Jun 17, 2018, 5:58:19 AM6/17/18

to

In article <87in6iu...@nightsong.com>,

Paul Rubin <no.e...@nospam.invalid> wrote:
>alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>> The i7's gives dramatically better results than the AMD's....
>> I'm interested in results on the newer i9 intel processors
>
>The i9s have the same core architectures as i7, just more cores and
>other things to make the product higher-end.

More cores, interesting. Other things, curious.

>
>You can try on a multicore xeon at low hourly cost at hetzner.de/cloud .

Isn't that more for people who want to get things done, instead
of benchmarking?

Anton Ertl

unread,

Jun 17, 2018, 6:41:17 AM6/17/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:

>In
>https://github.com/albertvanderhorst/primecounting/blob/master/dynamic
>
>you find the parallel prime counting program r10par.frt and the
>Forth compiler lina8G to be run on a 64 bit linux system.
>It uses dynamic programming instead of the usual sieve systems.
>
>Download both and make them executable.
>The option -p gives the number of parallel processes, default 2.
>
>A typical run on AMD FX8370 (8 core)
>
> time r10par.frt -p 7 1,000,000,000,000
> #S : ISN'T UNIQUE
> Under 1000000000000: 37607912018 primes
>
> real 0m16.044s
> user 0m14.825s
> sys 0m0.028s
>
>The i7's gives dramatically better results than the AMD's.
>(This means machines that are rated similarly by gamers,
>factors 3 to 5 are observed.)

i7 is a marketing name and has been used for many different CPUs, with
quite different performance (but always a high price:-). E.g., for
our LaTeX benchmark (numbers are times in seconds):

- Xeon X3460 (Lynnfield (Nehalem)) 2800MHz, Debian Lenny (64-bit) 0.484
- Core i7-6700K, 4200MHz (Turbo), 8MB L3, Debian Jessie (64-bit) 0.200

The Xeon X3460 is the same processor as a Core i7 860, except that it
tells the chipset that it is a Xeon, and the chipset then does not
disable ECC functionality.

Let's see the performance of a few AMD processors from the
Bulldozer/Piledriver/Steamroller/Excavator line that your FX8370 is
from, on this benchmark:

- AMD A8-5600K, 3600MHz, 2*2048KB L2, Debian Wheezy (64-bit) 0.424
- Athlon X4 845, 3500MHz, Ubuntu 16.04 0.380

A little faster than one i7, about twice as slow as another. The
A8-5600K has the same Piledriver cores as your FX8370.

Concerning Forth code, here are some Gforth results:

sieve bubble matrix fib fft release; CPU; gcc
0.118 0.171 0.061 0.166 0.050 2015-02-01; Intel Core i7-3517U 3.0GHz; gcc-4.9.0 (SUSE Linux)
0.076 0.104 0.040 0.076 0.032 2016-05-03; Intel Core i7-4790K 4.4GHz; gcc-4.9.2 (Debian 8)
0.076 0.112 0.040 0.080 0.028 2015-12-26; Intel Core i7-6700K 4.0GHz; gcc-4.9.2 (Debian 8)
0.120 0.168 0.064 0.160 0.060 2016-05-03; AMD Phenom II X2 560 3.3GHz; gcc-4.9.2 (Debian 8)
0.132 0.136 0.056 0.124 0.048 2017-05-25; AMD Athlon X4 845 (Carrizo/Excavator) 3.5GHz; gcc-4.9
0.093 0.099 0.042 0.104 0.030 2017-07-05; AMD Ryzen 1600X 4GHz; gcc-6.3

Your code "requires a library that in /usr/lib which will be present
if you installed a regular lina version 5.3 or higher." I don't have
that (and I certainly won't install it on several machines), so it
won't run.

>I'm interested in results on the newer i9 intel processors,

i9 is just a marketing name, too, indicating even higher price.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2018: http://www.euroforth.org/ef18/

peter....@gmail.com

unread,

Jun 17, 2018, 9:44:36 AM6/17/18

to

On Sunday, 17 June 2018 01:43:34 UTC+2, Albert van der Horst wrote:
> In
> https://github.com/albertvanderhorst/primecounting/blob/master/dynamic
>
> you find the parallel prime counting program r10par.frt and the
> Forth compiler lina8G to be run on a 64 bit linux system.
> It uses dynamic programming instead of the usual sieve systems.
>
> Download both and make them executable.
> The option -p gives the number of parallel processes, default 2.
>
> A typical run on AMD FX8370 (8 core)
>
> time r10par.frt -p 7 1,000,000,000,000
> #S : ISN'T UNIQUE
> Under 1000000000000: 37607912018 primes
>
> real 0m16.044s
> user 0m14.825s
> sys 0m0.028s
>

I have tried it on my system. 2xE5-2670v1 2.6/3.0 GHz 16/32 cores/threads
128G Ram

With different values for p I get

p user
1 17.828
2 12.047
3 9.922
5 8.750
7 8.422
8 8.156
10 9.844
15 12.141
20 11.297
24 10.656
28 11.406
32 12.547

It was run on Windows 10 1803 using the WSL

How is the threading system implemented in Lina64?
Are threads created once and reused or killed and recreated?

Peter

Albert van der Horst

unread,

Jun 17, 2018, 9:49:22 AM6/17/18

to

In article <2018Jun1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>>In
>>https://github.com/albertvanderhorst/primecounting/blob/master/dynamic
>>
>>you find the parallel prime counting program r10par.frt and the
>>Forth compiler lina8G to be run on a 64 bit linux system.
>>It uses dynamic programming instead of the usual sieve systems.
>>
>>Download both and make them executable.
>>The option -p gives the number of parallel processes, default 2.
>>
>>A typical run on AMD FX8370 (8 core)
>>
>> time r10par.frt -p 7 1,000,000,000,000
>> #S : ISN'T UNIQUE
>> Under 1000000000000: 37607912018 primes
>>
>> real 0m16.044s
>> user 0m14.825s
>> sys 0m0.028s
>>
>>The i7's gives dramatically better results than the AMD's.
>>(This means machines that are rated similarly by gamers,
>>factors 3 to 5 are observed.)
>
>i7 is a marketing name and has been used for many different CPUs, with
>quite different performance (but always a high price:-). E.g., for
>our LaTeX benchmark (numbers are times in seconds):

You snipped the part where I told that my buddy buys some i7
and I buy an AMD at about the same time for about the same price.

<SNIP>
Sorry I'm only interested in parallel benchmarks in this context.

>
>>I'm interested in results on the newer i9 intel processors,
>
>i9 is just a marketing name, too, indicating even higher price.

No it is not just that. It advertises considerably more cores/threads
and that is exactly what I want to find out with my parallel
benchmark. Whether it is worth the price.

You'll probably not the only one shying away from installing
ciforth.
You can try it out by unpacking
https://github.com/albertvanderhorst/ciforth/releases/download/CVS_REL-5-3-0/lina-5.3.0.tar.gz
in /tmp and then replace
/usr/bin/lina64
by
./lina64
in a script like r10par.frt that you want to run.

I will add a plain executable that can be run as is an linux machines.

>
>- anton

Anton Ertl

unread,

Jun 17, 2018, 11:06:37 AM6/17/18

to

No, I did not. There is no such part in the posting I cited.

Looking at <https://geizhals.at/?phist=1155953&age=9999>, I see that
the AMD FX8370 cost EUR 216 at it's high point. Looking at the
lowest-end desktop Core i7 that was current at the time, it cost EUR 246 at
it's lowest point <https://geizhals.eu/?phist=930986&age=9999>.

>>i9 is just a marketing name, too, indicating even higher price.
>
>No it is not just that. It advertises considerably more cores/threads

The Core i9-8950HK has 6 cores, which is not more than, e.g. the Core
i7-8700.

>and that is exactly what I want to find out with my parallel
>benchmark. Whether it is worth the price.

For a good benchmark result, there no no such thing as a too-high
price, is there? If you don't agree, it's unclear to me why you would
use that program as the basis for evaluating CPUs.

>You'll probably not the only one shying away from installing
>ciforth.
>You can try it out by unpacking
>https://github.com/albertvanderhorst/ciforth/releases/download/CVS_REL-5-3-0/lina-5.3.0.tar.gz
>in /tmp and then replace
> /usr/bin/lina64
>by
> ./lina64
>in a script like r10par.frt that you want to run.

I downloaded lina8G and r10par.frt (which already contains the change
you suggested), but what I get when I try to run this is a highly
informative

? ciforth ERROR # 8

Albert van der Horst

unread,

Jun 17, 2018, 11:21:36 AM6/17/18

to

In article <3bcb2d75-7800-4a8e...@googlegroups.com>,

Thanks!
Now *that* is interesting. It means that there is not much
to gain by going to the Intel's with many cores.
Slightly disappointing though.

I presume that you would have mentioned any wrong answers, so
that is a pretty good test for the program.

>
>It was run on Windows 10 1803 using the WSL

So on a virtual machine. Sometimes that limits the
number of threads that can run, have you checked that?

>
>How is the threading system implemented in Lina64?
>Are threads created once and reused or killed and recreated?

They are created and stay busy, and communicate through shared
memory, never stalling on a system call, until all work is done.
(Only user variables are different. )
It is a very basic fork with shared memory, one screen in the library.
( LOCATE THREAD-PET )

Processes work on one big area but different locality,
such that the expectation was that the cache should help to
keep the processes separate.

>
>Peter
>
>
>> The i7's gives dramatically better results than the AMD's.
>> (This means machines that are rated similarly by gamers,
>> factors 3 to 5 are observed.)
>>
>> I'm interested in results on the newer i9 intel processors,
>> especially the speedup compared to no parallelism.
>>

Albert van der Horst

unread,

Jun 17, 2018, 12:07:57 PM6/17/18

to

In article <2018Jun1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

<SNIP>

>>You'll probably not the only one shying away from installing
>>ciforth.
>>You can try it out by unpacking
>>https://github.com/albertvanderhorst/ciforth/releases/download/CVS_REL-5-3-0/lina-5.3.0.tar.gz
>>in /tmp and then replace
>> /usr/bin/lina64
>>by
>> ./lina64
>>in a script like r10par.frt that you want to run.
>
>I downloaded lina8G and r10par.frt (which already contains the change

> recommended) and got the uninformative

>
> ? ciforth ERROR # 8
>

Sorry for that. This https://github.com/albertvanderhorst/primecounting
is very much under construction.
I've added cooperating lina8G and forth.lab and updated the readme.

For those understanding the list comprehensions of python,
r10lico.py contains the base algorithm in 7 lines.
It is illustrative to run it with testoutput uncommented
for P10(100) .

>- anton

Groetjes Albert

peter....@gmail.com

unread,

Jun 17, 2018, 12:27:18 PM6/17/18

to

Yes they all show the same output.
Remember that this is a 2 processor system. That might have some
impact on memory handling.
When I run with 32 threads I can see all cores/threads in process manager
go to 100% at 3Ghz. so in some way they are used

>
> >
> >It was run on Windows 10 1803 using the WSL
>
> So on a virtual machine. Sometimes that limits the
> number of threads that can run, have you checked that?
>

WSL is no virtual machine. It is just a translation layer from Linux
syscalls to Windows systemcalls. Htop show 32 cores and 128GB ram

I tried to run it in a Linux Virtual machine. 16 cores 16 GB ram.
But that segfaults with lina8G. I will try with more ram.

Is it possible to run the program on wina64?

Peter

For CPU workloads it should run the same speed as Linux.

m...@iae.nl

unread,

Jun 17, 2018, 2:02:58 PM6/17/18

to

On Sunday, June 17, 2018 at 5:21:36 PM UTC+2, Albert van der Horst wrote:
[..]

> Thanks!
> Now *that* is interesting. It means that there is not much
> to gain by going to the Intel's with many cores.
> Slightly disappointing though.

I don't think you should conclude that on the basis of
this program. Current iForth benchmarks show a near
proportional speedup with the number of available
physical cores (tested up to 6). It took us a few
iterations to get there.

-marcel

m...@iae.nl

unread,

Jun 17, 2018, 2:09:57 PM6/17/18

to

On Sunday, June 17, 2018 at 6:27:18 PM UTC+2, peter....@gmail.com wrote:
> On Sunday, 17 June 2018 17:21:36 UTC+2, Albert van der Horst wrote:
> > In article <3bcb2d75-7800-4a8e...@googlegroups.com>,
> > <peter....@gmail.com> wrote:

[..]

> Remember that this is a 2 processor system. That might have some
> impact on memory handling.
> When I run with 32 threads I can see all cores/threads in process manager
> go to 100% at 3Ghz. so in some way they are used

What is the percentage 'Kernel times' in that? I know some
other programs that put on a nice show, but are not doing
much more than threads waiting on each other. A very well
written program is the VS C compiler (not the assembler/linker).
It shows 100% CPU and nearly no kernel time. (It could of course
be Window dressing :-)

-marcel

peter....@gmail.com

unread,

Jun 17, 2018, 3:25:51 PM6/17/18

to

How do you see the Kernel time?
Seeing the results it is clear that the threads are not calculating
primes at 100% but doing something else.

Peter

Anton Ertl

unread,

Jun 17, 2018, 6:06:43 PM6/17/18

to

Ok, I managed to get it to run. What I get is:

1 2 4 6 8 -p Parameter
14.551s 10.007s 7.730s 7.008s 6.611s Ryzen 7 1800X (8 cores)
11.468s 7.808s 6.244s 9.424s 9.148s Core i7-4790K (4 cores/8 threads)
31.698s 21.385s 16.161s 14.625s 13.593s 2xXeon 5450 (2 x 4 cores)
37.432s 25.424s 21.220s Athlon X4 845 (4 cores, two modules)
35.960s 24.064s Phenom II X2 560 (2 cores)
99.112s 69.828s 52.968s Celeron J3455 (4 cores)

This program does not parallelize well. The difference in speed
between different CPUs is quite a bit more for this benchmark than for
other benchmarks I have measured.

Albert van der Horst

unread,

Jun 18, 2018, 5:16:00 AM6/18/18

to

In article <29432172-be14-4013...@googlegroups.com>,

My conclusion is only valid for this program.

>
>-marcel

Albert van der Horst

unread,

Jun 18, 2018, 5:50:39 AM6/18/18

to

In article <29432172-be14-4013...@googlegroups.com>,
<m...@iae.nl> wrote:

This program goes linearly through two input areas and
write one output area.
The algorithm is in
https://github.com/albertvanderhorst/primecounting/blob/master/dynamic/r10lico.py
(Don't be fooled by the hash table, that is just an elegant formulation.)

Basically for 10^12 you need 2.10^6 cell arrays buffer.
A new buffer is generated based on the previous buffer only,
and each cell could be filled by a separate process in parallel.
Also the access to the old buffer is predictable, sweeping, not
random.

Despite being (designed for) cache friendliness, the algorithm
may just saturate the memory bandwidth.

N.B. the threads never stop, they busy wait for the run condition
to go up. Processors are just 100% busy.

>
>-marcel

Albert van der Horst

unread,

Jun 18, 2018, 6:06:41 AM6/18/18

to

In article <9a14a362-6032-4b90...@googlegroups.com>,
<peter....@gmail.com> wrote:
<SNIP>

>
>WSL is no virtual machine. It is just a translation layer from Linux
>syscalls to Windows systemcalls. Htop show 32 cores and 128GB ram

I didn't know some one pulled that off, impressive.

>
>I tried to run it in a Linux Virtual machine. 16 cores 16 GB ram.
>But that segfaults with lina8G. I will try with more ram.
>
>Is it possible to run the program on wina64?

Programs are portable accross ciforth, except for availability of
library functions.
I would have to reimplement this screen for windows which
doesn't seem prohibitive for some one knowledgeable, but
requires some study for me.

0 ( THREAD-PET KILL-PET PAUSE-PET ) CF: ?LI \ B5dec2
1 "CTA" WANTED "-syscalls-" WANTED HEX
2 \ Exit a thread. Indeed this is exit().
3 : EXIT-PET 0 _ _ __NR_exit XOS ;
4 \ Do a preemptive pause. ( abuse MS )
5 : PAUSE-PET 1 MS ;
6 \ Create a thread with dictionary SPACE. Execute XT in thread.
7 : THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @
8 ROT RSP! 2 CELLS - ( DSP) , ( TASK) , ( pid) 0 ,
9 DOES> DUP @ >R SWAP OVER CELL+ @ R@ 2! ( clone S: tp,xt)
10 100 R> _ __NR_clone XOS DUP IF
11 ( Mother) DUP ?ERRUR SWAP 2 CELLS + ! ELSE
12 ( Child) DROP RSP! CATCH DUP IF ERROR THEN EXIT-PET THEN ;
13 \ Kill a THREAD-PET , preemptively. Throw errors.
14 : KILL-PET >BODY 2 CELLS + @ 9 _ __NR_kill XOS ?ERRUR ;
15 DECIMAL
>
>Peter
>
>> Groetjes Albert

peter....@gmail.com

unread,

Jun 18, 2018, 6:36:01 AM6/18/18

to

On Monday, 18 June 2018 12:06:41 UTC+2, Albert van der Horst wrote:
> In article <9a14a362-6032-4b90...@googlegroups.com>,
> <peter....@gmail.com> wrote:
> <SNIP>
> >
> >WSL is no virtual machine. It is just a translation layer from Linux
> >syscalls to Windows systemcalls. Htop show 32 cores and 128GB ram
>
> I didn't know some one pulled that off, impressive.

Well, it is part of Win 10 so it is Microsoft that pulled it off!

In fact I am very impressed how well it runs. You open a command prompt
and type bash and you have a Linux command line. To develop for both
Window and Linux it is good.

>
> >
> >I tried to run it in a Linux Virtual machine. 16 cores 16 GB ram.
> >But that segfaults with lina8G. I will try with more ram.
> >
> >Is it possible to run the program on wina64?
>
> Programs are portable accross ciforth, except for availability of
> library functions.
> I would have to reimplement this screen for windows which
> doesn't seem prohibitive for some one knowledgeable, but
> requires some study for me.
>
> 0 ( THREAD-PET KILL-PET PAUSE-PET ) CF: ?LI \ B5dec2
> 1 "CTA" WANTED "-syscalls-" WANTED HEX
> 2 \ Exit a thread. Indeed this is exit().
> 3 : EXIT-PET 0 _ _ __NR_exit XOS ;
> 4 \ Do a preemptive pause. ( abuse MS )
> 5 : PAUSE-PET 1 MS ;
> 6 \ Create a thread with dictionary SPACE. Execute XT in thread.
> 7 : THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @
> 8 ROT RSP! 2 CELLS - ( DSP) , ( TASK) , ( pid) 0 ,
> 9 DOES> DUP @ >R SWAP OVER CELL+ @ R@ 2! ( clone S: tp,xt)
> 10 100 R> _ __NR_clone XOS DUP IF
> 11 ( Mother) DUP ?ERRUR SWAP 2 CELLS + ! ELSE
> 12 ( Child) DROP RSP! CATCH DUP IF ERROR THEN EXIT-PET THEN ;
> 13 \ Kill a THREAD-PET , preemptively. Throw errors.
> 14 : KILL-PET >BODY 2 CELLS + @ 9 _ __NR_kill XOS ?ERRUR ;
> 15 DECIMAL

I found that screen also and see that there are some work needed.

Peter

Anton Ertl

unread,

Jun 18, 2018, 11:04:53 AM6/18/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:

>This program goes linearly through two input areas and
>write one output area.
>The algorithm is in
>https://github.com/albertvanderhorst/primecounting/blob/master/dynamic/r10lico.py
>(Don't be fooled by the hash table, that is just an elegant formulation.)
>
>Basically for 10^12 you need 2.10^6 cell arrays buffer.
>A new buffer is generated based on the previous buffer only,
>and each cell could be filled by a separate process in parallel.
>Also the access to the old buffer is predictable, sweeping, not
>random.
>
>Despite being (designed for) cache friendliness, the algorithm
>may just saturate the memory bandwidth.
>
>N.B. the threads never stop, they busy wait for the run condition
>to go up. Processors are just 100% busy.

That's not a good way to get good speed: With SMT (aka
Hyperthreading), this means that the other thread(s) on the same core
are slower than otherwise. With power-limited multi-core CPUs (pretty
much every current one), this means that the core(s) doing actual work
is/are clocked slower than they would be otherwise, especially on
many-core CPUs. At the very least, you should use the PAUSE
instruction in your busy-waiting loop.

As for your memory consumption ideas, for the 1e12 parameter, I see
~16MB more memory in use than before. On several machines I had to

echo 1 >/proc/sys/vm/overcommit_memory

because your processes ask for so much virtual memory (and Linux
defaults to an idiotic compromise between overcommit and
no-overcommit).

Concerning the speculations on performance, you suggested "saturate
the memory bandwidth". I expect that some CPUs will spend a lot of
time in indirect branch mispredictions; also, I have not looked at
your code, but communications overhead between the processes (locked
instruction, fences etc.) may be an issue.

I used performance counters to measure some of these things on a few
machines, for the -p1 version:

Celeron J4105 Phenom II Core i7-4790K
2788235 1610199810 1304296 branch-misses
73158302596 73170728154 73159790157 instructions
106617028872 119057351983 49800135148 cpu-cycles
62445888 1937649 3098925 cache-misses
10397081 68195098 L1-dcache-load-misses

The Phenom has a lot of branch mispredictions (probably thanks to
threaded code and an old-fashioned BTB as indirect branch predictor)
that account for maybe 40%-50% of the cycles. The Celeron J4105 does
not have that problem, but there must be something else. Cache misses
(and thus memory bandwidth) do not seem to be the problem on these
machines, not even on the Celeron J4105.

Here are the results of the machines I have measured, including two
additional ones:

1 2 4 6 8 -p Parameter
14.551s 10.007s 7.730s 7.008s 6.611s Ryzen 7 1800X (8 cores)
11.468s 7.808s 6.244s 9.424s 9.148s Core i7-4790K (4 cores/8 threads)
31.698s 21.385s 16.161s 14.625s 13.593s 2xXeon 5450 (2 x 4 cores)
37.432s 25.424s 21.220s Athlon X4 845 (4 cores, two modules)
35.960s 24.064s Phenom II X2 560 (2 cores)
99.112s 69.828s 52.968s Celeron J3455 (4 cores)

42.772s 30.424s 23.420s Celeron J4105 (4 cores)
213.940s 145.168s Atom 330

The difference between the Celeron J3455 (with Goldmont cores) and
Celeron J4105 (with Goldmont+ cores) is interesting, because Intel did
not tout the improvements of Goldmont+ over Goldmont much.

Albert van der Horst

unread,

Jun 18, 2018, 2:23:11 PM6/18/18

to

In article <2018Jun1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
<SNIP>

>That's not a good way to get good speed: With SMT (aka
>Hyperthreading), this means that the other thread(s) on the same core
>are slower than otherwise. With power-limited multi-core CPUs (pretty
>much every current one), this means that the core(s) doing actual work
>is/are clocked slower than they would be otherwise, especially on
>many-core CPUs. At the very least, you should use the PAUSE
>instruction in your busy-waiting loop.

A PAUSE sets you back milliseconds, That's why I didn't use it.

:I wait BEGIN turn @ ME @ = UNTIL ;
replaced by
:I wait BEGIN PAUSE-PET turn @ ME @ = UNTIL ;

Indeed with PAUSE-pet the user time goes down from 17.2/16.6 to
15.4/12.9 . So you're right.
That is no win because the real time goes up from 17.6 to 28.3/25.3 .
So I was right too.

I've prime95 (GIMP mprime) running full blast on 8 processors all the
time. If I start r10par.frt on 7 processors, it gains priority over
prime95 and the temperature is about 3 degrees higher while it runs.
Terminating prime95 make the temperature drop from about 60 to about
25 degree celcius.
I hope my machine is not throttling, because that means I could
contribute equally much to GIMP with a cheaper machine and better
cooling.

The threads are very well balanced. The difference should not be
more than a dozen or so busy waits.
That's what I thought, but let's measure.
Measuring it I get that each busy waiting loop cycles about 4500 times
(multiply by 7 and by 80000 primes). That is unhealthy and
unexpected.

>As for your memory consumption ideas, for the 1e12 parameter, I see
>~16MB more memory in use than before. On several machines I had to
>
>echo 1 >/proc/sys/vm/overcommit_memory
>
>because your processes ask for so much virtual memory (and Linux
>defaults to an idiotic compromise between overcommit and
>no-overcommit).

10^12 is a relatively modest goal, then 30 Mb is sufficient.
You can bring the 64 Gbyte of lina8G (not the best of names)
down by growing it a negative amount.

lina8G -g 63970 lina30M
Then use lina30M in the scripts.

>
>Concerning the speculations on performance, you suggested "saturate
>the memory bandwidth". I expect that some CPUs will spend a lot of
>time in indirect branch mispredictions; also, I have not looked at
>your code, but communications overhead between the processes (locked
>instruction, fences etc.) may be an issue.

>I used performance counters to measure some of these things on a few
>machines, for the -p1 version:
>Celeron J4105 Phenom II Core i7-4790K

..
<SNIP>

I saved those results. Thanks, it deserves study.

And I think the algorithm is interesting. It will not be a champion
however until it runs on a fast Forth.

>- anton

Paul Rubin

unread,

Jun 18, 2018, 2:30:24 PM6/18/18

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> At the very least, you should use the PAUSE instruction in your
> busy-waiting loop.

That adds a huge amount of latency in Skylake-X, enough that spinloops
are preferable more of the time than in other architectures.

https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/

Anton Ertl

unread,

Jun 19, 2018, 4:10:38 AM6/19/18

to

Paul Rubin <no.e...@nospam.invalid> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> At the very least, you should use the PAUSE instruction in your
>> busy-waiting loop.
>
>That adds a huge amount of latency in Skylake-X, enough that spinloops
>are preferable more of the time than in other architectures.

Given that the PAUSE instruction exists for reducing the resource
usage of of spinloops, what you write means: By adding latency to
PAUSE, resource usage of spinloops was further reduced, so they became
preferable more of the time. This is probably true, but did you
intend to say that? In any case, the link you give discusses a more
involved issue:

>https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/

As far as I understand, here the system (.NET in this case) uses PAUSE
in spin loops, and after a certain spin count gives up and yields to
another process/thread, as it should. The count has been tuned for
the PAUSE latency of earlier processors, and is too high for Skylake,
i.e., it spins too long before yielding, and blocks the hardware
thread/core during this time.

Given that Albert van der Horst's program never yields and always
spins, this is not an issue for him.

Anton Ertl

unread,

Jun 19, 2018, 4:44:35 AM6/19/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:

>In article <2018Jun1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
><SNIP>
>>That's not a good way to get good speed: With SMT (aka
>>Hyperthreading), this means that the other thread(s) on the same core
>>are slower than otherwise. With power-limited multi-core CPUs (pretty
>>much every current one), this means that the core(s) doing actual work
>>is/are clocked slower than they would be otherwise, especially on
>>many-core CPUs. At the very least, you should use the PAUSE
>>instruction in your busy-waiting loop.
>
>A PAUSE sets you back milliseconds,

You obviously don't mean the PAUSE instruction
<https://c9x.me/x86/html/file_module_x86_id_232.html>, which waits 147
cycles on a Goldmont (100ns at 1.5GHz) and less for other cores listed
by Agner Fog.

>:I wait BEGIN turn @ ME @ = UNTIL ;
> replaced by
>:I wait BEGIN PAUSE-PET turn @ ME @ = UNTIL ;
>
>Indeed with PAUSE-pet the user time goes down from 17.2/16.6 to
>15.4/12.9 . So you're right.
>That is no win because the real time goes up from 17.6 to 28.3/25.3 .
>So I was right too.

If you had used the PAUSE instruction, user time and real time would
be similar (as long as no other process shares the same cores); the
PAUSE instruction counts as time consumed by the process/thread as far
as the OS is concerned.

>And I think the algorithm is interesting. It will not be a champion
>however until it runs on a fast Forth.

Convert it to standard Forth, then you can let iForth, Vfx (for small
instances) etc. have a go at it.

john

unread,

Jun 19, 2018, 6:18:43 AM6/19/18

to

In article <87a7rsa...@nightsong.com>, no.e...@nospam.invalid says...

It seem like a horses for course situation. On chip v off chip decision by Intel.
It shouldn't suprise anyone that a different architecture - even
if functionally compliant - would need a different compile strategy
to maintain performance or overcome some other variation. A different
architecture has logically to show a difference somehwere.

--

john

=========================
http://johntech.co.uk

=========================

Albert van der Horst

unread,

Jun 19, 2018, 8:09:53 AM6/19/18

to

In article <2018Jun1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>>In article <2018Jun1...@mips.complang.tuwien.ac.at>,
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>><SNIP>
>>>That's not a good way to get good speed: With SMT (aka
>>>Hyperthreading), this means that the other thread(s) on the same core
>>>are slower than otherwise. With power-limited multi-core CPUs (pretty
>>>much every current one), this means that the core(s) doing actual work
>>>is/are clocked slower than they would be otherwise, especially on
>>>many-core CPUs. At the very least, you should use the PAUSE
>>>instruction in your busy-waiting loop.
>>
>>A PAUSE sets you back milliseconds,
>
>You obviously don't mean the PAUSE instruction
><https://c9x.me/x86/html/file_module_x86_id_232.html>, which waits 147
>cycles on a Goldmont (100ns at 1.5GHz) and less for other cores listed
>by Agner Fog.

Indeed I thought you meant PAUSE as is usual in
Forth. In preemptive multi threading like mine, a Forth PAUSE is
in fact giving up control towards the operating system.
Another thing to try.

>
>>:I wait BEGIN turn @ ME @ = UNTIL ;
>> replaced by
>>:I wait BEGIN PAUSE-PET turn @ ME @ = UNTIL ;
>>
>>Indeed with PAUSE-pet the user time goes down from 17.2/16.6 to
>>15.4/12.9 . So you're right.
>>That is no win because the real time goes up from 17.6 to 28.3/25.3 .
>>So I was right too.
>
>If you had used the PAUSE instruction, user time and real time would
>be similar (as long as no other process shares the same cores); the
>PAUSE instruction counts as time consumed by the process/thread as far
>as the OS is concerned.

>
>>And I think the algorithm is interesting. It will not be a champion
>>however until it runs on a fast Forth.
>
>Convert it to standard Forth, then you can let iForth, Vfx (for small
>instances) etc. have a go at it.

What is the standard Forth way for parallel execution?

I didn't say that making it a champion is a short term goal.
What I find important that those interested can try the program
out relatively easy.
Downloading 3 files is not too bad.

You make me realize that not only parallelism isn't standardized
but neither are argument handling, or support for interpret loops.
Also most Forth's require much more than just two files to interpret
a Forth program, and it can be a pain to find out how to have
a gigabyte dictionary.

For Forth's that support THREAD-PET I'll try to port to that Forth.

THREAD-PET ( "name" SPACE -- )
Create a thread with its own dictionary, user variables and stacks,
the dictionary has size SPACE, but the main dictionary and it buffers
remain accessible. SPACE may be zero, but then PAD and <# can't be used.

Execution ( xt -- )
Execute xt in a preemptive thread. After execution the thread terminates.
Execution may also terminate by calling EXIT-PET

KILL-PET (xt -- )
xt must have been created by THREAD-PET. Preemptively terminate the
thread.

PAUSE-PET
A traditional PAUSE within Forth processes.

>
>- anton

Groetjes Albert

Anton Ertl

unread,

Jun 19, 2018, 9:23:53 AM6/19/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>In article <2018Jun1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>Convert it to standard Forth, then you can let iForth, Vfx (for small
>>instances) etc. have a go at it.
>
>What is the standard Forth way for parallel execution?

There is none yet. Andrew Haley is revving up to a proposal; he
presented something at EuroForth 2017:

Slides:
<http://www.complang.tuwien.ac.at/anton/euroforth/ef17/genproceedings/papers/haley.pdf>

Video:
<https://wiki.forth-ev.de/lib/exe/fetch.php/events:ef2017:multitasking.mp4>

Anyway, your program does not scale well to multiple processes, giving
only a speedup by a factor 2.2 when going from 1 core to 8. Given
that the speedup of iForth and VFX over lina is probably more than a
factor 2.2, you gain more from a fast Forth than from being able to
burn more CPU power.

>I didn't say that making it a champion is a short term goal.
>What I find important that those interested can try the program
>out relatively easy.
>Downloading 3 files is not too bad.

Downloading one would be better.

>You make me realize that not only parallelism isn't standardized
>but neither are argument handling,

True, but all serious systems can EVALUATE a command-line argument,
and you can pass arguments in that way. I.e., something like

$FORTH "include file.4th 1000000000000 pi . bye"

>or support for interpret loops.

See <http://forth-standard.org/standard/rationale#rat:core:COMPILE,>

>Also most Forth's require much more than just two files to interpret
>a Forth program

In my experience the Forth program is sufficient. In a misguided
appeal to minimalism, some systems do not include FP by default, and
you have to load that separately using arcane incantations if you need
it.

>and it can be a pain to find out how to have
>a gigabyte dictionary.

Aparently that's the case for lina, too, or you would not need to
supply lina8G. The usual way around that is to use ALLOCATE, which is
only limited by the hardware and the OS on most systems.

Albert van der Horst

unread,

Jun 19, 2018, 10:57:04 AM6/19/18

to

In article <2018Jun1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

<SNIP>

>
>>Also most Forth's require much more than just two files to interpret
>>a Forth program
>
>In my experience the Forth program is sufficient. In a misguided

I doubt that. My guess is that you don't count all the files
that belong to the forth system itself.
It may be reasonable to assume (or even require) that a Python
or a perl is installed on a system. Such is not true for
Your Favorite Forth System (tm).

>appeal to minimalism, some systems do not include FP by default, and
>you have to load that separately using arcane incantations if you need
>it.

The nice thing is that we don't need it.

Let's see

----------------------- naive prime counting ---------------------------
WANT PRIME?
: doit 0 1 ARG[] EVALUATE 2 DO I PRIME? IF 1+ THEN LOOP
. CR ;
-------------------

After `` lina -c prime.frt '' I have an executable that I can ship to
anyone, whether or not they have ciforth installed, like so: 1]

prime 1000
168

I always wondered how that would be done by other Forth's, such as gforth.
You may assume that you have an incantation available in gforth to get
a word to test for primeness. " REQUIRE prime.fs " or something.

>
>- anton
>--
>M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html

1]
(An the rest of the world can do `` wina -c prime.frt '')

Albert van der Horst

unread,

Jun 19, 2018, 11:09:33 AM6/19/18

to

In article <29432172-be14-4013...@googlegroups.com>,
<m...@iae.nl> wrote:

I don't doubt that. But if a program saturates the
memory bandwidth, there is not much iForth can do.
This program fetches two cells and stores one cell.
The calculation in between amounts to one subtraction
and one divide. That is an unfavourable.

See also
https://github.com/albertvanderhorst/primecounting \
/blob/master/dynamic/r10lico.py

>
>-marcel

Anton Ertl

unread,

Jun 19, 2018, 12:55:48 PM6/19/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:

>In article <2018Jun1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
><SNIP>
>>
>>>Also most Forth's require much more than just two files to interpret
>>>a Forth program
>>
>>In my experience the Forth program is sufficient. In a misguided
>
>I doubt that. My guess is that you don't count all the files
>that belong to the forth system itself.

Of course not.

>It may be reasonable to assume (or even require) that a Python
>or a perl is installed on a system. Such is not true for
>Your Favorite Forth System (tm).

Sure it is. And several others.

And that's the way to go. Python and Perl did not get installed on
every system by worrying about how to distribute the Python or Perl
system with every script.

But if you really worry about the number of files distributed, a
Docker container or somesuch may be able to provide everything in one
file.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Jun 19, 2018, 12:58:00 PM6/19/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:

>I don't doubt that. But if a program saturates the
>memory bandwidth, there is not much iForth can do.

Your program does not saturate the memory bandwidth, as I showed. And
if it did saturate the memory bandwidth, you would not be seeing a
speedup from parallelization.

Paul Rubin

unread,

Jun 19, 2018, 3:02:04 PM6/19/18

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> Given that the PAUSE instruction exists for reducing the resource
> usage of of spinloops, what you write means: By adding latency to
> PAUSE, resource usage of spinloops was further reduced, so they became
> preferable more of the time. This is probably true, but did you
> intend to say that?

What do you mean about the resource usage of spinloops being reduced?

My (maybe wrong) reading was that the latency and resource usage of
spinloops stayed the same as before, while the latency of PAUSE
increased, by enough to change the tradeoff between spinloops and PAUSE.

So the person was able to get lower latency and maybe higher total
throughput by changing some occurrences of PAUSE to use spinloops
instead, at the possible cost of burning more resources (i.e. power
dissipation and compute cycles that other threads might have been able
to use).

Albert van der Horst

unread,

Jun 19, 2018, 4:20:00 PM6/19/18

to

In article <2018Jun1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>>I don't doubt that. But if a program saturates the
>>memory bandwidth, there is not much iForth can do.
>
>Your program does not saturate the memory bandwidth, as I showed. And
>if it did saturate the memory bandwidth, you would not be seeing a
>speedup from parallelization.

If it saturates it would show speedup until a point, consistent
with observation.

I did not see where you drew the conclusion that it couldn't be the
bandwidth.

>
>- anton

Anton Ertl

unread,

Jun 20, 2018, 8:30:04 AM6/20/18

to

Paul Rubin <no.e...@nospam.invalid> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Given that the PAUSE instruction exists for reducing the resource
>> usage of of spinloops, what you write means: By adding latency to
>> PAUSE, resource usage of spinloops was further reduced, so they became
>> preferable more of the time. This is probably true, but did you
>> intend to say that?
>
>What do you mean about the resource usage of spinloops being reduced?

Consider a spinloop that waits until a memory location is 0 (or until
it times out and yields the hardware thread to another
thread/process). And consider the case when it takes 1500 cycles
until the memory location becomes 0.

Without PAUSE, the spinloop will do a load, a conditional branch, a
count decrement (for timeout) and a conditional branch, in a loop; a
modern Intel or AMD CPU may be able to do this in one cycle, but it
consumes a load unit slot, an ALU slot, and two branch slots in that
cycle, and the other hyperthread on the same core cannot make use of
these resources; all these instruction executions also consume energy.

With PAUSE, in each iteration of the spinloop, the PAUSE waits; on the
Skylake, for 141 cycles, while consuming only 4 micro-ops; so now the
spinloop consumes only 8 micro-ops every 142 cycles, while without
PAUSE it consumed maybe 4/cycle. As a result, the other hyperthread
on this core can run at almost full speed. The cost is that, on
average, the spinloop takes, on average, 71 cycles longer to resolve,
but given the typical waiting times, that's a minor issue.

>My (maybe wrong) reading was that the latency and resource usage of
>spinloops stayed the same as before, while the latency of PAUSE
>increased, by enough to change the tradeoff between spinloops and PAUSE.

No, PAUSE are not an alternative to spinloops, and that was not the
issue in this case.

m...@iae.nl

unread,

Jun 20, 2018, 9:46:48 AM6/20/18

to

On Tuesday, June 19, 2018 at 5:09:33 PM UTC+2, Albert van der Horst wrote:
[..]

> But if a program saturates the
> memory bandwidth, there is not much iForth can do.
> This program fetches two cells and stores one cell.
> The calculation in between amounts to one subtraction
> and one divide. That is an unfavourable.

So does each calculation of DDOT in the BLAS.

-marcel

Albert van der Horst

unread,

Jun 22, 2018, 9:35:57 AM6/22/18

to

In article <1bad26a6-8e36-4748...@googlegroups.com>,

I've done some work and I have come this far:

\ Design issue for threads under MS-Windows
\ If you start using MS handles, you're lost because it is
\ impossible to know what happens.
\ The only thing we can be sure about is that CreateThread
\ passes control to a code address, and that that code can
\ end by a ret instruction. Then cleanup will be matched to startup.
\ So we do all our Forth stuff between a CALL and RET instruction
\ and don't touch MS-Windows business.

\ ----------------------------------------------
\ THREAD-PET in wina.
WANT $-PREFIX K32 ASSEMBLERi86 H. CTA ALIAS

'R# ALIAS ME
"CreateThread" 'K32 DLL-ADDRESS: _Create-Thread _Create-Thread DROP
\ ----------------------------------------------
\ Return via the stack
CODE return RET, END-CODE

\ Run xt from workspace and remember handle . Then return.
: doit ME !
DUP 2 CELLS + @ CATCH DUP IF ERROR THEN
ME @ DSP! return ;

\ Run the thread, a structure as in thread-pet
CODE runthread
\ Remember windows stack pointer as a handle.
MOV, X| T| BX'| R| SP|
\ Get the workspace from the MS-Windows stack
POP, R| AX| POP, R| AX|
\ Install the workspace
MOV, X| T| SP'| BO| [AX] 0 B,
MOV, X| T| BP'| BO| [AX] 1 CELLS B,
PUSH, R| AX|
PUSH, R| BX|
\ Switch to high level code
LEA, SI'| MEM| 'doit >PHA L, NEXT,
END-CODE

\ Run thread as a Windows thread in its workspace, leave thread-id ,
: CREATE-THREAD
CALL[ PARS @ 6 CELLS ERASE
'runthread >CFA @ PAR3 \ Startaddress
( thread) PAR4 \ Parameter
\ CREATE-SUSPENDED PAR5 \ Conditions
_Create-Thread CALL]
DUP 0= 1001 AND THROW \ DUP 1001 ?ERROR
;

\ Create a 4 cell structure,
\ -empty data stack
\ -empty return stack
\ -execution token
\ -thread id

: THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @

ROT RSP! ( DSP) , ( TASK) , ( xt ) 0 , ( pid) 0 ,
DOES> ( xt -- ) \ Execute in parallel
DSP@ H. CR
>R R@ 2 CELLS + ! R@ CREATE-THREAD R> 3 CELLS + !
DSP@ H. CR
\ DELAY
DSP@ H. CR ;

\ ----------- test -----------------------
1000 THREAD-PET JAN
VARIABLE aapje 0 aapje !
: testje 123456789 aapje ! ;
\ ' testje jan

The last test shows that it works.

This would be the solution, but ...

Windows is very helpful. It keeps track of the stacks of
all threads. If it thinks a thread is out of bounds, it
allocates more stack space.
So the main Forth suddenly see the stack pointer changed
from e.g. UNUSED 33M to UNUSED 20K.
The symptom are the three points in THREAD-pet where the
stackpointer is printed. ( DSP@ H. CR ).
To times 33M then 20K has been observed many times.

>
>Peter

Albert van der Horst

unread,

Jun 23, 2018, 1:35:02 PM6/23/18

to

In article <pgitvf$r6i$1...@cherry.spenarnc.xs4all.nl>,

Albert van der Horst <alb...@cherry.spenarnc.xs4all.nl> wrote:
>In article <1bad26a6-8e36-4748...@googlegroups.com>,
> <peter....@gmail.com> wrote:
>>On Monday, 18 June 2018 12:06:41 UTC+2, Albert van der Horst wrote:
>>> In article <9a14a362-6032-4b90...@googlegroups.com>,
>>> <peter....@gmail.com> wrote:
>>> <SNIP>
>>> >
>>> >

>>> >I tried to run it in a Linux Virtual machine. 16 cores 16 GB ram.
>>> >But that segfaults with lina8G. I will try with more ram.
>>> >
>>> >Is it possible to run the program on wina64?

Not even on wina32 when you asked the question.

>>>
>>> Programs are portable accross ciforth, except for availability of
>>> library functions.

I implemented THREAD-PET for wina 32bits , which amounts to adding
2 screens to the forth.lab library, see below.
Now r10par.frt (originally for lina 32/64) runs unaltered under wine
with the -s (scripting) option:
wine wina.exe -s r10par.frt -p 6 2,000,000,000

[Not only is 2G about the largest a 32 bit Forth can do, but the
answers is easy to remember: 98 2222 87 primes. ]
Tests with 20 threads on wine and a few threads on a one core
Celeron under Windows-XP all give correct results, but the speed is terrible.

<SNIP>

>I've done some work and I have come this far:
>
>\ Design issue for threads under MS-Windows
>\ If you start using MS handles, you're lost because it is
>\ impossible to know what happens.
>\ The only thing we can be sure about is that CreateThread
>\ passes control to a code address, and that that code can
>\ end by a ret instruction. Then cleanup will be matched to startup.
>\ So we do all our Forth stuff between a CALL and RET instruction
>\ and don't touch MS-Windows business.
>

This strategy worked after all. As long as no DLL calls are made
in the threads these two screens do the job.

\ ----------------------------------------------
( runthread ) CF: ?WI ?32
WANT ASSEMBLERi86-HIGH
MAX-USER @ DUP USER ME CELL+ MAX-USER !
CODE return RET, END-CODE \ Return via the stack
: EXIT-PET ME @ DSP! return ;
\ Run xt from thread to completion, remember handle .

: doit ME ! DUP 2 CELLS + @ CATCH DUP IF ERROR THEN

EXIT-PET ;
CODE runthread \ Run the thread passed via the MS-stack.

MOV, X| T| BX'| R| SP|

POP, R| AX| POP, R| AX|

MOV, X| T| SP'| BO| [AX] 0 B,
MOV, X| T| BP'| BO| [AX] 1 CELLS B,
PUSH, R| AX| PUSH, R| BX|

LEA, SI'| MEM| 'doit >PHA L, NEXT,

END-CODE TRIM
\ ----------------------------------------------
( THREAD-PET KILL-PET ) CF: ?WI ?32
WANT K32 CTA runthread
"CreateThread" 'K32 DLL-ADDRESS: _Create-Thread
"TerminateThread" 'K32 DLL-ADDRESS: _Terminate-Thread
_Terminate-Thread DROP _Create-Thread DROP
\ Start thread : thread-id .

: CREATE-THREAD CALL[ PARS @ 6 CELLS ERASE

'runthread >CFA @ PAR3 ( thr) PAR4 _Create-Thread CALL]
DUP ?ERRUR ;
\ Use space for "thread" / run xt in thread.

: THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @
ROT RSP! ( DSP) , ( TASK) , ( xt ) 0 , ( pid) 0 ,

DOES> >R R@ 2 CELLS + ! R@ CREATE-THREAD
R> 3 CELLS + ! ;
: KILL-PET >BODY 3 CELLS + @ \ Forced kill of thread .
CALL[ PAR1 0 PAR2 _Terminate-Thread CALL] ;
\ ----------------------------------------------

This is the text that succeeds, the usage can be
inspected by `` top '' or an equivalent MS-windows tool.
\ ----------------------------------------------
1000 THREAD-PET jan

VARIABLE aapje 0 aapje !

: testje 123456789 aapje ! ;

: loopie 123456789 aapje ! BEGIN AGAIN ;

.( expect 0 :) aapje ? CR
'testje jan
100 MS
.( expect 123.. :) aapje ? CR
0 aapje !
.( expect 0 :) aapje ? CR
'loopie jan
100 MS
.( expect 123.. :) aapje ? CR
.( expect 100% usage )
10000 MS
'jan KILL-PET
.( expect 0% usage )

\ -----------------------------

I have some one working on wina64 bits now.
The nice thing is of course a multithreading that is portable
accross 32 and 64 bits Linux and Windows.

P.S. Oops. doit is a terrible name, but at least that is copy
paste of working code.