Hardware simulation in Forth

Brad Eckert

unread,

Sep 9, 2017, 2:53:09 PM9/9/17

to

Hi All,

I discovered a one-screen discrete event simulation this morning, suitable for simulating hardware in Forth before re-coding in Verilog. Its performance depends on a fast CMOVE.

In VFX, a small hardware simulation (simple counter) ran at ~30 MHz on my laptop. SwiftForth was ~11 MHz, but then it has a simplistic CMOVE. It seems like a good way to do hardware development on components you want to simulate in real time.

\ Discrete Event Simulation for Hardware Sims 9/9/17 BNE
64 value |regs| variable regptr \ 16 32-bit regs
|regs| 2* buffer: regs regs regptr !

: WIRE ( <name> -- ) create 0 , ['] ! , does> @ ;
: r! ( n addr -- ) @ |regs| + ! ;
: REG ( <name> -- ) create regptr @ , ['] r! ,
1 cells regptr +! does> @ @ ;
: /CLK ( -- ) regs dup |regs| + |regs| cmove ;
: CLK/ ( -- ) regs |regs| + regs |regs| cmove ;
: POR ( -- ) regs |regs| erase ;
: (=>) ( addr -- ) dup cell+ @ execute ;
: => ( n <name> -- ) state @
if ' >body postpone literal postpone (=>)
else ' >body (=>) then
; immediate

\ Support for smaller registers
: WIRE8 ( <name> -- ) create 0 , ['] c! , does> c@ ;
: rc! ( n addr -- ) @ |regs| + c! ;
: REG8 ( <name> -- ) create regptr @ , ['] rc! ,
1 regptr +! does> @ c@ ;

\ counter demo
REG tally POR
WIRE foo

: CYCLE ( -- ) \ synchronous logic
/CLK
tally 1+ => tally
tally 9 = if
0 => tally
then
CLK/
;
: cycles ( n -- )
0 ?do cycle tally . loop
;
[undefined] ticks [if] : ticks counter ; [then]
: bench ( -- )
ticks 1000000 0 do cycle loop
ticks swap - . ." ns/cycle"
;

cr bench \ i7 laptop SwiftForth = 11MHz, VFX = 32MHz
cr foo . 11 => foo foo .

hughag...@gmail.com

unread,

Sep 9, 2017, 3:50:57 PM9/9/17

to

I glanced over your code, but didn't examine it carefully.

I did something similar in my simulator for the MiniForth. Up to 5 instructions could be packed into a single opcode and would all execute concurrently in one cycle. The trick I figured out was to have two sets of registers, a SRC and a DST --- the concurrent instructions were simulated sequentially on the desktop computer --- they would use the SRC registers for their inputs and the DST registers for their results. After simulating each opcode, I would move all the DST registers to the SRC registers for the next opcode. Doing this, the concurrent instructions wouldn't mess up each other's registers when they were simulated sequentially. This corresponds to your two blocks of registers in REGS above.

I didn't use CMOVE though --- it would have been too slow to move the data one byte at a time --- there were only a few registers, so a loop wasn't needed; I just used @ and ! explicitly for each register.

Your code is somewhat hard to read. You could have given the two blocks of registers different names (like SRC and DST for example) --- your code could have been better written --- it is not terrible though, so I won't knock it too much.

Jan Coombs

unread,

Sep 9, 2017, 4:50:23 PM9/9/17

to

On Sat, 9 Sep 2017 11:53:07 -0700 (PDT)
Brad Eckert <hwf...@gmail.com> wrote:

> Hi All,
>
> I discovered a one-screen discrete event simulation this
> morning, suitable for simulating hardware in Forth before
> re-coding in Verilog. Its performance depends on a fast CMOVE.
>
> In VFX, a small hardware simulation (simple counter) ran at
> ~30 MHz on my laptop. SwiftForth was ~11 MHz, but then it has
> a simplistic CMOVE. It seems like a good way to do hardware
> development on components you want to simulate in real time.

For gForth 4.5 MHz, using this:

: ticks ntime 1000000 um/mod nip ;

Can we now build a b16 simulator in Forth?

Jan Coombs
--

rickman

unread,

Sep 9, 2017, 5:03:48 PM9/9/17

to

You discovered a very crude version of delta time delay. VHDL records
changes to registers as happening a delta time delay after the present time.
Then all changes to all registers at the present time are recorded before
the actual update takes place on the next time point which is one delta
delay later. Delta delays are zero seconds but still have the property of
sequence.

> I didn't use CMOVE though --- it would have been too slow to move the data one byte at a time --- there were only a few registers, so a loop wasn't needed; I just used @ and ! explicitly for each register.
>
> Your code is somewhat hard to read. You could have given the two blocks of registers different names (like SRC and DST for example) --- your code could have been better written --- it is not terrible though, so I won't knock it too much.

Don't knock it too much, knock it just enough ;)

--

Rick C

Viewed the eclipse at Wintercrest Farms,
on the centerline of totality since 1998

rickman

unread,

Sep 9, 2017, 5:04:43 PM9/9/17

to

What has been stopping anyone until now?

hughag...@gmail.com

unread,

Sep 9, 2017, 11:03:54 PM9/9/17

to

On Saturday, September 9, 2017 at 2:03:48 PM UTC-7, rickman wrote:
> hughag...@gmail.com wrote on 9/9/2017 3:50 PM:
> > I did something similar in my simulator for the MiniForth. Up to 5 instructions could be packed into a single opcode and would all execute concurrently in one cycle. The trick I figured out was to have two sets of registers, a SRC and a DST --- the concurrent instructions were simulated sequentially on the desktop computer --- they would use the SRC registers for their inputs and the DST registers for their results. After simulating each opcode, I would move all the DST registers to the SRC registers for the next opcode. Doing this, the concurrent instructions wouldn't mess up each other's registers when they were simulated sequentially. This corresponds to your two blocks of registers in REGS above.
>
> You discovered a very crude version of delta time delay. VHDL records
> changes to registers as happening a delta time delay after the present time.
> Then all changes to all registers at the present time are recorded before
> the actual update takes place on the next time point which is one delta
> delay later. Delta delays are zero seconds but still have the property of
> sequence.

Rickman --- I have told you a dozen times: piss off!

You are a troll --- you have never written any Forth code and know nothing about the subject ---
you troll comp.lang.forth because you know that if you brown-nose Elizabeth Rather
she and the Forth-200x committee will grant you expert status in Forth, just like Passaniti etc.,
and give you a place to make a pest of yourself, which seems to be what you enjoy the most.

Stop pestering me --- go away!

hughag...@gmail.com

unread,

Sep 9, 2017, 11:24:04 PM9/9/17

to

On Saturday, September 9, 2017 at 1:50:23 PM UTC-7, Jan Coombs wrote:
> Can we now build a b16 simulator in Forth?

You don't need anything like this to simulate the B16 ---
the B16 doesn't have instructions executing concurrently ---
simulating the B16 would be similar to simulating the 6502 or any other processor.

I do have <SWITCH and FAST-SWITCH> in the novice-package (not released yet).
This is better than CASE because it generates a sparse jump-table, so it is orders of magnitude faster.
This would work well for simulating a processor such as the 6502 or whatever.
FAST-SWITCH> works with 16-bit opcodes, so it could simulate the AVR8 etc..
For processors with larger opcodes you would need to use SLOW-SWITCH>
that generates a packed jump-table and then does a BSEARCH each time.

The B16 only has 32 instructions --- I don't think it is worth bothering with.

I designed a processor called the 65ISR that is an experiment in how simple a processor can be
and still be useful. It has 95 instructions.

A processor with only 32 instructions is a toy.
It is not going to be efficient enough for any real-world use.

Cecil Bayona

unread,

Sep 10, 2017, 1:31:10 AM9/10/17

to

All this CPU emulation might be quite handy as one could simulate a
processor and work out the instruction set before trying to put on in a
FPGA.

I have some code that I been working on, a Forth Cross compiler that
could easily be modified to do the emulation instead. It's in C++ and
partially translated to the GO language for the core and the rest is in
Forth. Lately I been feeling poorly so work has been slow but that
health issue coming under control so I can be back to being active on
the project soon.

--
Cecil - k5nwa

Cecil Bayona

unread,

Sep 10, 2017, 1:31:10 AM9/10/17

to

On 9/9/2017 10:24 PM, hughag...@gmail.com wrote:

32 instructions is not too bad if it's a Forth CPU, but I tend to like
having more than 32, more like 64 Forth primitives

--
Cecil - k5nwa

rickman

unread,

Sep 10, 2017, 2:01:33 AM9/10/17

to

Dude, you are a total nutcase. You made an intelligent, rational post and I
replied in an intelligent, rational way. Your response was to go schizoid
and start ranting about the Forth-200x committee and people who have nothing
to do with this topic.

Jeeze, if you can't respond rationally to a rational post maybe you would be
better off just ignoring them?

rickman

unread,

Sep 10, 2017, 2:16:25 AM9/10/17

to

hughag...@gmail.com wrote on 9/9/2017 11:24 PM:
> On Saturday, September 9, 2017 at 1:50:23 PM UTC-7, Jan Coombs wrote:
>> Can we now build a b16 simulator in Forth?
>
> You don't need anything like this to simulate the B16 ---
> the B16 doesn't have instructions executing concurrently ---
> simulating the B16 would be similar to simulating the 6502 or any other processor.

I don't think you understand the idea of parallelism. You need to use the
delta delay concept even when an instruction stream is executed singly if
there are multiple registers updated on the same clock cycle which have
cyclical dependencies. In the same clock cycle the PC is used to fetch an
opcode and updated to point to the next instruction while registers are used
as inputs to operations and written back to the same registers and other
registers can be used as memory addresses and incremented/decremented. This
can all be hard coded to preserve the ordering in a simple enough processor,
but if the instructions have a lot of parallel operations this can be error
prone. I suppose that the B16 isn't so complex, but the parallel nature of
register updates has to be considered when coding.

> I do have <SWITCH and FAST-SWITCH> in the novice-package (not released yet).
> This is better than CASE because it generates a sparse jump-table, so it is orders of magnitude faster.
> This would work well for simulating a processor such as the 6502 or whatever.
> FAST-SWITCH> works with 16-bit opcodes, so it could simulate the AVR8 etc..
> For processors with larger opcodes you would need to use SLOW-SWITCH>
> that generates a packed jump-table and then does a BSEARCH each time.
>
> The B16 only has 32 instructions --- I don't think it is worth bothering with.

Exactly.

> I designed a processor called the 65ISR that is an experiment in how simple a processor can be
> and still be useful. It has 95 instructions.

That doesn't sound so simple. In fact, that sounds pretty complex. Unless
you count all the possible combinations of register to register moves and
ALU operations I'm pretty sure the 8080 and similar ISRs have fewer unique

instructions.

> A processor with only 32 instructions is a toy.
> It is not going to be efficient enough for any real-world use.

LOL! I suppose the F18A is also a toy? Well, you might be right about that
one, but not because of the instruction count.

You actually have no real understanding of the term "efficiency", or more
accurately, it means just what you choose it to mean—neither more nor less.
So how do you define "efficiency"?

rickman

unread,

Sep 10, 2017, 2:32:45 AM9/10/17

to

Cecil Bayona wrote on 9/10/2017 12:41 AM:
> On 9/9/2017 10:24 PM, hughag...@gmail.com wrote:
>> On Saturday, September 9, 2017 at 1:50:23 PM UTC-7, Jan Coombs wrote:
>>> Can we now build a b16 simulator in Forth?
>>
>> You don't need anything like this to simulate the B16 ---
>> the B16 doesn't have instructions executing concurrently ---
>> simulating the B16 would be similar to simulating the 6502 or any other
>> processor.
>>
>> I do have <SWITCH and FAST-SWITCH> in the novice-package (not released yet).
>> This is better than CASE because it generates a sparse jump-table, so it
>> is orders of magnitude faster.
>> This would work well for simulating a processor such as the 6502 or whatever.
>> FAST-SWITCH> works with 16-bit opcodes, so it could simulate the AVR8 etc..
>> For processors with larger opcodes you would need to use SLOW-SWITCH>
>> that generates a packed jump-table and then does a BSEARCH each time.
>>
>> The B16 only has 32 instructions --- I don't think it is worth bothering
>> with.
>>
>> I designed a processor called the 65ISR that is an experiment in how
>> simple a processor can be
>> and still be useful. It has 95 instructions.
>>
>> A processor with only 32 instructions is a toy.
>> It is not going to be efficient enough for any real-world use.
>>
>
> All this CPU emulation might be quite handy as one could simulate a
> processor and work out the instruction set before trying to put on in a FPGA.

I don't think you understand how to design FPGAs. The process is to
optionally design the architecture in diagrams or in a specification
followed by an HDL design, simulation and finally synthesis and bench
testing. The simulation would be easier to perform than writing your own
simulator in Forth by a long shot. The Forth simulation may run faster
however since it is at a very simple Register Transfer Level (RTL) rather
than being simulated in an HDL simulator that has to deal with all the
details of the HDL language. Verifying the function of the CPU would not be
overly complex in HDL and the simulation support in HDL allows test benches
to easily automate testing of the instruction set functionality.

> I have some code that I been working on, a Forth Cross compiler that could
> easily be modified to do the emulation instead. It's in C++ and partially
> translated to the GO language for the core and the rest is in Forth.

Why reinvent the wheel when simulators are already available? The HDL for
the J1, the b16 and other stack processors is already written. Just provide
a test bench and you are off!

Lately
> I been feeling poorly so work has been slow but that health issue coming
> under control so I can be back to being active on the project soon.

Hope you feel better soon.

rickman

unread,

Sep 10, 2017, 2:33:22 AM9/10/17

to

Why? What is missing from the F18A or b16?

pahihu

unread,

Sep 10, 2017, 4:27:42 AM9/10/17

to

Nice.

Use the following to speed up somewhat:

: Nregs ( -- n ) regPtr @ regs - ;
: /CLK ( -- ) regs dup |regs| + Nregs cmove ;
: CLK/ ( -- ) regs |regs| + regs Nregs cmove ;

On my 1.4GHz Core2 notebook: 118 ns/cycle vs 352 ns/cycle using SwiftForth.

pahihu

Jan Coombs

unread,

Sep 10, 2017, 6:09:54 AM9/10/17

to

Similar 40% speedup on gForth, 131ns vs 213ns.

Have moved compatibility layer to top of file:

\ portability layer

[undefined] ticks [if]
[defined] counter [if] : ticks counter ; [then]
[defined] ntime [if] : ticks ntime 1000000 um/mod nip ;
[then] [then]

I'm not sure why we are quoting the counter cycle rate rather
than the simulated clock rate, so modified the bench code to:

: bench ( -- )
ticks 1000000 0 do cycle loop

ticks swap - dup cr . ." ns/CounterCycle"
10 / dup cr . ." ns/SimClockCycle"
1000 swap / cr . ." MHz Sim Clock"
;

Am I mistaken?

Jan Coombs

Jan Coombs

unread,

Sep 10, 2017, 6:33:28 AM9/10/17

to

On Sat, 9 Sep 2017 12:50:55 -0700 (PDT)
hughag...@gmail.com wrote:

> I glanced over your code, but didn't examine it carefully.
>
> I did something similar in my simulator for the MiniForth. Up
> to 5 instructions could be packed into a single opcode and
> would all execute concurrently in one cycle.

I think that both the subject and content indicated that Brad
was demonstrating code to do low level RTL logic simulation.
There is considerable distance and possibly months of work
between this and your point.

To preserve the value of postings it can be helpful if the
subject is also changed to reflect the content. Your criticism
of the posted code was not appropriate given your verifiable
opening line.

Feel free to email me privately if you wish to share insight on
simulatable and synthesizable processor design techniques. We
could probably replicate the hardware for your work with JH, and
revive your previous leading edge project.

Jan Coombs

pahihu

unread,

Sep 10, 2017, 8:57:46 AM9/10/17

to

Actually we don't need to move the elements, just keep track the current and new
register bank.

I've modified /CLK just to switch current bank to 0, and new bank to |regs| apart.
POR uses /CLK to initialise the banks, CLK switches the banks.

pahihu

\ Discrete Event Simulation for Hardware Sims 9/9/17 BNE

: (=>) ( addr -- ) dup cell+ @ execute ;
: => ( n <name> -- ) state @

IF ' >body POSTPONE literal POSTPONE (=>)
ELSE ' >body (=>) THEN
; IMMEDIATE

\ Support for smaller registers
: WIRE8 ( <name> -- ) Create 0 , ['] c! , DOES> c@ ;
: rc! ( n addr -- ) @ regsNew + c! ;
: REG8 ( <name> -- ) Create regPtr @ , ['] rc! ,
1 regPtr +! DOES> @ regsCurr + c@ ;

\ counter demo
REG tally POR
WIRE foo

: CYCLE ( -- ) \ synchronous logic

tally 1+ => tally
tally 9 = IF
0 => tally
THEN
CLK

;

: cycles ( n -- )

0 ?DO cycle tally . LOOP
;

: bench ( -- )
ticks 1000000 0 DO cycle LOOP

pahihu

unread,

Sep 10, 2017, 9:04:23 AM9/10/17

to

When all REGs change their value in the current cycle, it works.
To preserve the values of unmodified REGs we need to copy the state, of course.

pahihu

m...@iae.nl

unread,

Sep 10, 2017, 11:54:28 AM9/10/17

to

Something must be wrong (apart from division by zero)

4551 ps/CounterCycle
455 ps/SimClockCycle
2.197802 GHz Sim Clock ok
FORTH>

: bench ( -- )
MS? #1000000 #1000 * 0 do cycle loop

MS? swap - dup cr . ." ps/CounterCycle"
#10 / dup cr . ." ps/SimClockCycle"
1000e S>F F/ cr F. ." GHz Sim Clock" ;

FORTH> .ticker-info
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
x0 iTSC, TICKS-GET uses iTSC at 3402MHz

-marcel

rickman

unread,

Sep 10, 2017, 12:32:18 PM9/10/17

to

Jan Coombs wrote on 9/10/2017 6:33 AM:
> On Sat, 9 Sep 2017 12:50:55 -0700 (PDT)
> hughag...@gmail.com wrote:
>
>> I glanced over your code, but didn't examine it carefully.
>>
>> I did something similar in my simulator for the MiniForth. Up
>> to 5 instructions could be packed into a single opcode and
>> would all execute concurrently in one cycle.
>
> I think that both the subject and content indicated that Brad
> was demonstrating code to do low level RTL logic simulation.
> There is considerable distance and possibly months of work
> between this and your point.

I'm not sure it would take months of work. I didn't go through the
simulation code, so I don't know if it is actually working enough to
simulations. But once it is, it should not be hard to create a model of the
b16 or any other basic stack processor. Rather than simulate memory as a
large bank of registers with random logic to handle the address decoding, a
memory model can be constructed as a single entity improving the speed and
memory size.

> To preserve the value of postings it can be helpful if the
> subject is also changed to reflect the content. Your criticism
> of the posted code was not appropriate given your verifiable
> opening line.
>
> Feel free to email me privately if you wish to share insight on
> simulatable and synthesizable processor design techniques. We
> could probably replicate the hardware for your work with JH, and
> revive your previous leading edge project.

I don't understand the need for constructing a simulator in Forth. Adding
various features would be needed such as finding multiple drivers on a wire
when it is not valid and resolving states when multiple drivers are valid.

HDL simulators exist and work well. Why reinvent the wheel? Is the idea to
gain efficiency to allow complex simulations to proceed faster?

JUERGEN

unread,

Sep 10, 2017, 12:51:17 PM9/10/17

to

Thanks Brad, I always wondered why nobody has looked at the next step: converting this Forth code then into VHDL or even gates directly. It has been done as C to Gates. See Celoxica, Handel-C modified to Handel-Forth, see https://en.wikipedia.org/wiki/C_to_HDL. I had been talking to Ian Page at the time, and he said it would be relatively easy. Interactive VHDL -

JUERGEN

unread,

Sep 10, 2017, 1:05:17 PM9/10/17

to

Just remembered an article I did in the past about this theme https://www.design-reuse.com/articles/7656/fast-route-from-system-specification-to-implementation.html or a bit more scientific from Oxford https://core.ac.uk/display/21063760 The nice feature here that you can have Mixed Language Design: - some parts in Gates and some parts SW(Forth).

JUERGEN

unread,

Sep 10, 2017, 1:17:47 PM9/10/17

to

actually this link is better, what I realyy wanted to post, and article done by Ian Page - a veeeery little processor inluded ...

JUERGEN

unread,

Sep 10, 2017, 1:20:37 PM9/10/17

to

http://www.doc.ic.ac.uk/~wl/papers/fpl91a.pdf

hughag...@gmail.com

unread,

Sep 10, 2017, 1:52:26 PM9/10/17

to

On Sunday, September 10, 2017 at 3:33:28 AM UTC-7, Jan Coombs wrote:
> On Sat, 9 Sep 2017 12:50:55 -0700 (PDT)
> hughag...@gmail.com wrote:
>
> > I glanced over your code, but didn't examine it carefully.
> >
> > I did something similar in my simulator for the MiniForth. Up
> > to 5 instructions could be packed into a single opcode and
> > would all execute concurrently in one cycle.
>
> I think that both the subject and content indicated that Brad
> was demonstrating code to do low level RTL logic simulation.
> There is considerable distance and possibly months of work
> between this and your point.
>
> To preserve the value of postings it can be helpful if the
> subject is also changed to reflect the content. Your criticism
> of the posted code was not appropriate given your verifiable
> opening line.

I wasn't criticizing his code because it simulated a low-level circuit.
My criticism of the code was that it was hard to read.
Primarily, this was because the two register sets weren't in separate arrays, but were juxtaposed in one array.
Then he would use an offset from the front of the first array to get the base address of the second array.
That was needlessly complicated. If you have two arrays, just put them in two distinct arrays each with its own name.
I also said that his code wasn't terrible. It was somewhat over-complicated, but not bad really.
I'm not totally knocking his code. He has the basic idea of the two register sets figured out ---
I remember that I struggled to figure that out ---
I remember that the whole concept of concurrency was pretty mind-blowing for me.

Anyway, simulating instructions that execute in parallel presents the same problem whether low-level or high-level.
Essentially, the problem is that some instructions may set a register, and other instructions may use that same register.
If they happen in parallel, then the instruction that uses the register uses the original value.
They have to be simulated in sequence though, because that is how desktop computers work.
So, we have two sets of registers, a SRC and a DST --- everything uses the SRC registers and sets the DST registers.
Afterward, move all the DST values to the SRC registers so the next opcode can be simulated.

hughag...@gmail.com

unread,

Sep 10, 2017, 2:12:14 PM9/10/17

to

On Sunday, September 10, 2017 at 3:33:28 AM UTC-7, Jan Coombs wrote:

> Feel free to email me privately if you wish to share insight on
> simulatable and synthesizable processor design techniques. We
> could probably replicate the hardware for your work with JH, and
> revive your previous leading edge project.
>
> Jan Coombs

Google Groups isn't giving me the option of sending a private email.
My email is: hughag...@gmail.com
Send me an email so I can obtain your email address, then I will send you an email back with this information.

My 65ISR design is not similar to the MiniForth/RACE at all.
It only supports interrupts, but not a main-program --- so it is only for "event-driven" applications.
I think a Forth compiler can be written for it (I'm going to try), but it would be tough because there is no stack.
VFX does optimization (holding values in registers) to minimize stack use, so I would do this to eliminate stack use.

Note that the MiniForth (now called RACE and upgraded to an FPGA) is proprietary to JH (John Hart),
so it would be illegal or at least unethical to second-source it without his permission.
I don't think he ever will give anybody permission to second-source it --- he will take it to the grave.
He still makes money on it. All of his motion-control boards since 1994 have been based on it ---
so from his point of view, there is no need to "revive" it, as it is still going strong.

hughag...@gmail.com

unread,

Sep 10, 2017, 2:17:17 PM9/10/17

to

On Sunday, September 10, 2017 at 5:57:46 AM UTC-7, pahihu wrote:
> Actually we don't need to move the elements, just keep track the current and new
> register bank.

That is a good point. I didn't look at our code, but I think I understand what you mean.

Don't move all the values, just swap a pointer to the different arrays.
If you have a lot of registers (16 is a lot), this will be faster.
The MiniForth had very few registers, so I just moved the values themselves.
This was likely faster because there was no indirection needed to access the values in the simulation.

hughag...@gmail.com

unread,

Sep 10, 2017, 2:29:09 PM9/10/17

to

On Saturday, September 9, 2017 at 10:31:10 PM UTC-7, Cecil - k5nwa wrote:
> All this CPU emulation might be quite handy as one could simulate a
> processor and work out the instruction set before trying to put on in a
> FPGA.
>
> I have some code that I been working on, a Forth Cross compiler that
> could easily be modified to do the emulation instead. It's in C++ and
> partially translated to the GO language for the core and the rest is in
> Forth. Lately I been feeling poorly so work has been slow but that
> health issue coming under control so I can be back to being active on
> the project soon.

Well, all of the software for the MiniForth was developed in simulation.
A physical MiniForth board didn't exist until quite late in the project.

My simulator didn't decode the opcodes while simulating, because that would have been slow.
My assembler would generate machine-code for the target processor, and also generate code for the host machine.
The simulator ran the code for the host machine to do the simulation.
This is very fast! I had a Pentium at work and an 80486 at home, and it was fast enough.

You can also make your simulator obtain information about program execution.
This is the kind of information that a profiler would be used to obtain.
This is easier to do in a simulator though, and more accurate because it counts clock cycles.

Note that this only works when you have the source-code to your program.
This would be a bad technique for the 6502 because you don't have source-code (commercial games sold in the 1980s).

hughag...@gmail.com

unread,

Sep 10, 2017, 2:42:50 PM9/10/17

to

On Saturday, September 9, 2017 at 11:16:25 PM UTC-7, rickman wrote:
> hughag...@gmail.com wrote on 9/9/2017 11:24 PM:
> I don't think you understand the idea of parallelism.

> ...

> You actually have no real understanding of the term "efficiency"

Piss off!

Brad Eckert

unread,

Sep 10, 2017, 7:36:18 PM9/10/17

to

The basic idea is to simulate a CPU at the RTL level to play around with the instruction set and run it in real-enough time. Possibly in the 10 MHz range. The simulation can generate test vectors to verify that the Verilog/VHDL version is cycle-accurate in all the registers. This is different from an instruction set simulator, which matches hardware only through careful vetting.

Jan Coombs

unread,

Sep 10, 2017, 7:37:10 PM9/10/17

to

On Sat, 9 Sep 2017 11:53:07 -0700 (PDT)
Brad Eckert <hwf...@gmail.com> wrote:

> Hi All,
>
> I discovered a one-screen discrete event simulation this
> morning, suitable for simulating hardware in Forth before

> re-coding in Verilog. Its performance depends on a fast CMOVE.

You're off to a good start! The python tool that I use ran
this 30,000 times slower. Although it is about as fast as any
other free Hardware Description Language simulator [2]

One interesting possibility would be to extract a Verilog
netlist for FPGA place and route tools directly from the Forth
simulation model. Models of FPGA hardware library parts could
be made, maybe starting like this:

: LUT4 ( a b c d MAP -- o ) \
>r 2* + 2* + 2* + \ generates index
negate \ only if -1 used for True
r> swap rshift 1 and ; \ extract result

: LUT4_OR4 ( a b c d -- o ) 0xfffe LUT4 ;

: LUT_XOR2 ( a b -- o ) 0 0 %0000000110 LUT4 ;

: LUT_AND2 ( a b -- o ) 0 0 %0000001000 LUT4 ;

: HalfAdd ( a b -- s c ) 2dup 2>r LUT_XOR2 2r> LUT_AND2 ;

The hardware source must either be written, or processed to a
state so that every component used more than once has a unique
name. Then unique signal names can be derived from these.

Jan Coombs
--
[1] http://myhdl.org/
[2] http://myhdl.org/docs/performance.html

This is the translated hardware description without test harness:

# BradEckert_ForthHardwareSim_TabPretty_2017-09-10.py

from myhdl import always_seq, intbv, Signal, ResetSignal

def CYCLE(tally, clock, POR):

@always_seq(clock.posedge, reset=POR)
def COUNTlogic(): # logic description starts
if tally == 9: tally.next = 0
else: tally.next = tally + 1

return COUNTlogic

# External Signal Declarations

tally = Signal(intbv(0)[4:]) # 4b wide
clock = Signal(bool(0))
POR = ResetSignal(0, active=0, async=True)

# Exports: Verilog, VHDL, Trace Waveform

hughag...@gmail.com

unread,

Sep 10, 2017, 8:13:28 PM9/10/17

to

On Sunday, September 10, 2017 at 4:36:18 PM UTC-7, Brad Eckert wrote:
> The basic idea is to simulate a CPU at the RTL level to play around with the instruction set and run it in real-enough time. Possibly in the 10 MHz range. The simulation can generate test vectors to verify that the Verilog/VHDL version is cycle-accurate in all the registers. This is different from an instruction set simulator, which matches hardware only through careful vetting.

I think Testra had a simulator at the very low level that you are describing --- I didn't write it --- I didn't do anything in regard to hardware design.

They had their own HDL written in Forth --- LDL (Lattice Design Language) wasn't adequate --- nowadays it is no longer possible to write your own HDL, so they use VHDL for the FPGAs.

I don't know how to use any HDL. I will have to learn Verilog in order to get my 65ISR going --- I don't know anything about the subject at this time though.

Cecil Bayona

unread,

Sep 10, 2017, 8:13:57 PM9/10/17

to

It won't be a very high performance Forth initially since it generates
Token Threaded Code but it will generate code that runs in several CPUs
without re-compiling, emulating a CPU is not that far away as the
desktop could run the code locally on the PC to simulate it running on a
target. If it all works out I will then go for the next which is to
generate the Virtual Machines for various CPUs that will run the Token
Threaded Code on 16 or 32 bit CPUs. Another possibility for later will
be a JIT compiler that turns the tokens into native code when loaded.

Besides embedded applications Forth is very handy for creating
assemblers and simulators on PCs. A simulator for an actual CPU is not
that difficult with Forth and later I will look at those tools.

As soon as I feel a little better which will be soon, Monday I will
pickup a new drug prescriptions that seem to work well with samples I
received. I will then start moving the code to the VFX Compiler, due to
ambiguities in the ANS Standard quite a bit of code workout differently
between VFX and SwiftForth yet they are both ANS compliant. That is
quite disappointing to the point I felt like giving up on Forth but I
won't I will just work on tools that do what I need them to do.

--
Cecil - k5nwa

hughag...@gmail.com

unread,

Sep 10, 2017, 9:10:34 PM9/10/17

to

On Sunday, September 10, 2017 at 5:13:57 PM UTC-7, Cecil - k5nwa wrote:
> It won't be a very high performance Forth initially since it generates
> Token Threaded Code but it will generate code that runs in several CPUs
> without re-compiling, emulating a CPU is not that far away as the
> desktop could run the code locally on the PC to simulate it running on a
> target. If it all works out I will then go for the next which is to
> generate the Virtual Machines for various CPUs that will run the Token
> Threaded Code on 16 or 32 bit CPUs. Another possibility for later will
> be a JIT compiler that turns the tokens into native code when loaded.

What about my suggestion?

I said to use a common assembly-language. Then you can retarget the assembler for different processors.
It is possible with this scheme to write a token-threaded Forth, but I don't recommend doing so.
I would recommend a subroutine-threaded Forth instead, so you can get efficiency.

This is only vaguely related Forth processors. You could come up with an FPGA processor that exactly matches your universal assembler.
This would work well if your universal assembler was for an existing processor.
We discussed the PDP-11 previously, which is a pretty good choice.
The MSP430 is another good choice. There are FPGA implementations of both already.
Another possibility is a subset of the 16-bit x86 (no segment registers, various other features discarded).

I got part-way through designing my own "hugh-niversal assembly-language" that was a subset of the 16-bit x86.
I'll likely get back on that project again --- it is a pretty good idea.

Cecil Bayona

unread,

Sep 10, 2017, 10:59:05 PM9/10/17

to

Part of that project is a common assembler so I have not forgotten it
but I'm doing in in phases, first tokens, then start converting to a
common assembler once he assembler is fully working. Since I'm limiting
it to 16 and 32 bit CPUs they will have decent instruction sets and
addressing modes so once there it should be fairly efficient.

I'm kind of completely undecided on how to proceed, with tokens that is
fairly simple but slower at execution, with a common assembler it is
more complicated but fairly efficient at execution. So right up to now I
have been working on both approaches and doing experiments, I have the
assembler about 30% done and the token VM about 90% done.

My initial thoughts at the moment is to with approach #1;

1. Write a VM in Go that executes tokens, it would easily modified to
create a simulator
2 Use the VM to create a small cross-compiler Forth with about 150 to
200 words
3. Add a Universal Assembler and convert the primitives to assembler.
4. Create a JIT Module to convert the tokens to Universal compiler so
the code is fast.

I already have task 2 ready , task 1 is almost finished and I would have
to do task 3 and 4 to be done and the whole thing integrated. That is a
lot of work and kind of convoluted.

Approach #2 is;

1. Write a Universal Common Assembler with a simple cross-compiler
2. Write the primitives in the common assembler
3. Add the rest of the cross-compiling Forth with 150 to 200 words

This is a more complicated approach as it's all or nothing, but overall
the whole task is simpler because of fewer task to do. I already have
task 3 done so only task 1 and 2 would need to be finished and of course
the whole thing needs to be tested.

The second approach is what I believe that you favor and it's the more
traditional way of doing a cross compiler. The first approach I would
end with a working product faster since after step #2 I would have
working software but it would not generate very fast code, the
additional steps would make it faster.

What I think I will do is continue working on the Forth VM and the
Universal Common Assembler if I make a lot of headway with the assembler
then I will proceed to scheme #2 but if I run into major issues then I
will proceed with scheme #1 as it would be much simpler to get something
going.

I really like the PDP11 but the FPGA will be rather complex so it's not
a good first project, it would be rather hard to make the instructions
run on one clock quite a few will need multiple clocks to execute. The
MSP430 is a favorite of mine and real hardware is rather inexpensive so
I was thinking of using it for the initial CPU, I still want a PDP11 in
a FPGA in any case but that is a project for later.

--
Cecil - k5nwa

Anton Ertl

unread,

Sep 11, 2017, 5:33:51 AM9/11/17

to

Brad Eckert <hwf...@gmail.com> writes:
>Hi All,
>

>I discovered a one-screen discrete event simulation this morning, suitable for simulating hardware in Forth before re-coding in Verilog. Its performance depends on a fast CMOVE.

Using MOVE instead of CMOVE can help a lot. Using the original code
with a 10M-cycle variant of the benchmark, I get (results in Core
i7-4690K cycles):

CMOVE MOVE
2039M 895M vfxlin
3394M 1032M gforth-fast

With pahihu's improvement, there is little difference for this
benchmark, but for a simulation that simulates more registers, you will
see the slowdown effect from cmove again.

I have put the resulting program on
<http://www.complang.tuwien.ac.at/forth/programs/hw-sim.4th>

I leave it to you to package it up for theforth.net.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2017: http://euro.theforth.net/

Brad Eckert

unread,

Sep 11, 2017, 12:05:54 PM9/11/17

to

On Monday, September 11, 2017 at 2:33:51 AM UTC-7, Anton Ertl wrote:
> Brad Eckert <hwf...@gmail.com> writes:
> >Hi All,
> >
> >I discovered a one-screen discrete event simulation this morning, suitable for simulating hardware in Forth before re-coding in Verilog. Its performance depends on a fast CMOVE.
>
> Using MOVE instead of CMOVE can help a lot. Using the original code
> with a 10M-cycle variant of the benchmark, I get (results in Core
> i7-4690K cycles):
>
> CMOVE MOVE
> 2039M 895M vfxlin
> 3394M 1032M gforth-fast
>
> With pahihu's improvement, there is little difference for this
> benchmark, but for a simulation that simulates more registers, you will
> see the slowdown effect from cmove again.
>
> I have put the resulting program on
> <http://www.complang.tuwien.ac.at/forth/programs/hw-sim.4th>
>
> I leave it to you to package it up for theforth.net.
>

Wow, Thanks! I did change the code to generate inline code for the registers when in SwiftForth. That got it to run faster than VFX. Also, only one MOVE is needed per cycle, not two. The counter demo sim exceeds 100 MHz easily. A CPU simulation should run at least 10 MHz.

\ Discrete Event Simulation for Hardware Sims 9/10/17 BNE

16 value |Q| variable qPtr \ 4 32-bit registers
|Q| 2* buffer: Qbuf Qbuf qPtr !
|Q| 2* buffer: Dbuf Dbuf Qbuf - constant Q>D

: POR ( -- ) Qbuf |Q| erase Dbuf |Q| erase ;

[undefined] DocGen? [if] : SwiftForth ; [then]

[defined] SwiftForth [if]

: $compile ( $ -- ) \ compile counted string
count >r here r@ move r> allot
;
\ Header for register is {Qbuf[n], writer, 'writecode, readcode, writecode}
: (reg!) ( addr -- )
state @ if 2 cells + @ $compile postpone drop
else @ Q>D + ! then
;
: (wire!) ( addr -- )
state @ if 2 cells + @ $compile postpone drop
else ! then

;
: => ( n <name> -- )

' >body dup cell+ @ execute
; immediate

: resolve-count ( a -- ) \ resolve code length
dup here swap - 1- swap c!
;
: compREG ( 'write read write -- )
rot >r
here >r 0 c, [ asm ]
swap [EDI] EBX MOV [ forth ]
r> resolve-count
here r> ! \ resolve 'write
here >r 0 c, [ asm ]
EBX swap [EDI] MOV [ forth ]
r> resolve-count
1 cells qPtr +!
immediate

;

: REG ( <name> -- )

create qPtr @ dup , ['] (reg!) , here swap 0 ,
-origin dup Q>D + compREG
does> state @ if
postpone dup
3 cells + $compile
exit then @ @
;

: WIRE ( <name> -- )

create here -origin >r 0 , ['] (wire!) , here 0 ,
r> dup compREG
does> state @ if
postpone dup
3 cells + $compile
exit then @
;
CODE CLK ( -- )
EBX PUSH
Dbuf ECX ADDR
Qbuf EDX ADDR
|Q| 2 rshift # EBX MOV
BEGIN
0 [ECX] EAX MOV
4 [ECX] ECX LEA
EAX 0 [EDX] MOV
4 [EDX] EDX LEA
EBX DEC
0= UNTIL
EBX POP
RET END-CODE

: ticks counter ;
[else]
: (reg!) ( n addr -- ) @ Q>D + ! ;

: WIRE ( <name> -- ) create 0 , ['] ! , does> @ ;

: REG ( <name> -- ) create qPtr @ , ['] (reg!) ,
1 cells qPtr +! does> @ @ ;
: CLK ( -- ) Dbuf Qbuf |Q| cmove ;

: (=>) ( addr -- ) dup cell+ @ execute ;

: => ( n <name> -- ) ' >body state @
if postpone literal postpone (=>) else (=>) then

; immediate

\ Support for smaller registers
: WIRE8 ( <name> -- ) create 0 , ['] c! , does> c@ ;

: (regC!) ( n addr -- ) @ Q>D + c! ;
: REG8 ( <name> -- ) create qPtr @ , ['] (regC!) ,
1 qPtr +! does> @ c@ ;
[then]

\ counter demo
REG counter POR

: CYCLE ( -- ) \ synchronous logic

counter 1+ => counter
counter 9 = if
0 => counter
then
CLK

;
: cycles ( n -- )

0 ?do cycle counter . loop
;
: bench ( -- )
ticks 2000000 0 do CYCLE CYCLE CYCLE CYCLE CYCLE loop
ticks swap - \ msec for 10M cycles
10000 over 2/ + swap / . ." simulated MHz"
;

\ benchmark results: 92-106 MHz on VFX, 110-140 MHz on inlined SwiftForth

Brad Eckert

unread,

Sep 11, 2017, 1:37:14 PM9/11/17

to

Tested on SwiftForth with pahihu's fix and i7/7500U @ 2.7 GHz. Counter simulates at 270 MHz.

I think the hardware simulation of a Forth computer will have a much higher clock rate than my first desktop computer.

Albert van der Horst

unread,

Sep 11, 2017, 2:53:24 PM9/11/17

to

In article <2017Sep1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Brad Eckert <hwf...@gmail.com> writes:
>>Hi All,
>>
>>I discovered a one-screen discrete event simulation this morning, suitable for simulating hardware in Forth before re-coding in Verilog. Its performance depends on a fast CMOVE.
>
>Using MOVE instead of CMOVE can help a lot. Using the original code
>with a 10M-cycle variant of the benchmark, I get (results in Core
>i7-4690K cycles):

I'm surprised. On Intel machines I thought both map onto a
REP MOVSB
Now MOVE has to check for overlap of source and destination.
That can only make it slower.
What I'm I missing here?

>- anton
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

m...@iae.nl

unread,

Sep 11, 2017, 3:00:30 PM9/11/17

to

I left off the wires because they were not used.

\ Discrete Event Simulation for Hardware Sims 9/10/17 BNE

ANEW -devent

#16 value |Q| variable qPtr \ 4 32-bit registers
create Qbuf |Q| 2* ALLOT Qbuf qPtr !
create Dbuf |Q| 2* ALLOT Dbuf Qbuf - constant Q>D

: POR ( -- ) Qbuf |Q| erase Dbuf |Q| erase ;

:NONAME ( n addr -- ) @ Q>D + ! ; =: REG32!
: REG32 ( <name> -- ) create qPtr @ , REG32! , 1 CELLS qPtr +!
does> @ @ ;

\ Support for smaller registers
:NONAME ( n addr -- ) @ Q>D + c! ; =: REGC!
: REG8 ( <name> -- ) create qPtr @ , REGC! , 1 qPtr +!
does> @ c@ ;

: move4R ( a1 a2 -- ) SWAP @+ ROT !+ SWAP @+ ROT !+
SWAP @+ ROT !+ SWAP @ SWAP ! ;

: CLK ( -- ) Dbuf Qbuf move4R ;

: => ( n <name> -- )

' >body @+ SWAP @ ( -- addr opcode )
CASE
REG32! OF Q>D + ( addr -- ) LITERAL, POSTPONE ! ENDOF
REGC! OF Q>D + ( addr -- ) LITERAL, POSTPONE C! ENDOF
CR H. 1 ABORT" should be (regxx!)"
ENDCASE ; immediate

\ counter demo
REG32 counter POR

: CYCLE ( -- ) \ synchronous logic
counter 1+ => counter
counter 9 = if 0 => counter then
CLK ;

: cycles ( n -- ) 0 ?do cycle counter . loop ;

: bench ( -- )

?ms #20000000 0 do CYCLE CYCLE CYCLE CYCLE CYCLE loop
?ms swap - \ msec for 100M cycles
#100000 over 2/ + swap / . ." simulated MHz"

;

\ benchmark results: 92-106 MHz on VFX, 110-140 MHz on inlined SwiftForth

\ iForth: 389 - 395 simulated MHz ok ( 32bit Forths should be twice faster )

-marcel

Brad Eckert

unread,

Sep 11, 2017, 5:46:17 PM9/11/17

to

On Monday, September 11, 2017 at 10:37:14 AM UTC-7, Brad Eckert wrote:
> Tested on SwiftForth with pahihu's fix and i7/7500U @ 2.7 GHz. Counter simulates at 270 MHz.
>
> I think the hardware simulation of a Forth computer will have a much higher clock rate than my first desktop computer.

The inliner was bad - I don't know why it worked. I tried a similar inlining trick with VFX, but apparently too many cooks spoil the broth. It gets confused when the application compiles machine code directly. The SwiftForth version works fine:

: compREG ( 'write read write -- )

rot >r \ create inline code
6 c, 8B c, 9F c, swap -origin , \ read code

here r> ! \ resolve 'write

6 c, 89 c, 9F c, -origin , \ write code

Anton Ertl

unread,

Sep 12, 2017, 4:44:29 AM9/12/17

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>I'm surprised. On Intel machines I thought both map onto a
> REP MOVSB
>Now MOVE has to check for overlap of source and destination.
>That can only make it slower.
>What I'm I missing here?

REP MOVSB is slow. Very slow. That's because it gives the same
guarantee that CMOVE (or, with the other direction, CMOVE>) gives,
which makes it slow. And apparently the Intel engineers have not
bothered to implement it in a more efficient way; and neither have
SwiftForth nor Gforth for CMOVE. By contrast, MOVE has to check
(which can be done cheaply for the no-overlap case), but once that is
done, it can parallelize the copying as much as is practical. And it
certainly does not use REP MOVSB.

I pointed that out to you in
<2015Nov...@mips.complang.tuwien.ac.at> and
<2015Nov...@mips.complang.tuwien.ac.at>, but your reply was
"Nothing changes my mind." So of course you missed all that. And I
guess you will miss it again, for that reason.

Rod Pemberton

unread,

Sep 13, 2017, 7:05:20 PM9/13/17

to

On Tue, 12 Sep 2017 08:24:51 GMT
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> REP MOVSB is slow. Very slow.

Do you have any references for that claim?

(AIR, the programmer's reference manuals for Intel and AMD say the
exact opposite. I.e., that REP MOVS sequence has special circuitry to
make it the fastest way to move data.)

Rod Pemberton
--
To Zuckerberg. The implementation of DACA is illegal. It was
particularly cruel for President Obama to punish young illegal aliens
in the future via deportation after selling them a false hope in the
American Dream.

Coos Haak

unread,

Sep 13, 2017, 7:32:04 PM9/13/17

to

Op Wed, 13 Sep 2017 19:06:30 -0400 schreef Rod Pemberton:

> On Tue, 12 Sep 2017 08:24:51 GMT
> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> REP MOVSB is slow. Very slow.
>
> Do you have any references for that claim?
>
> (AIR, the programmer's reference manuals for Intel and AMD say the
> exact opposite. I.e., that REP MOVS sequence has special circuitry to
> make it the fastest way to move data.)
>
>
> Rod Pemberton

The AMD Software Optimization Guide for AMD64 Processors
(# 25112 rev. 3.06, september 2005)
says the oppositie in chapter 8.3 Repeated String Instructions.

groet Coos

Paul Rubin

unread,

Sep 14, 2017, 12:50:06 AM9/14/17

to

Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
> (AIR, the programmer's reference manuals for Intel and AMD say the
> exact opposite. I.e., that REP MOVS sequence has special circuitry to
> make it the fastest way to move data.)

Maybe in the PC-XT 8088 era. Not since then.

Alex

unread,

Sep 14, 2017, 6:12:44 AM9/14/17

to

Intel® 64 and IA-32 Architectures Optimization Reference Manual
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)

Beginning with processors based on Intel microarchitecture code name Ivy
Bridge, REP string operation using MOVSB and STOSB can provide both
flexible and high-performance REP string operations for software
in common situations like memory copy and set operations. Processors
that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID
feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1.

For processors supporting enhanced REP MOVSB/STOSB, implementing memcpy
with REP MOVSB will provide even more compact benefits in code size and
better throughput than using the combination of REP MOVSD+B. For
processors based on Intel microarchitecture code name Ivy Bridge,
implementing memcpy using ERMSB might not reach the same level of
throughput as using 256-bit or 128-bit AVX alternatives, depending on
length and alignment factors.

--
Alex

Ron Aaron

unread,

Sep 14, 2017, 6:59:08 AM9/14/17

to

On 14/09/2017 13:12, Alex wrote:

> 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)

...

Huh! Hadn't noticed that before. Very nice to know REP MOVSB is back
in play.

Albert van der Horst

unread,

Sep 14, 2017, 12:19:34 PM9/14/17

to

With combination you mean first do the bulk with MOVSD , then the remainder
with MOVSB ?
It is only natural that Intel makes REP MOVSB / REP MOVSD faster
given the surplus of silicon and the difficulty of putting it to
good use. The 2005 amd manual gives the advice to use the
10K+ microsoft / gcc libc routines for speed still.

Within a simple Forth like ciforth I don't want the hassle of
loading a dynamic library or add 30 % to the code for a small gain
in speed, I still think that is the correct trade off.
(Was I really that rude Anton?)

I measured it, for copying 100Mbyte gforth is about 4 times as
fast as lina (both 32 version, gforth takes about 3 seconds.)

>--
>Alex

Groetjes Albert

Alex

unread,

Sep 14, 2017, 12:45:12 PM9/14/17

to

On 14-Sep-17 09:21, Albert van der Horst wrote:
> With combination you mean first do the bulk with MOVSD , then the remainder
> with MOVSB ?

(The text was a direct quote, not mine).

The manual is quite clear; for most small to medium moves, do it all
with REP MOVSB and not REP MOVSD/MOVSB.

"Using ERMSB always delivers better performance than using REP MOVSD+B.
If the length is a multiple of 64, it can produce even higher
performance. For example, copying 65-128 bytes takes 40 cycles, while
copying 128 bytes needs only 35 cycles."

AVX2 may be faster in certain circumstances; mainly larger moves, but
the margin as shown below is small.

A table from page 3-70;

Table 3-4. Relative Performance of Memcpy() Using ERMSB Vs. 128-bit AVX

Range of Lengths (bytes) <128 128 to 2048 2048 to 4096
Memcpy_ERMSB/Memcpy_AVX128 0x7X 1X 1.02X

The complexity of a REP MOVSB is much reduced from the AVX2 version; and
its shortness makes it possible to inline it.

--
Alex

Julian Fondren

unread,

Sep 14, 2017, 5:49:45 PM9/14/17

to

On Wednesday, September 13, 2017 at 6:05:20 PM UTC-5, Rod Pemberton wrote:
> On Tue, 12 Sep 2017 08:24:51 GMT
> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
> > REP MOVSB is slow. Very slow.
>
> Do you have any references for that claim?
>
> (AIR, the programmer's reference manuals for Intel and AMD say the
> exact opposite. I.e., that REP MOVS sequence has special circuitry to
> make it the fastest way to move data.)
>

https://groups.google.com/d/msg/comp.lang.forth/_mHNadHZ9Ys/60EMc4GeBgAJ

Quoting:
----
$ make moves-bench
lina -c moves.frt
bash -c 'time ./moves'
204800

real 0m7.621s
user 0m7.620s
sys 0m0.001s
bash -c 'time sf moves.fs'

204800

real 0m8.237s
user 0m8.235s
sys 0m0.002s
bash -c 'time iforth moves.fs'
204800

real 0m0.878s
user 0m0.820s
sys 0m0.050s
bash -c 'time gforth moves.fs'
204800

real 0m0.676s
user 0m0.673s
sys 0m0.003s

These all just make a bunch of useless MOVEs between two buffers:

: moves ( -- )
buf1 1024 200 * bounds do
buf1 i over - buf2 swap move
#moves ++
loop #moves ? cr bye ;
----

I didn't include all of the code because I didn't think it mattered,
but SwiftForth actually only checks to see which CMOVE it should run -
it doesn't ask whether the buffers even overlap.

So these two programs have reliably different runtimes:

moves1.fs:

1024 200 * buffer: buf1
1024 200 * buffer: buf2
variable #moves

: moves ( -- )
buf1 1024 200 * bounds do
buf1 i over - buf2 swap move
#moves ++
loop #moves ? cr bye ;

moves bye

moves2.fs:

1024 200 * buffer: buf2
1024 200 * buffer: buf1
variable #moves

: moves ( -- )
buf1 1024 200 * bounds do
buf1 i over - buf2 swap move
#moves ++
loop #moves ? cr bye ;

moves bye

timed:

$ time sf moves1.fs
204800

real 0m8.064s
user 0m8.048s
sys 0m0.010s

$ time sf moves2.fs
204800

real 0m11.316s
user 0m11.261s
sys 0m0.029s

Alex

unread,

Sep 14, 2017, 11:08:12 PM9/14/17

to

On 14-Sep-17 14:49, Julian Fondren wrote:
> On Wednesday, September 13, 2017 at 6:05:20 PM UTC-5, Rod Pemberton wrote:
>> On Tue, 12 Sep 2017 08:24:51 GMT
>> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> REP MOVSB is slow. Very slow.
>>
>> Do you have any references for that claim?
>>
>> (AIR, the programmer's reference manuals for Intel and AMD say the
>> exact opposite. I.e., that REP MOVS sequence has special circuitry to
>> make it the fastest way to move data.)
>>
>
> https://groups.google.com/d/msg/comp.lang.forth/_mHNadHZ9Ys/60EMc4GeBgAJ
>
> Quoting:
> ----

<snipped>

Without knowing the chipset, these benchmarks aren't very useful.

--
Alex

Julian Fondren

unread,

Sep 14, 2017, 11:37:00 PM9/14/17

to

Sounds like speculation. Do you have any CPUs that reverse these numbers?

Anton Ertl

unread,

Sep 19, 2017, 10:09:41 AM9/19/17

to

Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
>On Tue, 12 Sep 2017 08:24:51 GMT
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> REP MOVSB is slow. Very slow.
>
>Do you have any references for that claim?

I was conflating the results of my CMOVE speed tests (which don't use
REP MOVSB, however), with some disappointing experiences that I had
with REP MOVSQ (which was slower than a simple loop for the block size
I used). So I decided to do a more in-depth measurement of REP MOVSB
vs. some alternatives. I wrote a microbenchmark that copies a buffer
to a non-overlapping buffer, with both buffers independently starting
at offsets from 0 to 4095 (for the "aligned" results, offsets are
aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
memmove, and memcpy.

You find the benchmark on
<http://www.complang.tuwien.ac.at/anton/move/> (not in a
nice-to-download package yet).

You find the results below, and my observations here:

* REP MOVSB is slower than memcpy for some block sizes (especially
<1KB) on all platforms, and for all block sizes on some platforms
(Penryn, Sandy Bridge, unaligned Ivy Bridge, Zen), and often not
just by a little. In theory the hardware people should know how to
get the best performance out of their hardware, but in practice,
that seems hard to achieve.

* Aligned buffers help REP MOVSB a lot, surprisingly especially at
larger block sizes. I would have expected that hardware can deal
with that better than software, which needs (predicted) branches to
deal with that efficiently. Once you pay for misalignment, an odd
block size does not cost extra.

* Startup overhead is high for REP MOVSB; some are better for one
byte, but are then even slower for 8. On the balance, if I had to
choose between REP MOVSB and an implementation that eschews REP
MOVSB, I would choose the latter, because of the bad performance for
small block sizes. Viewed another way, thanks to the startup
overhead I have to implement something relatively complex for CMOVE
that may use REP MOVSB, but only for large block sizes.

* There is a surprising gap between memcpy and memmove performance;
sometimes memcpy is faster, sometimes memmove. In theory, for this
benchmark memcpy should never be slower than memmove, and memmove
should only be slower by a three-instruction sequence that contains
a predictable loop (so the actual copying code can start right
away). Also, in those cases where REP MOVSB is faster, it should be
faster, memmove and memcpy should use that (in this benchmark), and
the extra cost should just be a few checks.

Looking at these results, it is all the more ridiculous to have a
memcpy separate from memmove. If they spent the effort that they
spend on the separate routines on a memmove that uses rep movsb
where profitable, they would see better performance for both
routines.

* Enhanced REP MOVSB/STOSB (starting with Ivy Bridge; CPU flag erms)
is mentioned as feature in Intel's optimization manual, but the
difference between Sandy Bridge and Ivy Bridge in REP MOVSB
performance is not bigger than other differences that do not get a
separate flag. The biggest difference is seen at the lower counts,
e.g., 53 (Ivy) vs. 173 cycles for blocksize 128.

* repmovsb (unaligned) has a 22x cycle count improvement between
Penryn (2007) and Skylake (2015). The cycle count improvemet from
K8 (2003/2005) to Zen (2017) on repmovsb aligned is a factor of 15.
So there is still a lot of progress in some areas.

* The improvement in memmove/memcpy performance from glibc 2.3.6/glibc
2.7 to glibc 2.24 are probably for a good part in the software and
for a smaller part in the hardware. I cannot run a newer statically
linked binary on an older kernel ("Fatal: kernel too old"), so I
built a statically linked binary on the glibc 2.3.6 system, and ran
it on the Zen hardware. The glibc 2.24 memmove is faster by a
factor of about 3 for the larger block sizes, and not quite a factor
of 2 for memcpy. The better memmove/memcpy cycle count over K8 is
due to this software improvement and a factor of almost 2 hardware
improvement.

* It is strange that memmove is close to memcpy on Haswell and
Skylake, but is much slower on Zen. Different code paths at work?

Things that this microbenchmark does not cover, and that may have a
significant influence on performance:

* Using the results; supposedly REP MOVSB has advantage there because
of weaker ordering requirements of the stores (or is that about
independent instructions? the optimization manual is unclear). I
have not seen any benchmark that demonstrates that.

* In real applications other code will compete for I-cache space with
the monstrous implementations of memmove and memcpy in glibc (one
memmove I looked at had 11KB of machine code).

* This microbenchmark uses the same block size all the time, which is
a good case for branch prediction for memmove and memcpy. A less
predictable size may slow down memmove and memcpy (and possibly some
implementations of REP MOVSB).

You can find more discussion on these issues on
<https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy>.

Results are in cycles per iteration (i.e. buffer copying work plus
some loop and call overhead).

Penryn (Xeon 5450), glibc 2.7
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
21 86 104 142 221 378 691 1319 2575 5086 10106 21276 repmovsb
16 30 68 97 97 135 211 362 665 1287 2499 5031 memmove
20 21 39 48 72 120 210 391 853 1685 3360 6773 memcpy
21 85 103 135 175 195 234 314 472 789 1424 2875 repmovsb aligned
16 30 35 39 47 60 94 160 291 554 1105 2646 memmove aligned
20 20 19 20 26 47 81 164 360 653 1239 2693 memcpy aligned
21 86 103 141 220 377 690 1318 2573 5084 10108 21275 repmovsb blksz-1
18 28 56 77 82 120 198 348 651 1276 2499 5015 memmove blksz-1
21 18 29 49 72 120 210 389 851 1682 3357 6771 memcpy blksz-1

Sandy Bridge (Xeon E3-1220) eglibc 2.11.3
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
19 83 100 129 174 183 206 268 398 653 1164 2236 repmovsb
14 28 44 56 79 127 230 430 830 1674 3287 6521 memmove
18 19 29 31 37 49 87 161 261 459 857 1703 memcpy
18 81 100 129 173 179 195 228 301 448 737 1357 repmovsb aligned
15 28 31 35 38 46 76 141 267 550 1075 2151 memmove aligned
19 19 17 17 23 35 65 125 194 314 555 1086 memcpy aligned
18 83 99 128 174 181 205 267 397 651 1162 2233 repmovsb blksz-1
16 26 42 54 77 126 226 426 833 1675 3286 6523 memmove blksz-1
19 16 15 32 36 50 86 161 260 459 858 1705 memcpy blksz-1

Ivy Bridge (Core i3-3227U), glibc 2.23
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
41 41 42 42 54 61 75 117 218 421 838 1658 repmovsb
14 14 15 15 17 45 64 102 173 319 615 1437 memmove
17 19 13 17 20 34 53 90 166 338 647 1439 memcpy
42 41 41 42 53 60 71 96 158 287 557 1093 repmovsb aligned
13 13 14 14 15 27 42 72 136 265 545 1341 memmove aligned
16 18 12 16 18 30 47 79 153 291 551 1241 memcpy aligned
53 41 42 42 54 68 82 123 225 427 833 1656 repmovsb blksz-1
14 14 15 15 18 45 63 102 172 319 614 1434 memmove blksz-1
17 20 13 17 20 34 53 91 166 338 647 1438 memcpy blksz-1

Haswell (Core i7-4690K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
38 38 38 38 45 51 64 100 171 306 576 1135 repmovsb
10 10 11 11 14 30 48 86 149 282 567 1414 memmove
11 12 9 12 15 29 48 86 167 324 628 1415 memcpy
39 39 39 39 46 50 58 74 106 170 298 581 repmovsb aligned
11 11 12 12 13 26 38 67 132 260 531 1362 memmove aligned
12 13 10 15 15 24 37 69 134 277 534 1236 memcpy aligned
50 38 38 38 47 52 66 104 175 310 579 1148 repmovsb blksz-1
10 10 11 11 15 29 47 83 149 280 567 1374 memmove blksz-1
10 11 9 12 15 29 48 86 161 324 628 1417 memcpy blksz-1

Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
33 33 33 33 40 44 54 76 130 237 460 974 repmovsb
10 10 10 10 12 24 40 75 145 302 570 1384 memmove
11 12 8 10 13 26 45 84 160 312 606 1316 memcpy
33 33 33 33 41 45 53 69 101 175 302 564 repmovsb aligned
11 11 11 11 12 24 37 72 141 285 558 1369 memmove aligned
13 14 10 12 15 23 40 75 151 288 562 1267 memcpy aligned
60 33 33 33 43 47 57 78 132 238 460 952 repmovsb blksz-1
10 10 11 11 12 24 40 75 145 301 570 1411 memmove blksz-1
10 11 8 10 13 26 45 84 164 312 606 1347 memcpy blksz-1

Goldmont (Celeron J3455), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
49 48 48 50 54 63 81 123 213 392 831 2681 repmovsb
10 8 8 19 19 37 66 109 206 398 861 2700 memmove
10 8 8 19 19 37 65 109 206 398 863 2699 memcpy
49 48 48 50 54 62 78 111 177 309 635 2130 repmovsb aligned
11 9 9 19 19 37 65 106 197 312 633 2157 memmove aligned
11 9 9 19 19 37 65 106 197 312 634 2157 memcpy aligned
38 53 64 66 70 78 95 137 226 405 831 2689 repmovsb blksz-1
10 9 8 13 19 37 65 109 206 409 835 2714 memmove blksz-1
10 9 8 13 19 37 65 109 206 409 829 2706 memcpy blksz-1

K8 (Athlon 64 X2 4400+), glibc 2.3.6
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
21 28 54 90 162 307 595 1171 2325 4632 9244 18467 repmovsb
17 40 69 80 104 161 253 433 794 1514 2955 5836 memmove
24 31 57 82 98 129 199 323 570 1064 2053 4032 memcpy
21 28 53 87 155 292 566 1113 2206 4394 8768 17516 repmovsb aligned
17 40 33 37 46 68 118 234 451 834 1635 3237 memmove aligned
24 31 56 45 54 72 120 193 338 627 1207 2367 memcpy aligned
17 27 53 89 161 306 594 1171 2325 4629 9248 18461 repmovsb blksz-1
17 37 61 81 105 152 251 433 792 1513 2952 5825 memmove blksz-1
20 30 56 83 100 130 202 325 572 1067 2054 4030 memcpy blksz-1

K10 (Phenom II X2 560), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
15 22 48 84 157 309 566 1080 2107 4161 8270 16487 repmovsb
16 35 56 69 104 152 262 456 839 1604 3135 6201 memmove
16 19 13 19 31 68 114 226 408 774 1505 2968 memcpy
14 21 48 85 158 122 154 219 348 606 1122 2155 repmovsb aligned
16 39 35 38 46 63 95 190 364 664 1268 2583 memmove aligned
19 21 13 20 25 56 89 177 306 566 1084 2121 memcpy aligned
14 21 47 83 155 300 565 1079 2106 4160 8269 16487 repmovsb blksz-1
17 32 55 68 91 156 261 454 837 1602 3131 6190 memmove blksz-1
17 23 13 18 30 69 114 228 411 774 1508 2966 memcpy blksz-1

Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 33 57 105 110 119 140 184 321 599 1160 2324 repmovsb
13 14 13 14 30 42 65 107 175 325 600 1222 memmove
10 10 11 12 30 43 67 113 185 329 604 1226 memcpy
25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned
12 13 12 13 16 24 40 72 136 264 536 1094 memmove aligned
11 11 12 11 21 27 42 74 139 267 541 1092 memcpy aligned
23 32 56 90 110 120 140 184 321 600 1160 2324 repmovsb blksz-1
13 13 14 13 30 42 67 108 176 325 599 1219 memmove blksz-1
10 10 11 12 31 43 67 113 185 331 604 1221 memcpy blksz-1

Zen (Ryzen 5 1600X), glibc 2.3.6 (-static)
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 32 56 106 111 119 140 184 321 600 1161 2334 repmovsb
10 18 29 36 49 77 132 263 501 940 1816 3581 memmove
26 34 59 80 88 102 133 198 342 599 1114 2182 memcpy
25 33 56 85 89 97 113 145 209 337 595 1145 repmovsb aligned
10 18 20 19 24 40 72 137 286 542 1054 2110 memmove aligned
26 34 59 50 55 70 100 165 311 567 1079 2126 memcpy aligned
22 32 56 90 111 119 142 184 321 600 1161 2338 repmovsb blksz-1
8 16 29 36 49 76 131 261 499 938 1814 3582 memmove blksz-1
24 33 58 82 88 101 134 198 345 602 1117 2184 memcpy blksz-1

Terje Mathisen

unread,

Sep 19, 2017, 10:46:10 AM9/19/17

to

Anton Ertl wrote:
> Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
>> On Tue, 12 Sep 2017 08:24:51 GMT
>> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> REP MOVSB is slow. Very slow.
>>
>> Do you have any references for that claim?
>
> I was conflating the results of my CMOVE speed tests (which don't use
> REP MOVSB, however), with some disappointing experiences that I had
> with REP MOVSQ (which was slower than a simple loop for the block size
> I used). So I decided to do a more in-depth measurement of REP MOVSB
> vs. some alternatives. I wrote a microbenchmark that copies a buffer
> to a non-overlapping buffer, with both buffers independently starting
> at offsets from 0 to 4095 (for the "aligned" results, offsets are
> aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
> memmove, and memcpy.
>
> You find the benchmark on
> <http://www.complang.tuwien.ac.at/anton/move/> (not in a
> nice-to-download package yet).
>
> You find the results below, and my observations here:

This is wonderful work Anton, very nice.

The only thing I'm really missing is a few odd block sizes, i.e.
memcpy(a, b, len) where len is 0..31 modulo 32.

Particularly for relatively small lengths of misaligned buffers I would
expect modern ("fast strings") hardware to beat most sw implementations.

Handling a misaligned starting point, a couple of 32-byte blocks and a
misaligned tail end would be somewhat painful in sw. Using masked writes
to handle the tail might even need a lookup table for the mask.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Sep 19, 2017, 11:28:12 AM9/19/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>Anton Ertl wrote:
>> You find the benchmark on
>> <http://www.complang.tuwien.ac.at/anton/move/> (not in a
>> nice-to-download package yet).

Nice to download now available:

http://www.complang.tuwien.ac.at/anton/move/move.zip

>The only thing I'm really missing is a few odd block sizes, i.e.
>memcpy(a, b, len) where len is 0..31 modulo 32.

the blksz-1 results give 31 mod 32 for the blocksizes of 32 and
higher, and also provide len=7. Given that they are close to the
(unaligned blksize) results, I did not bother looking for other odd
block sizes.

But if you want more, just change the Makefile.

>Particularly for relatively small lengths of misaligned buffers I would
>expect modern ("fast strings") hardware to beat most sw implementations.

I had expected that, too, but unfortunately, that's not the case.

>Handling a misaligned starting point, a couple of 32-byte blocks and a
>misaligned tail end would be somewhat painful in sw. Using masked writes
>to handle the tail might even need a lookup table for the mask.

Do we have masked writes before AVX512? Are they efficient?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Terje Mathisen

unread,

Sep 19, 2017, 1:21:14 PM9/19/17

to

Anton Ertl wrote:

> Terje Mathisen <terje.m...@tmsw.no> writes:
>> Particularly for relatively small lengths of misaligned buffers I
>> would expect modern ("fast strings") hardware to beat most sw
>> implementations.
>
> I had expected that, too, but unfortunately, that's not the case.

:-(

>
>> Handling a misaligned starting point, a couple of 32-byte blocks
>> and a misaligned tail end would be somewhat painful in sw. Using
>> masked writes to handle the tail might even need a lookup table for
>> the mask.
>
> Do we have masked writes before AVX512? Are they efficient?

Sure!

MASKMOVDQU writes 16 bytes using another 16-byte mask value to determine
which bytes to actually write (using the top bit in each byte), with the
destination address being the same as for STOS/MOVS i.e. the DI/EDI/RDI
register.

This operation has been available since SSE2, so every single 64-bit
capable cpu is guaranteed to have this opcode.

With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.

Since it allows unaligned target addresses it should be possible to use
this directly for both the first and last block of a memcpy() operation.

Anton Ertl

unread,

Sep 20, 2017, 3:43:28 AM9/20/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>> Do we have masked writes before AVX512? Are they efficient?
>
>Sure!
>
>MASKMOVDQU writes 16 bytes using another 16-byte mask value to determine
>which bytes to actually write (using the top bit in each byte), with the
>destination address being the same as for STOS/MOVS i.e. the DI/EDI/RDI
>register.
>
>This operation has been available since SSE2, so every single 64-bit
>capable cpu is guaranteed to have this opcode.
>
>With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.

According to <http://www.felixcloutier.com/x86/MASKMOVDQU.html>, only
the 128-bit version is allowed. Given that both the SSE2 and the
AVX128 version support 2 operands, I wonder what the difference
between these versions is.

- anton
--

Terje Mathisen

unread,

Sep 20, 2017, 4:28:11 AM9/20/17

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:

>> This operation has been available since SSE2, so every single 64-bit
>> capable cpu is guaranteed to have this opcode.
>>
>> With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.
>
> According to <http://www.felixcloutier.com/x86/MASKMOVDQU.html>, only
> the 128-bit version is allowed. Given that both the SSE2 and the
> AVX128 version support 2 operands, I wonder what the difference
> between these versions is.

Huh? That means the only difference is that you have twice as many
working registers, but that's almost certainly not the limiter for a
fast memcpy() implementation unless you get into really big blocks on a
memory system with a huge turnaround penalty:

32 16-byte regs give you 512 bytes of buffer space, so still not into
page size range...

m...@iae.nl

unread,

Sep 20, 2017, 6:28:03 AM9/20/17

to

On Tuesday, September 19, 2017 at 4:09:41 PM UTC+2, Anton Ertl wrote:
> Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
> >On Tue, 12 Sep 2017 08:24:51 GMT
> >an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >
> >> REP MOVSB is slow. Very slow.
> >
> >Do you have any references for that claim?
>
> I was conflating the results of my CMOVE speed tests (which don't use
> REP MOVSB, however), with some disappointing experiences that I had
> with REP MOVSQ (which was slower than a simple loop for the block size
> I used). So I decided to do a more in-depth measurement of REP MOVSB
> vs. some alternatives. I wrote a microbenchmark that copies a buffer
> to a non-overlapping buffer, with both buffers independently starting
> at offsets from 0 to 4095 (for the "aligned" results, offsets are
> aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
> memmove, and memcpy.

It seems that the curves flatten out to 8 bytes/cycle, or is
that a coincidence?

To link this with Forth -- I was triggered by Alex's remark
about inlining REP MOVSB. Now *that* would speed up (C)MOVE a
lot, at least on iForth, especially for frequent small
moves (< 64bytes). Any ideas how that works out?

In iForth the overhead of an OS call is pretty horendous,
so we only do that for sizes > 64. However, it makes the
size of (C)MOVE too big to inline (looking at the expected gain).

-marcel

Anton Ertl

unread,

Sep 20, 2017, 12:22:04 PM9/20/17

to

m...@iae.nl writes:
>It seems that the curves flatten out to 8 bytes/cycle, or is
>that a coincidence?

Depends on which results you look at. E.g., repmovsb aligned on the
Skylake approaches 32 bytes per cycle for 16k blocks, and 32
bytes/cycle is the limit of the hardware.

>To link this with Forth -- I was triggered by Alex's remark
>about inlining REP MOVSB. Now *that* would speed up (C)MOVE a
>lot, at least on iForth, especially for frequent small
>moves (< 64bytes). Any ideas how that works out?

Looking at the results for small blocks, the overhead of REP MOVSB is
pretty horrible.

>In iForth the overhead of an OS call is pretty horendous,
>so we only do that for sizes > 64. However, it makes the
>size of (C)MOVE too big to inline (looking at the expected gain).

Calls and returns are pretty cheap on AMD64 CPUs. memcpy/memmove are
not OS functions, but library functions. If you meant that ABI calls
are slow, that may be a good thing to work at.

Alternatively, you can write a MOVE and CMOVE yourself in assembly
language, but making it fast is quite a bit of work (you can see in
the results I showed that neither Intel nor AMD engineers nor the
glibc maintainers found it easy to make it fast.