MOVS and CMPS Performance question

jasond...@nospicedham.gmail.com

unread,

Dec 23, 2014, 9:59:01 PM12/23/14

to

I have some familiarity with 68K and 6502 Assembly, and not much with
Intel assembly. So, my question is this. What is the performance
advantages of using Intel string instructions such as MOVS and CMPS
as compared with just for instance writing characters with MOV in a
simple loop?

My brief investigation into the subject via Google has shown me
contradictory answers to this question.

Thank you for sharing your knowledge.

Jason

jasond...@gmail.com

unread,

Dec 23, 2014, 9:59:23 PM12/23/14

to

Robert Wessel

unread,

Dec 24, 2014, 2:00:00 AM12/24/14

to

It varies by implementation. On some CPUs, the string instructions
are faster, on others, the explicit loop. On most modern x86 CPUs,
the string instructions are as fast, or faster, than explicit loops
for lengths of a few dozen or more bytes, while explicit code may be
faster for short strings. The crossover point, again, varies,
particularly because of the higher setup/startup overhead of the
string instructions.

In the past it's been more clear cut. On the original x86s (IOW
8086/8), the string instructions were considerable faster than an
explicit loop, for even moderate sized strings. In the 486 and early
Pentium days, the string instructions were usually always slower. And
as I said, it's swung back a fair way since then

It's also remarkably difficult to do an "optimum" version of something
as simple as a memcpy(). While it's rather dated now, check out the
"AMD Athlon Processor x86 Code Optimization Guide", page 178 for AMD
recommended memcpy() implementation:

http://www.ii.uib.no/~osvik/amd_opt/22007k.pdf

All six pages of it.

At the end of the day, it's best to start with clear code, and then
optimize it if you *measure* it to have inadequate performance. On
modern processors, if you're presented with a variable length string
operation, there's a lot to be said for just using the string
instructions. The performance is rarely horrible, and often quite
good.

wolfgang kern

unread,

Dec 24, 2014, 3:15:05 AM12/24/14

to

Robert already answered about speed-dependency on block-size.
Single block-instructions are always slower than discrete code,
but they became handy when code-size is of concern.

I measured on my PhenomII that REP MOVSd (dword) is faster then a
discrete loop if the destination is dw-aligned and the iteration-
count is > eight.
If both operands are misaligned then there seem to be not much
difference as long the count is greater 8.
The crosspoint may go towards higher counts on other machines.

REPZ CMPSb may win only on larger counts (I didn't check recently).
__
wolfgang

Bernhard Schornak

unread,

Dec 24, 2014, 7:00:20 AM12/24/14

to

jasond...@nospicedham.gmail.com wrote:

> I have some familiarity with 68K and 6502 Assembly, and not much with
> Intel assembly. So, my question is this. What is the performance
> advantages of using Intel string instructions such as MOVS and CMPS
> as compared with just for instance writing characters with MOV in a
> simple loop?

Depends on what you want to do. For block moves and
fast simultaneous comparisons of multiple data, the
media instructions (SSE x) are ways faster than any
GPR solution. If lower speed is sufficient, but the
code size matters, the repeated string instructions
grant the advantage of reduced instruction size. It
is just a matter of the task - more code and better
speed or slightly slower, but compact executables.

Merry Winter Solstice

Bernhard Schornak

Rod Pemberton

unread,

Dec 25, 2014, 2:47:22 AM12/25/14

to

Is there a consistent SSE x instruction sequence which can
replace REP MOVSB for block moves and is always faster?

Rod Pemberton

Bernhard Schornak

unread,

Dec 25, 2014, 9:33:25 AM12/25/14

to

See

http://www.agner.org/optimize/optimizing_assembly.pdf

chapter 16.10 (page 145). Here

http://tinyurl.com/ns8xy46

some test results are discussed.

It is quite hard to find hints which solution is the best
for a specific task. Older AMD optimisation guides listed
code samples for optimised memcpy(), but they disappeared
around the end of the first decade of the 21st century.

(Probably, they assumed everyone should know how to do it
after it was published for many years...)

Greetings from Augsburg

Bernhard Schornak

Terje Mathisen

unread,

Dec 25, 2014, 9:48:26 AM12/25/14

to

Rather that the effect is disappearing in the noise:

If you only move tiny items, small enough to normally be in $L1, then it
doesn't matter, and if you move blocks that are big enough to miss both
L1 and L2 cache then it really doesn't matter: All code will be able to
keep the memory bus saturated. :-(

For L2-sized working sets you can often find SSE versions that will be
faster for specific pattern, but not as a generalized memcpy() replacement.

Terje
PS. Merry Christmas for those of you who celebrate this time of the
year. :-)
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Bernhard Schornak

unread,

Dec 25, 2014, 12:33:42 PM12/25/14

to

You can prefetch the portion of memory you will process in the
next iteration at the beginning of the loop body to load a few
cache lines ahead. If the loop body is large enough, execution
can (at least partially) hide main memory latency.

prefetch 0x00[R10] # initial prefetch as
prefetch 0x40[R10] # far away as possible
...
align to instruction queue size
loop:prefetch 0x80[R10]
prefetch 0xC0[R10]
movdqa xmm0,0x00[R10]
movdqa xmm1,0x10[R10]
movdqa xmm2,0x20[R10]
movdqa xmm3,0x30[R10]
movdqa xmm4,0x40[R10]
movdqa xmm5,0x50[R10]
movdqa xmm6,0x60[R10]
movdqa xmm7,0x70[R10]
movntdqa 0x00[R11],xmm0
movntdqa 0x10[R11],xmm1
movntdqa 0x20[R11],xmm2
movntdqa 0x30[R11],xmm3
movntdqa 0x40[R11],xmm4
movntdqa 0x50[R11],xmm5
movntdqa 0x60[R11],xmm6
movntdqa 0x70[R11],xmm7
add R10,0x80
add R11,0x80
dec ECX
jne loop
...

Should be a few clocks faster than without prefetching...

Robert Wessel

unread,

Dec 26, 2014, 12:04:22 AM12/26/14

to

Quite likely that will no faster, or even a smidge *slower*, on modern
processors, which do considerable prefetching in hardware, which will
certainly pick up simple sequential patterns like that.

The usefulness of explicit prefetching is mainly limited to patterns
which the hardware prefetchers cannot predict (and their capabilities
have increased over time), but you should generally assume that at
least sequential accesses separated by a constant stride *will* be
successfully prefetched by modern hardware.

Terje Mathisen

unread,

Dec 26, 2014, 3:04:38 AM12/26/14

to

Exactly right, the code above will (unfortunately) most likely end up a
little slower than naive code when used inline in many places in a
larger binary:

The increased code size means that the instruction cache will have worse
hit rates. :-(

This is of course something Intel & co have known for decades, they have
suffered from the curse of "having to get it exactly right", i.e. any
internal suggestions to spend some dedicated hw getting rid of the
startup overhead in REP MOVS used to sink on the argument that it had to
cover all cases equally well:

I.e. source can be in L1/L2/L3/RAM, destination ditto.
The memory ranges might be uncacheable/write through/write
back/memory_mapped etc.

Source & destination might be 1/2/4/8/16/32 or 64-byte aligned, both
absolutely and relatively to each other.

The block size can range from 0 to multiple GB.

You end up with a ~1024-way decision table in order to pick the best
approach, it is pretty clear that you will not be able to spend the
transistors to make every one of those a fast path!

The final consideration is of course what you are going to use the
target for: Does it make sense to keep it in (some level of) cache or
should the block copy bypass any cache levels which don't currently
contain the target lines?

Except for the case of uncacheable/memory mapped ram (which will still
use streaming stores) it is impossible for the HW to know what the best
target cache policy would be, you would need heuristics like "if the
size is larger than $L2 bypass target cache", or allow the use of hints,
i.e. typically redundant prefix byte(s) in front of the REP MOVS to
signal that this is a special version.

The FAST STRINGS facility in some later x86 cpus have been an attempt to
at least start on the job I outlined above. :-)

Terje
PS. I haven't even mentioned the case of overlapping ranges which can be
used to fill a block of memory with a recurring pattern of N bytes,
where N can be any random value. On IBM mainframes the standard way to
fill a block was to write the desired value into the first word, then do
an overlapping block move, reading from the just stored word and writing
to the next word!

Bernhard Schornak

unread,

Dec 30, 2014, 12:01:25 PM12/30/14

to

I update my test tool to get reliable data we can discuss:

https://docs.google.com/file/d/0B1OgMlxNnSNEUy15VjByX01OQ3c/edit?usp=drive_web

Google Drive has some problems with zip files. Click download
to get the archive, then unzip it. The 64 bit Windows program
can be found in the source folder. It only offers English and
German language support via [Program][Language].

Start the program, then select [Test][Evaluation]. The dialog
shows the best, worst and average results of 32 tests run for
four different testees. The test suite ran on a FX8350, AMD's
latest processor for the high end market.

The results confirm what I told: The SSE solution is about 25
percent faster than built-in repeated string instructions. My
suggestion to prefetch cache lines gains some additional per-
cent over the plain SSE solution without prefetching. As this
test shows as well: REP MOVSB is almost as fast as REP MOVSQ.

It was interesting to see how recent iNTEL machines behave in
this test. Anyone willing to post some results?

Bernhard Schornak

unread,

Dec 31, 2014, 4:33:53 AM12/31/14

to

Bernhard Schornak wrote:

> https://docs.google.com/file/d/0B1OgMlxNnSNEUy15VjByX01OQ3c/edit?usp=drive_web

Applied some cosmetic changes, new link is

https://docs.google.com/file/d/0B1OgMlxNnSNEQVlhU1A2VjQ2b1k/edit?usp=drive_web

Results:
-------- REP MOVSB REP MOVSQ MOVDQA PREFETCH
Fastest 17,612,734 18,150,361 12,615,510 11 186 411
Slowest 19 571 328 20 745 807 14 891 350 13 448 809
Average 18 653 890 19 219 699 13 396 096 11 966 623

If you run the tests, you should run them at least twice. The
first testee's first run always is much slower than following
runs - worst case was more than twice as long (the reason why
I thought MOVSQ is faster than MOVSB...).

Happy New Year!

Bernhard Schornak

Mel

unread,

Dec 31, 2014, 4:24:37 PM12/31/14

to

I will do when I return to home tommorow (i7 4790).
Happy new year!

--
Press any key to continue or any other to quit

Bernhard Schornak

unread,

Jan 1, 2015, 8:30:00 AM1/1/15

to

Mel wrote:

> I will do when I return to home tommorow (i7 4790).

Thank you in advance!

> Happy new year!

Same to you and all other readers!

Bernhard Schornak

Melzzzzz

unread,

Jan 1, 2015, 1:00:55 PM1/1/15

to

On Thu, 01 Jan 2015 14:15:24 +0100
Bernhard Schornak <scho...@nospicedham.web.de> wrote:

> Mel wrote:
>
>
> > I will do when I return to home tommorow (i7 4790).
>
>
> Thank you in advance!

Unfortunately program does not works under wine/Linux.
It starts ok but when I choose run test, it just freezes
and nothing happens.
I don't have windows on new computer unfortunately.

Bernhard Schornak

unread,

Jan 1, 2015, 2:16:08 PM1/1/15

to

Melzzzzz wrote:

> Unfortunately program does not works under wine/Linux.
> It starts ok but when I choose run test, it just freezes
> and nothing happens.
> I don't have windows on new computer unfortunately.

The test (4 * 32 test runs) takes a few seconds to finish:
Each of those 128 tests copies one 16 MB block of data. In
general, the program uses the same libraries than STbench.
If the latter runs properly, the first should run as well.

Does the evaluation dialog [Test][Evaluation] show results
of the stored data set? [Test][Statistics] is not present,
but all other menu items should open dialog windows (help,
about, etc.) and selecting a language should switch to the
chosen language immediately.

Melzzzzz

unread,

Jan 1, 2015, 3:01:12 PM1/1/15

to

On Thu, 01 Jan 2015 20:14:42 +0100
Bernhard Schornak <scho...@nospicedham.web.de> wrote:

> Melzzzzz wrote:
>
>
> > Unfortunately program does not works under wine/Linux.
> > It starts ok but when I choose run test, it just freezes
> > and nothing happens.
> > I don't have windows on new computer unfortunately.
>
>
> The test (4 * 32 test runs) takes a few seconds to finish:
> Each of those 128 tests copies one 16 MB block of data. In
> general, the program uses the same libraries than STbench.
> If the latter runs properly, the first should run as well.
>
> Does the evaluation dialog [Test][Evaluation] show results
> of the stored data set?

No, it just sits there.

[Test][Statistics] is not present,
> but all other menu items should open dialog windows (help,
> about, etc.) and selecting a language should switch to the
> chosen language immediately.

Language switch works, about, too but help for test and run
tests does not. I had problem with other assembler program
from fasm forum and problem was that not all required registers
were preserved before calling win api functions.
Wine is less forgiving that proper windows in this case.
I guess it is similar problem here, too.

>
>
> Happy New Year!
> Bernhard Schornak
>

Happy New Year, to you , too! ;)

Bernhard Schornak

unread,

Jan 1, 2015, 5:31:42 PM1/1/15

to

Melzzzzz wrote:

> ... I had problem with other assembler program

> from fasm forum and problem was that not all required registers
> were preserved before calling win api functions.

My libraries have wrappers for most API calls - only functions
exclusively used by a program are called directly. My programs
assume that all registers (except RAX and XMM0 ... XMM3) don't
change their contents when other functions were called. 64 bit
Windows clobbers less registers than 64 bit Linux. Might be my
wrappers are not sufficient. Linux clobbers RDI and RSI first,
while 64 bit Windows defined RCX and RDX to pass the first two
parameters - RDI and RSI are not defined to pass parameters in
Windows at all, and therefore neither are used nor restored by
my wrappers.

If the program runs properly, filetimes for 00000003, FFFFFFFC
and FFFFFFFE in the [data] folder should change as long as the
program was normally closed (not terminated by task manager or
wine's equivalent). The Content of those datafields may change
while the program runs. FFFFFFFC holds the current path to the
program, FFFFFFFE the main window's size, position and program
flags, 00000003 the test data sets (which might change, so its
changed flag is permanently set to force saving on exit).

Melzzzzz

unread,

Jan 1, 2015, 6:16:48 PM1/1/15

to

On Thu, 01 Jan 2015 23:27:20 +0100
Bernhard Schornak <scho...@nospicedham.web.de> wrote:

> Melzzzzz wrote:
>
>
> > ... I had problem with other assembler program
> > from fasm forum and problem was that not all required registers
> > were preserved before calling win api functions.
>
>
> My libraries have wrappers for most API calls - only functions
> exclusively used by a program are called directly. My programs
> assume that all registers (except RAX and XMM0 ... XMM3) don't
> change their contents when other functions were called. 64 bit
> Windows clobbers less registers than 64 bit Linux. Might be my
> wrappers are not sufficient. Linux clobbers RDI and RSI first,
> while 64 bit Windows defined RCX and RDX to pass the first two
> parameters - RDI and RSI are not defined to pass parameters in
> Windows at all, and therefore neither are used nor restored by
> my wrappers.

Perhaps there is problem with window proc? Some register(s) are
changed that should be saved?

>
> If the program runs properly, filetimes for 00000003, FFFFFFFC
> and FFFFFFFE in the [data] folder should change as long as the
> program was normally closed (not terminated by task manager or
> wine's equivalent). The Content of those datafields may change
> while the program runs. FFFFFFFC holds the current path to the
> program, FFFFFFFE the main window's size, position and program
> flags, 00000003 the test data sets (which might change, so its
> changed flag is permanently set to force saving on exit).
>

If I choose 'run tests' I cannot close program normally.

Bernhard Schornak

unread,

Jan 2, 2015, 6:04:41 AM1/2/15

to

Melzzzzz wrote:

> On Thu, 01 Jan 2015 23:27:20 +0100
> Bernhard Schornak <scho...@nospicedham.web.de> wrote:
>
>> Melzzzzz wrote:
>>
>>> ... I had problem with other assembler program
>>> from fasm forum and problem was that not all required registers
>>> were preserved before calling win api functions.
>>
>> My libraries have wrappers for most API calls - only functions
>> exclusively used by a program are called directly. My programs
>> assume that all registers (except RAX and XMM0 ... XMM3) don't
>> change their contents when other functions were called. 64 bit
>> Windows clobbers less registers than 64 bit Linux. Might be my
>> wrappers are not sufficient. Linux clobbers RDI and RSI first,
>> while 64 bit Windows defined RCX and RDX to pass the first two
>> parameters - RDI and RSI are not defined to pass parameters in
>> Windows at all, and therefore neither are used nor restored by
>> my wrappers.
>
> Perhaps there is problem with window proc? Some register(s) are
> changed that should be saved?

No. The only culprits could be RDI or RSI. Both are not used
in the entire program - except the two MOVSx loops where RDI
and RSI have to hold source (RSI) and target (RDI) address -
none of both registers is used anywhere else.

Even if Linux' thread manager interrupts the loop, it should
save the entire register set and the current processor state
before the thread is interrupted and restore all data before
the thread is resumed.

>> If the program runs properly, filetimes for 00000003, FFFFFFFC
>> and FFFFFFFE in the [data] folder should change as long as the
>> program was normally closed (not terminated by task manager or
>> wine's equivalent). The Content of those datafields may change
>> while the program runs. FFFFFFFC holds the current path to the
>> program, FFFFFFFE the main window's size, position and program
>> flags, 00000003 the test data sets (which might change, so its
>> changed flag is permanently set to force saving on exit).
>>
> If I choose 'run tests' I cannot close program normally.

Does the filetime of the mentioned files change if you close
the program without invoking any menu items? If not, there's
a problem with the file handling functions, at least writing
to disk might be corrupted (loading works, otherwise you did
not see a menu - the language stuff is loaded from disk when
the menu is initialised or another language is chosen).

I guess it is a problem caused by wine. Probably they didn't
care about the two different ABIs too much, and there should
be wrappers translating Linux calling conventions to Windows
64 bit standards. Overwritten registers might cause problems
with C(++) or Pascal programs, as well.

If you are interested in searching the cause of the error, I
could insert some register dumps and upload a debug version.
The required dump viewer is included in my PTB (Programmer's
ToolBox).

Melzzzzz

unread,

Jan 2, 2015, 6:34:44 AM1/2/15

to

On Fri, 02 Jan 2015 11:51:27 +0100

Bernhard Schornak <scho...@nospicedham.web.de> wrote:

> Melzzzzz wrote:
>
>
> > On Thu, 01 Jan 2015 23:27:20 +0100
> > Bernhard Schornak <scho...@nospicedham.web.de> wrote:
> >
> >> Melzzzzz wrote:
> >>
> >>> ... I had problem with other assembler program
> >>> from fasm forum and problem was that not all required registers
> >>> were preserved before calling win api functions.
> >>
> >> My libraries have wrappers for most API calls - only functions
> >> exclusively used by a program are called directly. My programs
> >> assume that all registers (except RAX and XMM0 ... XMM3) don't
> >> change their contents when other functions were called. 64 bit
> >> Windows clobbers less registers than 64 bit Linux. Might be my
> >> wrappers are not sufficient. Linux clobbers RDI and RSI first,
> >> while 64 bit Windows defined RCX and RDX to pass the first two
> >> parameters - RDI and RSI are not defined to pass parameters in
> >> Windows at all, and therefore neither are used nor restored by
> >> my wrappers.
> >
> > Perhaps there is problem with window proc? Some register(s) are
> > changed that should be saved?
>
>
> No. The only culprits could be RDI or RSI. Both are not used
> in the entire program - except the two MOVSx loops where RDI
> and RSI have to hold source (RSI) and target (RDI) address -
> none of both registers is used anywhere else.

I think that that could be culprit. Wine uses RDI/RSI in functions
that call win proc for particular dialog and if those get
clobbered it just hangs, I guess.

>
> Even if Linux' thread manager interrupts the loop, it should
> save the entire register set and the current processor state
> before the thread is interrupted and restore all data before
> the thread is resumed.
>
>
> >> If the program runs properly, filetimes for 00000003, FFFFFFFC
> >> and FFFFFFFE in the [data] folder should change as long as the
> >> program was normally closed (not terminated by task manager or
> >> wine's equivalent). The Content of those datafields may change
> >> while the program runs. FFFFFFFC holds the current path to the
> >> program, FFFFFFFE the main window's size, position and program
> >> flags, 00000003 the test data sets (which might change, so its
> >> changed flag is permanently set to force saving on exit).
> >>
> > If I choose 'run tests' I cannot close program normally.
>
>
> Does the filetime of the mentioned files change if you close
> the program without invoking any menu items?

Yes. All three files change modification timestamp
after normal program close.

Bernhard Schornak

unread,

Jan 2, 2015, 10:50:21 AM1/2/15

to

Implies that wine interrupts REP MOVSx in progress and does
not restore the prior state. The 4 test loops are in ttt.S,
MOVSB is executed in t00, MOVSQ in t01. The prologue stores
RDI/RSI, and all saved registers are restored by XIT before
the test functions return to their caller.

>>>> If the program runs properly, filetimes for 00000003, FFFFFFFC
>>>> and FFFFFFFE in the [data] folder should change as long as the
>>>> program was normally closed (not terminated by task manager or
>>>> wine's equivalent). The Content of those datafields may change
>>>> while the program runs. FFFFFFFC holds the current path to the
>>>> program, FFFFFFFE the main window's size, position and program
>>>> flags, 00000003 the test data sets (which might change, so its
>>>> changed flag is permanently set to force saving on exit).
>>>>
>>> If I choose 'run tests' I cannot close program normally.
>>
>>
>> Does the filetime of the mentioned files change if you close
>> the program without invoking any menu items?
>
> Yes. All three files change modification timestamp
> after normal program close.

Okay. Could you send me a copy of FFFFFFFE or post the con-
tent of the qwords at offset 0x0200, 0x0240 and 0x0280? All
of them should be valid addresses emitted by VirtualAlloc()
(as reference you might use the qwords at 0x0218, 0x0258 or
0x0298; they are offsets to the 'loader table', an internal
structure allocated during program initialisation).

George Neuner

unread,

Jan 2, 2015, 12:20:29 PM1/2/15

to

On Wed, 31 Dec 2014 10:22:21 +0100, Bernhard Schornak
<scho...@nospicedham.web.de> wrote:

>Bernhard Schornak wrote:
>
>
>> https://docs.google.com/file/d/0B1OgMlxNnSNEUy15VjByX01OQ3c/edit?usp=drive_web
>
>
>Applied some cosmetic changes, new link is
>
>https://docs.google.com/file/d/0B1OgMlxNnSNEQVlhU1A2VjQ2b1k/edit?usp=drive_web
>
>

>Bernhard Schornak

Hi Bernhard,

I have i7-3770, i5-3450 and i3-370M available.

I tried your program on Windows 7 and had the same result as Mel: the
GUI starts and the language switch appears to work, but trying to run
the tests does nothing - there a brief blip on 1 cpu and then it just
sits there forever with zero CPU activity.

Also, choosing anything in the help menu crashes the application.

>If the program runs properly, filetimes for 00000003, FFFFFFFC
>and FFFFFFFE in the [data] folder should change as long as the
>program was normally closed (not terminated by task manager or
>wine's equivalent). The Content of those datafields may change
>while the program runs. FFFFFFFC holds the current path to the
>program, FFFFFFFE the main window's size, position and program
>flags, 00000003 the test data sets (which might change, so its
>changed flag is permanently set to force saving on exit).

Can't terminate it cleanly if I try to run the tests - it locks up
completely unresponsive and has to be killed.

George

Bernhard Schornak

unread,

Jan 2, 2015, 1:05:35 PM1/2/15

to

George Neuner wrote:

> On Wed, 31 Dec 2014 10:22:21 +0100, Bernhard Schornak
> <scho...@nospicedham.web.de> wrote:
>>
>> Applied some cosmetic changes, new link is
>>
>> https://docs.google.com/file/d/0B1OgMlxNnSNEQVlhU1A2VjQ2b1k/edit?usp=drive_web
>

> Hi Bernhard,
>
> I have i7-3770, i5-3450 and i3-370M available.
>
> I tried your program on Windows 7 and had the same result as Mel: the
> GUI starts and the language switch appears to work, but trying to run
> the tests does nothing - there a brief blip on 1 cpu and then it just
> sits there forever with zero CPU activity.

Hi George,

first of all: Thanks for your report!

It is no multithreaded / multicore test (would be overkill for
a 16 MB copy action). I get a short spike with 40 percent load
when I run the test, but it completes properly on my FX8350.

> Also, choosing anything in the help menu crashes the application.

This should not happen at all. I will check the entire program
for bugs. It would be nice if someone was willing to report if
the program works with other AMD processors...

>> If the program runs properly, filetimes for 00000003, FFFFFFFC
>> and FFFFFFFE in the [data] folder should change as long as the
>> program was normally closed (not terminated by task manager or
>> wine's equivalent). The Content of those datafields may change
>> while the program runs. FFFFFFFC holds the current path to the
>> program, FFFFFFFE the main window's size, position and program
>> flags, 00000003 the test data sets (which might change, so its
>> changed flag is permanently set to force saving on exit).
>
> Can't terminate it cleanly if I try to run the tests - it locks up
> completely unresponsive and has to be killed.

The tests should be completed within a few seconds. If there's
no response after fifteen seconds, something is wrong. Can you
post the crash address (Error Offset in the 'Application Error
Log') so I can trace the error?

Bernhard Schornak

unread,

Jan 2, 2015, 2:05:41 PM1/2/15

to

Sorry for a malfunctioning test program - while updating the
main message procedure, I needed to save more registers, but
left both epilogues (processed exit / default procedure) un-
touched. Hence, some registers weren't restored at all, some
were restored with the wrong content. The original code used
0xC0...0xF0(%rsp), the changed code now starts with 0xA0 and
uses four additional registers.

I hope this was the only error, but I wonder why the program
runs on my machine without a faint hint that something isn't
working properly...

Okay, the new version is available here:

https://docs.google.com/file/d/0B1OgMlxNnSNET19QSGtxcmN0WEU/edit?usp=drive_web

Hope it works on all machines, now! ;)

Mea culpa and such...

Bernhard Schornak

George Neuner

unread,

Jan 3, 2015, 12:21:59 AM1/3/15

to

Hi Bernhard,

Still no luck on Win7 64-bit. This new version behaves a bit
differently in that the help menu items lock up instead of crashing,
and the test briefly bursts to ~30% for approximately 1 second ...

... but then again it locks up with zero CPU activity and has to be
killed.

WRT event log: this new version isn't crashing so there is no error
entry. However, for the previous version that did crash I have:

Faulting application name: tst.exe, version: 0.0.0.0, time stamp:
0x54a37ca9
Faulting module name: USER32.dll, version: 6.1.7601.17514, time stamp:
0x4ce7c9f1
Exception code: 0xc0000005
Fault offset: 0x0000000000024b33
Faulting process id: 0x143c
Faulting application start time: 0x01d026a9a59d99ef
Faulting application path: <redacted>\tst.exe
Faulting module path: C:\Windows\system32\USER32.dll
Report Id: 3047c5f0-929d-11e4-958d-005056c00008

There are multiple entries corresponding to different runs but all are
the same: 0xc0000005 at 0x0000000000024b33

Hope that helps,
George

Melzzzzz

unread,

Jan 3, 2015, 1:09:11 PM1/3/15

to

This is my quick test:

[bmaxa@maxa-pc assembler]$ ./rdtscp 4000000
4000000 128 byte blocks, loops:1
rep movsb 0.04352539184211
rep movsq 0.02895878605263
movntdq 0.02523812921053
movntdq prefetch 0.02508215763158
movntdq prefetch ymm 0.02417047026316
[bmaxa@maxa-pc assembler]$ ./rdtscp 400000
400000 128 byte blocks, loops:10
rep movsb 0.00311163213158
rep movsq 0.00244263126316
movntdq 0.00251265031579
movntdq prefetch 0.00257390510526
movntdq prefetch ymm 0.00242973521053
[bmaxa@maxa-pc assembler]$ ./rdtscp 4000
4000 128 byte blocks, loops:1000
rep movsb 0.00001444596763
rep movsq 0.00001314468553
movntdq 0.00002107178763
movntdq prefetch 0.00002129352158
movntdq prefetch ymm 0.00002099912526
[bmaxa@maxa-pc assembler]$ ./rdtscp 40000
40000 128 byte blocks, loops:100
rep movsb 0.00021878483684
rep movsq 0.00018026386579
movntdq 0.00023630260263
movntdq prefetch 0.00024114757105
movntdq prefetch ymm 0.00023099385000
[bmaxa@maxa-pc assembler]$
[bmaxa@maxa-pc assembler]$ cat rdtscp.asm
format elf64 executable 3
include 'import64.inc'
interpreter '/lib64/ld-linux-x86-64.so.2'
needed 'libc.so.6'
import printf,atoi,exit

segment executable
entry $
mov r8,100
mov r10,100000
cmp dword[rsp],2
jl .skip
mov rdi, [rsp+16]
call [atoi]
mov r8,rax
xor edx,edx
mov rax,4000000
div r8
mov r10,rax
.skip:
rdtscp
shl rdx,32
or rax,rdx
mov [r1],rax
mov rbx,r10
@@:
imul r9,r8,128
mov rcx,r9
mov rdi,outbuf
mov rsi,inbuf
rep movsb
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r1]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r1],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r2],rax
mov rbx,r10
@@:
imul r9,r8,128/8
mov rcx,r9
mov rdi,outbuf
mov rsi,inbuf
rep movsq
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r2]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r2],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r3],rax
mov rbx,r10
@@:
mov rcx,r8
mov rdi,outbuf
mov rsi,inbuf
.L0:
movdqa xmm0,[rsi]
movdqa xmm1,[rsi+0x10]
movdqa xmm2,[rsi+0x20]
movdqa xmm3,[rsi+0x30]
movdqa xmm4,[rsi+0x40]
movdqa xmm5,[rsi+0x50]
movdqa xmm6,[rsi+0x60]
movdqa xmm7,[rsi+0x70]
movntdq [rdi],xmm0
movntdq [rdi+0x10],xmm1
movntdq [rdi+0x20],xmm2
movntdq [rdi+0x30],xmm3
movntdq [rdi+0x40],xmm4
movntdq [rdi+0x50],xmm5
movntdq [rdi+0x60],xmm6
movntdq [rdi+0x70],xmm7
add rsi,128
add rdi,128
dec rcx
jnz .L0
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r3]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r3],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r4],rax
mov rbx,r10
@@:
mov rcx,r8
mov rdi,outbuf
mov rsi,inbuf
prefetch [rsi]
prefetch [rsi+0x40]
.L1:
prefetch [rsi+0x40]
prefetch [rsi+0x80]
movdqa xmm0,[rsi]
movdqa xmm1,[rsi+0x10]
movdqa xmm2,[rsi+0x20]
movdqa xmm3,[rsi+0x30]
movdqa xmm4,[rsi+0x40]
movdqa xmm5,[rsi+0x50]
movdqa xmm6,[rsi+0x60]
movdqa xmm7,[rsi+0x70]
movntdq [rdi],xmm0
movntdq [rdi+0x10],xmm1
movntdq [rdi+0x20],xmm2
movntdq [rdi+0x30],xmm3
movntdq [rdi+0x40],xmm4
movntdq [rdi+0x50],xmm5
movntdq [rdi+0x60],xmm6
movntdq [rdi+0x70],xmm7
add rsi,128
add rdi,128
dec rcx
jnz .L1
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r4]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r4],xmm0

rdtscp
shl rdx,32
or rax,rdx
mov [r5],rax
mov rbx,r10
@@:
mov rcx,r8
mov rdi,outbuf
mov rsi,inbuf
prefetch [rsi]
prefetch [rsi+0x40]
.L2:
prefetch [rsi+0x40]
prefetch [rsi+0x80]
vmovdqa ymm0,[rsi]
vmovdqa ymm1,[rsi+0x20]
vmovdqa ymm2,[rsi+0x40]
vmovdqa ymm3,[rsi+0x60]
vmovntdq [rdi],ymm0
vmovntdq [rdi+0x20],ymm1
vmovntdq [rdi+0x40],ymm2
vmovntdq [rdi+0x60],ymm3
add rsi,128
add rdi,128
dec rcx
jnz .L2
dec rbx
jnz @b

rdtscp
shl rdx,32
or rax,rdx
sub rax,[r5]
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,r10
mulsd xmm1,qword[clock]
divsd xmm0,xmm1
movsd [r5],xmm0

mov rdi,fmth
mov rsi,r8
mov rdx,r10
xor eax,eax
call [printf]

mov rdi,fmt
mov rsi,fmtmovsb
movsd xmm0, [r1]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovsq
movsd xmm0, [r2]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovntdq
movsd xmm0, [r3]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovntdqp
movsd xmm0, [r4]
mov eax,1
call [printf]

mov rdi,fmt
mov rsi,fmtmovntdqy
movsd xmm0, [r5]
mov eax,1
call [printf]

call [exit]

segment readable
fmt db '%-32s%16.14f',0ah,0
fmtmovsb db 'rep movsb',0
fmtmovsq db 'rep movsq',0
fmtmovntdq db 'movntdq',0
fmtmovntdqp db 'movntdq prefetch',0
fmtmovntdqy db 'movntdq prefetch ymm',0
fmth db '%d 128 byte blocks, loops:%d',0ah,0
align 8
clock dq 3.8e9
segment writeable
align 32
inbuf rb 4000000*128
outbuf rb 4000000*128
r1 rq 1
r2 rq 1
r3 rq 1
r4 rq 1
r5 rq 1

Melzzzzz

unread,

Jan 3, 2015, 1:09:13 PM1/3/15

to

On Sat, 03 Jan 2015 00:19:31 -0500
George Neuner <gneu...@nospicedham.comcast.net> wrote:

> On Fri, 02 Jan 2015 19:51:29 +0100, Bernhard Schornak
> <scho...@nospicedham.web.de> wrote:
>
> >Sorry for a malfunctioning test program - while updating the
> >main message procedure, I needed to save more registers, but
> >left both epilogues (processed exit / default procedure) un-
> >touched. Hence, some registers weren't restored at all, some
> >were restored with the wrong content. The original code used
> >0xC0...0xF0(%rsp), the changed code now starts with 0xA0 and
> >uses four additional registers.
> >
> >I hope this was the only error, but I wonder why the program
> >runs on my machine without a faint hint that something isn't
> >working properly...
> >
> >Okay, the new version is available here:
> >
> >https://docs.google.com/file/d/0B1OgMlxNnSNET19QSGtxcmN0WEU/edit?usp=drive_web
> >
> >Hope it works on all machines, now! ;)
> >
> >
> >Mea culpa and such...
> >
> >Bernhard Schornak
>
> Hi Bernhard,
>
> Still no luck on Win7 64-bit.

No luck in Wine, too.

Bernhard Schornak

unread,

Jan 3, 2015, 3:09:23 PM1/3/15

to

George Neuner wrote:

> Hi Bernhard,
>
> Still no luck on Win7 64-bit. This new version behaves a bit
> differently in that the help menu items lock up instead of crashing,
> and the test briefly bursts to ~30% for approximately 1 second ...
>
> ... but then again it locks up with zero CPU activity and has to be
> killed.

Okay. I corrected one minor bug: Processing 'about' ran into the
program's initialisation rather than to jump to the common exit,
so additional instances of some memory blocks were requested and
the menu was initialised once more. This did not cause problems,
but was quite superfluous. I uploaded a new version

https://docs.google.com/file/d/0B1OgMlxNnSNEU3c4Um0yUWFGUWs/edit?usp=drive_web

and a debug version

https://docs.google.com/file/d/0B1OgMlxNnSNEa3oxNURranVwMUU/edit?usp=drive_web

This debug version generates a file named reg.dmp, providing the
register contents before starting and after finishing the tests,
respective before/after evaluating the results file. Either send
me a copy of reg.dmp (a 64 kb file containing register dumps and
zeroes for unused entries) or download PTB, my debugging tool

https://docs.google.com/file/d/0B1OgMlxNnSNEc0FqcmRFdUYzaGc/edit?usp=drive_web

To let PTB show something else than an error message:

1. Open PTB and select [Program][Select Source].

2. Push [Add Folder], switch to your 'test' folder and [Open] it
to add it to the internal database.

3. Close PTB.

4. Run 'tst_dbg'. Issue the evaluation, first, then a test run.

5. There should be a 'reg.dmp' in the test folder, now.

6. [Dumps][Register Dump] opens the dump viewer. There should be
at least three dumps, if the test run hangs in the test loops
(or four if test hangs somewhere in a dialog's message loop).

> WRT event log: this new version isn't crashing so there is no error
> entry. However, for the previous version that did crash I have:
>
> Faulting application name: tst.exe, version: 0.0.0.0, time stamp:
> 0x54a37ca9
> Faulting module name: USER32.dll, version: 6.1.7601.17514, time stamp:
> 0x4ce7c9f1
> Exception code: 0xc0000005
> Fault offset: 0x0000000000024b33
> Faulting process id: 0x143c
> Faulting application start time: 0x01d026a9a59d99ef
> Faulting application path: <redacted>\tst.exe
> Faulting module path: C:\Windows\system32\USER32.dll
> Report Id: 3047c5f0-929d-11e4-958d-005056c00008
>
> There are multiple entries corresponding to different runs but all are
> the same: 0xc0000005 at 0x0000000000024b33

Okay, this tells us that USER32.DLL crashed. I have no way to
debug microsoft's DLL, so knowing this external address leads
to nowhere. At least, USER32.DLL holds most GUI functions, so
the error indicates something went wrong while calling one of
the GUI functions.

> Hope that helps,
> George

Of course! Unfortunately, I cannot trace third party DLLs and
executables, so in this case it did not help too much. Never-
theless, one problem was solved - I am sure this tiny program
will run as it should sooner or later, but I still wonder why
the program runs perfectly on my machine while malfunctioning
on others...

Bernhard Schornak

unread,

Jan 3, 2015, 3:09:24 PM1/3/15

to

Melzzzzz wrote:

> [bmaxa@maxa-pc assembler]$ ./rdtscp 4000000
> 4000000 128 byte blocks, loops:1

> rep movsb 0,04352539184211
> rep movsq 0,02895878605263
> movntdq 0,02523812921053
> movntdq prefetch 0,02508215763158

> movntdq prefetch ymm 0.02417047026316
> [bmaxa@maxa-pc assembler]$ ./rdtscp 400000
> 400000 128 byte blocks, loops:10
> rep movsb 0.00311163213158
> rep movsq 0.00244263126316
> movntdq 0.00251265031579
> movntdq prefetch 0.00257390510526
> movntdq prefetch ymm 0.00242973521053
> [bmaxa@maxa-pc assembler]$ ./rdtscp 4000
> 4000 128 byte blocks, loops:1000
> rep movsb 0.00001444596763
> rep movsq 0.00001314468553
> movntdq 0.00002107178763
> movntdq prefetch 0.00002129352158
> movntdq prefetch ymm 0.00002099912526
> [bmaxa@maxa-pc assembler]$ ./rdtscp 40000
> 40000 128 byte blocks, loops:100
> rep movsb 0.00021878483684
> rep movsq 0.00018026386579
> movntdq 0.00023630260263
> movntdq prefetch 0.00024114757105
> movntdq prefetch ymm 0.00023099385000

Okay - the task was to copy 16 MB from one to another memory
location where both memory blocks do not overlap... ;)

I guess your results are times in seconds, but they probably
are not really reliable if the processor works with variable
clock speeds for each core. RDTSCP returns reliable measure-
ments, even if the processor changes clock speed or switches
to power saving mode. The speed of AVX moves only depends on
the number of 64 bit busses between processor and memory. It
cannot be faster than the same task performed with SSE (XMM)
registers - the memory interface (not the register size!) is
the bottleneck.

ST Test Melzzz 1 Melzzz 2 Melzzz 3 Melzzz 4

MOVSB 153.29 % 173.53 % 120.89 % 67.84 % 90.73 %
MOVSQ 160.73 % 115.46 % 94.90 % 61.73 % 74.75 %
MOVDQA 113.26 % 100.62 % 97.62 % 98.96 % 97.99 %
PREFETCH 100 % 100 % 100 % 100 % 100 %

My test results vary, as well, but the overall error is less
than five percent. Your results for REP MOVSx vary between
67.84 and 173.52 (REP MOVSB) or 61.73 and 115.46 (REP MOVSQ)
percent. Do you believe these results are reliable enough to
decide which copy algorithm shall be implemented for the new
superfast memcopy()?

<snip>

...

> @@:
> mov rcx,r8
> mov rdi,outbuf
> mov rsi,inbuf
> prefetch [rsi]
> prefetch [rsi+0x40]

These prefetches are not required, but

> .L1:
> prefetch [rsi+0x40]
> prefetch [rsi+0x80]

these should be 0x80[RSI] and 0xC0[RSI]. Same applies to the
AVX version later on. Prefetching cache lines only speeds up
execution if it is done very early, so the prefetched memory
is present in L1 whenever the next iteration issues one more
read access. Writes are not that crucial - a write combining
sequence collects multiple write instructions to one and the
same cache line. This (partially) 'hides' write latencies.

> movdqa xmm0,[rsi]
> movdqa xmm1,[rsi+0x10]
> movdqa xmm2,[rsi+0x20]
> movdqa xmm3,[rsi+0x30]
> movdqa xmm4,[rsi+0x40]
> movdqa xmm5,[rsi+0x50]
> movdqa xmm6,[rsi+0x60]
> movdqa xmm7,[rsi+0x70]
> movntdq [rdi],xmm0
> movntdq [rdi+0x10],xmm1
> movntdq [rdi+0x20],xmm2
> movntdq [rdi+0x30],xmm3
> movntdq [rdi+0x40],xmm4
> movntdq [rdi+0x50],xmm5
> movntdq [rdi+0x60],xmm6
> movntdq [rdi+0x70],xmm7
> add rsi,128
> add rdi,128
> dec rcx
> jnz .L1
> dec rbx
> jnz @b

Melzzzzz

unread,

Jan 3, 2015, 10:55:44 PM1/3/15

to

On Sat, 03 Jan 2015 21:08:08 +0100
Bernhard Schornak <scho...@nospicedham.web.de> wrote:

>
> These prefetches are not required, but
>
>
> > .L1:
> > prefetch [rsi+0x40]
> > prefetch [rsi+0x80]
>
>
> these should be 0x80[RSI] and 0xC0[RSI]. Same applies to the
> AVX version later on. Prefetching cache lines only speeds up
> execution if it is done very early, so the prefetched memory
> is present in L1 whenever the next iteration issues one more
> read access. Writes are not that crucial - a write combining
> sequence collects multiple write instructions to one and the
> same cache line. This (partially) 'hides' write latencies.

Here are three consecutive runs with new prefetch:

[bmaxa@maxa-pc assembler]$ ./rdtscp 131072
131072 128 byte blocks, loops:30
rep movsb 0.00080062400000
rep movsq 0.00075080907895
movntdq 0.00077594236842
movntdq prefetch 0.00077857252632
movntdq prefetch ymm 0.00070758657895
[bmaxa@maxa-pc assembler]$ ./rdtscp 131072
131072 128 byte blocks, loops:30
rep movsb 0.00084068476316
rep movsq 0.00073536339474
movntdq 0.00078112389474
movntdq prefetch 0.00077485573684
movntdq prefetch ymm 0.00072159513158
[bmaxa@maxa-pc assembler]$ ./rdtscp 131072
131072 128 byte blocks, loops:30
rep movsb 0.00081143394737
rep movsq 0.00075160634211
movntdq 0.00078124363158
movntdq prefetch 0.00077630386842
movntdq prefetch ymm 0.00070821252632

Bernhard Schornak

unread,

Jan 4, 2015, 8:31:15 AM1/4/15

to

I am astonished: AVX is much faster than SSE. Updated version
(I replaced the slowest testee with an AVX loop):

https://docs.google.com/file/d/0B1OgMlxNnSNEMWIyNXdiT2FxT1k/edit?usp=drive_web

Results (screenshot):

https://docs.google.com/file/d/0B1OgMlxNnSNEa194ZmozR3o4SnM/edit?usp=drive_web

Evaluation (screenshot):

https://docs.google.com/file/d/0B1OgMlxNnSNEeVBJUjd2TGQ1Y1k/edit?usp=drive_web

Might be interesting to figure out the optimum prefetch depth
for SSE and AVX.

George Neuner

unread,

Jan 4, 2015, 1:32:13 PM1/4/15

to

On Sun, 04 Jan 2015 14:18:45 +0100, Bernhard Schornak
<scho...@nospicedham.web.de> wrote:

>I am astonished: AVX is much faster than SSE.

AVX is twice as wide: 256 bits vs 128. If your entire dataset fits
into the 2nd level cache - AVX is much faster than SSE.

But fitting your whole dataset into the cache is crucial. Using AVX
it's very easy to saturate the memory bus and get terrible performance
due to stalls. [Not that it was hard with SSE.]

Gods help us, AVX2 will be 512-bits wide (if and when it appears).

George

Melzzzzz

unread,

Jan 4, 2015, 4:17:55 PM1/4/15

to

On Sun, 04 Jan 2015 13:23:33 -0500
George Neuner <gneu...@nospicedham.comcast.net> wrote:

> On Sun, 04 Jan 2015 14:18:45 +0100, Bernhard Schornak
> <scho...@nospicedham.web.de> wrote:
>
> >I am astonished: AVX is much faster than SSE.
>
> AVX is twice as wide: 256 bits vs 128. If your entire dataset fits
> into the 2nd level cache - AVX is much faster than SSE.
>
> But fitting your whole dataset into the cache is crucial. Using AVX
> it's very easy to saturate the memory bus and get terrible performance
> due to stalls. [Not that it was hard with SSE.]

Hm, this is not what I observe. Actually due to movntdq SSE/AVX is
slower then rep movs/b/q when whole data set fits in cache but faster
when data set is larger then cache, being AVX slightly faster
then SSE. At least on intel.
[bmaxa@maxa-pc assembler]$ ./rdtscp 400
400 128 byte blocks, loops:10000
rep movsb 0.00000159286461
rep movsq 0.00000125410263
movntdq 0.00000209012771
movntdq prefetch 0.00000208670147
movntdq prefetch ymm 0.00000209255061

[bmaxa@maxa-pc assembler]$ ./rdtscp 4000000
4000000 128 byte blocks, loops:1

rep movsb 0.04587493736842
rep movsq 0.02940718184211
movntdq 0.02456892710526
movntdq prefetch 0.02405898868421
movntdq prefetch ymm 0.02165183684211

>
> Gods help us, AVX2 will be 512-bits wide (if and when it appears).

You mean AVX512? AVX2 is already here and it extends AVX to integer
operations. This year, AMD Excavator will have AVX2.
AVX512 will be in SkyLake based Xeons only, as I read it.

>
> George

Terje Mathisen

unread,

Jan 5, 2015, 3:04:26 AM1/5/15

to

George Neuner wrote:
> On Sun, 04 Jan 2015 14:18:45 +0100, Bernhard Schornak
> <scho...@nospicedham.web.de> wrote:
>
>> I am astonished: AVX is much faster than SSE.
>
> AVX is twice as wide: 256 bits vs 128. If your entire dataset fits
> into the 2nd level cache - AVX is much faster than SSE.

AVX* is faster than SSE if all the involved data paths are wider than
SSE/128 bits.

For data in RAM the only thing that really matters is target location:
Do you want the target to end up in some cache level, or is this data
that you are handing off to some bus master IO device, i.e. "fire & forget"?

If the latter then you should consider using the *NT moves that bypass
cache as well as the usual read-for-ownership. If you can still use
cache lines sized coalescing write buffers you will get pretty close to
hw optimal performance.

>
> But fitting your whole dataset into the cache is crucial. Using AVX
> it's very easy to saturate the memory bus and get terrible performance
> due to stalls. [Not that it was hard with SSE.]

On modern systems RAM is in exactly the same position as disk/core used
to be on old mainframes: It is a mostly sequential block based backing
store, with on-chip cache as the real "random access memory".

>
> Gods help us, AVX2 will be 512-bits wide (if and when it appears).

AVX512 is an obvious stage, just like Larrabee/MIC/etc: One 64-byte
cache line per register.

Terje
>
> George

George Neuner

unread,

Jan 5, 2015, 2:06:10 PM1/5/15

to

On Sun, 4 Jan 2015 22:08:33 +0100, Melzzzzz
<m...@nospicedham.zzzzz.com> wrote:

>On Sun, 04 Jan 2015 13:23:33 -0500
>George Neuner <gneu...@nospicedham.comcast.net> wrote:
>
>> But fitting your whole dataset into the cache is crucial. Using AVX
>> it's very easy to saturate the memory bus and get terrible performance
>> due to stalls. [Not that it was hard with SSE.]
>
>Hm, this is not what I observe. Actually due to movntdq SSE/AVX is
>slower then rep movs/b/q when whole data set fits in cache but faster
>when data set is larger then cache, being AVX slightly faster
>then SSE. At least on intel.

Maybe string ops are faster at memmove() ... I was processing series
of hundreds of 2Kx2Kx32-bit images with an algorithm that required
(not quite but) essentially random access to 5 images at a time, plus
writable metadata equivalent to 2 more images.

I saw 500+% slow down when elements of my working set were pushed out
of cache. It took enormous effort to figure out a prefetch strategy
that worked.

George

George Neuner

unread,

Jan 5, 2015, 2:21:12 PM1/5/15

to

On Mon, 05 Jan 2015 08:50:03 +0100, Terje Mathisen
<terje.m...@nospicedham.tmsw.no> wrote:

>George Neuner wrote:
>> On Sun, 04 Jan 2015 14:18:45 +0100, Bernhard Schornak
>> <scho...@nospicedham.web.de> wrote:
>>
>>> I am astonished: AVX is much faster than SSE.
>>
>> AVX is twice as wide: 256 bits vs 128. If your entire dataset fits
>> into the 2nd level cache - AVX is much faster than SSE.
>
>AVX* is faster than SSE if all the involved data paths are wider than
>SSE/128 bits.

Which essentially is only inside the core. Does any x86 have 256-bit
paths to L2 ?

>For data in RAM the only thing that really matters is target location:
>Do you want the target to end up in some cache level, or is this data
>that you are handing off to some bus master IO device, i.e. "fire & forget"?
>
>If the latter then you should consider using the *NT moves that bypass
>cache as well as the usual read-for-ownership. If you can still use
>cache lines sized coalescing write buffers you will get pretty close to
>hw optimal performance.

Absolutely. But there are a lot of algorithms require RMW targets and
so have to keep them in cache [and preferably close].

George

Terje Mathisen

unread,

Jan 6, 2015, 6:23:19 AM1/6/15

to

This is probably the majority of all algorithms/code, in which case the
best you can do is often to figure out what is the smallest block size
you work with:

Small enough for L1? Great, do as much work as possible on each chunk,
then write it to the final destination using NT moves.

Fitting in L2: Still pretty good, this is an order of magnitude faster
than code which needs to repeatedly scan through memory buffers
significantly larger than L2.

One example: Many file systems are currently implementing realtime LZ4
compression/decompression. Since the dictionary is the previous 0-64 KB
of decompressed data, it will always be in L2 and quite often in L1.

The block size used to be fixed at 4MB, now it is variable in powers of
4, starting at 64KB and ending at 4MB (via 256KB and 1 MB), which means
that a single block will normally fit in L2 alongside the decompressed data!

Terje

Robert Prins

unread,

Jul 5, 2015, 4:59:43 PM7/5/15

to

Very old thread, but this PS was interesting:

On 2014-12-26 07:59, Terje Mathisen wrote:
> PS. I haven't even mentioned the case of overlapping ranges which can be used to
> fill a block of memory with a recurring pattern of N bytes, where N can be any
> random value. On IBM mainframes the standard way to fill a block was to write
> the desired value into the first word, then do an overlapping block move,
> reading from the just stored word and writing to the next word!

And this may surprise you, but the current Enterprise PL/I compiler no longer
uses this technique, as on the current z13 (and apparently some if its
predecessors) overlapping MVC's are apparently slower than discrete loops, even
for loops that seem to have a considerable overhead!

See https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=66462

Robert
--
Robert AH Prins
robert(a)prino(d)org

Terje Mathisen

unread,

Jul 6, 2015, 2:01:22 AM7/6/15

to

That doesn't surprise me at all:

Overlapping MVC in order to emulate RLL store operations _must_ carry a
larger overhead than simply doing the stores by themselves.

It is only when you have special-case hw which detects this case and
turns it into a block store on-the-fly that you avoid all the load
operations.