How does this instruction assemble ?

Lars Erdmann

unread,

May 26, 2012, 2:50:40 PM5/26/12

to

Hallo,

I have to use an assembler that can only assemble <= 386 instructions (don't
ask ...).

I want to use instruction:

movnti es:[bx],eax

from a 16-bit (!) code segment. Does this assemble as:

; movnti es:[bx],eax
db 026h,00fh,0c3h,007h

026h : es segment override prefix
00fh,0c3h: code for "movnti"
007h: for [bx],eax

?
Do I assume correctly that I do not need an operand size override prefix (as
"movnti" always uses 32-bit register) ?
I would be very grateful if someone could assemble this for a 16-bit (!)
code segment with an assembler that supports >= SSE2 and provide the
resulting listing.

And last but not least: is there something like:

movnti eax,es:[bx]

?
Funny enough I have seen the latter in some Microsoft example code but the
Intel spec does not list this to be possible. If not, does anyone know an
equivalent for "movnti eax,es:[bx]" ?

Lars

Frank Kotler

unread,

May 26, 2012, 3:39:23 PM5/26/12

to

Using Nasm, this code:

bits 16

movnti [es:bx], eax
o32 movnti [es:bx], eax
; movnti eax, [es:bx] ; nasm reports "invalid combination of opcodes and
operands"

Disassembles to:

00000000 260FC307 movnti [es:bx],eax
00000004 26 es
00000005 66 o32
00000006 0FC307 movnti [bx],eax

As to what this "means", I'm afraid it's "above my pay grade". Hope you
find it useful...

Best,
Frank

Lars Erdmann

unread,

May 26, 2012, 5:14:59 PM5/26/12

to

Yes, that answers my question.
Thanks.

Lars

"Frank Kotler" <fbko...@nospicedham.myfairpoint.net> schrieb im Newsbeitrag
news:jprble$sf7$1...@speranza.aioe.org...

Philip Lantz

unread,

May 27, 2012, 4:45:19 AM5/27/12

to

Lars Erdmann wrote:
>
> Hallo,
>
> I have to use an assembler that can only assemble <= 386 instructions (don't
> ask ...).
>
> I want to use instruction:
>
> movnti es:[bx],eax
>
> from a 16-bit (!) code segment. Does this assemble as:
>
> ; movnti es:[bx],eax
> db 026h,00fh,0c3h,007h
>
> 026h : es segment override prefix
> 00fh,0c3h: code for "movnti"
> 007h: for [bx],eax
>
> ?

This looks right to me.

> Do I assume correctly that I do not need an operand size override prefix (as
> "movnti" always uses 32-bit register) ?

I believe so.

> And last but not least: is there something like:
>
> movnti eax,es:[bx]
>
> ?

> If not, does anyone know an
> equivalent for "movnti eax,es:[bx]" ?

I was just talking to some experts about this very question a couple of
weeks ago, and I believe the consensus was that there is no way to
bypass the cache on a load.

Philip Lantz

unread,

May 29, 2012, 11:41:27 PM5/29/12

to

Philip Lantz wrote:

> Lars Erdmann wrote:
> > And last but not least: is there something like:
> >
> > movnti eax,es:[bx]
> >
> > ?
> > If not, does anyone know an
> > equivalent for "movnti eax,es:[bx]" ?
>
> I was just talking to some experts about this very question a couple of
> weeks ago, and I believe the consensus was that there is no way to
> bypass the cache on a load.

I misremembered: movntdqa will do this. (Maybe the question was how to
do it without SSE*, in which case I was right.)

Another option is to use movnti to write to an address within the cache
line; this will force the cache line to be evicted (and written back, if
it is dirty). If you follow that with mfence, then a subsequent normal
read operation to an address within the same cache line is guaranteed to
come directly from the device. However, you could only use this if there
is an address in the relevant cache line that is safe to write to.

In other words:
movnti es:[bx+N], eax
mfence
mov eax, es:[bx]

where N and the initial value of eax are carefully chosen to be
innocuous, and [bx+N] is within the same cache line as [bx].

I'm not sure there are any advantages to using this approach over using
clflush.

Lars Erdmann

unread,

Jun 10, 2012, 4:13:02 AM6/10/12

to

Thank you so much for your answer.

The problem with "clflush": it is an optional instruction. And it does not
necessarily come with SSE*.
In fact it has its own identifying bit in the CPUID feature flags.
And my CPU does not support it (but maybe the problem reporter's system).
That renders it kind of useless :-(

Problem with movntdqa: it requires SSE4.1. And my test box only supports
SSE2 ...

That leaves the alternate option. Cache line size is 32 bytes for all Intel
Pentium CPUs, correct ?

Lars

"Philip Lantz" <p...@nospicedham.canterey.us> schrieb im Newsbeitrag
news:MPG.2a2f36a4a...@news.eternal-september.org...

Terje Mathisen

unread,

Jun 10, 2012, 5:03:32 AM6/10/12

to

Lars Erdmann wrote:
> Thank you so much for your answer.
>
> The problem with "clflush": it is an optional instruction. And it does
> not necessarily come with SSE*.
> In fact it has its own identifying bit in the CPUID feature flags.
> And my CPU does not support it (but maybe the problem reporter's system).
> That renders it kind of useless :-(
>
> Problem with movntdqa: it requires SSE4.1. And my test box only supports
> SSE2 ...
>
> That leaves the alternate option. Cache line size is 32 bytes for all
> Intel Pentium CPUs, correct ?

Line size has been 64 bytes on pretty much all Intel cpus with cache, afair.

Some of them have had 128-byte lines, but with separate tracking for
each half...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

George Neuner

unread,

Jun 11, 2012, 2:13:01 PM6/11/12

to

On Sun, 10 Jun 2012 11:03:32 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>Lars Erdmann wrote:
>>
>> That leaves the alternate option. Cache line size is 32 bytes for all
>> Intel Pentium CPUs, correct ?
>
>Line size has been 64 bytes on pretty much all Intel cpus with cache, afair.

The 32-bit Pentium family member all have 32 byte L1 lines. Don't
know offhand about the 64-bit Pentiums.

i486 had 16 byte L1 lines.

George

wolfgang kern

unread,

Jun 12, 2012, 12:43:44 PM6/12/12

to

George Neuner mentioned:
...

>>Lars Erdmann wrote:

>>> That leaves the alternate option. Cache line size is 32 bytes for all
>>> Intel Pentium CPUs, correct ?

>>Line size has been 64 bytes on pretty much all Intel cpus with cache,
>>afair.

> The 32-bit Pentium family member all have 32 byte L1 lines. Don't
> know offhand about the 64-bit Pentiums.

> i486 had 16 byte L1 lines.

I cannot confirm this (because I don't use Intels since many years...)
My early AMD486 used already 32-bit cache-size for data and code-fetch.
AFAIR, 64-bit data cache came with AMD-K8, while the instruction-fetch
(aka Icache) remained 32-bits. Methink Intels EMT64 AMD-clones do same.

Could well be that newer 64-bit architectures use larger cache-lines
even I cannot see much gain in such an attempt.
__
wolfgang

George Neuner

unread,

Jun 12, 2012, 8:55:15 PM6/12/12

to

On Tue, 12 Jun 2012 18:43:44 +0200, "wolfgang kern" <now...@never.at>
wrote:

>
>George Neuner mentioned:

>
>> The 32-bit Pentium family member all have 32 byte L1 lines. Don't
>> know offhand about the 64-bit Pentiums.
>>
>> i486 had 16 byte L1 lines.
>
>I cannot confirm this (because I don't use Intels since many years...)
>My early AMD486 used already 32-bit cache-size for data and code-fetch.
>AFAIR, 64-bit data cache came with AMD-K8, while the instruction-fetch
>(aka Icache) remained 32-bits. Methink Intels EMT64 AMD-clones do same.
>
>Could well be that newer 64-bit architectures use larger cache-lines
>even I cannot see much gain in such an attempt.
>__
>wolfgang

I don't have a cite handy, but my understanding is that the newer
Intel 64-bit chips (core 2, i3/5/7) use 256-bit (4 word) cache lines.

But in any case here we are talking about L1 cache. The older chips
did not have on chip L2 cache. The new multi-cores do have on chip
L2.

George

Philip Lantz

unread,

Jun 13, 2012, 3:01:14 AM6/13/12

to

Lars Erdmann wrote:
> Philip Lantz schrieb...

> The problem with "clflush": it is an optional instruction. And it does not
> necessarily come with SSE*.
> In fact it has its own identifying bit in the CPUID feature flags.
> And my CPU does not support it (but maybe the problem reporter's system).
> That renders it kind of useless :-(

It is true that clflush has a separate bit in the CPUID feature flags,
but it was introduced at the same time as movnti (SSE2), and I don't
believe that Intel has ever made any processors that have one but not
the other. If you have an Intel processor with movnti, I'm pretty sure
it has clflush too.

> That leaves the alternate option. Cache line size is 32 bytes for all
> Intel Pentium CPUs, correct ?

I think some are 32 and some are 64. You can get the cache line size
from CPUID leaf 1, EBX bits 15:8.

wolfgang kern

unread,

Jun 13, 2012, 4:22:27 AM6/13/12

to

George Neuner replied:

>>> The 32-bit Pentium family member all have 32 byte L1 lines. Don't
>>> know offhand about the 64-bit Pentiums.

>>> i486 had 16 byte L1 lines.

>>I cannot confirm this (because I don't use Intels since many years...)
>>My early AMD486 used already 32-bit cache-size for data and code-fetch.
>>AFAIR, 64-bit data cache came with AMD-K8, while the instruction-fetch
>>(aka Icache) remained 32-bits. Methink Intels EMT64 AMD-clones do same.

>>Could well be that newer 64-bit architectures use larger cache-lines
>>even I cannot see much gain in such an attempt.

> I don't have a cite handy, but my understanding is that the newer
> Intel 64-bit chips (core 2, i3/5/7) use 256-bit (4 word) cache lines.

> But in any case here we are talking about L1 cache. The older chips
> did not have on chip L2 cache. The new multi-cores do have on chip
> L2.

Oh yeah, thanks to Bill Gates and his bloatware-factory, memory became
larger, faster and affordable :) [8MB L1 and 256MB L2 per core here].

About L1, but I dont remember when (with which CPU) L1-cache became
apart for Intruction-fetch and Data-cache. Optimisation were quite
easier with the old code&data L1-cache architecture, JMHO.
__
wolfgang