Load/Store Variable Number of Bytes to/from an XMM register?

Paul K. McKneely

unread,

May 13, 2011, 4:18:33 PM5/13/11

to

Hi,

Is anyone here an expert on SSE?

I have been studying the AMD64 manual and I am trying to
find instructions that will load/store a variable number of bytes
to/from an XMM register. The arbitrary number can be known
at assembly time but it should take any number from 1 to 16.
This is necessary because a "vector" of bytes (either unsigned
or signed) will be specified by the programmer. My plan is to
load and store only the number of bytes being used starting with
the first. All bytes are contiguous leaving unused bytes to pad to
the end of the register. Results from operations on the unused
bytes will be thrown away. So far I can only find instructions
that load or store the whole or half of an XMM register. That is
not good enough. For example: Let's say I have a vector of 7
unsigned bytes. I could load 16 byte with a movdqu but I risk
having an access fault on the last 9 bytes since I am reading into
what follows. If I store the first seven bytes with movdqu then I
will either overwrite anthing that lies within the 9 fill bytes that
follow or I will have an access fault if it runs over the end of
mapped virtual memory.

Does this make sense?

Terje Mathisen

unread,

May 13, 2011, 5:02:06 PM5/13/11

to

Paul K. McKneely wrote:
> Hi,
>
> Is anyone here an expert on SSE?
>
> I have been studying the AMD64 manual and I am trying to
> find instructions that will load/store a variable number of bytes
> to/from an XMM register. The arbitrary number can be known
> at assembly time but it should take any number from 1 to 16.
> This is necessary because a "vector" of bytes (either unsigned
> or signed) will be specified by the programmer. My plan is to
> load and store only the number of bytes being used starting with
> the first. All bytes are contiguous leaving unused bytes to pad to
> the end of the register. Results from operations on the unused
> bytes will be thrown away. So far I can only find instructions
> that load or store the whole or half of an XMM register. That is
> not good enough. For example: Let's say I have a vector of 7

Too bad: That's how SIMD code in general, and SSE specifically works.

> unsigned bytes. I could load 16 byte with a movdqu but I risk
> having an access fault on the last 9 bytes since I am reading into
> what follows. If I store the first seven bytes with movdqu then I
> will either overwrite anthing that lies within the 9 fill bytes that
> follow or I will have an access fault if it runs over the end of
> mapped virtual memory.
>
> Does this make sense?

The desire to have a loadN() SSE operation is very understandable, but
very seldom needed:

Unless you're loading directly from some hardware device, you can always
safely load from an aligned address that starts at or below the desired
starting point, then use a byte shuffle to align the desired bytes in
the low N positions.

You can also use an unaligned load, but then you're responsible for not
accessing beyond the current segment/page end.

On the output side there is in fact a MASKMOVDQU operation which can
write any, none or all of the 16 bytes to the corresponding target
address, and the store can start at an arbitrary address.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

DSF

unread,

May 20, 2011, 10:40:08 PM5/20/11

to

The bad part being that MASKMOVDQU is slower than frozen molasses. I
used it to replace a series of progressively smaller stores (8 bytes,
4 bytes, etc.) and was surprised to find it about four times as slow!
(Surprised until I looked up the info for it. Latency of 43 clocks
according to AMD and 64 macro-ops according to Agner Fog's research.)

DSF

Paul K. McKneely

unread,

Jun 9, 2011, 9:18:34 AM6/9/11

to

Hi All,

I came up with a solution which is not very pretty but it works
and will be reliable. It may not be very fast. which brings into
question how useful xmm registers are for speeding up execution.
To load or store 1 to 16 contiguous bytes of storage requires
one to seven instructions. The best cases are loading or storing
one dword, qword or dqward as in:

movd xmm0,DWORD PTR [Var]
movd xmm0,QWORD PTR [Var]
movdqu xmm0,XMMWORD PTR [Var]

The worst case is to load or store 15 bytes. The
following example loads 15 contiguous bytes:

mov al,BYTE PTR 14[A]
shl eax,16
mov ax,WORD PTR 12[A]
shl rax,32
mov eax,DWORD PTR 8[A]
movd xmm0,rax
movlhps xmm0,xmm0
movd xmm0,QWORD PTR [A]

The alternative is to always load the full register with
movdqu and "hope" that you are not reading from an
address that is not legal. That sort of practice seems
all too common in the PC industry as with Shoddy
Software Inc. who tells me "Well, it works most of the
time."

Paul K. McKneely

unread,

Jun 9, 2011, 2:46:04 PM6/9/11

to

I just realized that the 15-byte load example will not
work because the last movd clears the upper 64 bits,
destroying the value you just put in there. You need
two xmm registers:

mov al,BYTE PTR 14[A]
shl eax,16
mov ax,WORD PTR 12[A]
shl rax,32
mov eax,DWORD PTR 8[A]

movd xmm1,rax
movd xmm0,QWORD PTR [A]
movlhps xmm0,xmm1

Terje Mathisen

unread,

Jun 9, 2011, 4:02:58 PM6/9/11

to

Loading one or two aligned 16-byte blocks, then shifting and merging
them will _always_ work, unless you are reading from a memorymapped
device with destructive read operations, something that we stopped doing
after the EGA adapter in 1984.

Your code is a horrible set of partial register stalls. :-(