[jnc+1 ? stosb/stosw are only one byte code "AA/AB"]
Yes, pre- and post-aligning string operations are
the main speed-gain in my OS. It works with 32-bit
reduction/extension for any odd start and size.
But I also align source or destination to quad bounds.
... ;adjust for an aligned loop start here
SHR ecx,1 ;no action at all if ecx=0
JNC +2 ; +2 for use32
LODSW ; because prefix required here
REP LODSD ;falls through if ECX=Zero
and with similar dummy reads up front and at end it
can part-read disk sectors at any offset and size.