Any more graceful way to saturate adds?

Jim Leonard

unread,

Apr 10, 2008, 5:59:55 PM4/10/08

to

I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
adding two 8-bit numbers to each other and saturating the result.
Here is the best I could come up with:

; AL and DL have unsigned 8-bit values to add
; DL is actually extended into DX (ie. 00xxh)
cbw ; Make AL a signed word
add dx,ax ; Do word sized add, into DX
jns @@HIGH ; jump if result not signed/negative
xor dx,dx ; Underflowed; make 0
@@HIGH:
or dh,dh ; Did it overflow?
jz @@OK ; if not, we're ok
mov dx,00FFh ; Overflowed; force our saturation value (FFh)
@@OK:

This is part of an ADPCM inner loop.

Is this about as fast as I can expect? Or am I missing an obvious
technique? (I am not an 808x assembler expert)

Dick Wesseling

unread,

Apr 10, 2008, 10:32:55 PM4/10/08

to

In article <e9aa9511-4087-485f...@c65g2000hsa.googlegroups.com>,

Jim Leonard <spam...@crayne.org> writes:
> I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
> adding two 8-bit numbers to each other and saturating the result.
> Here is the best I could come up with:
>
> ; AL and DL have unsigned 8-bit values to add

And the result should be unsigned also I suppose. In that case you
only need to test for overflow. Unsigned overflow sets the carry
flag. Extend the carry flag to all ones (FFh) for overflow and
zero for the normal case. Then it goes something like this:

add dl,al ; do byte size add
sbb dh,dh ; carry -> FFh no carry -> 0
or dl,dh ; saturate
; xor dh.dh ; Want word size result???

Terje Mathisen

unread,

Apr 11, 2008, 3:41:42 AM4/11/08

to

What's wrong with the obvious approach?

add dl,al ; Carry set if it overflows!
sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff

OK?

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Jim Leonard

unread,

Apr 11, 2008, 2:05:19 PM4/11/08

to

On Apr 11, 2:41 am, Terje Mathisen <spamt...@crayne.org> wrote:
> Jim Leonard wrote:
> > I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
> > adding two 8-bit numbers to each other and saturating the result.
> > Here is the best I could come up with:
>
> > ; AL and DL have unsigned 8-bit values to add
> > ; DL is actually extended into DX (ie. 00xxh)
> > cbw ; Make AL a signed word
> > add dx,ax ; Do word sized add, into DX
> > jns @@HIGH ; jump if result not signed/negative
> > xor dx,dx ; Underflowed; make 0
> > @@HIGH:
> > or dh,dh ; Did it overflow?
> > jz @@OK ; if not, we're ok
> > mov dx,00FFh ; Overflowed; force our saturation value (FFh)
> > @@OK:
>
> > This is part of an ADPCM inner loop.
>
> > Is this about as fast as I can expect? Or am I missing an obvious
> > technique? (I am not an 808x assembler expert)
>
> What's wrong with the obvious approach?
>
> add dl,al ; Carry set if it overflows!
> sbb al,al ; AL = 0xff if overflow
> or dl,al ; Turns any overflow into 0xff
>
> OK?

Yes -- I was missing the forest for the trees, again. Thanks! (and
thanks to Mr. Wesseling as well).

Terence

unread,

Apr 11, 2008, 7:40:10 PM4/11/08

to

I wouldn't do it that way.
I would add the two supposedly unsigned 8-bit integers and test the 9
th bit for 1, which if present, the addition has overflowed.

Note that any non-overflow result of 255 is for your purposes
indistinguishable from an overflow,
So you just add, test 9th bit, jump around or load or OR 255h.

Terence

unread,

Apr 12, 2008, 2:28:22 AM4/12/08

to

Make that last "255h" into 255 or else 0FFh! (Sigh!)

Wolfgang Kern

unread,

Apr 12, 2008, 6:51:54 AM4/12/08

to

Terence wrote:

>> I wouldn't do it that way.
>> I would add the two supposedly unsigned 8-bit integers and test the 9
>> th bit for 1, which if present, the addition has overflowed.

>> Note that any non-overflow result of 255 is for your purposes
>> indistinguishable from an overflow,
>> So you just add, test 9th bit, jump around or load or OR 255h.

> Make that last "255h" into 255 or else 0FFh! (Sigh!)

ok. :)

First I thought: 'what's wrong by using the Carry-flag ? ',
but now I may see the flaw in the whole story.
The result can turn out to be FF(h) without an overflow,
so using FF(h) as overflow indication may lead to errors.

;just use carry-flag as indication:
ADD DL,AL
JC error

;or expand it to a true result
MOV DH,0 ;
ADD DL,AL ;also possible: MOVZX DX,AL
ADC DH,0 ;DH NZ meens overflow but DX holds true value

;or use just bit0 of DH for overflow (a bit shorter/faster):
ADD DL,AL
SETC DH ;DH NZ means overflow

;or if DL should just be clamped to FF(h):
ADD DL,AL
SETC DH ;becomes 0 or 1
NEG DH ;becomes 0 or FF
OR DL,DH ;DL keeps value or becomes FF(h)

SETcc can avoid branches and gain speed by this,
but it will not work on older CPUs.
(my example suffers from partial register stalls and
dependencies, so it should by used within other pairing
instructions for full speed),

__
wolfgang

Robert Redelmeier

unread,

Apr 12, 2008, 3:29:18 PM4/12/08

to

Terence <spam...@crayne.org> wrote in part:

> On Apr 12, 9:40 am, Terence <spamt...@crayne.org> wrote:
>> I wouldn't do it that way. I would add the two supposedly
>> unsigned 8-bit integers and test the 9 th bit for 1,
>> which if present, the addition has overflowed.

CF is also set with two byte adds.

>> Note that any non-overflow result of 255 is for your
>> purposes indistinguishable from an overflow, So you just
>> add, test 9th bit, jump around or load or OR 255h.
> Make that last "255h" into 255 or else 0FFh! (Sigh!)

Well, the OP did specify an old 808x which gives a more
complex answer:

Terje's clever SBB/OR is fastest on modern CPUs by a
great deal since there are no poorly predictable branches.

Your JC saves one instruction (two byte-transfers),
but only on the non-saturate case. For the saturate case,
it costs two more instr ( MOV/JMP ) [4 bytes].

Optimum will depend on the data.

-- Robert

Richard Russell

unread,

Apr 12, 2008, 4:52:03 PM4/12/08

to

On 11 Apr, 08:41, Terje Mathisen <spamt...@crayne.org> wrote:
> What's wrong with the obvious approach?
>

> add al,dl ; Carry set if it overflows!

> sbb al,al ; AL = 0xff if overflow
> or dl,al ; Turns any overflow into 0xff

Very neat. Now, what about the signed case (saturating at +127 or
-128)? The best I've managed is:

add al,dl ; Signed addition
jno ok ; Jump if no overflow
sbb al,al ; al is 0x00 or 0xff
xor al,127 ; Saturate to -128 or +127
ok:

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Terence

unread,

Apr 12, 2008, 8:03:07 PM4/12/08

to

On Apr 13, 5:29 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:

> CF is also set with two byte adds.

But this never occurs with adding one byte to one bye in two-byte
registers (with zero bits to the left). The objective is to count to
no more than 254 since 255 itself is the overflow signal.

Rod Pemberton

unread,

Apr 12, 2008, 11:10:57 PM4/12/08

to

"Terence" <spam...@crayne.org> wrote in message
news:55e84590-158c-4042...@p25g2000pri.googlegroups.com...

Oh, you didn't like the prior two changes? (inc doesn't update carry...)

Do you like these two changes (instead) to "Terje's solution"?

add al, 1
adc dl,al ; Carry set if it overflows!

sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff

Rod Pemberton

unread,

Apr 12, 2008, 11:06:56 PM4/12/08

to

"Terence" <spam...@crayne.org> wrote in message
news:55e84590-158c-4042...@p25g2000pri.googlegroups.com...

Do you like these two changes to "Terje's solution"?

inc al
adc dl,al ; Carry set if it overflows!

sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff

Rod Pemberton

Terje Mathisen

unread,

Apr 13, 2008, 9:00:06 AM4/13/08

to

That's a red herring:

It really doesn't matter if you say:

if ((sum = a+b) > 254) sum = 255;

or if you write:

if ((sum = a+b) > 255) sum = 255;

Please notice that as long as the sum is above 254, saturation will
always return 255, and at that point there is no way to determine if
this was the result of saturation after overflow, or simply because the
sum actually was 255!

Terje Mathisen

unread,

Apr 13, 2008, 9:31:59 AM4/13/08

to

Richard Russell wrote:
> On 11 Apr, 08:41, Terje Mathisen <spamt...@crayne.org> wrote:
>> What's wrong with the obvious approach?
>>
>> add al,dl ; Carry set if it overflows!
>> sbb al,al ; AL = 0xff if overflow
>> or dl,al ; Turns any overflow into 0xff
>
> Very neat. Now, what about the signed case (saturating at +127 or
> -128)? The best I've managed is:
>
> add al,dl ; Signed addition
> jno ok ; Jump if no overflow
> sbb al,al ; al is 0x00 or 0xff
> xor al,127 ; Saturate to -128 or +127
> ok:

That is pretty neat code, I like it!

The generic way to handle arbitrary clamping values is something like this:

;; if ( x < MIN) x = MIN;
;; else if (x > MAX) x = MAX;

cmp eax,MIN ; Carry set if (eax < MIN)
sbb ebx,ebx ; EBX = -1 if underflow
cmp eax,MAX+1 ; Carry clear if (eax > MAX)
adc ebx,ebx ; Merge carry with previous result

At this point there are 4 possible combinations:

second 0 second 1
first 0 OVFL OK
first 1 impossible UNDFL

These four corresponds to EBX = 0, 1, -2, -1 which means that it is
possible to store the original value into a four-element table, then use
EBX to retrieve either the same value back or the clamped results:

cmp eax,MIN ; Carry set if (eax < MIN)
sbb ebx,ebx ; EBX = -1 if underflow
cmp eax,MAX+1 ; Carry clear if (eax > MAX)
adc ebx,ebx ; Merge carry with previous result
mov clamp_table[4], eax
mov eax,clamp_table[ebx*4+8]

This code will only be faster than the naive approach if clamped values
are both common and non-predictable, if this isn't true then it is much
better to simply use the generic C-style if/else if/else approach!

Rod Pemberton

unread,

Apr 13, 2008, 11:27:29 AM4/13/08

to

"Terje Mathisen" <spam...@crayne.org> wrote in message
news:rrSdnYmBeZ1Knp_V...@giganews.com...

> Terence wrote:
> > On Apr 13, 5:29 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
> >
> >> CF is also set with two byte adds.
> >
> > But this never occurs with adding one byte to one bye in two-byte
> > registers (with zero bits to the left). The objective is to count to
> > no more than 254 since 255 itself is the overflow signal.
>
> That's a red herring:
>
> It really doesn't matter if you say:
>
> if ((sum = a+b) > 254) sum = 255;
>
> or if you write:
>
> if ((sum = a+b) > 255) sum = 255;
>
> Please notice that as long as the sum is above 254, saturation will
> always return 255, and at that point there is no way to determine if
> this was the result of saturation after overflow, or simply because the
> sum actually was 255!
>
> OK?
>

Let's plug some values into the code you posted:

> add dl,al ; Carry set if it overflows!
> sbb al,al ; AL = 0xff if overflow
> or dl,al ; Turns any overflow into 0xff

if (al+dl) = 0, result al=0, dl=0
if 1< (al+dl) < 254, result al=0, dl=dl<>0
if (al+dl) = 255, result al=0, dl=dl=0xff
if 255 < (al+dl), result al=0xff, dl=0xff

dl is 0xff at sum > 254, but al is 0xff at sum > 255 ...

> Please notice that as long as the sum is above 254, saturation will
> always return 255,

False, if one uses al. True, if one uses dl.

> and at that point there is no way to determine if
> this was the result of saturation after overflow, or simply because the
> sum actually was 255!

False, if one uses al. True, if one uses dl.

If Jim needs to distinguish 255 from overflow as Terence noted, perhaps this
is sufficient:

add dl,al ; Carry set if it overflows!

sbb dl,dl ; DL = 0xff if overflow

if (al+dl) = 0, result dl=0, al=al
if 1< (al+dl) < 254, result dl=0, al=al
if (al+dl) = 255, result dl=0, al=al
if 255 < (al+dl), result dl=0xff, al=al

Rod Pemberton

Alexei A. Frounze

unread,

Apr 13, 2008, 8:19:25 PM4/13/08

to

On Apr 13, 8:27 am, "Rod Pemberton" <spamt...@crayne.org> wrote:
> "Terje Mathisen" <spamt...@crayne.org> wrote in message
...

While we're at it, how about complementing this with a similar
solution for signed saturation (-128 to +127) of signed bytes for the
fun of it? :)

Alex

Rod Pemberton

unread,

Apr 14, 2008, 2:41:38 AM4/14/08

to

"Alexei A. Frounze" <spam...@crayne.org> wrote in message
news:58afc87d-dcb4-4688...@s13g2000prd.googlegroups.com...

??? How's your server?

http://groups.google.com/group/comp.lang.asm.x86/msg/36af00ed6b8201ae
http://groups.google.com/group/comp.lang.asm.x86/msg/b994b1ae492f9833

RP

Jim Leonard

unread,

Apr 14, 2008, 9:58:12 AM4/14/08

to

On Apr 13, 10:27 am, "Rod Pemberton" <spamt...@crayne.org> wrote:
> If Jim needs to distinguish 255 from overflow as Terence noted, perhaps this
> is sufficient:

I do not need to distinguish 255 from overflow.

The quick background for this is that it is the clamping function for
adding a signed 8-bit integer to an unsigned 8-bit integer. The
function is an adapted ADPCM routine; whereas most ADPCM works on
signed 16-bit samples, I have a need for it to work on unsigned 8-bit
samples.
The inner loop of the routine has a previous unsigned sample, and a
new signed sample, and needs to add them together to get the new
sample. The result needs to clamp at both 0 and 255.

With this in mind, what is the best solution, assuming the previous
unsigned sample is in DL and the new signed sample is in AL?

Alexei A. Frounze

unread,

Apr 14, 2008, 6:39:12 AM4/14/08

to

On Apr 13, 11:41 pm, "Rod Pemberton" <spamt...@crayne.org> wrote:
> "Alexei A. Frounze" <spamt...@crayne.org> wrote in messagenews:58afc87d-dcb4-4688...@s13g2000prd.googlegroups.com...

>
> > On Apr 13, 8:27 am, "Rod Pemberton" <spamt...@crayne.org> wrote:
> > > "Terje Mathisen" <spamt...@crayne.org> wrote in message
> > ...
>
> > While we're at it, how about complementing this with a similar
> > solution for signed saturation (-128 to +127) of signed bytes for the
> > fun of it? :)
>
> ??? How's your server?
>
> http://groups.google.com/group/comp.lang.asm.x86/msg/36af00ed6b8201aehttp://groups.google.com/group/comp.lang.asm.x86/msg/b994b1ae492f9833
>
> RP

Missed the posts. Still I was wondering if someone came up with a
compact method not involving jumps and LUTs.

Alex

pete

unread,

Apr 15, 2008, 1:11:19 AM4/15/08

to

In article <7f1061af-6435-47c0...@a22g2000hsc.googlegroups.com>
spam...@crayne.org "Jim Leonard " writes:

Just to be absolutely clear about the problem (and without any
attempt at optimising the code), can you confirm that the below
is what you're after?

(unsigned sample in DL, signed "update" in AL)

sub dh, dh ;DH = 0
cbw ;sign-extend AL to AX
add dx, ax ;signed addition
Now:
if DH == 0, nothing to do: ans in DL
if DH == FF, -ve overflow so clamp to 0: set DL = 0
if DH == 1, +ve overflow so clamp to FF: set DL = FF

Is that correct?

Pete
--
"We have not inherited the earth from our ancestors,
we have borrowed it from our descendants."

Jim Leonard

unread,

Apr 15, 2008, 10:30:34 AM4/15/08

to

On Apr 15, 12:11 am, pete <spamt...@crayne.org> wrote:
> Just to be absolutely clear about the problem (and without any

> attempt at optimizing the code), can you confirm that the below

> is what you're after?
>
> (unsigned sample in DL, signed "update" in AL)
>
> sub dh, dh ;DH = 0
> cbw ;sign-extend AL to AX
> add dx, ax ;signed addition
> Now:
> if DH == 0, nothing to do: ans in DL
> if DH == FF, -ve overflow so clamp to 0: set DL = 0
> if DH == 1, +ve overflow so clamp to FF: set DL = FF
>
> Is that correct?

Yes, that's exactly correct. My beginner's attempt at applying the
clamping has been done with JMPs, which are really costly in an inner
loop (especially on my lowly 808x) and I feel that I'm missing
something obvious.

Was Terje's solution:

add dl,al ; Carry set if it overflows!
sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff

...the right one? I see overflow handling but not underflow...

pete

unread,

Apr 17, 2008, 1:05:43 PM4/17/08

to

In article <9c1c6a51-9572-4ffb...@l42g2000hsc.googlegroups.com>
spam...@crayne.org "Jim Leonard " writes:

Well, I'd guess that Terje has more expertise in his little
finger that I in my entire body :-) but below is the best I can
come up with -- still a jmp in there though...

; unsigned sample in DL, signed "update" in AL
sub dh, dh
cbw
add dx, ax
or dh, dh
jz done
; either DH == FF (-ve overflow) or DH == 1 (+ve overflow)
sar dh, 1 ;now DH is 0 or FF, but the wrong way round
not dh ;invert it
mov dl, dh
done:

Perhaps (certainly!) someone can improve on that?

Terje Mathisen

unread,

Apr 17, 2008, 2:14:08 PM4/17/08

to

pete wrote:
> In article <9c1c6a51-9572-4ffb...@l42g2000hsc.googlegroups.com>

>> Was Terje's solution:
>>
>> add dl,al ; Carry set if it overflows!
>> sbb al,al ; AL = 0xff if overflow
>> or dl,al ; Turns any overflow into 0xff
>>
>> ...the right one? I see overflow handling but not underflow...

That didn't even try to handle a unsigned/signed mix.

>
> Well, I'd guess that Terje has more expertise in his little

This only means I have more scars and have made more mistakes than most
of the rest of you.

> finger that I in my entire body :-) but below is the best I can
> come up with -- still a jmp in there though...
>
> ; unsigned sample in DL, signed "update" in AL
> sub dh, dh
> cbw
> add dx, ax
> or dh, dh
> jz done
> ; either DH == FF (-ve overflow) or DH == 1 (+ve overflow)
> sar dh, 1 ;now DH is 0 or FF, but the wrong way round
> not dh ;invert it
> mov dl, dh
> done:

What about converting the problem (sort of) to an unsigned case?

; The two first instructions can be skipped if the high
; halfs are known to be zero
and dx,255
and ax,255
; Move the signed update into the [0-255] range:
add ax,128

add ax,dx ; Result must be in [0..510] range, of which
; 128..383 is in range

; The trivial way to clamp this is to adjust & branch:
sub ax,128
jc underflow
cmp ax,256
jnc overflow

Can we do this differently? Sure, but only if clamping is both common
and non-predictable:

sub ax,128
mov dx,0xff00
cmovc al,dl
cmp ax,256
cmovnc al,dh

How about doing it without conditional moves?

sub ax,128
sbb dx,dx ; -1 if underflow
cmp ax,256
sbb bx,bx
xor dx,-1

and ax,dx
or ax,bx

Anyway, unless clamping is very common, it is hard to beat the branch
predictors!

Danjel McGougan

unread,

Apr 18, 2008, 5:31:30 AM4/18/08

to

On 17 Apr, 20:14, Terje Mathisen <spamt...@crayne.org> wrote:

> pete wrote:
> >> Was Terje's solution:
>
> >> add dl,al ; Carry set if it overflows!
> >> sbb al,al ; AL = 0xff if overflow
> >> or dl,al ; Turns any overflow into 0xff
>
> >> ...the right one? I see overflow handling but not underflow...
>

[...]

>
> Anyway, unless clamping is very common, it is hard to beat the branch
> predictors!
>

I don't think the 808x CPU has a branch predictor. I guess one could
just add up the cycles used for each instruction on such an old CPU
and come up with the best performing code for various statistics of
the data. It should be pretty much deterministic (does it even use
cache memory?).

BR,
Danjel McGougan

Terje Mathisen

unread,

Apr 18, 2008, 2:21:01 PM4/18/08

to

Danjel McGougan wrote:
> On 17 Apr, 20:14, Terje Mathisen <spamt...@crayne.org> wrote:
>> pete wrote:
>>>> Was Terje's solution:
>>>> add dl,al ; Carry set if it overflows!
>>>> sbb al,al ; AL = 0xff if overflow
>>>> or dl,al ; Turns any overflow into 0xff
>>>> ...the right one? I see overflow handling but not underflow...
> [...]
>> Anyway, unless clamping is very common, it is hard to beat the branch
>> predictors!
>>
>
> I don't think the 808x CPU has a branch predictor. I guess one could

Ouch, I forgot about the 8088/8086 target!

> just add up the cycles used for each instruction on such an old CPU
> and come up with the best performing code for various statistics of
> the data. It should be pretty much deterministic (does it even use
> cache memory?).

The only cache at all is a 6/8 byte prefetch queue for instruction
bytes, but on 8088 that queue is very often empty since it has to
compete with all memory accesses.

The performance is easy to calculate though:

Without taken branches:
4 cycles per byte touched, code & data.

Taken branches were something like 4-8 cycles afair, might be wrong.

; Input: AL = signed byte, DL = unsigned byte
; Output: AL unsigned clamped byte

xor dx,dx
next: ; bytes
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
js underflow ; 2
test ah,ah ; 2
jnz overflow ; 2
store:
stosb ; 1+1 Store clamped result
jmp next ; 2+t

This looks like 14 bytes total, i.e. about 56 cycles for a nonclamped
result.

underflow:
xor ax,ax ; 2
stosb ; 1+1
jmp next ; 2+t

Underflow adds a taken branch but get rid of 2 code bytes, so we're
talking about +/- zero extra cost.

overflow:
mov al,255 ; 3
stosb ; 1+1
jmp next ; 2+t

Overflow requires 3 extra code bytes and a taken branch, i.e.
approximately 16+ cycles.

If most samples turn out to be in range, then we should simplify the
code to optimize this path even further:

xor dx,dx
next: ; bytes
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
test ah,0ffh
jnz clamp
stosb
jmp next

clamp:
js underflow
mov al,255
stosb
jmp next

underflow:
xor ax,ax
stosb
jmp next

dave

unread,

Apr 18, 2008, 4:57:11 PM4/18/08

to

On the real 8088 IBM PC the cycle count is total bytes in the
instruction and data being referenced times the wait states. The PC
had so many waits on memory access that only a few instructions such
as multiply and divide would take more CPU cycles than the fetch.

Novosad, Stephan R.

unread,

Apr 21, 2008, 9:57:57 AM4/21/08

to

Hello,

   Given the 8088/8086 target, and looking at the code posted,
I realized that an undocumented instruction could be used.
Ever since I heard of these instructions that were not mentioned
by Intel, I have been trying to find a use for them. This may
be one. Maybe Jim will time the various solutions posted, and
mention which he chose.

   The effort behind my entry was to have no jumps and no extra
memory access (immediate data, variables, or PUSH/POP). My 8088 is
dead, so I have not programed for it for quite some time. But for
my 80186 machine, those two rules have sped up some programs.

   Anyway, here is an example of using SALC.

Regards,

Steve N.

; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SALC    MACRO           ; Set AL to Carry (not for NEC V20).
        DB      0D6H    ; One byte instead of the two byte SBB reg,reg.
        ENDM

CLAMP:                  ; Setup
        XOR     DH,DH   ; Convert unsigned byte to word.
        CBW             ; Convert signed byte to word.
        MOV     BL,AL   ; Save for unsigned byte calc.

                        ; Signed word add:
        ADD     AX,DX
        SAR     AH,1    ; \ Make mask for zero clamp.
        NOT     AH      ; /

                        ; Unsigned byte add:
        ADD     BL,DL   ; (Could be DL,BL if result in DL wanted.)
        SALC            ; Make mask for 0FFH clamp.

        OR      BL,AL   ; Clamp high.
        AND     BL,AH   ; Clamp low.

        RET

; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Jim Leonard

unread,

Apr 21, 2008, 10:23:39 AM4/21/08

to

On Apr 18, 3:57 pm, dave <spamt...@crayne.org> wrote:
> On the real 8088 IBM PC the cycle count is total bytes in the
> instruction and data being referenced times the wait states. The PC
> had so many waits on memory access that only a few instructions such
> as multiply and divide would take more CPU cycles than the fetch.

...but they still had to execute. Just because an opcode takes 4 or 8
cycles to fetch doesn't mean its execution time is magically erased.
Since the (tiny) prefetch queue can be full after a slow operation,
optimization still matters.

I originally asked about the topic because a jump clears the prefetch
queue, and was trying to find a way to clamp without jumping. I have
to ask, though: The penalty on my CPU for jumping is 1. clearing the
prefetch queue and 2. 17 cycles. I will have to look at Terje's
suggestions and see if the solutions he provided actually beat that,
which may have been what he was referring to when he wrote "unless

Jim Leonard

unread,

Apr 21, 2008, 10:33:56 AM4/21/08

to

On Apr 18, 1:21 pm, Terje Mathisen <spamt...@crayne.org> wrote:
> The only cache at all is a 6/8 byte prefetch queue for instruction
> bytes, but on 8088 that queue is very often empty since it has to
> compete with all memory accesses.

4 bytes, actually, on 8088. It's 6 on 8086 and NEC V20/30.

> Taken branches were something like 4-8 cycles afair, might be wrong.

17, actually. That's where the detective work comes in; to "beat" the
penalty of a jump, the alternate solution has to be 16 cycles or less
(including opcode fetch as you noted).

I think, based on what you wrote, that 17 cycles is not that big a
deal when alternate solutions would most likely be longer/slower.

> If most samples turn out to be in range, then we should simplify the
> code to optimize this path even further:

"Most" is subjective, but in an ADPCM decoder (what this code is for)
I would agree that the ratio of needing clamping to not needing is
about 1:8, so yes, most samples will be in range.

> xor dx,dx
> next: ; bytes
> mov dl,[bx+si] ; 2+1
> lodsb ; 1+1
> cbw ; 1 -128 to 127
> add ax,dx ; 2 -128 to 383
> test ah,0ffh
> jnz clamp
> stosb
> jmp next
>
> clamp:
> js underflow
> mov al,255
> stosb
> jmp next
>
> underflow:
> xor ax,ax
> stosb
> jmp next
>
> Terje

Looks like a winner; thanks for all your help!

Esra Sdrawkcab

unread,

Apr 21, 2008, 4:07:15 PM4/21/08

to

Might it be "better" (at least smaller code) to place the stosb after
the "next" label? but then you'd need a setup JMP to start,
[and a "cx=0" test at some point!]

xor dx,dx
jmp short start

next: ; bytes
stosb
start:

mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
test ah,0ffh

jz next

clamp:
js underflow
mov al,255

jmp next

underflow:
xor ax,ax
jmp next

aku ankka

unread,

Apr 22, 2008, 3:38:05 AM4/22/08

to

Hello. You can use carry to generate a saturation mask. The idea
revolves around this concept:

0 - 1 = -1 (all bits set)
0 - 0 = 0

add al, bl ; this is ur addition
xor cl, cl ; zero d cl register
sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
or al, cl ; this will saturate; the cl = 0xff if carry was set.. else
it is 0x00

w00t++ w00t++

Saturating w/o branching is child's play; enjoy!

Terje Mathisen

unread,

Apr 22, 2008, 2:03:00 PM4/22/08

to

aku ankka wrote:
> Hello. You can use carry to generate a saturation mask. The idea
> revolves around this concept:
>
> 0 - 1 = -1 (all bits set)
> 0 - 0 = 0
>
> add al, bl ; this is ur addition
> xor cl, cl ; zero d cl register

This XOR will overwrite the flags from the previous ADD!
INC/DEC are the only arithmetic opcodes which doesn't modify the carry flag.

> sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
> or al, cl ; this will saturate; the cl = 0xff if carry was set.. else
> it is 0x00
>
> w00t++ w00t++
>
> Saturating w/o branching is child's play; enjoy!

Children often make simple mistakes. :-)

Terje Mathisen

unread,

Apr 22, 2008, 2:01:06 PM4/22/08

to

Esra Sdrawkcab wrote:
> Might it be "better" (at least smaller code) to place the stosb after
> the "next" label? but then you'd need a setup JMP to start,
> [and a "cx=0" test at some point!]
>
> xor dx,dx
> jmp short start
>
> next: ; bytes
> stosb
> start:
> mov dl,[bx+si] ; 2+1
> lodsb ; 1+1
> cbw ; 1 -128 to 127
> add ax,dx ; 2 -128 to 383
> test ah,0ffh
> jz next

Yes indeed, this is a very common/useful optimization, I didn't do this
simply because I didn't know what the surrounding code looked like, I
just wrote a minimal wrapper around the core clamping code.

BTW, there is a very sneaky way to avoid that initial jump:

Instead of

jmp short start
next:
stosb
start:

you can instead write:

mov al,0AAh
org $-1 ; Back up one byte into the immediate constant above
next:
stosb

I.e. on the first iteration we do a dummy two-byte instruction with a
one-byte immediate, which we then make sure contains the STOSB opcode byte.

The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
a taken branch, so the actual cost of rotating the loop like this is 8
cycles on the first iteration, instead of 17, and each iteration saves
two opcode bytes or 8 cycles.

aku ankka

unread,

Apr 23, 2008, 2:43:47 AM4/23/08

to

On Apr 22, 9:03 pm, Terje Mathisen <spamt...@crayne.org> wrote:
> aku ankka wrote:
> > Hello. You can use carry to generate a saturation mask. The idea
> > revolves around this concept:
>
> > 0 - 1 = -1 (all bits set)
> > 0 - 0 = 0
>
> > add al, bl ; this is ur addition
> > xor cl, cl ; zero d cl register
>
> This XOR will overwrite the flags from the previous ADD!
> INC/DEC are the only arithmetic opcodes which doesn't modify the carry flag.
>
> > sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
> > or al, cl ; this will saturate; the cl = 0xff if carry was set.. else
> > it is 0x00
>
> > w00t++ w00t++
>
> > Saturating w/o branching is child's play; enjoy!
>
> Children often make simple mistakes. :-)
>
> Terje
>
> --

> - <Terje.Mathi...@hda.hydro.com>

> "almost all programming can be viewed as an exercise in caching"

Yea; d xor was unnecessary, I was about to post follow-up but didn't
see dat post o mine because of moderation lag. But since u tried to to
correct tha sh1t, here's what I wanted to post about five seconds
after hitting d send button:

a - a - carry = 0 - carry

In other words the register doesn't need to be zeroed to begin wid.
I'm baffled u didn't see dat one.. I did but wanted to quote miself so
there u go.. thxbye

Esra Sdrawkcab

unread,

Apr 23, 2008, 4:20:41 AM4/23/08

to

Terje Mathisen wrote:

> you can instead write:
>
> mov al,0AAh
> org $-1 ; Back up one byte into the immediate constant above
> next:
> stosb
>
> I.e. on the first iteration we do a dummy two-byte instruction with a
> one-byte immediate, which we then make sure contains the STOSB opcode byte.
>
> The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
> a taken branch, so the actual cost of rotating the loop like this is 8
> cycles on the first iteration, instead of 17, and each iteration saves
> two opcode bytes or 8 cycles.
>

Neat.

Jim Leonard

unread,

May 1, 2008, 4:13:06 PM5/1/08

to

On Apr 22, 1:01 pm, Terje Mathisen <spamt...@crayne.org> wrote:
> BTW, there is a very sneaky way to avoid that initial jump:
>
> Instead of
>
> jmp short start
> next:
> stosb
> start:
>
> you can instead write:
>
> mov al,0AAh
> org $-1 ; Back up one byte into the immediate constant above
> next:
> stosb
>
> I.e. on the first iteration we do a dummy two-byte instruction with a
> one-byte immediate, which we then make sure contains the STOSB opcode byte.
>
> The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
> a taken branch, so the actual cost of rotating the loop like this is 8
> cycles on the first iteration, instead of 17, and each iteration saves
> two opcode bytes or 8 cycles.

I hate to seem more daft than I already am, but after re-reading the
above for a week I still don't get the trick. I thought ORG was for
specifying where the assembler assumes the binary code will get
loaded; how does that save two opcode bytes per iteration?

pete

unread,

May 2, 2008, 12:38:05 AM5/2/08

to

In article <25740676-bb54-4871...@j22g2000hsf.googlegroups.com>
spam...@crayne.org "Jim Leonard" writes:

> On Apr 22, 1:01 pm, Terje Mathisen <spamt...@crayne.org> wrote:
> > BTW, there is a very sneaky way to avoid that initial jump:
> >
> > Instead of
> >
> > jmp short start
> > next:
> > stosb
> > start:
> >
> > you can instead write:
> >
> > mov al,0AAh
> > org $-1 ; Back up one byte into the immediate constant above
> > next:
> > stosb
> >
> > I.e. on the first iteration we do a dummy two-byte instruction with a
> > one-byte immediate, which we then make sure contains the STOSB opcode byte.
> >
> > The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
> > a taken branch, so the actual cost of rotating the loop like this is 8
> > cycles on the first iteration, instead of 17, and each iteration saves
> > two opcode bytes or 8 cycles.
>
> I hate to seem more daft than I already am, but after re-reading the
> above for a week I still don't get the trick.

The "trick" is to locate the next: label in the middle of the
instruction 'mov al,imm8' -- the 'org $-1' winds back the program
counter one location and assembles the STOSB opcode in the
location occupied by the imm8 value of the previous instruction.
This results in a benign 'mov al,0AAh' first time through the
code (benign in the sense that AL will soon after be loaded with
the proper value) but when looping back to the next: label the
code will find a STOSB (i.e. jumps into the middle of the mov
instruction).

> I thought ORG was for
> specifying where the assembler assumes the binary code will get
> loaded; how does that save two opcode bytes per iteration?

This saving comes from having a single STOSB at the "top" of the
loop; the alternative earlier loop had 3 STOSB instructions (one
in each branch) -- hence the saving of 2 x STOSB.

Very neat and goes some way to justify my earlier comment about
Terje's little finger ;-)

Pete

Incidentally, I've used the 'org $-1' technique in the past for
self-modifying code to jump to alternative addresses (like a
switch() in C). Something like

jmp short +2
org $-1
disp db ?

proc1: some code
ret
proc2: more code etc...
ret

Then, at some point before the jmp you just load up the
displacement byte (disp) with 0, (proc2-proc1), (proc3-proc1),
etc. according to which section of code you want to jump to.
Not much use on modern processors I guess, but it worked well on
the 8086 and z80!

Terje Mathisen

unread,

May 2, 2008, 3:11:22 AM5/2/08

to

Pete wrote a good explanation, using ORG together with $ is for relative
instead of absolute addresses.

The two bytes saved per iteration was the unconditional jump from the
bottom of the loop, getting rid of those required the loop to be
rotated, i.e. moving the STOSB to the top, and getting rid of the
initial jump past this opcode was the reason for my dummy immediate
instruction.

H. Peter Anvin

unread,

Jun 2, 2008, 12:43:53 PM6/2/08

to

Terje Mathisen wrote:
>
> BTW, there is a very sneaky way to avoid that initial jump:
>
> Instead of
>
> jmp short start
> next:
> stosb
> start:
>
> you can instead write:
>
> mov al,0AAh
> org $-1 ; Back up one byte into the immediate constant above
> next:
> stosb
>
> I.e. on the first iteration we do a dummy two-byte instruction with a
> one-byte immediate, which we then make sure contains the STOSB opcode byte.
>

It's worth noting that this will cause an L1 cache miss on some platforms.

-hpa

Terje Mathisen

unread,

Jun 2, 2008, 1:53:52 PM6/2/08

to

Which was why I did note it in one of my posts, and returned here when
it was clear that the target was an original 808x, which really didn't
care about stuff like cache coherence or partially decoded instructions. :-)

The trick was a win on every x86 up to and including the 486 which was
the last time I used it.