; AL and DL have unsigned 8-bit values to add
; DL is actually extended into DX (ie. 00xxh)
cbw ; Make AL a signed word
add dx,ax ; Do word sized add, into DX
jns @@HIGH ; jump if result not signed/negative
xor dx,dx ; Underflowed; make 0
@@HIGH:
or dh,dh ; Did it overflow?
jz @@OK ; if not, we're ok
mov dx,00FFh ; Overflowed; force our saturation value (FFh)
@@OK:
This is part of an ADPCM inner loop.
Is this about as fast as I can expect? Or am I missing an obvious
technique? (I am not an 808x assembler expert)
And the result should be unsigned also I suppose. In that case you
only need to test for overflow. Unsigned overflow sets the carry
flag. Extend the carry flag to all ones (FFh) for overflow and
zero for the normal case. Then it goes something like this:
add dl,al ; do byte size add
sbb dh,dh ; carry -> FFh no carry -> 0
or dl,dh ; saturate
; xor dh.dh ; Want word size result???
What's wrong with the obvious approach?
add dl,al ; Carry set if it overflows!
sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff
OK?
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Yes -- I was missing the forest for the trees, again. Thanks! (and
thanks to Mr. Wesseling as well).
Note that any non-overflow result of 255 is for your purposes
indistinguishable from an overflow,
So you just add, test 9th bit, jump around or load or OR 255h.
Make that last "255h" into 255 or else 0FFh! (Sigh!)
>> I wouldn't do it that way.
>> I would add the two supposedly unsigned 8-bit integers and test the 9
>> th bit for 1, which if present, the addition has overflowed.
>> Note that any non-overflow result of 255 is for your purposes
>> indistinguishable from an overflow,
>> So you just add, test 9th bit, jump around or load or OR 255h.
> Make that last "255h" into 255 or else 0FFh! (Sigh!)
ok. :)
First I thought: 'what's wrong by using the Carry-flag ? ',
but now I may see the flaw in the whole story.
The result can turn out to be FF(h) without an overflow,
so using FF(h) as overflow indication may lead to errors.
;just use carry-flag as indication:
ADD DL,AL
JC error
;or expand it to a true result
MOV DH,0 ;
ADD DL,AL ;also possible: MOVZX DX,AL
ADC DH,0 ;DH NZ meens overflow but DX holds true value
;or use just bit0 of DH for overflow (a bit shorter/faster):
ADD DL,AL
SETC DH ;DH NZ means overflow
;or if DL should just be clamped to FF(h):
ADD DL,AL
SETC DH ;becomes 0 or 1
NEG DH ;becomes 0 or FF
OR DL,DH ;DL keeps value or becomes FF(h)
SETcc can avoid branches and gain speed by this,
but it will not work on older CPUs.
(my example suffers from partial register stalls and
dependencies, so it should by used within other pairing
instructions for full speed),
__
wolfgang
CF is also set with two byte adds.
>> Note that any non-overflow result of 255 is for your
>> purposes indistinguishable from an overflow, So you just
>> add, test 9th bit, jump around or load or OR 255h.
> Make that last "255h" into 255 or else 0FFh! (Sigh!)
Well, the OP did specify an old 808x which gives a more
complex answer:
Terje's clever SBB/OR is fastest on modern CPUs by a
great deal since there are no poorly predictable branches.
Your JC saves one instruction (two byte-transfers),
but only on the non-saturate case. For the saturate case,
it costs two more instr ( MOV/JMP ) [4 bytes].
Optimum will depend on the data.
-- Robert
Very neat. Now, what about the signed case (saturating at +127 or
-128)? The best I've managed is:
add al,dl ; Signed addition
jno ok ; Jump if no overflow
sbb al,al ; al is 0x00 or 0xff
xor al,127 ; Saturate to -128 or +127
ok:
Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.
> CF is also set with two byte adds.
But this never occurs with adding one byte to one bye in two-byte
registers (with zero bits to the left). The objective is to count to
no more than 254 since 255 itself is the overflow signal.
Oh, you didn't like the prior two changes? (inc doesn't update carry...)
Do you like these two changes (instead) to "Terje's solution"?
add al, 1
adc dl,al ; Carry set if it overflows!
sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff
Rod Pemberton
Do you like these two changes to "Terje's solution"?
inc al
adc dl,al ; Carry set if it overflows!
sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff
Rod Pemberton
That's a red herring:
It really doesn't matter if you say:
if ((sum = a+b) > 254) sum = 255;
or if you write:
if ((sum = a+b) > 255) sum = 255;
Please notice that as long as the sum is above 254, saturation will
always return 255, and at that point there is no way to determine if
this was the result of saturation after overflow, or simply because the
sum actually was 255!
That is pretty neat code, I like it!
The generic way to handle arbitrary clamping values is something like this:
;; if ( x < MIN) x = MIN;
;; else if (x > MAX) x = MAX;
cmp eax,MIN ; Carry set if (eax < MIN)
sbb ebx,ebx ; EBX = -1 if underflow
cmp eax,MAX+1 ; Carry clear if (eax > MAX)
adc ebx,ebx ; Merge carry with previous result
At this point there are 4 possible combinations:
second 0 second 1
first 0 OVFL OK
first 1 impossible UNDFL
These four corresponds to EBX = 0, 1, -2, -1 which means that it is
possible to store the original value into a four-element table, then use
EBX to retrieve either the same value back or the clamped results:
cmp eax,MIN ; Carry set if (eax < MIN)
sbb ebx,ebx ; EBX = -1 if underflow
cmp eax,MAX+1 ; Carry clear if (eax > MAX)
adc ebx,ebx ; Merge carry with previous result
mov clamp_table[4], eax
mov eax,clamp_table[ebx*4+8]
This code will only be faster than the naive approach if clamped values
are both common and non-predictable, if this isn't true then it is much
better to simply use the generic C-style if/else if/else approach!
Let's plug some values into the code you posted:
> add dl,al ; Carry set if it overflows!
> sbb al,al ; AL = 0xff if overflow
> or dl,al ; Turns any overflow into 0xff
if (al+dl) = 0, result al=0, dl=0
if 1< (al+dl) < 254, result al=0, dl=dl<>0
if (al+dl) = 255, result al=0, dl=dl=0xff
if 255 < (al+dl), result al=0xff, dl=0xff
dl is 0xff at sum > 254, but al is 0xff at sum > 255 ...
> Please notice that as long as the sum is above 254, saturation will
> always return 255,
False, if one uses al. True, if one uses dl.
> and at that point there is no way to determine if
> this was the result of saturation after overflow, or simply because the
> sum actually was 255!
False, if one uses al. True, if one uses dl.
If Jim needs to distinguish 255 from overflow as Terence noted, perhaps this
is sufficient:
add dl,al ; Carry set if it overflows!
sbb dl,dl ; DL = 0xff if overflow
if (al+dl) = 0, result dl=0, al=al
if 1< (al+dl) < 254, result dl=0, al=al
if (al+dl) = 255, result dl=0, al=al
if 255 < (al+dl), result dl=0xff, al=al
Rod Pemberton
While we're at it, how about complementing this with a similar
solution for signed saturation (-128 to +127) of signed bytes for the
fun of it? :)
Alex
??? How's your server?
http://groups.google.com/group/comp.lang.asm.x86/msg/36af00ed6b8201ae
http://groups.google.com/group/comp.lang.asm.x86/msg/b994b1ae492f9833
RP
I do not need to distinguish 255 from overflow.
The quick background for this is that it is the clamping function for
adding a signed 8-bit integer to an unsigned 8-bit integer. The
function is an adapted ADPCM routine; whereas most ADPCM works on
signed 16-bit samples, I have a need for it to work on unsigned 8-bit
samples.
The inner loop of the routine has a previous unsigned sample, and a
new signed sample, and needs to add them together to get the new
sample. The result needs to clamp at both 0 and 255.
With this in mind, what is the best solution, assuming the previous
unsigned sample is in DL and the new signed sample is in AL?
Missed the posts. Still I was wondering if someone came up with a
compact method not involving jumps and LUTs.
Alex
Just to be absolutely clear about the problem (and without any
attempt at optimising the code), can you confirm that the below
is what you're after?
(unsigned sample in DL, signed "update" in AL)
sub dh, dh ;DH = 0
cbw ;sign-extend AL to AX
add dx, ax ;signed addition
Now:
if DH == 0, nothing to do: ans in DL
if DH == FF, -ve overflow so clamp to 0: set DL = 0
if DH == 1, +ve overflow so clamp to FF: set DL = FF
Is that correct?
Pete
--
"We have not inherited the earth from our ancestors,
we have borrowed it from our descendants."
Yes, that's exactly correct. My beginner's attempt at applying the
clamping has been done with JMPs, which are really costly in an inner
loop (especially on my lowly 808x) and I feel that I'm missing
something obvious.
Was Terje's solution:
add dl,al ; Carry set if it overflows!
sbb al,al ; AL = 0xff if overflow
or dl,al ; Turns any overflow into 0xff
...the right one? I see overflow handling but not underflow...
Well, I'd guess that Terje has more expertise in his little
finger that I in my entire body :-) but below is the best I can
come up with -- still a jmp in there though...
; unsigned sample in DL, signed "update" in AL
sub dh, dh
cbw
add dx, ax
or dh, dh
jz done
; either DH == FF (-ve overflow) or DH == 1 (+ve overflow)
sar dh, 1 ;now DH is 0 or FF, but the wrong way round
not dh ;invert it
mov dl, dh
done:
Perhaps (certainly!) someone can improve on that?
That didn't even try to handle a unsigned/signed mix.
>
> Well, I'd guess that Terje has more expertise in his little
This only means I have more scars and have made more mistakes than most
of the rest of you.
> finger that I in my entire body :-) but below is the best I can
> come up with -- still a jmp in there though...
>
> ; unsigned sample in DL, signed "update" in AL
> sub dh, dh
> cbw
> add dx, ax
> or dh, dh
> jz done
> ; either DH == FF (-ve overflow) or DH == 1 (+ve overflow)
> sar dh, 1 ;now DH is 0 or FF, but the wrong way round
> not dh ;invert it
> mov dl, dh
> done:
What about converting the problem (sort of) to an unsigned case?
; The two first instructions can be skipped if the high
; halfs are known to be zero
and dx,255
and ax,255
; Move the signed update into the [0-255] range:
add ax,128
add ax,dx ; Result must be in [0..510] range, of which
; 128..383 is in range
; The trivial way to clamp this is to adjust & branch:
sub ax,128
jc underflow
cmp ax,256
jnc overflow
Can we do this differently? Sure, but only if clamping is both common
and non-predictable:
sub ax,128
mov dx,0xff00
cmovc al,dl
cmp ax,256
cmovnc al,dh
How about doing it without conditional moves?
sub ax,128
sbb dx,dx ; -1 if underflow
cmp ax,256
sbb bx,bx
xor dx,-1
and ax,dx
or ax,bx
Anyway, unless clamping is very common, it is hard to beat the branch
predictors!
I don't think the 808x CPU has a branch predictor. I guess one could
just add up the cycles used for each instruction on such an old CPU
and come up with the best performing code for various statistics of
the data. It should be pretty much deterministic (does it even use
cache memory?).
BR,
Danjel McGougan
Ouch, I forgot about the 8088/8086 target!
> just add up the cycles used for each instruction on such an old CPU
> and come up with the best performing code for various statistics of
> the data. It should be pretty much deterministic (does it even use
> cache memory?).
The only cache at all is a 6/8 byte prefetch queue for instruction
bytes, but on 8088 that queue is very often empty since it has to
compete with all memory accesses.
The performance is easy to calculate though:
Without taken branches:
4 cycles per byte touched, code & data.
Taken branches were something like 4-8 cycles afair, might be wrong.
; Input: AL = signed byte, DL = unsigned byte
; Output: AL unsigned clamped byte
xor dx,dx
next: ; bytes
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
js underflow ; 2
test ah,ah ; 2
jnz overflow ; 2
store:
stosb ; 1+1 Store clamped result
jmp next ; 2+t
This looks like 14 bytes total, i.e. about 56 cycles for a nonclamped
result.
underflow:
xor ax,ax ; 2
stosb ; 1+1
jmp next ; 2+t
Underflow adds a taken branch but get rid of 2 code bytes, so we're
talking about +/- zero extra cost.
overflow:
mov al,255 ; 3
stosb ; 1+1
jmp next ; 2+t
Overflow requires 3 extra code bytes and a taken branch, i.e.
approximately 16+ cycles.
If most samples turn out to be in range, then we should simplify the
code to optimize this path even further:
xor dx,dx
next: ; bytes
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
test ah,0ffh
jnz clamp
stosb
jmp next
clamp:
js underflow
mov al,255
stosb
jmp next
underflow:
xor ax,ax
stosb
jmp next
Hello,
Given the 8088/8086 target, and looking at the code posted,
I realized that an undocumented instruction could be used.
Ever since I heard of these instructions that were not mentioned
by Intel, I have been trying to find a use for them. This may
be one. Maybe Jim will time the various solutions posted, and
mention which he chose.
The effort behind my entry was to have no jumps and no extra
memory access (immediate data, variables, or PUSH/POP). My 8088 is
dead, so I have not programed for it for quite some time. But for
my 80186 machine, those two rules have sped up some programs.
Anyway, here is an example of using SALC.
Regards,
Steve N.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SALC MACRO ; Set AL to Carry (not for NEC V20).
DB 0D6H ; One byte instead of the two byte SBB reg,reg.
ENDM
CLAMP: ; Setup
XOR DH,DH ; Convert unsigned byte to word.
CBW ; Convert signed byte to word.
MOV BL,AL ; Save for unsigned byte calc.
; Signed word add:
ADD AX,DX
SAR AH,1 ; \ Make mask for zero clamp.
NOT AH ; /
; Unsigned byte add:
ADD BL,DL ; (Could be DL,BL if result in DL wanted.)
SALC ; Make mask for 0FFH clamp.
OR BL,AL ; Clamp high.
AND BL,AH ; Clamp low.
RET
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
...but they still had to execute. Just because an opcode takes 4 or 8
cycles to fetch doesn't mean its execution time is magically erased.
Since the (tiny) prefetch queue can be full after a slow operation,
optimization still matters.
I originally asked about the topic because a jump clears the prefetch
queue, and was trying to find a way to clamp without jumping. I have
to ask, though: The penalty on my CPU for jumping is 1. clearing the
prefetch queue and 2. 17 cycles. I will have to look at Terje's
suggestions and see if the solutions he provided actually beat that,
which may have been what he was referring to when he wrote "unless
4 bytes, actually, on 8088. It's 6 on 8086 and NEC V20/30.
> Taken branches were something like 4-8 cycles afair, might be wrong.
17, actually. That's where the detective work comes in; to "beat" the
penalty of a jump, the alternate solution has to be 16 cycles or less
(including opcode fetch as you noted).
I think, based on what you wrote, that 17 cycles is not that big a
deal when alternate solutions would most likely be longer/slower.
> If most samples turn out to be in range, then we should simplify the
> code to optimize this path even further:
"Most" is subjective, but in an ADPCM decoder (what this code is for)
I would agree that the ratio of needing clamping to not needing is
about 1:8, so yes, most samples will be in range.
> xor dx,dx
> next: ; bytes
> mov dl,[bx+si] ; 2+1
> lodsb ; 1+1
> cbw ; 1 -128 to 127
> add ax,dx ; 2 -128 to 383
> test ah,0ffh
> jnz clamp
> stosb
> jmp next
>
> clamp:
> js underflow
> mov al,255
> stosb
> jmp next
>
> underflow:
> xor ax,ax
> stosb
> jmp next
>
> Terje
Looks like a winner; thanks for all your help!
Might it be "better" (at least smaller code) to place the stosb after
the "next" label? but then you'd need a setup JMP to start,
[and a "cx=0" test at some point!]
xor dx,dx
jmp short start
next: ; bytes
stosb
start:
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
test ah,0ffh
jz next
clamp:
js underflow
mov al,255
jmp next
underflow:
xor ax,ax
jmp next
0 - 1 = -1 (all bits set)
0 - 0 = 0
add al, bl ; this is ur addition
xor cl, cl ; zero d cl register
sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
or al, cl ; this will saturate; the cl = 0xff if carry was set.. else
it is 0x00
w00t++ w00t++
Saturating w/o branching is child's play; enjoy!
This XOR will overwrite the flags from the previous ADD!
INC/DEC are the only arithmetic opcodes which doesn't modify the carry flag.
> sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
> or al, cl ; this will saturate; the cl = 0xff if carry was set.. else
> it is 0x00
>
> w00t++ w00t++
>
> Saturating w/o branching is child's play; enjoy!
Children often make simple mistakes. :-)