Possible to emulate PUSH imm16 on 808x?

Jim Leonard

unread,

Mar 31, 2012, 8:45:38 PM3/31/12

to

I'm trying to make changes to a 286-specific size-optimized program to
get it to compile and run on 8088/8086 and have written macros for
things like PUSHA/POPA and SHR reg,imm8, but I ran across a PUSH imm16
and I'm stumped as to how I'm going to emulate it. The code uses it
pushing function locations, like this:

push MyFunction

(target assembler is A86) If I were writing the code, I'd push
immediate values by saving AX in a scratch register, MOV AX,imm16,
PUSH AX, then restore AX. However, all registers are in use at the
time of the PUSH. I can't save AX onto the stack because it's the
stack I need AX to manipulate!

Need a little help here. How would you emulate PUSH imm16 as an 808x-
specific macro? Halt interrupts and manipulate the stack directly, I
guess, but how to do that preserving all registers?

Rod Pemberton

unread,

Apr 1, 2012, 3:25:12 AM4/1/12

to

"Jim Leonard" <moby...@nospicedham.gmail.com> wrote in message
news:31532971-2c67-4f28...@m16g2000yqc.googlegroups.com...

Disabling interrupts is probably a good idea. I wouldn't want a simulated
"PUSH" to be interrupted.

Does the replacement code need to be the same size of bytes? (no?)

Does the replacement code need to be fast? (no?)

Well, I'm not too familiar with x86 limitations.

I'm not familiar with A86's syntax either.

Perhaps something like one of these would work (unchecked):

a) move through a memory location

MOV [mem], imm16
PUSH [mem]

b) save and restore AX etc to memory

MOV [mem], AX
MOV AX, imm16
PUSH AX
MOV AX, [mem]

c) if you know one of your registers is always zero

ADD reg, imm16
PUSH reg
XOR reg, reg

d) if you know BP is set to SP

MOV [BP-2], imm16
SUB SP, 2

HTH,

Rod Pemberton

Terje Mathisen

unread,

Apr 1, 2012, 4:05:04 AM4/1/12

to

Rod Pemberton wrote:
> d) if you know BP is set to SP
>
> MOV [BP-2], imm16
> SUB SP, 2

The last one is BAD!

Get an interrupt between the MOV and the SUB, and the code breaks!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Apr 1, 2012, 4:02:05 AM4/1/12

to

The easiest approach is probably by using BP:

push bp ;; Room for the value to be pushed
push bp
mov bp, sp
mov word ptr [bp+2], MyFunction
pop bp

I suspect this is also the smallest (in code bytes) macro to do it...

pe...@nospam.demon.co.uk

unread,

Apr 1, 2012, 4:51:15 AM4/1/12

to

In article <31532971-2c67-4f28...@m16g2000yqc.googlegroups.com>

Off the top of the head (not always a good thing, especially this
early in the morning!):

PushIM MACRO value ;'push immediate'
push ax ;anything, slot for receiving 'value'
push bp ;save BP
mov bp, sp
mov word ptr 2[bp], value
pop bp ;restore BP and reset SP
ENDM

You could replace the "push ax" with "sub sp,2" -- 3 bytes vs 1 but
executes in 4 clocks as opposed to 11/15 (according to Helppc).

Pete
--
Believe those who are seeking the truth.
Doubt those who find it. - André Gide

Rod Pemberton

unread,

Apr 1, 2012, 5:44:50 AM4/1/12

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message
news:gqel49-...@ntp6.tmsw.no...

> Rod Pemberton wrote:
> > d) if you know BP is set to SP
> >
> > MOV [BP-2], imm16
> > SUB SP, 2
>
> The last one is BAD!
>
> Get an interrupt between the MOV and the SUB, and the code breaks!
>

Rod Pemberton also wrote:

RP> Disabling interrupts is probably a good idea. I wouldn't
RP> want a simulated "PUSH" to be interrupted.

Does that nullify the "BAD!" disclaimer from Terje? I sure hope so ... :-)

Or, do you mean he should reverse the two instructions for safety?

RP

Rod Pemberton

unread,

Apr 1, 2012, 6:01:17 AM4/1/12

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message

news:5lel49-...@ntp6.tmsw.no...

Well, I get ten bytes for that and Pete's variation of it too. That's the
same size as my a) and b). My other two are shorter (-4 and -1), but less
generic.

I suspect the Terje/Pete solution is the fastest of the current three ten
byte solutions because of only one memory instruction.

Rod Pemberton

Terje Mathisen

unread,

Apr 1, 2012, 9:02:26 AM4/1/12

to

pe...@nospam.demon.co.uk wrote:
> PushIM MACRO value ;'push immediate'
> push ax ;anything, slot for receiving 'value'
> push bp ;save BP
> mov bp, sp
> mov word ptr 2[bp], value
> pop bp ;restore BP and reset SP
> ENDM

That's identical to my suggestion. :-)

>
> You could replace the "push ax" with "sub sp,2" -- 3 bytes vs 1 but
> executes in 4 clocks as opposed to 11/15 (according to Helppc).

Don't believe it:

SUB SP,2 has to take at the very minimum 4 clocks for each of the 3 code
bytes, so 12 clocks is the actual timing.

OTOH, "PUSH AX" also transfers 3 bytes: 1 code byte plus 2 bytes written
to the stack, so this version has the same 12 cycle minimum.

With the same amount of data, and effectively the same timing, I'd
prefer the version with shorter code!

Terje Mathisen

unread,

Apr 1, 2012, 9:06:00 AM4/1/12

to

Rod Pemberton wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"@giganews.com> wrote in message
> news:gqel49-...@ntp6.tmsw.no...
>> Rod Pemberton wrote:
>>> d) if you know BP is set to SP
>>>
>>> MOV [BP-2], imm16
>>> SUB SP, 2
>>
>> The last one is BAD!
>>
>> Get an interrupt between the MOV and the SUB, and the code breaks!
>>
>
> Rod Pemberton also wrote:
>
> RP> Disabling interrupts is probably a good idea. I wouldn't
> RP> want a simulated "PUSH" to be interrupted.

I would not do so, see below.

>
> Does that nullify the "BAD!" disclaimer from Terje? I sure hope so ... :-)

It would make it work, so yeah. :-)

>
> Or, do you mean he should reverse the two instructions for safety?

1) Reversing is the proper approach.

2) No need for CLI/STI wrappers.

3) There is no expectation among 8088 programs that items below the
stack pointer can carry any kind of dependable value, this area is after
all overwritten on every interrupt, so at least 18.2 times per second
just from the timer IRQ.

Terje Mathisen

unread,

Apr 1, 2012, 9:08:00 AM4/1/12

to

Rod Pemberton wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"@giganews.com> wrote in message

>> The easiest approach is probably by using BP:
>>
>> push bp ;; Room for the value to be pushed
>> push bp
>> mov bp, sp
>> mov word ptr [bp+2], MyFunction
>> pop bp
>>
>> I suspect this is also the smallest (in code bytes) macro to do it...
>>
>
> Well, I get ten bytes for that and Pete's variation of it too. That's the
> same size as my a) and b). My other two are shorter (-4 and -1), but less
> generic.
>
> I suspect the Terje/Pete solution is the fastest of the current three ten
> byte solutions because of only one memory instruction.

See my other post: any PUSH/POP instruction carries the 8-cycle penalty
caused by the two bytes pushed/popped. Implicit memory operations are
still memory operations.

pe...@nospam.demon.co.uk

unread,

Apr 1, 2012, 12:16:24 PM4/1/12

to

In article <380m49-...@ntp6.tmsw.no>

"terje.mathisen at tmsw.no"@giganews.com "Terje Mathisen" writes:

> pe...@nospam.demon.co.uk wrote:
> > PushIM MACRO value ;'push immediate'
> > push ax ;anything, slot for receiving 'value'
> > push bp ;save BP
> > mov bp, sp
> > mov word ptr 2[bp], value
> > pop bp ;restore BP and reset SP
> > ENDM
>
> That's identical to my suggestion. :-)

Phew -- good to know that my early morning brain is in good company!

> > You could replace the "push ax" with "sub sp,2" -- 3 bytes vs 1 but
> > executes in 4 clocks as opposed to 11/15 (according to Helppc).
>
> Don't believe it:
>
> SUB SP,2 has to take at the very minimum 4 clocks for each of the 3 code
> bytes, so 12 clocks is the actual timing.

I defer to your knowledge and experience; though it does seem counter
intuitive that "sub sp,2" is a sub-op of what "push ax" does behind
the scenes...

> OTOH, "PUSH AX" also transfers 3 bytes: 1 code byte plus 2 bytes written
> to the stack, so this version has the same 12 cycle minimum.
>
> With the same amount of data, and effectively the same timing, I'd
> prefer the version with shorter code!
>
> Terje

Absolutely!

NimbUs

unread,

Apr 2, 2012, 7:42:26 AM4/2/12

to

Terje Mathisen <"terje.mathisen at tmsw.no"@giganews.com> had
this to say:

>> You could replace the "push ax" with "sub sp,2" -- 3 bytes vs
1 but
>> executes in 4 clocks as opposed to 11/15 (according to
Helppc).

> Don't believe it:

> SUB SP,2 has to take at the very minimum 4 clocks for each of
the 3 code
> bytes, so 12 clocks is the actual timing.

This is completely false. The 8086 did not have to wait for an
instruction to complete before it could fetch the next one from
the memory store (fortunately!). Code bytes would be prefetched
and predecoded in advance of execution.

The 8086/8088 was an /advanced/ microprocessor (don't laugh!).
Things ran in parallel, with anticipation. Specifically there
are 2 internal queues involved in actual instruction
timing/thruput calculations, viz the prefetch queue (into which
instruction bytes are accumulated whenever the bus is not
occupied by other tranfers such as fetching of operands and
storing back results), and the predecode queue which does some
decoding of the actually prefetched instructions in advance,
these two activities : prefetching and predecoding taking place
in parallel with actual execution of instructions.

Because of this the time taken by one instruction to complete is
more easily determined, in general, by measuring than using the
tables provided by Intel. In the case of a simple SUB SP,n
however, there are no data to load from or store to memory, and
as recalled above instruction bytes were prefetched (and
decoded) well in advance, so that the time from the table would
probably match the actual measured execution time as part of
your program.

HTH. By the way, as part of Intel's struggle against speed
bottlenecks, both prefetch and predecode queues got larger and
larger from 8088 to 8086 to 286 to 386 (startying with the 486,
the architecture became very different).

--
Nimbus

Bob Masta

unread,

Apr 2, 2012, 8:25:21 AM4/2/12

to

On 02 Apr 2012 11:42:26 GMT, NimbUs

<nim...@nospicedham.XXX.invalid> wrote:

>Terje Mathisen <"terje.mathisen at tmsw.no"@giganews.com> had
>this to say:
>
>>> You could replace the "push ax" with "sub sp,2" -- 3 bytes vs
>1 but
>>> executes in 4 clocks as opposed to 11/15 (according to
>Helppc).
>
>> Don't believe it:
>
>> SUB SP,2 has to take at the very minimum 4 clocks for each of
>the 3 code
>> bytes, so 12 clocks is the actual timing.
>
>This is completely false. The 8086 did not have to wait for an
>instruction to complete before it could fetch the next one from
>the memory store (fortunately!). Code bytes would be prefetched
>and predecoded in advance of execution.

That was the general idea. It just turns out that the
instruction queue is almost always empty, except after a
slow instruction (MUL, DIV, etc). By actual timing, the 4
clocks per byte limit almost always wins.

>The 8086/8088 was an /advanced/ microprocessor (don't laugh!).
>Things ran in parallel, with anticipation. Specifically there
>are 2 internal queues involved in actual instruction
>timing/thruput calculations, viz the prefetch queue (into which
>instruction bytes are accumulated whenever the bus is not
>occupied by other tranfers such as fetching of operands and
>storing back results), and the predecode queue which does some
>decoding of the actually prefetched instructions in advance,
>these two activities : prefetching and predecoding taking place
>in parallel with actual execution of instructions.
>
>Because of this the time taken by one instruction to complete is
>more easily determined, in general, by measuring than using the
>tables provided by Intel. In the case of a simple SUB SP,n
>however, there are no data to load from or store to memory, and
>as recalled above instruction bytes were prefetched (and
>decoded) well in advance, so that the time from the table would
>probably match the actual measured execution time as part of
>your program.
>
>HTH. By the way, as part of Intel's struggle against speed
>bottlenecks, both prefetch and predecode queues got larger and
>larger from 8088 to 8086 to 286 to 386 (startying with the 486,
>the architecture became very different).
>
>
>--
>Nimbus
>

Bob Masta

DAQARTA v6.02
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
Frequency Counter, FREE Signal Generator
Pitch Track, Pitch-to-MIDI
Science with your sound card!

Robert Redelmeier

unread,

Apr 2, 2012, 8:58:42 AM4/2/12

to

Bob Masta <N0S...@daqarta.com> wrote in part:
> On 02 Apr 2012 11:42:26 GMT, NimbUs wrote:
>>Terje Mathisen had this to say:

>>> SUB SP,2 has to take at the very minimum 4 clocks for each of
>>> the 3 code bytes, so 12 clocks is the actual timing.
>>
>>This is completely false. The 8086 did not have to wait for an
>>instruction to complete before it could fetch the next one from
>>the memory store (fortunately!). Code bytes would be prefetched
>>and predecoded in advance of execution.
>
> That was the general idea. It just turns out that the
> instruction queue is almost always empty, except after
> a slow instruction (MUL, DIV, etc). By actual timing,
> the 4 clocks per byte limit almost always wins.

At least on the 8088 in the common IBM PC & compatibles.
The 8086 found on a very few machines was not so fetch bound.

-- Robert

Terje Mathisen

unread,

Apr 2, 2012, 9:23:11 AM4/2/12

to

NimbUs wrote:
> Terje Mathisen<"terje.mathisen at tmsw.no"@giganews.com> had

>> SUB SP,2 has to take at the very minimum 4 clocks for each of
> the 3 code
>> bytes, so 12 clocks is the actual timing.
>
> This is completely false. The 8086 did not have to wait for an
> instruction to complete before it could fetch the next one from
> the memory store (fortunately!). Code bytes would be prefetched
> and predecoded in advance of execution.
>
> The 8086/8088 was an /advanced/ microprocessor (don't laugh!).

BZZT! Wrong.

If you had said just 8086 above, I would have let you get away with that
statement, but by including the 8088 you've shown that you simply never
did any serious asm programming with that generation machines.

As Bob M and Robert R have already told you, the 8088 was so extremely
fetch limited that for anything except MUL/DIV and other really slow
microcoded instructions, the prefetch buffer was _always_ empty.

The 16-bit bus version (8086) otoh grabbed twice as many instruction
bytes per bus cycle (4 cpu cycles), so it did indeed achieve overlap
between prefetch and execution for many codes.

> Things ran in parallel, with anticipation. Specifically there
> are 2 internal queues involved in actual instruction
> timing/thruput calculations, viz the prefetch queue (into which
> instruction bytes are accumulated whenever the bus is not
> occupied by other tranfers such as fetching of operands and
> storing back results), and the predecode queue which does some
> decoding of the actually prefetched instructions in advance,
> these two activities : prefetching and predecoding taking place
> in parallel with actual execution of instructions.

This is exactly how the 8086 designers intended the cpu to work, but
since IBM chose the 8088 (both to be able to use very cheap 8-bit
peripheral chips and to limit the performance to reduce the competition
with their own minicomputer systems?) real life performance was
extremely well modeled by using the "4 cycles per byte" rule of thumb.

Jim Leonard

unread,

Apr 2, 2012, 10:01:40 AM4/2/12

to

On Apr 1, 2:25 am, "Rod Pemberton"

<do_not_h...@nospicedham.noavailemail.cmm> wrote:
> a) move through a memory location
>
> MOV [mem], imm16
> PUSH [mem]
>
> b) save and restore AX etc to memory
>
> MOV [mem], AX
> MOV AX, imm16
> PUSH AX
> MOV AX, [mem]

These are pretty obvious, aren't they? So I initially avoided
thinking about them because I was unsure if there was space in the
program I could use for this (remember, it's space-optimized which
makes it a bit wonky, lots of goofy tricks all over the place) and it
dawned on me that I can use space in the copyright string for this
(it's printed on program start but then apparently unused after
that).

Thanks for the second pair of eyes!

> c) if you know one of your registers is always zero
>
> ADD reg, imm16
> PUSH reg
> XOR reg, reg

All registers are unknown at all times, hence my question.

> d) if you know BP is set to SP

It's not, but it's not free either (used as a general-purpose reg in
the program).

Jim Leonard

unread,

Apr 2, 2012, 10:25:14 AM4/2/12

to

On Apr 1, 3:02 am, Terje Mathisen <"terje.mathisen at

tmsw.no"@giganews.com> wrote:
> The easiest approach is probably by using BP:
>
> push bp ;; Room for the value to be pushed
> push bp
> mov bp, sp
> mov word ptr [bp+2], MyFunction
> pop bp

Forget my other post, THIS is great and is how I'm going to do it --
thanks!

Terje Mathisen

unread,

Apr 2, 2012, 12:24:21 PM4/2/12

to

If you want to include this as a macro, so that you can conditionally
get either the macro or the 186+ immediate push, then I would make one
further step:

Assuming your current source code often pushes several constants, why
not make a macro that can do the same (or one macro for each count of
arguments to make it easier):

push3 macro imm1, imm2, imm3
sub sp,6
push bp
mov bp,sp
mov [bp+2],imm3
mov [bp+4],imm2
mov [bp+6],imm1
pop bp
endm

In 186+ this simply becomes

push3 macro imm1, imm2, imm3
push imm1
push imm2
push imm3
endm

I.e. this way you only suffer the extra overhead once instead of three
times.

wolfgang kern

unread,

Apr 2, 2012, 2:52:45 PM4/2/12

to

Jim Leonard asked:

Why doesn't PUSH imm16 work for you ?
If your compiler got a problem with it then you could try:

push ax ;save it
mov ax,MyFunction ;hopefully an aleady defined label
exch ax,[esp] ;this works in my tools for modern CPUs

But if you really work on any olde 8086-hardware then it will look like:

push ax
push ax ;make space for the one pushed data
mov bp,sp
mov ax,MyFunction
mov [bp+2],ax ;really old 8086 hardware may need [bp+4] here
pop ax ;restore ax and have 'MyFunction' on Stack yet.

you could us also BX/SI/DI instead of BP, but then with an SS overide.
__
wolfgang

Czerno

unread,

Apr 3, 2012, 4:41:53 AM4/3/12

to

Terje Mathisen <"terje.mathisen at tmsw.no"@giganews.com>

écrivait news:1rlo49-...@ntp6.tmsw.no:

> NimbUs wrote:
>>> bytes, so 12 clocks is the actual timing.

>> This is completely false. The 8086 did not have to wait for
an
>> instruction to complete before it could fetch the next one
from
>> the memory store (fortunately!). Code bytes would be
prefetched
>> and predecoded in advance of execution.
>>
>> The 8086/8088 was an /advanced/ microprocessor (don't
laugh!).

> BZZT! Wrong.
>
> If you had said just 8086 above, I would have let you get away
with that
> statement, but by including the 8088 you've shown that you
simply never
> did any serious asm programming with that generation machines.

Caught with my pants down, am I ? I was, partly. My personal
experience was a short one with 80186 and rather intimate
acquaintance with much + powerful 80286. I should have added a
caveat about 8088 limitations, which have been well reported by
several sources in the past. My 'excuse' for this short sight
will be the OP mentionning 8088/8086 so I copied that lazily :=(

> As Bob M and Robert R have already told you, the 8088 was so
extremely
> fetch limited that for anything except MUL/DIV and other
really slow
> microcoded instructions, the prefetch buffer was _always_
empty.

I'll defer to your expertise, furthermore I've read that story
several times elsewhere.

> The 16-bit bus version (8086) otoh grabbed twice as many
instruction
> bytes per bus cycle (4 cpu cycles), so it did indeed achieve
overlap
> between prefetch and execution for many codes.

In addition, I think I remember, the 8088 queue was shortened vs
the 8086's.

>> Things ran in parallel, with anticipation. Specifically there
>> are 2 internal queues

This is the original intent by Intel engineers and ISTM it was
'advanced' indeed (for microprocessors). That some
implementations failed to perform may be attributable to their
'bean counters'...

Whatever... I apologise to the OP and everybody else, for my not
grabbing the specific question and replying generalities
instead.

--
NimbUs

NimbUs

unread,

Apr 3, 2012, 7:35:52 AM4/3/12

to

Terje Mathisen <"terje.mathisen at tmsw.no"@giganews.com>

écrivait news:1rlo49-...@ntp6.tmsw.no:

> NimbUs wrote:

>>> bytes, so 12 clocks is the actual timing.

>> This is completely false. The 8086 did not have to wait for
an
>> instruction to complete before it could fetch the next one
from
>> the memory store (fortunately!). Code bytes would be
prefetched
>> and predecoded in advance of execution.
>>
>> The 8086/8088 was an /advanced/ microprocessor (don't
laugh!).

> BZZT! Wrong.
>
> If you had said just 8086 above, I would have let you get away
with that
> statement, but by including the 8088 you've shown that you
simply never
> did any serious asm programming with that generation machines.

Caught with my pants down, am I ? I was, partly. My personal
experience was a short one with 80186 and rather intimate
acquaintance with much + powerful 80286. I should have added a
caveat about 8088 limitations, which have been well reported by
several sources in the past. My 'excuse' for this short sight
will be the OP mentionning 8088/8086 so I copied that lazily :=(

> As Bob M and Robert R have already told you, the 8088 was so
extremely
> fetch limited that for anything except MUL/DIV and other
really slow
> microcoded instructions, the prefetch buffer was _always_
empty.

I'll defer to your expertise, furthermore I've read that story
several times elsewhere.

> The 16-bit bus version (8086) otoh grabbed twice as many
instruction
> bytes per bus cycle (4 cpu cycles), so it did indeed achieve
overlap
> between prefetch and execution for many codes.

In addition, I think I remember, the 8088 queue was shortened vs
the 8086's.

>> Things ran in parallel, with anticipation. Specifically there
>> are 2 internal queues

Dick Wesseling

unread,

Apr 4, 2012, 3:30:26 PM4/4/12

to

In article <31532971-2c67-4f28...@m16g2000yqc.googlegroups.com>,

Jim Leonard <moby...@nospicedham.gmail.com> writes:
> I'm trying to make changes to a 286-specific size-optimized program to
> get it to compile and run on 8088/8086 and have written macros for
> things like PUSHA/POPA and SHR reg,imm8, but I ran across a PUSH imm16
> and I'm stumped as to how I'm going to emulate it. The code uses it
> pushing function locations, like this:
>
> push MyFunction

The 8080 can push [mem], so:

MyFunctionIm dw MyFunction
...
push [MyFunctionIm]

(you may need a cs segment prefix).

> (target assembler is A86)

I'm not familiar with A86. In nasm the macro would be something
like:

%macro pushimm 1
section .rodata
%%imm: dw %1
section .text
push word [%%imm]
%endmacro

However, this is not perfect. If you push the same immediate
value more than once as in:

pushimm 77
pushimm 77

you get two locations in the rodata section with the same value.

James Harris

unread,

Apr 5, 2012, 12:44:11 PM4/5/12

to

On Apr 2, 3:25 pm, Jim Leonard <mobyga...@nospicedham.gmail.com>
wrote:

> On Apr 1, 3:02 am, Terje Mathisen <"terje.mathisen at

> > The easiest approach is probably by using BP:
>
> > push bp ;; Room for the value to be pushed
> > push bp
> > mov bp, sp
> > mov word ptr [bp+2], MyFunction
> > pop bp
>
> Forget my other post, THIS is great and is how I'm going to do it --
> thanks!

I came up with almost the same as Terje (but not quite as good) but
have you seen Dick Wesseling's suggestion posted just yesterday? He
seems to have come up with the best option. Based on his suggestion
you could convert

push 53

into just one instruction rather than five

push word [WORD_53]

and have a data section with

WORD_53: dw 53

I don't think I would bother with a macro. If you convert all the
pushes to the above form first then your assembler will tell you which
words need to be defined in the data section.

You mentioned the code was space optimised for the 286. I don't know
if it needs to be space optimised for the 8086 you are converting to
but if it does then Terje's code will take ten bytes and Dick's just
six.

As Dick mentions you could place your constants with the code and
address off the CS register, adding another byte but still three bytes
shorter. And the number of instructions is fewer: one fifth as many.

James

Jim Leonard

unread,

Apr 5, 2012, 4:45:36 PM4/5/12

to

On Apr 3, 3:41 am, Czerno <cze...@nospicedham.czerno.tk.invalid>
wrote:

> My 'excuse' for this short sight
> will be the OP mentionning 8088/8086 so I copied that lazily :=(

No worries; all my (hobby) targets are 8088 so I've gotten used to 4-
cycle counting. I have an 8086 clone I test on as well, but I don't
bother optimizing for it since I just consider the larger prefetch
queue and the word fetches as "gravy".

> In addition, I think I remember, the 8088 queue was shortened vs
> the 8086's.

Yep. 8086 was 6 bytes; 8088 was 4. Insult to injury.

Jim Leonard

unread,

Apr 5, 2012, 4:42:13 PM4/5/12

to

On Apr 5, 11:44 am, James Harris

<james.harri...@nospicedham.gmail.com> wrote:
> You mentioned the code was space optimised for the 286. I don't know
> if it needs to be space optimised for the 8086 you are converting to
> but if it does then Terje's code will take ten bytes and Dick's just
> six.

Thanks for the info. The goal is to get the program to run at all,
and not necessarily preserve its space optimization on 808x platforms.

Terje's method was the most portable and worked "right first time". I
now have some additional changes to make as the program works on the
original PC but not a clone I have set up next to it, so the debugging
continues. I blame the display subsystem in the clone :-) making this
no longer an x86 asm issue.

> As Dick mentions you could place your constants with the code and
> address off the CS register, adding another byte but still three bytes
> shorter. And the number of instructions is fewer: one fifth as many.

The original program was 4088 bytes; my semi-functional port is up to
4200 bytes. Until I hit 9*512=4068 bytes, I'm not worried about
size. But I'll remember this technique in the future; thanks (and
thanks Dick!)

Terje Mathisen

unread,

Apr 6, 2012, 4:34:49 AM4/6/12

to

James Harris wrote:
> into just one instruction rather than five
>
> push word [WORD_53]
>
> and have a data section with
>
> WORD_53: dw 53

That is definitely an option. I would normally not consider it since it
causes both a read and a write of the 16-bit constant, but that might
still be OK.:-)

> You mentioned the code was space optimised for the 286. I don't know
> if it needs to be space optimised for the 8086 you are converting to
> but if it does then Terje's code will take ten bytes and Dick's just
> six.

Right, even though it needs to both read and write the immediate word
this still ends up as just 10 bytes of total bus traffic, vs 18 for my code.

>
> As Dick mentions you could place your constants with the code and
> address off the CS register, adding another byte but still three bytes
> shorter. And the number of instructions is fewer: one fifth as many.

Nr of instructions don't matter at all for 8088 code, but this is still
far faster to my version.

When pushing multiple words at once the difference does drop quite a bit
(each additional word needs one byte less in the "mov word ptr
[bp+n],imm" form than "push word ptr [1234]"), but break-even would
require 5 words to be pushed.

Re. total bus bytes (and actual speed) the BP-relative moves generates 7
bytes per extra word pushed, saving 3 bytes, so with 4 or more words the
bus traffic is less.

Jim Leonard

unread,

Apr 6, 2012, 10:08:30 AM4/6/12

to

On Apr 5, 3:42 pm, Jim Leonard <mobyga...@nospicedham.gmail.com>
wrote:

> The original program was 4088 bytes; my semi-functional port is up to
> 4200 bytes. Until I hit 9*512=4068 bytes, I'm not worried about
> size. But I'll remember this technique in the future; thanks (and
> thanks Dick!)

That should be 4608 bytes of course, sorry.

Dick Wesseling

unread,

Apr 6, 2012, 11:51:07 PM4/6/12

to

In article <aem259-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"@giganews.com> writes:
>
> When pushing multiple words at once the difference does drop quite a bit
> (each additional word needs one byte less in the "mov word ptr
> [bp+n],imm" form than "push word ptr [1234]"), but break-even would
> require 5 words to be pushed.

The semantics of your multiple words version is different from push
immediate. "sub sp,6" modifies the flags register, push does not.
You can solve that by replacing it with 3 * "push bp", but that shifts
the break-even point.

Terje Mathisen

unread,

Apr 7, 2012, 4:02:11 AM4/7/12

to

You are right of course.

I tend to disregard flag contents before a call, as it is almost always
the case that flags only transfer information back from the call, not
into it.