Instruction Cycle of a 8086 statement

k3xji

unread,

Mar 20, 2007, 7:47:55 AM3/20/07

to

Hi all,

Is there a way of determining the Instruction Cycle Count of a
statement in 8086 processors?I don't want to calculate the exact time
of a statement, I only want to measure how many cycles will a
statement take?For example fetch an instruction is 2 cycles and for
example moving a register to memory is 4 cycles...etc.Are there some
constraint values on specific statements?

Regards...

Jake Waskett

unread,

Mar 20, 2007, 3:09:10 PM3/20/07

to

Are you asking about the original Intel 8086 or about the family of
8086-compatible processors?

If the former, there is some data here:
http://library.n0i.net/hardware/i8086opcodes/#2

If the latter, it's really an unanswerable question, and even for a single
implementation the timing may depend upon the internal state of the CPU at
the time.

Rick Hodgin

unread,

Mar 20, 2007, 10:06:46 PM3/20/07

to

> Is there a way of determining the Instruction Cycle Count of a

> statement in 8086 processors?... I only want to measure how

> many cycles will a statement take?

The original 8086 and compatibles had stringently documented timings. They
were not out-of-order processors, though I believe some of the later
compatibles could issue reads/writes in parallel for the nearest
instruction.

If you look at Wikipedia you can see some of the basics that you'll find
elsewhere:
http://en.wikipedia.org/wiki/8086

- Rick C. Hodgin

Bob Masta

unread,

Mar 21, 2007, 9:20:22 AM3/21/07

to

Stringently documented, perhaps, but sometimes in misleading
ways. The 8088 was notorious for giving specific timings that
were pure fiction. The reason was that the true timing was
limited by the instruction fetch time, since the 8088 had a very
short prefetch queue. As a rule of thumb, the true t-state count was
4 times the number of bytes in the instruction, regardless what
the Intel docs said, except for really slow instructions like MUL
and DIV. Later processors with longer queues were much less
likely to show this discrepancy.

Best regards,

Bob Masta

D A Q A R T A
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Signal Generator
Science with your sound card!

Jim Leonard

unread,

Mar 21, 2007, 12:21:40 PM3/21/07

to

On Mar 21, 8:20 am, NoS...@daqarta.com (Bob Masta) wrote:
> Stringently documented, perhaps, but sometimes in misleading
> ways. The 8088 was notorious for giving specific timings that
> were pure fiction.

Oh, come on. Name one single published timing that was "pure
fiction". See below.

> As a rule of thumb, the true t-state count was
> 4 times the number of bytes in the instruction, regardless what
> the Intel docs said, except for really slow instructions like MUL
> and DIV.

4 times the number of opcode bytes?? That's a blatantly incorrect
statement. The true instruction count is this:

published instruction timing + (number of opcode bytes * 4)

...because each byte fetch took 4 cycles on 8088. (On 8086, you could
fetch two bytes in 4 cycles, so that was one reason 8086 was faster
right off the bat.) If it was already fetched, then the published
timings are indeed accurate.

So, for example, how long does XLAT take to execute on 8088? XLAT's
published timings are 11 cycles, and it's a one-byte opcode, so if
it's prefetched it takes 11 cycles to execute, and if it is not
already in the prefetch queue it takes 15. (However it would be
extremely unlikely that XLAT wouldn't be prefetched since it's only a
single byte opcode.)

I know this to be true because I've timed it myself using CTC channel
2. In fact, I've used the CTC to find the "break-even" point for
bittwiddling opcodes like ROR to see where the reg vs. immed.
diminishing return is (for the record, the "ROR reg,CL" form is faster
if CL>4). So I take exception to your claim of "pure fiction" on the
part of Intel.

spam...@crayne.org

unread,

Mar 21, 2007, 1:42:21 PM3/21/07

to

One factor that made the number of bytes in the instructions more
critical to timing than the instruction cycle counts was IBMs decision
to run memory with as many wait states as they used. I forget how
many were used, but it was very conservative. I do remember that on
the AT, there was a utility that would delay refresh and it was
effective in increasing processing speed.

On Wed, 21 Mar 2007 13:20:22 GMT, NoS...@daqarta.com (Bob Masta)
wrote:

Terje Mathisen

unread,

Mar 21, 2007, 4:11:56 PM3/21/07

to

Jim Leonard wrote:
> On Mar 21, 8:20 am, NoS...@daqarta.com (Bob Masta) wrote:
>> As a rule of thumb, the true t-state count was
>> 4 times the number of bytes in the instruction, regardless what
>> the Intel docs said, except for really slow instructions like MUL
>> and DIV.
>
> 4 times the number of opcode bytes?? That's a blatantly incorrect
> statement. The true instruction count is this:
>
> published instruction timing + (number of opcode bytes * 4)

> I know this to be true because I've timed it myself using CTC channel

> 2. In fact, I've used the CTC to find the "break-even" point for
> bittwiddling opcodes like ROR to see where the reg vs. immed.
> diminishing return is (for the record, the "ROR reg,CL" form is faster
> if CL>4). So I take exception to your claim of "pure fiction" on the
> part of Intel.
>

I tend to agree with Bob, I've written a _lot_ of 8088 asm code, and I
always found that the best estimate of actual speed was gotten like this:

For each non-branch instruction in the program/loop/algorithm:

Add the size of the instruction + the number of bytes read + the number
of bytes written: This is the total number of memory bus transfers needed.

Take this number and multiply by about 4.05 (the fraction is for the bus
cycles lost to DRAM refresh), this gives you the _minimum_ possible
number of cycles.

As long as you are only executing instructions where the published cycle
count is >= to what you just calculated, the calculated value is correct.

If you use an opcode which takes longer than this, then the cpu can use
the slack time to prefech upcoming opcode bytes, and the next
instruction(s) can/will run a little faster, since the opcode bytes can
be skipped from the bus transfer calculations.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Bob Masta

unread,

Mar 22, 2007, 9:35:01 AM3/22/07

to

On 21 Mar 2007 09:21:40 -0700, "Jim Leonard" <spam...@crayne.org>
wrote:

>On Mar 21, 8:20 am, NoS...@daqarta.com (Bob Masta) wrote:
>> Stringently documented, perhaps, but sometimes in misleading
>> ways. The 8088 was notorious for giving specific timings that
>> were pure fiction.
>
>Oh, come on. Name one single published timing that was "pure
>fiction". See below.

Every published timing that was less than 4 times the byte count.
For example, ADD reg,reg is listed as taking 3 clocks
in "The iAPX 86,88 and iAPX 186,188 Architecture and
Instructions" (Intel, 1986). But it's a 2-byte opcode
and takes 8 clocks on an 8088. The vast majority
of opcodes are in this category.

>> As a rule of thumb, the true t-state count was
>> 4 times the number of bytes in the instruction, regardless what
>> the Intel docs said, except for really slow instructions like MUL
>> and DIV.
>
>4 times the number of opcode bytes?? That's a blatantly incorrect
>statement. The true instruction count is this:
>
> published instruction timing + (number of opcode bytes * 4)
>
>...because each byte fetch took 4 cycles on 8088. (On 8086, you could
>fetch two bytes in 4 cycles, so that was one reason 8086 was faster
>right off the bat.) If it was already fetched, then the published
>timings are indeed accurate

Instead of the sum, your formula should be the *greater of* the
published timing *or* (4 times bytes). The problem with the
published timings was precisely as you state: They assumed the
instruction had already been fetched. But the 8088 was
intensely bus-limited and almost *never* had a prefetched
instruction ready, except after a short, slow instruction (MUL, DIV,
AAM, etc.)

>So, for example, how long does XLAT take to execute on 8088? XLAT's
>published timings are 11 cycles, and it's a one-byte opcode, so if
>it's prefetched it takes 11 cycles to execute, and if it is not
>already in the prefetch queue it takes 15. (However it would be
>extremely unlikely that XLAT wouldn't be prefetched since it's only a
>single byte opcode.)

My point exactly: XLAT is a "short, slow" instruction. Since the
published time is more than 4 clocks (for a 1-byte instruction),
the published timing is correct.

>I know this to be true because I've timed it myself using CTC channel
>2. In fact, I've used the CTC to find the "break-even" point for
>bittwiddling opcodes like ROR to see where the reg vs. immed.
>diminishing return is (for the record, the "ROR reg,CL" form is faster
>if CL>4). So I take exception to your claim of "pure fiction" on the
>part of Intel.
>

"Back in the day" I spent a lot of time (!) timing instructions, and
only the "short, slow" instructions matched the published timings.
Have a look through the old manual and see how many instructions
fall into each category. My rule of thumb applies not only to
artificial single-instruction loops, but also to normal code blocks,
since they tend to be dominated by simple instructions (MOV, ADD,
etc) that suck the queue dry. Those few MULs and DIVs don't
make up the difference: Only the very next instruction benefits
from the prefetch, and even then only if it is a 1- or 2-byte
instruction. All memory accesses are much longer. So in
actual practice the rule of thumb is pretty close.

As noted by Terje, there is also overhead due to memory refresh.
I usually would shut off the refresh during my timing experiments to
avoid any chance for misleading results. (No, it doesn't crash the
system, if you are careful!) But that was overkill; just add the
refresh overhead fraction and you will be very close.

Jim Leonard

unread,

Mar 23, 2007, 1:20:50 PM3/23/07

to

On Mar 21, 3:11 pm, Terje Mathisen <spamt...@crayne.org> wrote:
> I tend to agree with Bob, I've written a _lot_ of 8088 asm code, and I
> always found that the best estimate of actual speed was gotten like this:
>
> For each non-branch instruction in the program/loop/algorithm:
>
> Add the size of the instruction + the number of bytes read + the number
> of bytes written: This is the total number of memory bus transfers needed.
>
> Take this number and multiply by about 4.05 (the fraction is for the bus
> cycles lost to DRAM refresh), this gives you the _minimum_ possible
> number of cycles.

I don't think you, I, or Bob are disagreeing, just selectively
choosing what to agree on. A memory access is "4.05" cycles, but it
is a fact that the prefetch queue does indeed work in some cases,
otherwise SHR reg,CL would always be faster than multiple SHR reg,1.
Prefetch queue is only 4 bytes but it does indeed function in some
cases, and I've gotten speedups by rearranging my code (putting short/
fast instructions after slow ones).

The only thing I took exception to was Bob's apparent claim that the
prefetch queue was useless; otherwise I think we're all saying pretty
much the same thing.