cycle-accurate emulation revisited.

239 views
Skip to first unread message

William Cattey

unread,
Jun 21, 2025, 10:36:43 AMJun 21
to PiDP-8
How time flies when one has a busy life.  I've been hoping to move the conversation about cycle-accurate emulation forward for some time, but have experienced delays. 

I finally took some time to think more about the obstacles to getting cycle-accurate emulation deployed. The primary one seems to be, "We don't like the performance hit we get, so are there ways to turn it off, either at compile time or run time?"

The solution to disabling cycle-accurate emulation is ugly: Make two copies of the main instruction loop, and have conditional compilation include one or the other, or have a flag choose one or the other. "We don't like that either."

I spent some time comparing the two code lines most recently visible on tangentsoft in trunk and cycle-accurate. The primary question I pursued was, "Can we make cycle-accurate run faster?"

I think we can.

I noticed that the instruction decoding changed, perhaps more than it needed to:

The OPR instructions used to execute with shallow decoding

1. switch ((IR >> 7) & 037)  gets you to OPR
2. switch ((IR >> 4) & 017) for group 1 or if ((IR & 01) == 0) for group 2/3
3. only if group 2/3: switch ((IR >> 3) & 017)
4. only if group 2/3: skip instructions, one more if statement.
5. only if group 3 EAE instructions several if/then statements.

Cycle accurate is a more time consuming decode:

1. Add switch on Major_State (required for cycle accurate)
2. op_code = (IR >> 9) & 07; switch(op_code) gets you to OPR
3. if (!(IR & 00400)) {group 1}  else if (IR & 00400 && !(IR & 00001)) {group 2} else {group 3}
4. If group 1: if (IR & 0200) ...
    4.1  if (IR & 0100) ...
    4.2 if (IR & 0040) ...
    4.3 if (IR & 0020)...
    4.4 if (IR & 0001)...
    4.5 switch (IR & 00016) ...
4. If group 2: switch (IR & 00170)
    4.1 Only if skip insructions, one more if statement.
4. if group 3: EAE stuff

Similarly with MRI instructions, a single switch decode amongst
    page zero /current page
    direct /indirect
takes us from:
    
1. switch ((IR >> 7) & 037) dispatches directly to MRI instruction and decodes page zero/current page and direct/indirect
2. Extra work for JMP/JMS

to:

1. Add switch on Major_State (required for cycle accurate)
2. op_code = (IR >> 9) & 07; switch(op_code) gets you to MRI
3. if (IR & 0200)  {current page} else {zero page}
4.  if (IR & 0400) {set Defer major state} else {set execute major state}
5. Extra work for JMP/JMS

I think going back to the original decoding, and just adding the Major_State outer switch with some slight cleverness around dispatching MRI work into subsequent major states would result in most everything taking one more switch instruction to dispatch, and I think that might speed up emulation to the point where we're comfortable with it.

William Cattey

unread,
Jun 21, 2025, 2:11:46 PMJun 21
to PiDP-8
I have a draft of pdp8_cpu.c that makes this change.  However, I've never done the basic bootstrapping testing of an updated SIMH.  (I've only ever run the top-level pidp8i build which presumes the SIMH emulator works well enough to actually do everything.)
pdp8_cpu.c

William Cattey

unread,
Jun 26, 2025, 12:35:13 AMJun 26
to PiDP-8
Today I learned how justme did his benchmarking, and ran his scripts. Everything was fine on my Mac, but when I went over to the real pidp8i, I discovered I'd missed a semicolon.

Here's a corrected version of pdp8_cpu.c for any who wish to try it.
pdp8_cpu.c

William Cattey

unread,
Jun 26, 2025, 1:52:41 AMJun 26
to PiDP-8
I learned how to run justme's benchmarking and have done it.

I thought I'd identified places where additional instructions were being executed, and that by returning to the earlier mode of instruction decoding, I'd get back some performance. It looks like my re-merge may not have bought us anything. Indeed it seems to run a teeny bit slower. However, I still need to test the old cycle-accurate code on a non-blinking-light platform.

Below are benchmark logs from 5 tests:
1. Trunk with instruction accurate emulation. Running on a pidp8i with a Pi3b.
Tick Time: 27.05 seconds; Ticks: 1,624; Current Insts Per Tick:    159,514
2. Previous cycle-accurate emulation, most recently from Vince Slyngstad.Running on a pidp8i with a Pi3b.
Tick Time: 62.08 seconds Ticks: 3,726;Current Insts Per Tick:    70,645
3. My reversion to some of the earlier instruction decoding, integrated into cycle-accurate emulation. Running on a pidp8i with a Pi3b. 
Tick Time: 64.9; Ticks: 3,895; Current Insts Per Tick:    67,801
4. Trunk with instruction accurate emulation. Running on an M2 Mac. (No gpio)
Tick Time: 2.8 seconds; Ticks: 169; Current Insts Per Tick:    6,293,658
5. My reversion version of cycle-accurate emulation.  Running on an M2 Mac. (No gpio)
Tick Time:  3.18 seconds; Ticks: 192; Current Insts Per Tick:    4,107,841

High level analysis:

On a Pi3b, with blinking lights, cycle accurate takes 2 and 1/3 as much time to run the benchmark (64 seconds versus 27 seconds), 2 and a quarter times as many ticks, and runs 40% of the instructions per tick.

On my mac with no blinking lights cycle accurate takes only about 10% more clock time (3.1 seconds versus 2.8 seconds), 12% more ticks,  and runs 65% of the instructions per tick.

Conclusions:

Phoning out the additional states to the PiDP-8 gpio introduces significant slowness.  However, it must be admitted that it does enable the functionality of single step, which is a real front panel action on nearly all of the PDP-8 Family. (The exception being the PDP-8/a.)

I have come to believe that, if we're not taking the time to phone out additional light state, the overhead of cycle accurate emulation is not all that bad!

I will run the benchmark against the old cycle-accurate code and see if I really did make any improvement.

Detail:

1. The trunk with instruction accurate emulation clocks in as follows:

Script started, file is /home/pi/Bench-2025-06-26-trunk-1.log
pi@pidp8:~/src/pidp8i/trunk $ ./Benchmark1
spawn bin/pidp8i-sim SimStart
PiDP-8/I trunk:id[3fee766e68] [pinctrl I/O] [ils] [stdpcb] [gpio] [rt]
PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01210 (JMP 1207)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 9,570,840 instructions/sec
Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 9,671,179
Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   4
  Seconds Running:           27 (27 seconds)
  Calibration Opportunities: 27
  Instruction Time:          249595079
  Real Time:                 2859048153
  Virtual Time:              2859048153
  Next Interval:             1,000
  Base Tick Delay:           159,514
  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    159,514
  Initializations:           2
  Ticks:                     1,624
  Tick Time:                 27.05 seconds
  Initialize Base Time:      00:05:11.319
  Tick Start Time:           00:05:11.340
  Wall Clock Time Now:       00:05:37.703


----
2. The previous cycle-accurate emulation clocks in as follows:

Script started, file is /home/pi/Bench-2025-06-26-cycle-old-2.log
pi@pidp8:~/src/pidp8i/cycle $ ./Benchmark1
spawn bin/pidp8i-sim SimStart
PiDP-8/I pi5-ils2-bworm-cyclerealistic:id[9de1597bf4] [pinctrl I/O] [ils] [stdpcb] [gpio] [rt]
PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6
PiDP-8/I initial throttle = 1000000.000000 IPS
PiDP-8/I initial throttle = 1000000.000000 IPS

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01210 (JMP 1207)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 4,238,700 instructions/sec
Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 1,648,532
Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   6
  Seconds Running:           62 (1:02 minutes)
  Calibration Opportunities: 62
  Instruction Time:          248008054
  Real Time:                 2858929822
  Virtual Time:              2858929822
  Next Interval:             1,000
  Base Tick Delay:           70,645
  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    70,645
  Initializations:           2
  Ticks:                     3,726
  Tick Time:                 1:02.083333 minutes
  Initialize Base Time:      00:02:40.542
  Tick Start Time:           00:02:40.547
  Wall Clock Time Now:       00:03:39.400

-----
3. My new merge clocks in as follows:

pi@pidp8:~/src/pidp8i/cycle $ date; ./Benchmark1
Thu 26 Jun 2025 12:44:46 AM EDT
spawn bin/pidp8i-sim SimStart
PiDP-8/I pi5-ils2-bworm-cyclerealistic:id[c9676ea6c5] [pinctrl I/O] [ils] [stdpcb] [gpio] [rt]
PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6
PiDP-8/I initial throttle = 1000000.000000 IPS
PiDP-8/I initial throttle = 1000000.000000 IPS

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01210 (JMP 1207)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 4,068,060 instructions/sec
Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 1,653,439
Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   55
  Seconds Running:           64 (1:04 minutes)
  Calibration Opportunities: 64
  Instruction Time:          244678496
  Real Time:                 2861461305
  Virtual Time:              2861461307
  Next Interval:             1,002
  Base Tick Delay:           67,666
  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    67,801
  Initializations:           2
  Ticks:                     3,895
  Tick Time:                 1:04.9 minutes
  Initialize Base Time:      00:44:49.971
  Tick Start Time:           00:44:49.977
  Wall Clock Time Now:       00:45:51.713

4. Trunk on my mac.

Script started, output file is ../Bench-trunk-mac-1.log

The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$ cp -p ~/PDP-8/PiDP-8/mkatz-benchmarks/Benchmark/Benchmark.rk05 bin
bash-3.2$ ./Benchmark1
spawn bin/pidp8i-sim SimStart

PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01207 (KSF)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 377,619,480 instructions/sec
Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 43,859,649
Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   49
  Seconds Running:           2 (2 seconds)
  Calibration Opportunities: 2
  Instruction Time:          3971965
  Real Time:                 2862539409
  Virtual Time:              2862541138
  Next Interval:             1,500
  Base Tick Delay:           4,195,772
  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    6,293,658
  Initializations:           2
  Ticks:                     169
  Tick Time:                 2.8 seconds
  Initialize Base Time:      01:03:48.864
  Tick Start Time:           01:03:48.867
  Wall Clock Time Now:       01:03:50.104
sim> bash-3.2$ exit

----
5. My stab at cycle-accurate on my mac:

Script started, output file is ../Bench-cycle-mac-1.log

The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$ cp -p ~/PDP-8/PiDP-8/mkatz-benchmarks/Benchmark/Benchmark.rk05 bin
bash-3.2$ ./Benchmark1
spawn bin/pidp8i-sim SimStart

PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01207 (KSF)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 246,470,460 instructions/sec
Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 44,642,857
Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   12
  Seconds Running:           3 (3 seconds)
  Calibration Opportunities: 3
  Instruction Time:          237870477
  Real Time:                 2862502013
  Virtual Time:              2862503243
  Next Interval:             1,500
  Base Tick Delay:           2,738,561
  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    4,107,841
  Initializations:           2
  Ticks:                     192
  Tick Time:                 3.183333 seconds
  Initialize Base Time:      01:03:10.037
  Tick Start Time:           01:03:10.041
  Wall Clock Time Now:       01:03:11.803
sim> bash-3.2$ exit


William Cattey

unread,
Jun 26, 2025, 2:14:31 AMJun 26
to PiDP-8
I ran the benchmark against the original cycle-accurate code on my Mac.

Interestingly, mine seems a teeny bit faster.  Here is output from two runs.
I can't explain why mine runs a bit slower on the pi and a bit faster on the mac, since the gpio calls are what I kept the SAME across the two code lines.

Tick time: 3.27 or 3.25 secs instead of 3.18
Total ticks 198 or 197 instead of 192.
Insts per tick of 4,000,989 or 4,105,437 instead of 4,107,841

Script started, output file is ../Bench-cycle-old-mac-1.log


The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$ cp -p ~/PDP-8/PiDP-8/mkatz-benchmarks/Benchmark/Benchmark.rk05 bin
bash-3.2$ ./Benchmark1
spawn bin/pidp8i-sim SimStart

PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01210 (JMP 1207)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 240,059,340 instructions/sec
Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 43,103,448

Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   18

  Seconds Running:           3 (3 seconds)
  Calibration Opportunities: 3
  Instruction Time:          211197802
  Real Time:                 2866274942
  Virtual Time:              2866276229
  Next Interval:             1,500
  Base Tick Delay:           2,667,326

  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    4,000,989
  Initializations:           2
  Ticks:                     198
  Tick Time:                 3.283333 seconds
  Initialize Base Time:      02:06:03.093
  Tick Start Time:           02:06:03.097
  Wall Clock Time Now:       02:06:04.898
sim> bash-3.2$ exit

Script done, output file is ../Bench-cycle-old-mac-1.log
Mac-mini:cycle wdc$ script ../Bench-cycle-old-mac-12.log
Script started, output file is ../Bench-cycle-old-mac-12.log


The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$ cp -p ~/PDP-8/PiDP-8/mkatz-benchmarks/Benchmark/Benchmark.rk05 bin
bash-3.2$ ./Benchmark1
spawn bin/pidp8i-sim SimStart

PDP-8 simulator Open SIMH V4.1-0 Current        git commit id: ffe537a6

.SET SYS NO INIT

.R P,PASCAL,PRIME1
P8RTS V1-0-F E
P8COM V1-0-F F
 NO ERRORS DETECTED.

.R P,PRIME1
P8RTS V1-0-F E

.
Simulation stopped, PC: 01207 (KSF)
sim> show clocks
Minimum Host Sleep Time:        1 ms (1000Hz)
Host Clock Resolution:          1 ms
Execution Rate:                 246,326,220 instructions/sec

Calibrated Timer:               CLK
Pre-Calibration Estimated Rate: 43,859,649
Calibration:                    Always
Asynchronous Clocks:            Available

PDP-8 clock device is CLK
Calibrated Timer 0:
  Running at:                60 Hz
  Tick Size:                 16.667 msecs
  Ticks in current second:   16

  Seconds Running:           3 (3 seconds)
  Calibration Opportunities: 3
  Instruction Time:          218846065
  Real Time:                 2866361682
  Virtual Time:              2866362970
  Next Interval:             1,500
  Base Tick Delay:           2,736,958

  Initial Insts Per Tick:    8,000
  Current Insts Per Tick:    4,105,437
  Initializations:           2
  Ticks:                     196
  Tick Time:                 3.25 seconds
  Initialize Base Time:      02:07:29.820
  Tick Start Time:           02:07:29.823
  Wall Clock Time Now:       02:07:31.570
sim> bash-3.2$ exit

Script done, output file is ../Bench-cycle-old-mac-12.log

William Cattey

unread,
Jun 27, 2025, 12:31:35 AMJun 27
to PiDP-8
I did more benchmarking.  I think that my attempt to revert from my version of pdp8_cpu.c may have failed, and so I repeated that test, re-extracting Vince's cycle-accurate version.

Here's what I learned:

When you take the blinking lights out of the equation, you only take a 30% performance hit with the cycle-accurate code.

However, I have to admit, that the performance enhancement I thought it was making did not come to pass. Vince's earlier checkin ran faster than mine.

Measuring by elapsed time in ticks, tick count or instructions per tick I saw this:

Trunk: Tick time: 27.05 secs; Ticks: 1,624; Insts Per Tick: 159,514  baseline performance
Old/Vince: Tick time: 62.13 secs; Ticks: 3,729; Insts Per Tick: 70,886   44% of baseline
New/Bill Tick time: 64.90 secs; Ticks: 3,895; Insts Per Tick: 67,801   42% of baseline

NoPi:
Trunk: Tick time: 12.03 secs base ; Ticks:   723; base Insts Per Tick: 365,375  baseline performance.
Old/Vince: Tick time: 17.08 secs 70% of base; Ticks: 1,026; 72% of base Insts PerTick: 254,202  70% of base
New/Bill Tick time: 21.45 secs 56% of base; Ticks: 1,288; Insts Per Tick: 200,501  55% of baseline

So if the blinking lights are not part of the system, cycle accurate is only a 30% performance hit, not a 50% or 100% performance hit.

To me this means we should advocate for migration of upstream to cycle-accurate emulation.

I WOULD like to understand why my redo is less performant.  I thought I was replacing a bunch of If statements with a single switch statement, and then doing everything else the same.

Attached is a file with the full benchmark results from the above 6 test cases on my Pi3b.

-Bill
Comp-Bench.txt

Steve Tockey

unread,
Jun 27, 2025, 1:10:47 PMJun 27
to PiDP-8
Bill,
Thanks for taking the lead on this. I've been buried in other stuff so no time to dedicate on this front, sorry.

"I WOULD like to understand why my redo is less performant.  I thought I was replacing a bunch of If statements with a single switch statement, and then doing everything else the same."

These days a lot is probably dependent on the specific compiler and the optimizations it chooses to apply vs. not. When given a switch-case in source code, there are different ways to turn it into object code, one of which is to basically unravel it back into a series of if() statements meaning you wouldn't necessarily see any performance difference at all. Under the conditions in the SIMH source code, I would think that it would usually translate switch-case into a jump table but since that can end up taking more machine instructions than a simple if() statementtht could be where the performance hit is coming from. The only way to be sure, however, is to either examine the .o files or see the actual machine instructions in the .exe file to know for sure what's going on.


-- steve


William Cattey

unread,
Jun 27, 2025, 10:36:56 PMJun 27
to PiDP-8
I spent some time with cc -S -fverbose-asm, comparing the vince code line to my code line.

Alas I could make no sense of what the compiler was doing.

I tried some swapping back and forth of different approaches to do certain things. After a couple hours of this, I've concluded that whatever Vince checked in seems to have totally lucked out on what gcc on both my m2 mac, and my Pi3 create.

Particularly perplexing was the use of both the variable:

op_code = (IR >> 9) & 07;
switch (op_code) {

and the in-line

switch ((IR >> 9) & 07) {

eliminate all use of the variable including the line

int op_code = 0;

and the code runs slower.

Change the current uses of the in-line in the instruction loop and replace it with op_code and the code runs... SLOWER!

I give up.

Warren Young

unread,
Jun 28, 2025, 12:29:24 AMJun 28
to William Cattey, PiDP-8
On Fri, Jun 27, 2025 at 20:36 William Cattey <bill....@gmail.com> wrote:
I spent some time with cc -S -fverbose-asm,

Do yourself a favor and go hit the easy button: 

William Cattey

unread,
Jun 28, 2025, 8:26:32 PMJun 28
to PiDP-8
Hi Warren,

That's a VERY interesting system -- meld for assembler output.

I learned how to use it. Their support community in discord is AMAZING!  They fixed a bug that prevented my being able to compare assembler output like I needed.

Unfortunately, the results I'm getting make no sense.  I VERY carefully experimented with the effect of changing

    (IR >> 9) & 07
to 
   op_code

Here's what i learned:

The early declaration and initialization, and then utilization:

    int op_code = 0;
...

            op_code = (IR >> 9) & 07;
            switch (op_code) {

is completely equivalent to leaving the declaration and initialization out and just saying:

     switch ((IR >> 9) & 07) {

This makes sense to me because I accept that the compiler optimizes out the superfluous work, and uses registers efficently.

What does NOT make sense is this:

Down in the code for the EXECUTE State, we have this:

        case EXECUTE_state:
            if (((IR >> 9) & 07) < 4) {                     /* AND .. DCA, or is it JMS? */
                if (IR & 00400)                             /* it is AND .. DCA, direct or indirect? */
                    MA = DF | (MA & 07777);                 /* indirect, use DF */
                else
                    MA = IF | (MA & 07777);                 /* direct, use IF */
                MB = M[MA];                                 /* get the data word */

                switch ((IR >> 9) & 07) {

If I change it to:

        case EXECUTE_state:
            if (op_code < 4) {                     /* AND .. DCA, or is it JMS? */
                if (IR & 00400)                             /* it is AND .. DCA, direct or indirect? */
                    MA = DF | (MA & 07777);                 /* indirect, use DF */
                else
                    MA = IF | (MA & 07777);                 /* direct, use IF */
                MB = M[MA];                                 /* get the data word */
                switch (op_code) {

Compiler Explorer with either gcc 8.2.0 ARM 32 or 64 bit says that yes, the first if goes from one instruction to two instructions, but the second switch goes from 7 instructions to 3.
But when I build pidp8-sim with that code, and run the benchmark performance drops from 15,252,120 instructions/sec 17.03 seconds execution time to 13,387,860 instructions/sec 19.28 execution time.

The code analyzer says there's fewer instructions. The loop is executed the same way.  But performance went down. I tried this multiple times. 

I will note in passing that with just this bit of fiddling, gcc does massive re-arrangement of code. I had to rely extensively on Compiler Explorer's "Reveal Linked Code" and live highlighting of the instruction blocks to tune into changes that were not the renaming of all symbols because a block of code moved.

So, I tried again, but I'm feeling quite disheartened.

-Bill

William Cattey

unread,
Jun 28, 2025, 11:42:01 PMJun 28
to PiDP-8
Here are some musings on cycle-accurate emulation performance and value:

Although the PDP-8 has 3 major states, they don't apply to all instructions.

  • All register-only instructions complete in the Fetch major state.
  • Most Memory Reference instructions complete in Fetch/Execute states.
  • Only Memory Reference instructions using indirect addressing require all three Fetch/Defer/Execute states.
This means that, if we had a run-time test for cycle-accurate emulation, we would always incur the overhead of testing that situation, as opposed to sometimes taking more time to run Memory Reference, and Indirectly addressed instructions.
As I mentioned at the beginning of this thread, the big performance hit comes only when running on PiDP-8/i hardware when we do the additional procedure calls to report the status changes out to gpio.  Without that extra work, the performance hit is a consequence of the instruction mix.

Also as I said earlier in this thread, most of the PDP-8 family has the "Single Step" control on the front panel, and is intended for use in debugging. Bringing that sub-step out to an external interface has definite value for our PiDP-8/i community, and could have value to other SIMH PDP-8 communities. With the way Moore's law seem to keep giving our target platforms greater performance, I think a 30% performance hit, with the added functionality of, "You can get full front-panel operation" is something we probably can sell to upstream SIMH. The compiler seems to have done a decent job of recognizing the FETCH leg of the switch as the most frequently used, and needing to execute with minimal overhead.

I may do some investigation into how to make conditional compilation to remove even the overhead of:

    this_Major_State = next_Major_State;
    switch (this_Major_State) {

Having tried a merge of the original instruction decoding with the new major state code, I might be able to do better than in-lining the DEFER and EXECUTE major state code if MAJOR_STATES was undefined.

-Bill

William Cattey

unread,
Jun 29, 2025, 4:39:35 PMJun 29
to PiDP-8
So I did a version of pdp8_cpu.c where I #ifdef out the
    switch (this_Major_State) {
and in-line the DEFER and EXECUTE blocks where they previously did:
    next_Major_State =...
On my Pi3b with blinky lights disabled, the benchmark completes in 16.95 seconds with 15,427,440 instructions/sec compared to  17.08 seconds with 15,252,120 instructions/sec for Vince's cycle accurate and trunk instruction accurate of 12.03 seconds with 21,922,500 instructions/sec.

I then removed the save/restore of saved_Major_State:
    saved_Major_State = next_Major_State;
through the instruction loop.

Now the benchmark completes in 14.45 seconds with 18,122,700 instructions/sec.

Perhaps there's something to be discovered about the instruction decoding of trunk versus cycle-accurate.

Or perhaps there's another bit of state save/restore I've missed.

William Cattey

unread,
Jun 30, 2025, 2:54:44 AMJun 30
to PiDP-8
I think bench checking C code is easier than reading assembler output and trying to guess what the compiler is doing...

I think I found another source of lower performance in the cycle-accurate code:  Additional setting of the MB register.  In the trunk version of pdp8_cpu.c, MB is written in exactly 4 places, but for only one thing: the ISZ instruction.  Nowhere else is MB assigned.

In the cycle-accurate code, MB is set in six place, but for several different things:

  1. The contents of memory when the instruction is first fetched.
  2. When the value of an Indirect address obtained from memory.
  3. The contents of memory when the data word is fetched in an MRI instruction.
  4. The result of an MRI ISZ or DCA operation.
  5. The destination address of a JMS instruction.
Question for Steve Tockey or Vince Slyngstad:  Am I right in believing these assignments were oversights in the original instruction-accurate implementation, and correct operation of that version requires these assignments?

If so, correct operation imposes a 15% performance hit on the old emulation, and cycle-accurate imposes another 15%.  In which case, I think the analysis is done, and the performance hit with cycle-accurate as the default is reasonable. (Though i do have a version that turns it off at compile time to get that 15% back.)

And now I'm going to try and un-obsess about this for a while...

-Bill

Vincent Slyngstad

unread,
Jun 30, 2025, 6:33:33 AMJun 30
to PiDP-8
>  Question for Steve Tockey or Vince Slyngstad:  Am I right in believing these assignments were oversights in the original instruction-accurate implementation, and correct operation of that version requires these assignments?

Apologies for being absent. I am back at home now, and better able to respond. Thanks for all your work on this!

Technically, I think it is correct that MB should be set at the end of fetch (1), defer (2), and execute (3, 4, 5). Those last three could be mutually exclusive.

I'm not sure that much veracity in the display would be lost if only the "last" modification of MB were pushed to the lights. The lights are a sort of blurry approximation during run state, so it may also be possible to merge the changes or even leave the very transient changes out. It seems the trunk version is already leaving the intermediate updates out, and the display has been heretofore acceptable.

A separate issue is the accuracy of the display in the stopped state, which I perceive as the main advantage of the cycle accurate version. So it would be essential to push correct state to the lights whenever the machine is in a stoppable/observable state and actually stopping.

Vince

William Cattey

unread,
Jun 30, 2025, 11:02:02 AMJun 30
to PiDP-8
Hi Vince,

Welcome home, and welcome back to the discussion.

Thanks for the sanity check.  It sounds like there may be a code change inspired by your reply, but I'm not quite sure what it is.

Are you saying that, if the compile time option to disable MAJOR_STATES was implemented, that some of the assignments to MB in the current code line could be removed?

or

Are you saying that there are a couple redundant assignments to MB in the current code that can be moved to the end of the 3 Major state sections?

-Bill

Steve Tockey

unread,
Jun 30, 2025, 2:05:01 PMJun 30
to PiDP-8

I don't believe that any of those assignments are redundant. They may not be fully necessary when the CPU is running because the front panel lights are just periodically sampled. On the other hand, when the CPU is halted, and particularly when the Sing Step switch is used, those assignments are required to have the MB show the right value at the end of each major state. Without those assignments, the MB would be incorrect when Sing Step-ping through machine code.


-- steve

William Cattey

unread,
Jun 30, 2025, 4:02:49 PMJun 30
to PiDP-8
That clarification helps.

When next I get a moment to play with my "compile time elimination of cycle-accurate" experiment, I'll remove the MB updates that happen within the removed major states. I expect I'll get something that will clock in FASTER than the existing instruction-accurate codeline.

-Bill

William Cattey

unread,
Jul 1, 2025, 6:38:37 PMJul 1
to PiDP-8
I've had a moment to review the cycle-accurate code. I think the updates to MB present in that code are actually the minimum required:

  1. Set MB at the beginning of an instruction. Needed in both instruction accurate and cycle accurate emulation.  Not previously set at all.
  2. Set MB with the final results of an MRI Instruction to wit either the data word used as input to AND and TAD, the results to be posted in ISZ and DCA, the return address to be stored for JMS, or the destination address of JMP. These appear to me to be needed in both instruction accurate and cycle accurate. Previously MB was only set for ISZ.
I see no additional tracking of MB that is specific to cycle-accurate operation.  

Steve, Vince, would you please sanity check my review?

-Bill

William Cattey

unread,
Jul 3, 2025, 12:19:11 AMJul 3
to PiDP-8
I took a stab at adding the reporting out into the MB major register.  Attached to this message is a context diff against the current trunk pdp8_cpu.c (instruction accurate).

As I suspected, adding the MB reporting introduced a 9% to 16% performance hit (depending on the metric):

Reminder, No Blinking trunk performance on Pi3b is:
Tick time: 12.03 secs ; Ticks: 723; Insts Per Tick: 365,375; Insts/sec: 21922500 baseline
Tick time: 13.27 secs, 91% ;Ticks: 797, 91%; Insts Per Tick: 305,705, 84%; Insts/sec: 18342300, 84%

So... Correct instruction accurate operation involves a performance hit of 9-16%, cycle accurate operation involves another 14-21% performance hit on top of that (net 30%).

I can add the compile-time disablement of cycle-accurate operation to get that back down to a 15% performance hit.

How do we build consensus around the question:

Is it worth 15% performance hit to make cycle-accurate the sole codeline?

-Bill
mb_trunk.diff

William Cattey

unread,
Jul 3, 2025, 1:06:19 AMJul 3
to PiDP-8
Feeling quite comfortable in my understanding of the situation. I've sent a comment to the OpenSIMH issue I opened last November testing the waters for accepting a Pull request to add support for Major States. In that note I include the LibreOffice spreadsheet herewith attached that gives the benchmark results and the relative performance hits.

I advocate for "Support for Major States" without run-time or compile time conditionals to disable it:
  • Run-time checking would add an extra if to all instructions, not just MRI instructions.
  • Support for Major States gets us full support of the "Single Step" switch that is present on nearly all members of the Family of 8.
  • It's really only a 15% performance hit when you add in the proper MB support which is something that even the instruction accurate code should have.
What do others think?

-Bill
Benchmarks.ods

Warren Young

unread,
Jul 3, 2025, 11:32:36 AMJul 3
to Bill Cattey, PiDP-8
On Jul 2, 2025, at 22:19, William Cattey <bill....@gmail.com> wrote:

Is it worth 15% performance hit to make cycle-accurate the sole codeline?

Yes. Even on the old Pi 1 B+, prior testing shows that we were running 5.6x faster than realtime. We’ve got the perf to spare.

The costs of making this conditional aren’t worth it. It’s not merely the runtime cost of a C “if” or the configuration complexity bought by an #ifdef, it’s also the need to change everything that compares simulated IPS to that of real machines.

One such is the lib/pidp8i/ips.py.in file, from which I got the above test result. We don’t want to make that constant conditional. If we’re moving to cycle-accurate, the thing to do is repeat the test under the new simulator and update the raspberry_pi_b_plus constant accordingly.

I recall my Pi 3 running flat-out ~24x faster than a real PDP-8/I at one point, so a 15% haircut could be expected to run at 20x the speed of real hardware for a pure CPU test, faster still when I/O is involved. If that isn’t fast enough, there’s the Pi 4 and 5, for which I am not aware of any current benchmarks, but which have to be even faster.

Those who think even that isn’t fast enough will be running mainstream SIMH on something better than a Pi 5, and won’t care about switches and incandescent lamp simulations and cycle accuracy anyway.

On the other end of the scale, those who do care about cycle accuracy are likely throttling their simulated CPU, where we should have plenty of spare host-side cycles before we peg a single CPU core.

Let’s also keep in mind that this is a retrocomputing project, which tends to attract the type of person who isn’t unduly swayed by the new-and-shiny. If such a user feels the need for a faster simulator, a suggestion to revert to the older version should not be viewed as unreasonable.

William Cattey

unread,
Jul 6, 2025, 6:14:37 PMJul 6
to PiDP-8
On the strength of input in private emails from Steve, Vince, and Mike, in addition to the above affirmation from Warren I've done the following:

1. Merged cycle-accurate into trunk (I made a couple cosmetic changes.)

Everything tests out fine on my Pi and my Mac.

2. Crafted a version of pdp8_cpu.c taken from the cycle-accurate version in trunk that removes all the PiDP8 stuff.

That tests out fine on my Mac.

3. I submitted a pull request to the OpenSIMH upstream with pdp8_cpu.c mentioned in #2 above. See: https://github.com/poetnerd/simh/pull/1

This would not have been possible without the work of Steve Tockey for the original cycle-accurate work, follow-on integration and refinement by Vince Slyngstad and Heinz-Bernd Eggenstein, benchmarking invention by Mike Katz, and of course the original and ongoing work of Warren Young.

-Bill

William Cattey

unread,
Jul 11, 2025, 1:45:44 PMJul 11
to PiDP-8
There was a teeny portability improvement made, and now I've submitted the pull request for real upstream at:

William Cattey

unread,
Aug 24, 2025, 9:56:01 PM (14 days ago) Aug 24
to PiDP-8
The pull request has been rejected for inclusion in "for the primary emulator", by the Open SIMH Steering Group and Bob Supnik himself. However, Paul Koning also said, "But we allow multiple emulators for a given machine (you can see this in the pdp10 series of emulators, for example) so if you want to submit this as an alternate emulator (perhaps "pdp8-ms") that would be an acceptable contribution."

I looked in the PDP10 subdir and in CMakeLists I found this:


option(PANDA_LIGHTS
       "Enable (=1)/disable (=0) KA-10/KI-11 simulator's Panda display. (def: disabled)"
       FALSE)
option(PIDP10
       "Enable (=1)/disable (=0) PIDP10 display options (def: disabled)"
       FALSE)

This looks like a VERY interesting opportunity to quit doing the simh-merge thing, and instead add our PiDP-8/i stuff as an option in the OpenSIMH tree!

I think that our pdp8-cpu.c really should be the compile-time version we build for the PiDP-8/i hardware (having slogged through the various other options.)

I'd also very much like to add code that enables us to actually set the PDP-8/i instruction set (but as a run-time option since we haven't explored the ramifications of turning off the 8e instructions with our deployed software.)

What do others think about pushing the whole PiDP-8/i enchilada over to OpenSIMH as an optional build?

-Bill

Steve Tockey

unread,
Aug 25, 2025, 2:34:36 AM (13 days ago) Aug 25
to PiDP-8

Bill,
That sounds to me like a good option. Let me know what I can do to help.


-- steve



Reply all
Reply to author
Forward
0 new messages