Background: One growing hobby of late is “retro-computing”, or
hobbyist vintage computing, where people dust off their old 1980’s PCs
and use them, write new programs for them, etc. One problem that has
cropped up recently in this hobby is a means to properly identify and
benchmark systems, not only “real” (ie. true IBM PC/XT, AT, etc.) but
moreso unmarked clones. There is also a need for something like this
for emulator writers, so that they can attempt to get their code cycle-
exact, and also just for regular people who, for example, just want to
play games at the right speed in DOSBOX.
Everyone is familiar with the old Norton SI and Landmark CPU Speed
benchmarks, but they are horribly misleading and generall incomplete
test suites. Other benchmarks, such as C&T MIPS.COM, are much better,
but they aren’t realtime (test takes 30 seconds) and only offer three
machine classes to compare to. So, I have volunteered to write a
benchmark that would meet the above needs. The goals would be
relatively simple:
- Take a performance measurement of a machine and store it locally in
a tiny database that accompanies the program
- Allow comparison of the current machine’s metric to the database,
and bring up close matches for comparison
- Perform the measurement/comparison continuously, so that running it
inside an emulator would allow you to immediately see the results of
tuning the emulator speed. (For example, this would allow people to
“dial” the speed of the emulator to match a target machine.)
Now the problems:
I’m having trouble coming up with a decent metric and/or way of
profiling a machine that not only works on ANY PC (ie. even PC/XT
where there is no RTC or RDTSC available, only the 8253) but *also*
working as high up as, say, a Pentium @ 166MHz (but not much higher,
as there is no target audience for this benchmark above that
platform).
The basic idea I had was to run through every single 808x-compatible
instruction (except POP CS which would hang 286 and later, and aad/aam
with a custom divisor because those hang NEC V20/V30) and time it,
then perform some memory moves/fills in system RAM, then the same to
video adapter RAM, and then print out the closest matches in the
database for all three measurements. Optionally, also output some
sort of combined score (like a “fingerprint” for the machine) so that
one generic clone can be compared to other generic clones and/or known
machine performance profiles.
I was planning on using the 8253 at full resolution to perform the
timing, using Abrash’s Zen timer code which I am very familiar with.
The problem with this method, as far as I can tell, is that once I hit
486 and later, L1 caching becomes a problem – not because caching is
an “unfair” speed boost (if anything, I definitely WANT caching to
affect speed as a true test of how fast a system is), but rather
because of how small the test suite is -- it would fit entirely in
cache and, coupled with pipelining on Pentium and later, would execute
faster than the 1.19Mhz 8253 would be able to detect! Ie. the entire
test suite could execute in a single tick of the 8253.
Questions:
- Is this a reasonable fear, or am I overestimating how much
pipelining and cache will speed things up? (Remember, there is no
target audience for this benchmark beyond a Pentium)
- Should I look into some sort of alternate timing method, such as
running the test suite multiple times in a certain time period? If
so, what would a reasonable time boundary be? (no more than a full
second, I hope... Remember, one of the primary goals of the benchmark
is to run “realtime” such that adjusting an emulator, or popping a
“turbo” button on/off, would be noticeable; I'm also worried about
having interrupts turned of for a long period of time)
- Has this problem already been solved by a benchmark utility I am not
yet aware of?
Thanks for reading this far :-) Any and all thoughts regarding this
are appreciated.
I'm not sure of the utility of this utility (!). In particular, I
don't think a one-number result that includes all opcodes would be
very useful, since most of them rarely get used and might swamp out
differences in those that really do get used a lot.
When I wrote my SYSTEST utility (a "few" years ago),
I opted for a full screen of output, showing separate values for
things I thought were important. (Comparing video RAM and system RAM
moves turned out to be a real eye-opener... seemed at the time that
a video move took about 1 usec no matter how fast the CPU, due to
video limitations.)
You can download the latest version (ca 1999, I think) at
<http://www.daqarta.com/systest.zip>
I didn't encounter any cache issues, but then I wasn't looking too
hard.
I experimented with running the timing loops with and without refresh.
As I recall, it didn't make that much difference (about as expected),
but I believe I finally ended up reporting refresh-off times.
I don't quite understand the "real-time" issue. Why not run from the
command line? You might have to run it a couple of times to home in
on the tuning values you want, but I guess I don't see this as any big
deal. And I'm not sure what the alternative is, unless you do find a
single-value metric you could show as a "meter". If this is running
under real-mode DOS anyway, you'd have to awaken the meter via a
system timer to do a periodic update. But then you would have to deal
with the fact that you are messing with that same timer to do your
benchmark timing. Not impossible, but an added layer of effort.
I never had a problem running SYSTEST with and without Turbo,
etc. It allows you to redirect output to a file, so I usually would
just save files and compare them later.
Best regards,
Bob Masta
DAQARTA v3.50
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, FREE Signal Generator
Science with your sound card!
...
>>Questions:
>>- Is this a reasonable fear, or am I overestimating how much
>>pipelining and cache will speed things up? (Remember, there is no
>>target audience for this benchmark beyond a Pentium)
>>- Should I look into some sort of alternate timing method, such as
>>running the test suite multiple times in a certain time period? If
>>so, what would a reasonable time boundary be? (no more than a full
>>second, I hope... Remember, one of the primary goals of the benchmark
>>is to run “realtime” such that adjusting an emulator, or popping a
>>“turbo” button on/off, would be noticeable; I'm also worried about
>>having interrupts turned of for a long period of time)
>>- Has this problem already been solved by a benchmark utility I am not
>>yet aware of?
I checked on 'DOSBOX0.72' (link posted by Rod in AOD) recently and it
seem to run olde games with their assumed speed in win98 and XP as well.
>>Thanks for reading this far :-) Any and all thoughts regarding this
>>are appreciated.
> I'm not sure of the utility of this utility (!). In particular, I
> don't think a one-number result that includes all opcodes would be
> very useful, since most of them rarely get used and might swamp out
> differences in those that really do get used a lot.
>
> When I wrote my SYSTEST utility (a "few" years ago),
> I opted for a full screen of output, showing separate values for
> things I thought were important. (Comparing video RAM and system RAM
> moves turned out to be a real eye-opener... seemed at the time that
> a video move took about 1 usec no matter how fast the CPU, due to
> video limitations.)
I can confirm this limitation for direct Vram-write within AGP range.
Currently I'm stuck to 32 CPU cycles access-rate because I couldn't
find out where the hell nVidea may have located it's UDMA-bus-master.
> You can download the latest version (ca 1999, I think) at
> <http://www.daqarta.com/systest.zip>
> I didn't encounter any cache issues, but then I wasn't looking too
> hard.
Beside this I encountered the physical limit of the PCI-bus and/or
the AGP-bridges. Direct screen-write timing seem not to be not affected
by any AGP-settings (they seem to belong to bus-masters only).
> I experimented with running the timing loops with and without refresh.
> As I recall, it didn't make that much difference (about as expected),
> but I believe I finally ended up reporting refresh-off times.
Yeah, modern RAM modules dont need to be told when it's time for any
refresh read, so we measure this burst-reads within our RDTSC-figures
beside cache-burst-reads without even beeing notified.
> I don't quite understand the "real-time" issue. Why not run from the
> command line? You might have to run it a couple of times to home in
> on the tuning values you want, but I guess I don't see this as any big
> deal. And I'm not sure what the alternative is, unless you do find a
> single-value metric you could show as a "meter". If this is running
> under real-mode DOS anyway, you'd have to awaken the meter via a
> system timer to do a periodic update. But then you would have to deal
> with the fact that you are messing with that same timer to do your
> benchmark timing. Not impossible, but an added layer of effort.
I agree for the impossible 'exact'-meter, but my OS actually checks
on code modules timing to either grant a full time slice or split it
up into parts. Even it's just a course estimation (worst case on first
run anyway), it becomes useful to optimise the whole job-queu.
> I never had a problem running SYSTEST with and without Turbo,
> etc. It allows you to redirect output to a file, so I usually would
> just save files and compare them later.
:) turbo-switch emulation seems to be just a delaying feature today.
__
wofgang
The one-number result wouldn't be the only reported metric. The more
useful metrics would be the x86 opcode test, followed by the system
RAM test and the video adapter RAM test. The combined metric is
definitely not meant to be the main use of the benchmark, but rather a
way to fulfill curiosity about "what already-profiled machine is
closest to this one?"
> When I wrote my SYSTEST utility (a "few" years ago),
> I opted for a full screen of output, showing separate values for
> things I thought were important. (Comparing video RAM and system RAM
> moves turned out to be a real eye-opener... seemed at the time that
> a video move took about 1 usec no matter how fast the CPU, due to
> video limitations.)
It's even worse on an old XT; scrolling when moving video ram around
tops out at 160KB/s but if you can maintain a second buffer in system
ram, updated during idle periods, you can REP MOVSW at up to 240KB/s.
The advantage I remember on "newer" systems (we're talking 1992-1995
here) is that, since the video was the limiting factor, you could
interleave something else during the memory move.
> You can download the latest version (ca 1999, I think) at
> <http://www.daqarta.com/systest.zip>
Very nice! I have a lot of CPU identifying code; is the code for
yours available? I can detect NEC V20/V30 but I'm curious how you
detect 8086 (prefetch queue size?)
> I don't quite understand the "real-time" issue.
When you run an adjustable x86 emulator such as DOSBox, you have the
ability to give it more or less CPU time while it is running. (In
DOSBox, I believe it is CTRL+F11 and CTRL+F12.) The problem with
DOSBox, just as an example, is that DOSBox specifies "speed" as
"number of cycles to run per timeslice" and is just a single number,
ie. CYCLES=2000 or something. If you are trying to emulate a specific
speed target, it is a very lengthy trial-and-error session of running
a benchmark, waiting 30 seconds or more, adjust, repeat.
> And I'm not sure what the alternative is, unless you do find a
> single-value metric you could show as a "meter".
Exactly, this was the point of the combined metric number. A quick
way of "dialing" the speed of an emulator to match a particular target
machine.
> If this is running
> under real-mode DOS anyway, you'd have to awaken the meter via a
> system timer to do a periodic update. But then you would have to deal
> with the fact that you are messing with that same timer to do your
> benchmark timing. Not impossible, but an added layer of effort.
I've already solved that problem :-) but thanks for the reminder, and
for the advice thus far.
>> You can download the latest version (ca 1999, I think) at
>> <http://www.daqarta.com/systest.zip>
>
>Very nice! I have a lot of CPU identifying code; is the code for
>yours available? I can detect NEC V20/V30 but I'm curious how you
>detect 8086 (prefetch queue size?)
Jim:
Yes, I do use prefetch size for the 86/88 determination. Here is the
complete ID code from SYSTEST:
;-----------------------------------------------------
GET_CPU PROC NEAR
;Determine CPU type:
CLI
XOR BX,BX ;Assume 8088, BL = 0
PUSH SP
POP AX
CMP SP,AX
JE TEST_286
CALL GET_FLAGS
;Test for 80188/80186 using test from Robert Collins' CPUID.
According to
;Intel docs, 8018x family will use linear addressing for a word
written to FFFF,
;while the 8088/8086 will wrap the high word to 0:
MOV AX,DS:[0FFFFh] ;Get original data
MOV WORD PTR DS:[0FFFFh],0AAAAh ;Write signature at test
location
CMP BYTE PTR DS:[0],0AAh ;8086 if it wrapped to 0
MOV DS:[0FFFFh],AX ;Restore original data
JE TEST_V20
MOV BL,1
JMP GOT_CPU
TEST_V20:
;Here we rely on the 8088/8086 behavior of skipping the byte after
INSB, which
;they don't support but the V20 does. NOTE: Robert Collins' CPUID
method of
;testing for V20 based upon MUL flags does not work... 8086 (at least)
seems to
;set ZF as does V20 (he claimed it always set ZF=0 (NZ)).
INSB$ EQU 6Ch
MOV BH,20 ;Assume NEC V20
MOV DI,OFFSET TEST_V20 ;Overwrite 1st byte of this code
with port
XOR DX,DX ;Port 0
DB INSB$ ;Port 0 into ES:[DI] if not 8088,
else skip
INC DX ;Skip this if 8088, leave DX = 0
TEST DX,DX ; Else set DX = 1 if INSB
supported
JNZ GOT_CPU
;Test 8088/8086 based upon prefetch queue length. 8088 = 4-byte, 8086
= 6-byte.
;Code modified from Szilagyi, EDN Design Ideas, Apr 26, 1990, pg 232.
CLI
MOV BH,8 ;Assume 8088
MOV SI,OFFSET QUE_CODE + 1 ;Target code ahead of or in queue
JMP SHORT $+2 ;Reset que (important!)
MOV [SI],BH ;2 bytes. Modify targ if 8088,
not if 8086
NOP ;1 byte
NOP ;1 byte - end of 8088 queue
QUE_CODE:
MOV BH,6 ;6 is outside 8088 queue, just
inside 8086
STI
JMP SHORT GOT_CPU
TEST_286:
INC BX ;Not 8088, assume 286
INC BX
CALL GET_FLAGS ;Also formats to LO_FLAGS_MSG
OR AX,4000h ;Attempt to set Nested Task flag
CALL SET_FLAGS ;PUSH AX, POPF workaround for
early 286s
PUSHF
POP AX
TEST AX,4000h ;Did we set it?
JZ GOT_CPU ;Can't set it on 286 in real mode
INC BX ;Not 286, assume 386
;Check for 386 by attempting to toggle EFLAGS register Alignment Check
bit.
;Can't be changed on a 386. Also tries to toggle ID bit 21 for 486+
test.
CLI
DB 66h
PUSHF
DB 66h
PUSHF
DB 66h
POP AX
DB 66h
MOV CX,AX
DB 66h
XOR AX,0 ;XOR EAX,00240000h toggle bits 18 and 21
DW 24h
DB 66h
PUSH AX
DB 66h
POPF ;POPF is OK here, since 286s already
trapped
DB 66h
PUSHF
DB 66h
POP AX
DB 66h
POPF
STI
DB 66h
XOR AX,CX ;Compare bits (1 if toggled, else 0)
;Check for bit 18 toggled:
DB 66h
TEST AX,0 ;TEST EAX,00040000h test Alignment Check
bit 18
DW 4
JNZ GOT_486_PLUS
TEST MSW_WORD,1 ;Protected mode?
JNZ GOT_CPU ;Don't try this test if so
DB 0Fh, 00100000b, 11000000b ;MOV EAX,CR0
AND AL,NOT 10000b ;Attempt to clear CR0 bt 4
DB 0Fh, 00100010b, 11000000b ;MOV CR0,EAX write new
value
DB 0Fh, 00100000b, 11000000b ;MOV EAX,CR0 get it back
TEST AL,10000b ;Did it clear?
JZ GOT_CPU
MOV BH,1
JMP SHORT GOT_CPU ;Must be 386 if it wasn't
toggled
GOT_486_PLUS:
INC BX ;Assume 486
;Check for ID bit 21 toggled:
DB 66h
TEST AX,0
DW 20h
JZ GOT_CPU
;486DX4, Pentium, or higher which has CPUID instruction. CAUTION:
Some 486s
;less than DX4 also allow bit 21 toggle, but return 0 for CPU here.
(?)
DB 66h
MOV AX,1 ;MOV EAX,1
DW 0
DB 0Fh, 0A2h ;CPUID returns family digit in bits
8-11
MOV CPU_ID,AX
MOV BL,AH ;Assume proper CPU number
AND BL,0Fh ;Isolate bits 8-11
GOT_CPU:
STI
MOV WORD PTR CPU_TYPE,BX
RET
GET_CPU ENDP
;-------------------------------------
This is exactly right, I measured this when writing my terminal
emulator/file transfer program around 1983, and decided that I
absolutely had to use a RAM-based frame buffer, then only update the
real video frame after the end of the current batch of input characters.
This allowed me to handle stuff like a series of LF characters (often
used to clear the screen) and still keep up with 9600 baud inputs
without dropping anything.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
This is interesting, my own x86/88 code used the opposite approach:
Instead of flushing the queue with a JMP, I allow it to fill completely
by doing a dummy DIV instruction:
This opcode takes so long (~20-40 cycles?) that the prefetch queue will
always fill completely, I then overwrite parts of it using a REP STOSB
instruction to replace INC opcodes with NOPs.
The number of INCs executed is directly correlated to the size of the
queue, all the way up to the 386 which had a 32-byte buffer here.
(I didn't use this feature to determine if I had a 386 of course, the
standard ways for 286/386/486/cpuid works much more easily!)
This is fascinating, I've got different methods than you. For
example:
> TEST_V20:
> ;Here we rely on the 8088/8086 behavior of skipping the byte after
> INSB, which
> ;they don't support but the V20 does.
I never knew about that! Instead, I half-trip 8080 compatibility mode
("break for emulation") and see what happens:
; On the 8088, 0Fh performs a POP CS.
; On the V20/V30, it is the start of a number of multi-byte
instructions.
; With the byte string 0F 14 C3 the CPU will perform the following:
; 8088/8086 V20/V30
; pop cs set1 bl, cl
; adc al, 0C3h
xor al, al ; clear al and carry flag
push cs ; in case POP CS is successful!
db 0Fh, 14h, 0C3h ; instructions (see above)
cmp al, 0C3h ; if al is C3h then 8088/8086
jne upV20
mov ax, 0 ; set 8088/8086 flag
jmp Exit
upV20:
pop ax ; correct for lack of POP CS
mov ax, 200h ; set V20/V30 flag
jmp Exit
I'm afraid I don't quite understand the prefetch queue code:
> ;Test 8088/8086 based upon prefetch queue length. 8088 = 4-byte, 8086
> = 6-byte.
> ;Code modified from Szilagyi, EDN Design Ideas, Apr 26, 1990, pg 232.
> CLI
> MOV BH,8 ;Assume 8088
> MOV SI,OFFSET QUE_CODE + 1 ;Target code ahead of or in queue
> JMP SHORT $+2 ;Reset que (important!)
> MOV [SI],BH ;2 bytes. Modify targ if 8088, not if 8086
> NOP ;1 byte
> NOP ;1 byte - end of 8088 queue
>
> QUE_CODE:
> MOV BH,6 ;6 is outside 8088 queue, just
> inside 8086
> STI
> JMP SHORT GOT_CPU
I don't quite understand the above; if the JMP SHORT clears the
prefetch queue, how does this work?
I'm sorry, I was confused; what I'm actually doing up there is running
an NEC V20/V30-specific instruction SETl (set a specific bit). Here's
the reference for that particular opcode:
Mnemonic: SET1 reg/mem,CL/immediate
Opcode : SET1 r/m8,CL : 0F 14 [mod:000:r/m] (4/13 clocks)
SET1 r/m8,imm3 : 0F 1C [mod:000:r/m] imm (5/14 clocks)
SET1 r/m16,CL : 0F 15 [mod:000:r/m] (4/13 clocks)
SET1 r/m16,imm4: 0F 1D [mod:000:r/m] imm (5/14 clocks)
SET1 CY : F9 (NEC nomenclature for Intel's STC)
SET1 DIR : FD (NEC nomenclature for Intel's STD)
Sets the specified bit in the register/memory operand. The bit number
(CL
or immediate) is ANDed with 07 (for 8-bit operands) or 0F (for 16-bit
operands) to get a valid bit number. No flags are affected by this
operation, except the Carry and Direction Flag with SET1 CY and SET1
DIR.
The first (smaller) clock count in each pair is for register operands.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.
(Information supplied by Anthony Naggs)
Here is an other cpu/fpu-test(german book from DATA Becker: PC Intern):
;═══════════════════════════════════════════════════════════════
; PROZ.ASM (c) S.L.
; Modul ermittelt Prozessortyp und Koprozessor
;═══════════════════════════════════════════════════════════════
.model tiny
.code
; --------------------------------------------------------------
; Konstanten
; --------------------------------------------------------------
Cpu86 = 0
Cpu286 = 2
Cpu386 = 4
Cpu486 = 6
Cpu586 = 8
FpuNone = 0
FpuYes = 2
FpuEmul = 4
public GetCpu
GetCpu proc near
mov ax, Cpu86
xor bx, bx
push bx ; Null auf Stack
popf ; Null in Flagregister
pushf
pop bx ; zurück nach bx
and bh, 0F0h
cmp bh, 0F0h ; wenn gleich, dann 8086
je @CpuOk
.286
mov ax, Cpu286
push 7000h ; dasselbe mit 7000h
popf
pushf
pop bx
and bh, 70h
jz @CpuOk ; wenn Null dann 286
.386
mov ax, Cpu386
mov edx, esp
and esp, 0FFFCh ; durch vier teilbare Adr.
pushfd
pop ebx
mov ecx, ebx
btc ebx, 18 ; Bit 18 umdrehen
push ebx
popfd
pushfd
pop ebx
push ecx ; alte Flaggen zurück
popfd
mov esp, edx ; Stack zurück
cmp ecx, ebx ; wenn gleich dann 386
jz @CpuOk
mov ax, Cpu486
btc ecx, 21
push ecx
popfd
pushfd
pop ebx
cmp ebx, ecx ; wenn ungleich, dann 486
jnz @CpuOk
mov ax, Cpu586
.8086
@CpuOk:
mov dx, FpuNone ; von keiner FPU ausgehen
mov byte ptr cs:[@1], 90h
mov byte ptr cs:[@2], 90h
@1: finit ; initialisieren
@2: fstcw cs:[aword] ; CW speichern
cmp byte ptr cs:[aword+1], 3
jnz @FpuOk
mov dx, FpuYes
cmp ax, Cpu86 ; keine Emulation bei 8086
jz @FpuOk
.286p
smsw bx ; MSW nach BX
shr bx, 1
and bx, 3 ; Bits 0 und 1 maskieren
cmp bx, 2 ; Bit 2=1 -> Emul
jnz @FpuOk
mov dx, FpuEmul
.8086
@FpuOk:
sti
ret
aword word 0
GetCpu endp
end
;-------------
Dirk
>I'm afraid I don't quite understand the prefetch queue code:
>
>> ;Test 8088/8086 based upon prefetch queue length. 8088 = 4-byte, 8086
>> = 6-byte.
>> ;Code modified from Szilagyi, EDN Design Ideas, Apr 26, 1990, pg 232.
>> CLI
>> MOV BH,8 ;Assume 8088
>> MOV SI,OFFSET QUE_CODE + 1 ;Target code ahead of or in queue
>> JMP SHORT $+2 ;Reset que (important!)
>> MOV [SI],BH ;2 bytes. Modify targ if 8088, not if 8086
>> NOP ;1 byte
>> NOP ;1 byte - end of 8088 queue
>>
>> QUE_CODE:
>> MOV BH,6 ;6 is outside 8088 queue, just
>> inside 8086
>> STI
>> JMP SHORT GOT_CPU
>
>I don't quite understand the above; if the JMP SHORT clears the
>prefetch queue, how does this work?
>
The MOV BH,6 instruction is changed to MOV BH,8 by MOV [SI],BH.
That really happens on an 8088 since the [SI] target is outside of the
4-byte queue. But on an 8086 or higher, that instruction was already
in the queue at the time of MOV [SI],BH and so it was too late to
change it... the MOV BH,6 was already a done deal.