Concerns about performance w/Python, Pysco on Pentiums

Peter Hansen

unread,

Mar 5, 2003, 9:50:27 AM3/5/03

to

(This is an exploratory inquiry to see if anyone has suggestions
for things I can experiment with, or thoughts on what I might
be doing wrong. If necessary, I will later be able to post more
concrete information.)

We've spiked a simulator for the Motorola 68HC12 microcontroller.
The code is pure Python and, other than an initial data load,
there is no I/O until the program completes, and no threads, no
GUI, and no extension modules or anything except pure Python.
(For those who know about these things, we're loading an .S19
file with an image of the HC12 code, and simulating the CPU core
to dispatch on individual opcodes).

On a Pentium 266 MMX, the simulator executes roughly 15000
simulated HC12 clock cycles per second. On a P3 866 MHz
chip, it can do about 85000 cycles per second. On a P4 2GHz
it can do about 115000 cycles per second. The real CPU
runs 8 million of these clock cycles per second, and we
were hoping for significantly better performance than we've
seen so far.

More interesting to me, however, is the poor relative performance
of the faster machines. I can believe the P3 866 should be about
5.5+ times faster than the old P266, but the P4 is only 35%
faster than it! (Note, the P3 is running Win98SE, the P4 is running
Redhat 7.3, both with their "vanilla" Python 2.2 installations,
in the case of Linux that being the RPM from python.org.)

I had high hopes for Psyco, so I installed it on the P266MMX
machine and took a first stab at binding the core functions
(basically the dispatch routine, plus all opcode functions)
but achieved only a 12% speedup, substantially below my hopes
and expectations based on others' reports using Psyco.

The core code consists of a loop which grabs a byte, does
a dictionary lookup to find the opcode function to call, and
calls it, passing in a reference to the CPU object. Other
than twiddling bits, doing a lot of "& 0xFFFF" operations, and
the odd addition or multiplication, not much is going on.
At a first approximation, I'd guess most of the time is going
into function calls (I'll profile at some point of course).

So my questions are these:

1. Any thoughts on why the Linux-based P4 2GHz machine is so
pathetically little faster than a machine it ought to be
twice as fast as? Is it because we're running code (maybe
both Linux and Python) that isn't optimized for Pentiums?
In that case, why is Win98SE so much faster? Does it self-
adjust for faster CPUs, installing optimized modules when
it detects a non-386 chip?

2. Any thoughts on why Psyco provides such a small speedup?
Is it likely I'm using it wrong? Or is it ineffective on
code where the bottleneck is Python function calls? Should
I consider Pyrex instead?

At the moment, it's actually "fast enough", so this isn't an
urgent concern. On the other hand, our intention is to use this
simulator to allow true test-driven development of embedded
system code (which, I believe, may well be a "first"), and
as we grow the number of tests we will doubtless become interested
in better performance. With the P3 machine we can run at 1/100
the native CPU speed, but I'd like to see something an order of
magnitude faster. In fact, my initial estimate was that on the
fast CPUs and with Psyco we could probably achieve parity (using a
2GHZ chip and Python to simulate a lowly 16MHz chip) but I'm losing
hope on that one.

Any input is welcome. By the way, it's my firm intention that
the simulator and (I hope) the test framework itself will be
released as an open-source project. (Note also, that the entire
simulator and framework is itself being test-driven, so it will
come complete with a full suite of unit and acceptance tests.)

Thanks.

-Peter

Gerhard Häring

unread,

Mar 5, 2003, 10:34:44 AM3/5/03

to

Peter Hansen <pe...@engcorp.com> wrote:
> [hardware simulator in Python]

> On a Pentium 266 MMX, the simulator executes roughly 15000
> simulated HC12 clock cycles per second. On a P3 866 MHz
> chip, it can do about 85000 cycles per second. On a P4 2GHz
> it can do about 115000 cycles per second. The real CPU
> runs 8 million of these clock cycles per second, and we
> were hoping for significantly better performance than we've
> seen so far.
>
> More interesting to me, however, is the poor relative performance
> of the faster machines. I can believe the P3 866 should be about
> 5.5+ times faster than the old P266, but the P4 is only 35%
> faster than it!

It may well be that the real bottleneck of your simulator is memory access
and not CPU. In this case you could try rewriting your Python code so that
it uses continuous memory blocks. If for example you use lists for
simulating the RAM, you could try to use the buffer module instead.

(Note, the P3 is running Win98SE, the P4 is running
> Redhat 7.3, both with their "vanilla" Python 2.2 installations,
> in the case of Linux that being the RPM from python.org.)

These use different compilers. Microsoft Visual C++ vs. (most likely) gcc
2.9.5.x.

It should be possible to squeeze more performance out of the P4 by using
gcc 3.2 or the Intel compiler instead. You should rather expect -5 to 15 %
improvement rather than an order of magnitude, though ;-)

Still this sounds like memory is the bottleneck ... You could try to look
at the cache misses for your process, if that is at all possible.

-- Gerhard

Anton Muhin

unread,

Mar 5, 2003, 11:22:56 AM3/5/03

to

Array or Numeric modules might lead to better performance.

HTH,
Anton.

Stephen Kellett

unread,

Mar 5, 2003, 11:38:50 AM3/5/03

to

>These use different compilers. Microsoft Visual C++ vs. (most likely) gcc
>2.9.5.x.

The Python on my Windows box is compiled using the Intel compiler, not
the Microsoft compiler.

Stephen
--
Stephen Kellett
Object Media Limited http://www.objmedia.demon.co.uk
RSI Information: http://www.objmedia.demon.co.uk/rsi.html

phil hunt

unread,

Mar 5, 2003, 1:00:51 PM3/5/03

to

On Wed, 05 Mar 2003 09:50:27 -0500, Peter Hansen <pe...@engcorp.com> wrote:
>
>At the moment, it's actually "fast enough", so this isn't an
>urgent concern. On the other hand, our intention is to use this
>simulator to allow true test-driven development of embedded
>system code (which, I believe, may well be a "first"), and
>as we grow the number of tests we will doubtless become interested
>in better performance. With the P3 machine we can run at 1/100
>the native CPU speed, but I'd like to see something an order of
>magnitude faster.

I'm sure the thought of re-coding it in C/C++ has occurred to you.

Writing a machine code emulator isn't something I'd
consider Python to be the natural choice for. And I doubt if the C
code would be much (if any) more complex.

--
|*|*| Philip Hunt <ph...@cabalamat.org> |*|*|
|*|*| "Memes are a hoax; pass it on" |*|*|

Gerhard Haering

unread,

Mar 5, 2003, 4:05:13 PM3/5/03

to

* Stephen Kellett <sn...@objmedia.demon.co.uk> [2003-03-05 16:38 +0000]:

> >These use different compilers. Microsoft Visual C++ vs. (most likely) gcc
> >2.9.5.x.
>
> The Python on my Windows box is compiled using the Intel compiler, not
> the Microsoft compiler.

It may well be, if you compiled it yourself. But if you're using the one from
python.org, it's compiled using MSVC6. Same for ActiveState last time I looked
at it.

-- Gerhard

Chris Liechti

unread,

Mar 5, 2003, 5:58:19 PM3/5/03

to

ph...@cabalamat.org (phil hunt) wrote in
news:slrnb6ceqj...@cabalamat.uklinux.net:

> On Wed, 05 Mar 2003 09:50:27 -0500, Peter Hansen <pe...@engcorp.com>
> wrote:
>>
>>At the moment, it's actually "fast enough", so this isn't an
>>urgent concern. On the other hand, our intention is to use this
>>simulator to allow true test-driven development of embedded
>>system code (which, I believe, may well be a "first"), and
>>as we grow the number of tests we will doubtless become interested
>>in better performance. With the P3 machine we can run at 1/100
>>the native CPU speed, but I'd like to see something an order of
>>magnitude faster.
>
> I'm sure the thought of re-coding it in C/C++ has occurred to you.
>
> Writing a machine code emulator isn't something I'd
> consider Python to be the natural choice for. And I doubt if the C
> code would be much (if any) more complex.

hehe, it was my fisrt thought to use python.... i wrote a simulator for an
MSP430 embedded 16 bit processor ;-)

my approach was verry modular, so that the peripheral modules can be added
later and could be simulated too (e.g. the RAM mapped multiplier works :-)
i have also different watches and i'm using the observer pattern for the
CPU registers, so that a GUI is easily built so i don't expect too much
performance from it and i have not profiled it...

chris

--
Chris <clie...@gmx.net>

Chris Liechti

unread,

Mar 5, 2003, 6:11:55 PM3/5/03

to

Peter Hansen <pe...@engcorp.com> wrote in
news:3E660EB3...@engcorp.com:

> More interesting to me, however, is the poor relative performance
> of the faster machines. I can believe the P3 866 should be about
> 5.5+ times faster than the old P266, but the P4 is only 35%
> faster than it! (Note, the P3 is running Win98SE, the P4 is running
> Redhat 7.3, both with their "vanilla" Python 2.2 installations,
> in the case of Linux that being the RPM from python.org.)

the P4 does not seem te be equaly efficient with it cycles as a P3 ;-)

see e..g here, second diagramm and the text below:
http://www.sysopt.com/reviews/pentium4/index4.html

in addidtion to that different compilers (and options) might explain a
bigger difference. i assume that you used the same python versions and no
other CPU hogging processes...

chris

--
Chris <clie...@gmx.net>

Stephen Kellett

unread,

Mar 5, 2003, 5:38:28 PM3/5/03

to

In message <mailman.1046894831...@python.org>, Gerhard
Haering <gerhard...@gmx.de> writes

Nope, I downloaded it from Python.org.

The relevant function returns "MSC 32 bit (Intel)" - which I interpret
to mean the Intel compiler.

I've only ever heard the Microsoft compiler identified by people as
either "MSVC", "Developer Studio" or "Visual Studio", the last two being
inaccurate, but identifying the main product.

Thinking about this, I'm wrong, as the Python22.lib is linked against
the Microsoft multithread CRT dll, which of course Intel would provide
their own equivalent. So that pretty much proves its the Microsoft

Peter Hansen

unread,

Mar 5, 2003, 9:47:10 PM3/5/03

to Gerhard Häring

Gerhard Häring wrote:
>
> Peter Hansen <pe...@engcorp.com> wrote:
> > [hardware simulator in Python]

> > More interesting to me, however, is the poor relative performance
> > of the faster machines. I can believe the P3 866 should be about
> > 5.5+ times faster than the old P266, but the P4 is only 35%
> > faster than it!
>
> It may well be that the real bottleneck of your simulator is memory access
> and not CPU. In this case you could try rewriting your Python code so that
> it uses continuous memory blocks. If for example you use lists for
> simulating the RAM, you could try to use the buffer module instead.

That's an interesting thought, although I might run into trouble with
the second idea. Buffer can only handle primitives all the same,
whereas the main reason to simulate memory with a list is that you
can then stick "memory-like" objects in it to simulate the special
registers of the hardware, such as having a UART register which then
interfaces to the serial port on the PC, or a timer register which
can trigger simulated interrupts at the appropriate time.

> Still this sounds like memory is the bottleneck ... You could try to look
> at the cache misses for your process, if that is at all possible.

Not knowing that much about Linux I'd be at a loss to do this on my
own. Any suggestions for where to look? (And I'm just going to
assume outright that it's infeasible under Win98. :-)

Thanks for the thought. I'll see what I can think of to prove or
disprove it.

-Peter

Peter Hansen

unread,

Mar 5, 2003, 9:48:03 PM3/5/03

to

Anton Muhin wrote:
>
> Array or Numeric modules might lead to better performance.

Hmm... I don't see how. Unless you are thinking along the lines
of Gerhard's suggestions about using a buffer instead of a list
to simulate memory. I really think this is a small aspect of the
problem, though, but without profiling I guess I won't know.

-Peter

Peter Hansen

unread,

Mar 5, 2003, 9:56:56 PM3/5/03

to

phil hunt wrote:
>
> On Wed, 05 Mar 2003 09:50:27 -0500, Peter Hansen <pe...@engcorp.com> wrote:
> >
> >At the moment, it's actually "fast enough", so this isn't an
> >urgent concern.
>

> I'm sure the thought of re-coding it in C/C++ has occurred to you.

That thought should be considered by anyone using Python and concerned
about performance, of course. Not that recoding most of it would
be of interest, but the core portions could certainly be done that way.
As I said, it's fast enough for now. I'd guess that we could use it
effectively for a year or so, and flesh out all the features before
we seriously consider that step. C is so much less malleable than Python.

> Writing a machine code emulator isn't something I'd
> consider Python to be the natural choice for. And I doubt if the C
> code would be much (if any) more complex.

You're probably quite right about the complexity... the relevant functions
are three or four lines apiece and quite simple, *especially* for a
language that understands bit-banging like C does. (That's part of
why I was surprised to see practically no speedup with Pysco. Maybe
it's just not targetted for that kind of code.)

As for the "natural choice", we think along different lines. For
me, Python is clearly the natural choice for something that inherently
needs to be interfaced with a test framework, and which needs a wide
variety of complex components (the special registers, for example,
and custom hardware used in our products) simulated as well. I
would consider C much later, but initially it definitely should be
Python, until the code stops or greatly slows its evolution.
An "emulator" in the traditional sense of course usually has performance
as a primary criterion. For our purposes, it's near the bottom of
the list, even if we still feel some concern.

(Also, I think you might take a different approach if you did this
stuff test-driven instead of the traditional way (not sure if you've
caught the TDD bug yet...). Python is *so* effective for this kind
of approach that it's hard to imagine being as productive with a
non-dynamic language.)

-Peter

Peter Hansen

unread,

Mar 5, 2003, 10:02:23 PM3/5/03

to

Chris Liechti wrote:
>
> Peter Hansen <pe...@engcorp.com> wrote:

> > More interesting to me, however, is the poor relative performance
> > of the faster machines.
>

> the P4 does not seem te be equaly efficient with it cycles as a P3 ;-)
>
> see e..g here, second diagramm and the text below:
> http://www.sysopt.com/reviews/pentium4/index4.html

Thanks for that link! I did spend some time googling but didn't
manage to find anything that showed things that clearly. The pages
I found vaguely gave the impression I should expect a 100% boost,
not just 25-30%. That explains part of the problem (but not the Psyco
part).

> in addidtion to that different compilers (and options) might explain a
> bigger difference. i assume that you used the same python versions and no
> other CPU hogging processes...

Same versions (2.2) and no load other than the test, definitely. I
think you hit the nail on the head with this one, as the test appears
to me (in spite of Gerhard's good suggestion) to be very nearly *just*
raw CPU usage.

I'll profile a bit to learn more... but does anyone have any
comments on the Psyco front?

-Peter

Gerhard Häring

unread,

Mar 6, 2003, 6:37:42 AM3/6/03

to

Peter Hansen <pe...@engcorp.com> wrote:
> Gerhard Häring wrote:
>> Peter Hansen <pe...@engcorp.com> wrote:
>> > [hardware simulator in Python]

>> It may well be that the real bottleneck of your simulator is memory access

>> and not CPU. [...]

> Not knowing that much about Linux I'd be at a loss to do this on my
> own. Any suggestions for where to look?

A quick googling shows that there are several performance counter
patches out there. But maybe you can get the information you need out
of the standard profiling information from /proc/profile using
readprofile(1).

You'll need to add a kernel commandline option in your bootloader
(GRUB/LILO). This is an important part of the man page:

"""
To enable profiling, the kernel must be rebooted, because
no profiling module is available, and it wouldn't be easy
to build. To enable profiling, you can specify "profile=2"
(or another number) on the kernel commandline. The number
you specify is the two-exponent used as profiling step.
"""

I've never done anything of the above myself, though ;-)

-- Gerhard

Gerhard Häring

unread,

Mar 6, 2003, 6:49:10 AM3/6/03

to

Peter Hansen <pe...@engcorp.com> wrote:
> Gerhard Häring wrote:

>> Still this sounds like memory is the bottleneck ... You could try to
>> look at the cache misses for your process, if that is at all possible.
>
> Not knowing that much about Linux I'd be at a loss to do this on my
> own. Any suggestions for where to look? (And I'm just going to
> assume outright that it's infeasible under Win98. :-)

The tools I proposed do little else than query CPU-specific performance
registers. So there is no reason why this shouldn't be possible, even on
Win98. It may well be much easier to do it in Windows 98, because you need
special privileges to query these registers (on Linux, only the kernel can
do). Being the toy it is, I'd not be surprised if even a user-space program
could query performance registers on Windows 9x/ME :)

-- Gerhard

Michael Hudson

unread,

Mar 6, 2003, 8:16:26 AM3/6/03

to

Peter Hansen <pe...@engcorp.com> writes:

> > Still this sounds like memory is the bottleneck ... You could try to look
> > at the cache misses for your process, if that is at all possible.
>
> Not knowing that much about Linux I'd be at a loss to do this on my
> own. Any suggestions for where to look?

Googling for cachegrind might lead to enlightenment. It comes with
valgrind. Interpreting its output is probably something of a black
art.

Cheers,
M.

--
All parts should go together without forcing. You must remember that
the parts you are reassembling were disassembled by you. Therefore,
if you can't get them together again, there must be a reason. By all
means, do not use a hammer. -- IBM maintenance manual, 1925

Anders J. Munch

unread,

Mar 6, 2003, 11:16:41 AM3/6/03

to

"Peter Hansen" <pe...@engcorp.com> wrote:
>
> I'll profile a bit to learn more... but does anyone have any
> comments on the Psyco front?

When working with partial evaluators, sometimes dumbing down code can
make it faster, because it makes the code more accessible to analysis.

You were switching by means of a dict of functions, right? An elegant
and natural Python solution, but my bet is that psyco can't see
through it.

Try replacing the dict with a long if-elif-elif-elif-chain ordered by
expected execution frequency. Or better yet, a tree of if statements
performing an inlined binary search, if you don't mind the
maintenance.

Straight interpretation will be dog-slow of course (at least for the
simple elif chain), but there is the distinct possibility that psyco
will have a field day.

not-that-I-really-know-anything-about-psyco-ly y'rs, Anders

Skip Montanaro

unread,

Mar 6, 2003, 11:03:16 AM3/6/03

to

Gerhard> Peter Hansen <pe...@engcorp.com> wrote:
>> Not knowing that much about Linux I'd be at a loss to do this on my
>> own. Any suggestions for where to look? (And I'm just going to
>> assume outright that it's infeasible under Win98. :-)

Gerhard> The tools I proposed do little else than query CPU-specific
Gerhard> performance registers.

Jeremy Hylton created a patch for the Python virtual machine (floating
around the SF patch subsystem somewhere) which does something similar for
just Python programs. It gives you an idea how many clock cycles each
virtual machine instruction consumes. You might find it useful.

Skip

Michael Hudson

unread,

Mar 6, 2003, 11:51:57 AM3/6/03

to

"Anders J. Munch" <ande...@dancontrol.dk> writes:

> You were switching by means of a dict of functions, right? An elegant
> and natural Python solution, but my bet is that psyco can't see
> through it.
>
> Try replacing the dict with a long if-elif-elif-elif-chain ordered by
> expected execution frequency. Or better yet, a tree of if statements
> performing an inlined binary search, if you don't mind the
> maintenance.

You might also try a list of functions rather than a dict. I think
psyco knows more about lists than dicts. But ICBW.

Cheers,
M.

--
My hat is lined with tinfoil for protection in the unlikely event
that the droid gets his PowerPoint presentation working.
-- Alan W. Frame, alt.sysadmin.recovery

Stephen Kellett

unread,

Mar 6, 2003, 11:42:17 AM3/6/03

to

>do). Being the toy it is, I'd not be surprised if even a user-space program
>could query performance registers on Windows 9x/ME :)

I can't comment for Win9x, but Win NT/2000/XP programs can happily call
cpuid and rdtsc without causing any exceptions. These are user mode
programs, not drivers sitting in kernel space.

Peter Hansen

unread,

Mar 6, 2003, 12:26:05 PM3/6/03

to

Michael Hudson wrote:
>
> "Anders J. Munch" <ande...@dancontrol.dk> writes:
>
> > You were switching by means of a dict of functions, right? An elegant
> > and natural Python solution, but my bet is that psyco can't see
> > through it.
> >
> > Try replacing the dict with a long if-elif-elif-elif-chain ordered by
> > expected execution frequency. Or better yet, a tree of if statements
> > performing an inlined binary search, if you don't mind the
> > maintenance.
>
> You might also try a list of functions rather than a dict. I think
> psyco knows more about lists than dicts. But ICBW.

Ah, thanks. My lateral-thinking cap must have been off. Although
dicts are fast in Python in general, in this case the items in the
list are tightly constrained (opcodes, from 0 to 255) and a list
is quite feasible. I'll give it a shot - thanks! :)

-Peter

Michael Hudson

unread,

Mar 6, 2003, 12:39:51 PM3/6/03

to

Peter Hansen <pe...@engcorp.com> writes:

> Michael Hudson wrote:
> >
> > You might also try a list of functions rather than a dict. I think
> > psyco knows more about lists than dicts. But ICBW.
>
> Ah, thanks. My lateral-thinking cap must have been off. Although
> dicts are fast in Python in general, in this case the items in the
> list are tightly constrained (opcodes, from 0 to 255) and a list
> is quite feasible. I'll give it a shot - thanks! :)

In which case, I'd expect lists to be faster than dicts w/o psyco too,
'cause of the special casing for list[int] in ceval.c.

Cheers,
M.

--
Finding a needle in a haystack is a lot easier if you burn down
the haystack and scan the ashes with a metal detector.
-- the Silicon Valley Tarot (another one nicked from David Rush)

Peter Hansen

unread,

Mar 6, 2003, 2:35:14 PM3/6/03

to

Michael Hudson wrote:
>
> Peter Hansen <pe...@engcorp.com> writes:
>
> > Michael Hudson wrote:
> > >
> > > You might also try a list of functions rather than a dict. I think
> > > psyco knows more about lists than dicts. But ICBW.
> >
> > Ah, thanks. My lateral-thinking cap must have been off. Although
> > dicts are fast in Python in general, in this case the items in the
> > list are tightly constrained (opcodes, from 0 to 255) and a list
> > is quite feasible. I'll give it a shot - thanks! :)
>
> In which case, I'd expect lists to be faster than dicts w/o psyco too,
> 'cause of the special casing for list[int] in ceval.c.

Interesting... I tried the (one-line!) change to use a list
instead of a dict... below are some results. For reference, for those
still interested, here are a few snippets showing sample code, with some
extraneous stuff removed for clarity:

------------------

class Opcode:
opcodes = {} # or use [None] * 256 to store in a list

def __init__(self, code, name, length, cycles, mode):
self.code = code
self.name = name
self.length = length
self.execute = globals()['execute' + self.name]
self.cycles = cycles
self.mode = mode

self.opcodes[self.code] = self

# example opcode: "no operation"
def executeNOP(cpu):
pass

# example opcode: "load D register"
def executeLDD(cpu):
cpu.setD(cpu.readUword(cpu.effectiveAddress))
cpu.CCR_N = bool(cpu.D & 0x8000)
cpu.CCR_Z = (cpu.D == 0)

# create some opcodes, storing references in class
Opcode(0xA7, 'NOP', 1, 1, 'INH')
Opcode(0xCC, 'LDD', 3, 2, 'IMM')

class Cpu:
....
def __init__(self, name):
self.name = name

self.setCCR('sxhinzvc')
self.D = 0
self.A = 0
self.B = 0
self.X = 0
self.Y = 0
self.SP = 0
self.PC = 0
self.memory = [0] * 65536
self.opcodes = Opcode.opcodes

def step(self):
opcodeByte = self.readByte(self.PC)
try:
opcode = self.opcodes[opcodeByte]
except KeyError:
raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
self.readMemory(self.PC + 1, 5)))
else:
deltaPC, self.effectiveAddress = self.resolveAddress(opcode.mode)
newPC = opcode.execute(self)

if newPC is not None:
self.PC = newPC
else:
self.PC += deltaPC + opcode.length

self.cycles += opcode.cycles
...

Summarizing, opcodes are instantiated objects tracked in a class
variable in the Opcode class. They are dispatched through their
execute attribute, which is bound on instantiation to the global
function of the appropriate name. The Cpu class, instantiated
once, has a step() method that is called repeatedly from a run()
method. It grabs the opcode byte from memory, looks it up in
the single Opcode class list/dict, and calls the execute function,
passing itself as a parameter.

Here are the results, this time for a P3 730MHz machine.
(Results are repeatable +/- about 500Hz.)

Original code w/dict, no Psyco: 69350 Hz (A= baseline)
Original code w/dict, use Psyco: 94250 Hz (B= A + 36%)

Variation with list, no Psyco: 70725 Hz (C = A + 2%)
Original code w/list, use Psyco: 96190 Hz (D = C + 36%, or A + 39%)

So the switch to use a list provides a minimal (2%) speedup,
while Psyco, properly used (*), manages to speed either approach
up by 36%.

* I apparently screwed up the first time I used Pysco, or maybe
it doesn't work nearly as well on a P266MMX, because on this machine
it's doing much better than 12%. I'm not willing to claim that
I'm actually using it "correctly" yet, since this is time #2 for me
using Pysco...

Tentative conclusion: although it's at the bottom of the range of
claimed improvements from Psyco, I'll take my 36% and run. I'll
switch to lists, because dicts have zero advantages in this case,
though the speedup is minor.

I'll take Chris L's research as conclusive about the somewhat
dubious value of a superfast Pentium 4 chip compared to the lowly
P3 at a lower clock rate, and stop worrying about it.

And I doubt I'll bother playing around any more without
(a) deciding the thing is "too slow" (which it's not, yet) and
(b) actually finishing the code, and profiling it as one
always should before optimizing. ;-)

I very much appreciate all the input received to date, and those
responses still to come.

Cheers,
-Peter

Pedro Rodriguez

unread,

Mar 7, 2003, 2:48:55 AM3/7/03

to

On Thu, 06 Mar 2003 20:35:14 +0100, Peter Hansen wrote:

> def step(self):
> opcodeByte = self.readByte(self.PC)
> try:
> opcode = self.opcodes[opcodeByte]
> except KeyError:
> raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
> self.readMemory(self.PC + 1, 5)))
> else:

> ...

if you are using a dictionnary for opcodes, wouldn't you gain
time by not putting a try/except clause but by going straight
with self.opcodes.get(...) and checking the returned value
against None.

import time

l = {}
for i in range(256):
l[i] = i

t0 = time.time()
for i in range(1000000):
x = l.get(i)
if x is None:
pass
else:
pass
t1 = time.time()
print "get", t1 - t0

t0 = time.time()
for i in range(1000000):
try:
x = l[i]
except KeyError:
pass
else:
pass
t1 = time.time()
print "try", t1 - t0

$ time python t.py
get 2.30093002319
try 13.6490590572

real 0m15.995s
user 0m15.650s
sys 0m0.110s

Pedro

Andrew Bennetts

unread,

Mar 7, 2003, 3:24:04 AM3/7/03

to

On Fri, Mar 07, 2003 at 08:48:55AM +0100, Pedro Rodriguez wrote:
> On Thu, 06 Mar 2003 20:35:14 +0100, Peter Hansen wrote:
>
> > def step(self):
> > opcodeByte = self.readByte(self.PC)
> > try:
> > opcode = self.opcodes[opcodeByte]
> > except KeyError:
> > raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
> > self.readMemory(self.PC + 1, 5)))
> > else:
> > ...
>
> if you are using a dictionnary for opcodes, wouldn't you gain
> time by not putting a try/except clause but by going straight
> with self.opcodes.get(...) and checking the returned value
> against None.

Depends on how frequently the simulator attempts to lookup an non-existent
opcode. My guess is that it would only happen rarely, in which case
try/except will be faster.

As always, the only way to know for sure is to try it on the actual code and
time it :)

-Andrew.

Pedro

unread,

Mar 7, 2003, 4:47:13 AM3/7/03

to

"Andrew Bennetts" <andrew-p...@puzzling.org> wrote in message
news:mailman.104702469...@python.org...

> On Fri, Mar 07, 2003 at 08:48:55AM +0100, Pedro Rodriguez wrote:
> > On Thu, 06 Mar 2003 20:35:14 +0100, Peter Hansen wrote:
> >
> > > def step(self):
> > > opcodeByte = self.readByte(self.PC)
> > > try:
> > > opcode = self.opcodes[opcodeByte]
> > > except KeyError:
> > > raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
> > > self.readMemory(self.PC + 1, 5)))
> > > else:
> > > ...
> >
> > if you are using a dictionnary for opcodes, wouldn't you gain
> > time by not putting a try/except clause but by going straight
> > with self.opcodes.get(...) and checking the returned value
> > against None.
>
> Depends on how frequently the simulator attempts to lookup an non-existent
> opcode. My guess is that it would only happen rarely, in which case
> try/except will be faster.
>

You're totally right, and replacing in my example the 'i' reference with '1'
(which always succeeds) proves your point (timing on windows this time) :
get 1.89100003242
try 0.84399998188

And getting ride of the try/except doesn't buy apparently much
(timed at approx 0.7, which is still some 20% gain). If the exception case
MUST never happen (except while debugging) and the frequency look up is
very high, it may be worth moving the try/except clause around the 'loop'
to catch some fatal errors.

I would even consider having checking that self.opcodes[...] return a
tuple containing (mode, length,...) instead of going through attribute
look up for opcode.xxx .

> As always, the only way to know for sure is to try it on the actual code
and
> time it :)
>

Yes.

> -Andrew.
>
Thanks for correcting my point.

Pedro

CezaryB

unread,

Mar 7, 2003, 5:26:38 AM3/7/03

to

On 3/6/03 8:35 PM, Peter Hansen wrote:
> Michael Hudson wrote:
>
>>Peter Hansen <pe...@engcorp.com> writes:

> Interesting... I tried the (one-line!) change to use a list
> instead of a dict... below are some results. For reference, for those
> still interested, here are a few snippets showing sample code, with some
> extraneous stuff removed for clarity:

[...]

> class Cpu:
> ....
> def __init__(self, name):
> self.name = name
>
> self.setCCR('sxhinzvc')
> self.D = 0
> self.A = 0
> self.B = 0
> self.X = 0
> self.Y = 0
> self.SP = 0
> self.PC = 0
> self.memory = [0] * 65536
> self.opcodes = Opcode.opcodes

Tuples are faster then lists. Try this:
self.opcodes = tuple( Opcode.opcodes )

> def step(self):

> opcodeByte = self.readByte(self.PC)
> try:
> opcode = self.opcodes[opcodeByte]
> except KeyError:
> raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
> self.readMemory(self.PC + 1, 5)))
> else:
> deltaPC, self.effectiveAddress = self.resolveAddress(opcode.mode)
> newPC = opcode.execute(self)
>
> if newPC is not None:
> self.PC = newPC
> else:
> self.PC += deltaPC + opcode.length
>
> self.cycles += opcode.cycles
> ...

Inline "step" in your main loop, use local variables instead od self.PC, self.opcodes
self.resolveAddress. It should help Psyco.

CezaryB

Andrew Bennetts

unread,

Mar 7, 2003, 6:05:26 AM3/7/03

to

Regarding inlining, I thought psyco would do it automatically?

Hmm, another thought... does this code use __slots__? Do they make any
measureable difference to performance? (I'm wondering if they improve
locality and memory footprint enough to have noticeable cache benefits)

-Andrew.

Peter Hansen

unread,

Mar 7, 2003, 8:34:30 AM3/7/03

to

Andrew Bennetts wrote:
>
> On Fri, Mar 07, 2003 at 08:48:55AM +0100, Pedro Rodriguez wrote:
> > On Thu, 06 Mar 2003 20:35:14 +0100, Peter Hansen wrote:
> >
> > > def step(self):
> > > opcodeByte = self.readByte(self.PC)
> > > try:
> > > opcode = self.opcodes[opcodeByte]
> > > except KeyError:
> > > raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
> > > self.readMemory(self.PC + 1, 5)))
> > > else:
> > > ...
> >
> > if you are using a dictionnary for opcodes, wouldn't you gain
> > time by not putting a try/except clause but by going straight
> > with self.opcodes.get(...) and checking the returned value
> > against None.
>
> Depends on how frequently the simulator attempts to lookup an non-existent
> opcode. My guess is that it would only happen rarely, in which case
> try/except will be faster.

Andrew is quite right. Naturally, unimplemented opcodes are in effect
*never* encountered in valid code, and that feature is there for now
only because 34 out of about 200 opcodes have been implemented.

-Peter

Peter Hansen

unread,

Mar 7, 2003, 8:40:41 AM3/7/03

to

CezaryB wrote:
>
> Tuples are faster then lists. Try this:
> self.opcodes = tuple( Opcode.opcodes )

An interesting point, if true. I'm surprised by that. I would
have thought that the difference, if any, would be so small as
to be immeasurable in anything but code which only did tuple
lookups... I'll have to experiment to see, if only to prove
your point to myself.

> > def step(self):

>
> > if newPC is not None:
> > self.PC = newPC
> > else:
> > self.PC += deltaPC + opcode.length
> >
> > self.cycles += opcode.cycles
> > ...
> Inline "step" in your main loop, use local variables instead od self.PC, self.opcodes
> self.resolveAddress. It should help Psyco.

That's a possibility, of course, although then the step() method is
not available for use when I actually want it available, unless I
modify run() (which currently calls step()) so I can ask it to
stop after only one instruction (and then rewrite step() to call
run() with that flag, of course).

If performance were really that important, these changes would be
exactly the route I would consider taking. Performance isn't that
important, however, and maintainability is far more important than
the smallish gains that would be found so far.

I appreciate the suggestions, though, and may play with a few just
for kicks.

The real issue regarding Psyco for me was simply what I could get
*without* hand-tuning the code first. In my mind, Psyco would
provide negligible value if I could get significant speed gains
only by inlining code, using locals, and so forth. What's needed
is something that lets me write the code in the most straight-
forward maintainable fashion, without much regard to performance
(except in basic algorithmic terms), and then slap in a bind()
or two and see a big speedup. If that's not feasible, then I'm
personally happy with staying vanilla Python as it's lots
"fast enough" in even this case.

But tuples faster than lists? Hmmm.... ;-)

-Peter

Michael Hudson

unread,

Mar 7, 2003, 9:04:31 AM3/7/03

to

CezaryB <cez...@bigfoot.com> writes:

> Tuples are faster then lists. Try this:

You sure about that?

/>> def f(l, i, n):
|.. from time import time
|.. r = xrange(n)
|.. T = time()
|.. for o in r:
|.. pass
|.. t0 = time() - T
|.. T = time()
|.. for o in r:
|.. l[i] # 1
|.. l[i] # 2
|.. l[i] # 3
|.. l[i] # 4
|.. l[i] # 5
|.. l[i] # 6
|.. l[i] # 7
|.. l[i] # 8
|.. l[i] # 9
|.. l[i] # 10
|.. t1 = time() - T - t0
|.. return t1/n/10
\__
->> listtime = f(range(1000), 500, int(1e6))
->> tupletime = f(tuple(range(1000)), 500, int(1e6))
->> print listtime, tupletime, tupletime/listtime - 1.0
2.08104205132e-07 2.37756896019e-07 0.142489628543

So simple subscription seems to be ~14% slower for tuples.

This is Python from CVS of about a week ago, fwiw.

Cheers,
M.

--
Have you considered downgrading your arrogance to a reasonable level?
-- Erik Naggum, comp.lang.lisp, to yet another C++-using troll

Michael Hudson

unread,

Mar 7, 2003, 9:06:37 AM3/7/03

to

Peter Hansen <pe...@engcorp.com> writes:

> Tentative conclusion: although it's at the bottom of the range of
> claimed improvements from Psyco, I'll take my 36% and run. I'll
> switch to lists, because dicts have zero advantages in this case,
> though the speedup is minor.

Have you tried Python 2.3a2? You might like what you see (pymalloc is
probably the main reason for the improvement, but there have been
many, many little performance tweaks).

Cheers,
M.

--
In short, just business as usual in the wacky world of floating
point <wink>. -- Tim Peters, comp.lang.python

Anders J. Munch

unread,

Mar 7, 2003, 9:28:06 AM3/7/03

to

"Michael Hudson" <m...@python.net> wrote:
> CezaryB <cez...@bigfoot.com> writes:
>
> > Tuples are faster then lists. Try this:
>
> You sure about that?

[...]

> So simple subscription seems to be ~14% slower for tuples.

Surprised to see any difference at all.
I would expect what's faster for tuples is creating them, not using them.

- Anders

Tim Peters

unread,

Mar 7, 2003, 9:42:59 AM3/7/03

to

[CezaryB]

>> Tuples are faster then lists. Try this:

[Michael Hudson]
> You sure about that?
> ... [timing code] ...

> So simple subscription seems to be ~14% slower for tuples.
>
> This is Python from CVS of about a week ago, fwiw.

Note that the eval loop special-cases list, but not tuple, subscripts, in
the BINARY_SUBSCR opcode. list[i] is done inline, tuple[i] ends up going
thru the generic PyObject_GetItem. If there's a surprise here, then, it's
that tuple[i] isn't a *lot* slower than list[i].

Michael Hudson

unread,

Mar 7, 2003, 10:08:16 AM3/7/03

to

"Anders J. Munch" <ande...@dancontrol.dk> writes:

> "Michael Hudson" <m...@python.net> wrote:
> > CezaryB <cez...@bigfoot.com> writes:
> >
> > > Tuples are faster then lists. Try this:
> >
> > You sure about that?
> [...]
> > So simple subscription seems to be ~14% slower for tuples.
>
> Surprised to see any difference at all.

list[int] is special cased in Python/ceval.c, tuple[int] goes through
a (C) function call. Both are pretty quick -- 4-and-change million of
either a second -- so this is unlikely to be the bottleneck.

> I would expect what's faster for tuples is creating them, not using them.

Maybe a little, as creating a list involved two allocations. Timing
anything that so obviously gets the memory hierarchy into play is more
effort than I can be bothered with today.

Cheers,
M.

--
When physicists speak of a TOE, they don't really mean a theory
of *everything*. Taken literally, "Everything" covers a lot of
ground, including biology, art, decoherence and the best way to
barbecue ribs. -- John Baez, sci.physics.research

Cezary Biernacki

unread,

Mar 7, 2003, 11:17:14 AM3/7/03

to

On 3/7/03 4:08 PM, Michael Hudson wrote:
> "Anders J. Munch" <ande...@dancontrol.dk> writes:
>
>
>>"Michael Hudson" <m...@python.net> wrote:
>>
>>>CezaryB <cez...@bigfoot.com> writes:
>>>
>>>
>>>>Tuples are faster then lists. Try this:
>>>
>>>You sure about that?
>>[...]

I was :-(. I tested it years ago, but now I am wrong.
It looks like lists are 10% faster then tuples [Python 2.2.2].

Sorry, for bad advice.

CB

CezaryB

unread,

Mar 7, 2003, 11:18:44 AM3/7/03

to

On 3/7/03 4:08 PM, Michael Hudson wrote:

> "Anders J. Munch" <ande...@dancontrol.dk> writes:
>
>
>>"Michael Hudson" <m...@python.net> wrote:
>>
>>>CezaryB <cez...@bigfoot.com> writes:
>>>
>>>
>>>>Tuples are faster then lists. Try this:
>>>
>>>You sure about that?
>>[...]

I was :-(. I tested it years ago, but now I am wrong.

John Machin

unread,

Mar 7, 2003, 10:57:11 AM3/7/03

to

Pedro Rodriguez <pedro_r...@club-internet.fr> wrote in message news:<b49iot$r78$1...@s1.read.news.oleane.net>...

> On Thu, 06 Mar 2003 20:35:14 +0100, Peter Hansen wrote:
>
> > def step(self):
> > opcodeByte = self.readByte(self.PC)
> > try:
> > opcode = self.opcodes[opcodeByte]
> > except KeyError:
> > raise UnimplementedOpcode('$%02X %s' % (opcodeByte,

> if you are using a dictionnary for opcodes, wouldn't you gain

> time by not putting a try/except clause but by going straight
> with self.opcodes.get(...) and checking the returned value
> against None.

Rule 1: optimize for frequent events, not for infrequent events. In
emulating a CPU, which is more frequent, valid opcode or invalid
opcode? In your get-versus-try example, you do "valid" 256 times and
"invalid" (1000000 - 256) times.

Below is a slightly more realistic test, which seems to indicate that
the try-except caper is about 50% faster than dict.get() if you never
hit the "except" clause.

Cheers,
John

8<------

import time

def getter(q):

l = {}
for i in range(256):
l[i] = i
t0 = time.time()

npass = nfail = 0
lget = l.get
for i in xrange(1000000):
x = lget(q)
if x is None:
nfail += 1
else:
npass += 1
t1 = time.time()
print "get", q, t1 - t0, npass, nfail

def tryer(q):

l = {}
for i in range(256):
l[i] = i
t0 = time.time()

npass = nfail = 0
for i in xrange(1000000):
try:
x = l[q]
except KeyError:
nfail += 1
else:
npass += 1
t1 = time.time()
print "try", q, t1 - t0, npass, nfail

for q in (0, 1, 255, -1, 256, 257):
getter(q)
tryer(q)

8<-------
=== output (Python 2.2 Windows 32 bit version, on a 1.4Ghz Athlon)

get 0 1.30199992657 1000000 0
try 0 0.871000051498 1000000 0
get 1 1.27199995518 1000000 0
try 1 0.881000041962 1000000 0
get 255 1.36199998856 1000000 0
try 255 0.950999975204 1000000 0
get -1 1.31200003624 0 1000000
try -1 7.77199995518 0 1000000
get 256 1.32100009918 0 1000000
try 256 7.77199995518 0 1000000
get 257 1.30200004578 0 1000000
try 257 7.79100000858 0 1000000

A. Lloyd Flanagan

unread,

Mar 7, 2003, 3:06:42 PM3/7/03

to

Tim Peters <tim...@comcast.net> wrote in message news:<mailman.1047048253...@python.org>...

>
> Note that the eval loop special-cases list, but not tuple, subscripts, in
> the BINARY_SUBSCR opcode. list[i] is done inline, tuple[i] ends up going
> thru the generic PyObject_GetItem. If there's a surprise here, then, it's
> that tuple[i] isn't a *lot* slower than list[i].

Mostly out of curiousity, is there a reason for that difference, or
did it sort of 'just work out that way'?

Peter Hansen

unread,

Mar 7, 2003, 5:22:02 PM3/7/03

to

Michael Hudson wrote:
>
> Peter Hansen <pe...@engcorp.com> writes:
>
> > Tentative conclusion: although it's at the bottom of the range of
> > claimed improvements from Psyco, I'll take my 36% and run. I'll
> > switch to lists, because dicts have zero advantages in this case,
> > though the speedup is minor.
>
> Have you tried Python 2.3a2? You might like what you see (pymalloc is
> probably the main reason for the improvement, but there have been
> many, many little performance tweaks).

I feel bad for not having tried the alpha on more stuff, so I took
your suggestion and started running a variety of things through it.
(No bugs yet! :-)

Here is a summary of the results, this time from my P266MMX machine.
The units are "pseudo-Hz" which are meaningless in real-world terms,
but the performance relative to baseline is shown too.

Python 2.2 results
------------------
Py2.2, dict, no Psyco: 15198, 100% (== 2.2 baseline)
Py2.2, list, no Psyco: 15320, 101%
Py2.2, tuple, no Psyco: 14918, 98%
Py2.2, list, Psyco: 18629, 123%

For Python 2.3 (values relative to 2.2 baseline)
--------------
Py2.3, dict, no Psyco: 17900, 118% (== 2.3 baseline)
Py2.3, list, no Psyco: 17923, 118%
Py2.3, tuple, no Psyco: 17149, 113% (or 96% of 2.3 baseline)

(Of course, Psyco is not released for 2.3 yet (unless it's in CVS)
so I don't have those results.)

I don't think I expected a large increase with 2.3, as the program
pre-allocates almost everything and does little, except creating
the odd integer value beyond 100 as the program runs.

I also tried inlining the step() method in the run() method,
as CezaryB suggested. (Using locals in the way he suggested is not
possible, as the instance variables are required by the various
methods that are called.)

Py2.2, list, inlined: 16468, 108%
Py2.3, list, inlined: 19074, 126%
Py2.2, list, inlined, Psyco: 9963, 66% (!)

Not sure what that last one indicates, except that under certain
circumstances (perhaps when used by someone who hasn't even read
it's documentation? :-), Psyco can produce *worse* results.

-Peter

Tim Peters

unread,

Mar 7, 2003, 5:32:45 PM3/7/03

to

[Tim]

> Note that the eval loop special-cases list, but not tuple,
> subscripts, in the BINARY_SUBSCR opcode. list[i] is done inline, tuple[i]
> ends up going thru the generic PyObject_GetItem. If there's a surprise
> here, then, it's that tuple[i] isn't a *lot* slower than list[i].

[A. Lloyd Flanagan]

> Mostly out of curiousity, is there a reason for that difference, or
> did it sort of 'just work out that way'?

I'm not sure what you're asking. BINARY_SUBSCR is generated in response to
any subexpression that looks like

expression1[expression2]

and all such subexpressions can be evaluated by PyObject_GetItem(). The
special case for when expression1 turns out (at runtime) to be a list, and
expression2 turns out (at runtime) to be an int, is (of course) deliberately
aimed at speeding list indexing. Tuple indexing is surely much rarer, and
every special case slows down all cases it doesn't apply to (e.g., the
special case for list indexing slows down tuple indexing, by the amount of
time it takes to determine that it's not the special case of list indexing).

Skip Montanaro

unread,

Mar 7, 2003, 5:41:24 PM3/7/03

to

Peter> Py2.2, list, inlined: 16468, 108%
Peter> Py2.3, list, inlined: 19074, 126%
Peter> Py2.2, list, inlined, Psyco: 9963, 66% (!)

Peter> Not sure what that last one indicates, except that under certain
Peter> circumstances (perhaps when used by someone who hasn't even read
Peter> it's documentation? :-), Psyco can produce *worse* results.

How exactly are you using Psyco? If you let it figure out which routines to
specialize, it generally does a poor job. If you know where your hotspots
are, you can tell it to just specialize those routines.

You might want to play around with the pystone benchmark a little. It's
been awhile since I ran the experiment, but I believe the critical routine
is pystone.Proc8. If you tell it to specialize just that one routine, the
speedup is much better than if you simply call psyco.jit() and let Psyco
figure out what to do.

Skip

Peter Hansen

unread,

Mar 7, 2003, 7:57:22 PM3/7/03

to

Skip Montanaro wrote:
>
> Peter> Py2.2, list, inlined: 16468, 108%
> Peter> Py2.3, list, inlined: 19074, 126%
> Peter> Py2.2, list, inlined, Psyco: 9963, 66% (!)
>
> Peter> Not sure what that last one indicates, except that under certain
> Peter> circumstances (perhaps when used by someone who hasn't even read
> Peter> it's documentation? :-), Psyco can produce *worse* results.
>
> How exactly are you using Psyco? If you let it figure out which routines to
> specialize, it generally does a poor job. If you know where your hotspots
> are, you can tell it to just specialize those routines.

At first I tried binding all the opcode handler functions, and the
Opcode class, and the Cpu class. A little experimentation showed that
ditching the opcode handlers increased performance, so now I just
do the two classes. Are you suggesting picking only some methods
in the classes would help?

At this point, actually, I need to do one thing before going any
further makes sense: profile the code. And read the Psyco manual.
Two things I must do before it makes sense. Profile, RTFM, and
develop a _need_ for greater performance. *Three* things... there
are three things I need. Oh bugger...

-Peter

Aahz

unread,

Mar 7, 2003, 8:35:32 PM3/7/03

to

In article <3E691B8A...@engcorp.com>,
Peter Hansen <pe...@engcorp.com> wrote:
>
> [...]

BTW, I don't see any evidence that you're using python -O; if not, that
would account for the primary speedup of 2.3a2.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

Register for PyCon now! http://www.python.org/pycon/reg.html

Skip Montanaro

unread,

Mar 7, 2003, 10:51:51 PM3/7/03

to

Peter> Are you suggesting picking only some methods in the classes would
Peter> help?

Yes, it might be worth a try.

Skip

Pedro Rodriguez

unread,

Mar 8, 2003, 6:22:33 AM3/8/03

to

I understand, and agree, with the reasoning behind Andrew and John
remarks. Just for the purpose of the exercise, I will add some points.

I understand that you can live with performance as it is, but if some
basic optimisation or tool can buy you some time, you'll may make the
effort of using it as long as it keep the code simple and maintenable.

Let's consider that current performance is a problem, without loosing
focus on maintenability. By design you'll have to consider what functions
are critical, using a profiler for instance will show you what functions
are heavily used or what are the one that too expensive.

Optimization can be done by redesigning some aspects of the application
in regard with data organization and data flow.

If the method "step" were to be the time critical function of your code,
what can be done to improve it.

If I focus on the "try/except", and pick up your comment that this should
not be frequent to fall in the exception case, so why clutter your code
with this test (or any test, like the one I posted) ? Just consider that
all opcodes are valid, but that the action for some will raise the
exception (def executeINVALIDOPCODE(): raise ...).

You'll simplify, and do some basic optimization, in the flow of your
code, without adding extra complexity, and using your data organization
at best. I don't know what your real cpu will do in this case, but you
may even implement the proper behaviour.

What about the next test, with the returned new PC, why not have the
opcode method do the proper job, only when needed (by passing the
opcode mode to the execute method, or just knowing, by design of the
opcode method, how it will affect the CPU PC).

Just-thinking-loud'ly y'rs,
Pedro

Peter Hansen

unread,

Mar 8, 2003, 2:42:23 PM3/8/03

to

Pedro Rodriguez wrote:
>
> If I focus on the "try/except", and pick up your comment that this should
> not be frequent to fall in the exception case, so why clutter your code
> with this test (or any test, like the one I posted) ? Just consider that
> all opcodes are valid, but that the action for some will raise the
> exception (def executeINVALIDOPCODE(): raise ...).

That's an excellent idea that I hadn't considered. Although the
HC12 itself has effectively no unimplemented opcodes, and my
exception is there just until more than 34 are implemented, it's
likely I will never implement some 50 or so of the opcodes as our
real code doesn't use them. I'll throw in something to auto-build
256 "dummy" opcodes on startup, then allow them to be substituted
with the real ones as they are defined. Thanks. :)

> What about the next test, with the returned new PC, why not have the
> opcode method do the proper job, only when needed (by passing the
> opcode mode to the execute method, or just knowing, by design of the
> opcode method, how it will affect the CPU PC).

Something like that will probably happen, eventually, but right now
the opcode handler functions are actually unaware of the specific
opcode being executed and therefore don't know which addressing
mode should be used. I could pass in the opcode, but as far as I
can tell I'm then making the opcode handlers do something that
the CPU itself is supposed to do (thus increasing the amount of
duplicated code, thus reducing maintainability), *plus* I'm now
passing in two parameters instead of just one when I call execute().

Thanks for the ideas. The thoughts on restructuring are good, but
I'm going to trust in the test-driven development (TDD) approach
to lead the design. If the code needs to go in the direction
you suggest, it will lead me there on its own.

-Peter

Peter Hansen

unread,

Mar 8, 2003, 2:44:08 PM3/8/03

to

Aahz wrote:
>
> In article <3E691B8A...@engcorp.com>,
> Peter Hansen <pe...@engcorp.com> wrote:
> >
> > [...]
>
> BTW, I don't see any evidence that you're using python -O; if not, that
> would account for the primary speedup of 2.3a2.

You're quite correct. I basically never use -O, mostly because
I never think of it (nor care about performance enough). A quick
test shows that with Python 2.2 and -O, I can get roughly half
the speedup that I get going to Python 2.3. I'll investigate
more thoroughly on a work machine next week.

-Peter