Ann: Revival of the BytecodeHacks

Christian Tismer

unread,

Jul 29, 2004, 9:08:49 PM7/29/04

to pytho...@python.org, python-ann...@python.org, pypy-dev

Dear community,

I could not resist to do this announcement, although
this project belongs to Michael Hudson.
Mike, please forgive me. I owe you a beer.

In a single day-and-night session, I hacked against bytecodehacks,
to upgrade it from Python 1.5.2 to Python 2.3 .
This was quite some work, although not so very much, due to
the excellent basic layout which Michael created.

What are the Bytecodehacks?
---------------------------

The bytecodehacks allow you to do certain modifications to
compiled Python code objects. There are lots of applications
included, like macro expansions and function inlining,
things which Python does not provide and also is not supposed
to provide. This kind of madness exists for people who ask
for it, without worring those who don't care.

Again, this great package was written by Michael Hudson in early
2000, and to my knowledge, it was never ported to the more
recent Python versions. Michael told me that this is a package
he is no longer too fond of, since it was written in his early days.
He also told me that he is not keen on supporting it so much,
because he would be tempted to give it a whole rewrite.

Now, I'm thinking differently. I got the package to work, after
about 12 hours of hacking, and it simply works. Since I didn't
write the initial version, I have a different relationship to it.
In other words: It is easier to maintain a foreign package than
your own, since you are not married with it.

Why do I dig into foreign areas?
--------------------------------

Well, I have enough work with my Stackless package.
Stackless is almost ready. (Almost, like your toy
railway gets really ready; it really never will.)
I just built a minimum Psyco support into it, because
I'm basically always after as much speed as I can get.
But there are limits with the regular Python interpreter.

So my idea is to use these crazy other projects to get more
performance, and to support them directly.
My first idea to accelerate Psyco using Stackless was
to provide Stackless with extra hardware stacks, which can
be switched at light-speed. I still have this idea in mind,
but the implementation is not so trivial.

Comparatively, replacing generators (yield calls) with
a couple of save/restores of tuples *is* almost trivial,
as I'm probably going to show tomorrow.
In Python, these "fake-generators" would be reasonably
slower.
But, by the fact that these are then Psyco-enabled, makes
them really, really fast, and also completely inlineable.
I think to name the module "renegate". :-)

Why do I want to revive this package
------------------------------------
Well, I am a pragmatic guy, and I have a really good reason why I
need the bytecodehacks. I am writing a sophisticated package
which involves parsing of PDF files, and I want to do it all in
Python. In order to get this PDF processor to almost C speed,
I used Armin Rigo's wonderful Psyco package.
Unfortunately, Psyco has a few limitations, which act as a
show-stopper:

- generators are not supported. That means, whenever I use
a generator, Psyco will not accelerate it, but will act as a
small slow-down.

- Psyco is great at optimizing simple structures like lists, tuples,
numbers and strings. It is less able to enhance things like
object properties. Using self frequently disables almost all
of Psyco's capabilities.

- Psyco has difficulties with inlining. Simple functions *are*
inlined, but when they contain a conditional branch or they
exceed some limit, inlining is disabled. This *could* be changed,
but with a lot of effort by changing C code. This is not going to
happen, because all of this stuff will be enhanced and
re-implemented during the PyPy project.

Now, by combining the re-animated bytecodehacks project with
Psyco, I am almost sure that I can remove certain restrictions
from Psyco, by turning problematic Python structures into simple
ones, which the current Psyco can handle natively.

Poor man's PyPy
---------------

Although I am a member of the PyPy project, and I do belong to the
people who initiated the PyPy project, I am impatient, and I want
to get a few of the expected PyPy results right now. Psyco is
phantastic but not perfect, and it needs some help to gather maximum
performance.
By adding Bytecodehacks in the right manner, I think I can fill
this gap. With BCH, I can replace generators by ordinary methods
of a class (plus a few bytecode instructions which have no real
Python equivalent, like goto). By inspecting the data flow of a
self.attribute, I can prove that it is invisible outside and
replace it by a simple local variable in many cases. By using
Bytecodehacks for proper inlining of functions, I can deliver
Psyco from this difficult task.

The expected result
-------------------

By consequently applying the methods I sketched above,
I expect that I can make almost every existing application
reasonably faster. I will provide this as a service for
customers and charge them for relevant acceleration.
The software will stay open-sourced. This is just a few
add-ons to Psyco and Bytecodehacks, and I'm not the author
of these. I just found out how nicely they can fit together.

My guess is an overall acceleration of at least a factor of
five for almost any native Python application. There is no proof
yet, this is all Vodoo from my stomach. But this stomach tends
to be quite reliable

Mike, please forgive me this announcement. You should have
written it, but I was so very inspired.

Getting the bytecodehacks for Python 2.3
----------------------------------------

The current source code is available at

cvs -d:pserver:anon...@cvs.sourceforge.net:/cvsroot/bytecodehacks login

cvs -z3 -d:pserver:anon...@cvs.sourceforge.net:/cvsroot/bytecodehacks
co bytecodehacks

cheers -- chris
--
Christian Tismer :^) <mailto:tis...@stackless.com>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/

Radovan Garabik

unread,

Jul 30, 2004, 2:47:50 AM7/30/04

to

Christian Tismer <tis...@stackless.com> wrote:
...

> need the bytecodehacks. I am writing a sophisticated package
> which involves parsing of PDF files, and I want to do it all in
> Python. In order to get this PDF processor to almost C speed,

What license is your pdf parser going to have?
Do you have a working version?
My work plans include writing a pdf parser (and I prefer python for it),
if your package is going to be open source, I would rather not
duplicate the effort.

thanks

--
-----------------------------------------------------------
| Radovan Garabík http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Andreas Lobinger

unread,

Jul 30, 2004, 5:35:58 AM7/30/04

to

Aloha,

Radovan Garabik wrote:

> Christian Tismer <tis...@stackless.com> wrote:
>>need the bytecodehacks. I am writing a sophisticated package
>>which involves parsing of PDF files, and I want to do it all in
>>Python. In order to get this PDF processor to almost C speed,

When you need hacks to get reasonable speed for full parsing PDF,
your algorithms are not very efficiently designed...
BTDT
Even Acrobat sometimes works for seconds reading files.

> What license is your pdf parser going to have?
> Do you have a working version?
> My work plans include writing a pdf parser (and I prefer python for it),
> if your package is going to be open source

Can we make a list here, how many people have started writing
pdf processing SW?

I'm working now for ~20 months (i do this for enjoyment, not for a
specific task) on a pdf-low-level library that simply reads and
parses a .pdf to python data types and also writes
memory back to a file. And *yes* there is an intend to go public.

A few observations from the work on the lib. show, that it's not
the problem to get f.e. a pdf-tokenizer running (and running fast),
it's the real problem to understand PDF as a stuctured document
in depth. I changed some code forth and back because i wasn't sure
what description in mem would fit. (f.e. .pdf is simply a collection
of objects, is it a list, or a dict with the object number as key?)

Wishing a happy day
LOBI

Christian Tismer

unread,

Jul 30, 2004, 5:55:10 AM7/30/04

to Andreas Lobinger, pytho...@python.org

Andreas Lobinger wrote:
> Aloha,
>
> Radovan Garabik wrote:
>
>> Christian Tismer <tis...@stackless.com> wrote:
>>
>>> need the bytecodehacks. I am writing a sophisticated package
>>> which involves parsing of PDF files, and I want to do it all in
>>> Python. In order to get this PDF processor to almost C speed,
>
>
> When you need hacks to get reasonable speed for full parsing PDF,
> your algorithms are not very efficiently designed...

I'm not sure what you are talking about.
By speed, I'm thinking of reaching almost the
speed of a pure C implementation.
For that reason, I use Psyco, and the primitive
routines are implemented in a way that Psyco
optimizes best.
But from design, the code looks nicer if things like
data streams and token sequences are implemented using
generators.
Generators are not supported by Psyco.
With Bytecodehacks, I can make Psyco support generators,
by applying code transformations which turn the yield
statement into something different, but semantical
identical.
The goal is to have things both fast and nicely readable.

Christopher T King

unread,

Jul 30, 2004, 8:50:54 AM7/30/04

to

On Fri, 30 Jul 2004, Christian Tismer wrote:

> With Bytecodehacks, I can make Psyco support generators,
> by applying code transformations which turn the yield
> statement into something different, but semantical
> identical.
> The goal is to have things both fast and nicely readable.

Wouldn't the ideal solution be to make Psyco support generators natively?

Correct me if I'm wrong, but wouldn't that amount to no more than some C
stack trickery (i.e. pop the stack used by the current function, store it
& the current IP into the generator structure, return to caller)?

Christian Tismer

unread,

Jul 30, 2004, 9:02:19 AM7/30/04

to Radovan Garabik, pytho...@python.org

Radovan Garabik wrote:

> Christian Tismer <tis...@stackless.com> wrote:
> ...
>
>>need the bytecodehacks. I am writing a sophisticated package
>>which involves parsing of PDF files, and I want to do it all in
>>Python. In order to get this PDF processor to almost C speed,
>
>
> What license is your pdf parser going to have?

I don't know, yet. From time to time I write things
which are not open source, to make a living.

> Do you have a working version?

Yes, quite.

> My work plans include writing a pdf parser (and I prefer python for it),
> if your package is going to be open source, I would rather not
> duplicate the effort.

My targets are a bit special. I think Dinu Gherman might
have something you'd like to start playing with.

ciao - chris

Michael Hudson

unread,

Jul 30, 2004, 10:03:19 AM7/30/04

to

Christopher T King <squi...@WPI.EDU> writes:

> On Fri, 30 Jul 2004, Christian Tismer wrote:
>
> > With Bytecodehacks, I can make Psyco support generators,
> > by applying code transformations which turn the yield
> > statement into something different, but semantical
> > identical.
> > The goal is to have things both fast and nicely readable.
>
> Wouldn't the ideal solution be to make Psyco support generators
> natively?

Yes. You make it sounds easy, though.

> Correct me if I'm wrong, but wouldn't that amount to no more than some C
> stack trickery (i.e. pop the stack used by the current function, store it
> & the current IP into the generator structure, return to caller)?

No, it's harder than that. How much do you know about how psyco works?

Cheers,
mwh

--
<glyph> we need PB for C#
* moshez squishes glyph
<moshez> glyph: squishy insane person
-- from Twisted.Quotes

Christopher T King

unread,

Jul 30, 2004, 10:16:43 AM7/30/04

to

On Fri, 30 Jul 2004, Michael Hudson wrote:

> No, it's harder than that. How much do you know about how psyco works?

Not much ;) I'm very interested in having generator support in Psyco
though, and if/when I find the time, I'm going to look into how to go
about implementing it.

Michael Hudson

unread,

Jul 30, 2004, 10:27:56 AM7/30/04

to

Christopher T King <squi...@WPI.EDU> writes:

OK, but I think Armin, psyco's author, is interested too and hasn't
done it yet... this might tell you something :-)

Cheers,
mwh

--
<dash> wow. this code does something highly entertaining, but
nowhere near correct -- from Twisted.Quotes

Christopher T King

unread,

Jul 30, 2004, 10:44:48 AM7/30/04

to

On Fri, 30 Jul 2004, Michael Hudson wrote:

> OK, but I think Armin, psyco's author, is interested too and hasn't
> done it yet... this might tell you something :-)

IIRC, the Psyco web site says something to the effect of "I'll look into
implementing generators if there's sufficient demand"; I just figured
there wasn't sufficient demand yet.

Christian Tismer

unread,

Jul 30, 2004, 10:48:40 AM7/30/04

to Christopher T King, pytho...@python.org

Christopher T King wrote:

I tell you this is not trivial.
The problem is equivalent to the problem
to make Psyco Stackless. I looked into
Psyco for quite some while, but finally
took the easier way, last but not least because
Armin wants to stop hacking this complicated C code.
If there exists a byte code sequence which does
the equivalent thing, why should I try the harder way.

ciao - chris

Christopher T King

unread,

Jul 30, 2004, 12:20:00 PM7/30/04

to

On Fri, 30 Jul 2004, Christian Tismer wrote:

> I tell you this is not trivial.

What makes it so difficult? Does Psyco not keep its state on the stack,
like CPython does? I'm trying to form a picture of the Psyco internals on
my mind, but I can't find any in-depth documentation of them (aside from
the source, of course), so I'm basing my thoughts on (possibly
naive) assumptions.

Implementing generators in C is trivial: since all the state of a C
function is contained on the stack, you need only use setjmp/longjmp (or
equivalent) augmented with stack saving/restoring routines to make it
work. I just fail to see how implementing them in generated machine code
can be that much different, unless local state isn't kept on the stack
(which you seem to imply that it is), or the dynamic profiling / code
generation routines make a mess of this whole scheme (in which case I'll
keep quiet).

David Boddie

unread,

Jul 30, 2004, 1:04:51 PM7/30/04

to

Andreas Lobinger <andreas....@netsurf.de> wrote in message news:<ced4pv$m5d$1...@news.mch.sbs.de>...

> Can we make a list here, how many people have started writing
> pdf processing SW?

Time to jump in here. I started one about three years ago:

http://www.boddie.org.uk/david/Projects/Python/pdftools/

It still isn't finished...

> I'm working now for ~20 months (i do this for enjoyment, not for a
> specific task) on a pdf-low-level library that simply reads and
> parses a .pdf to python data types and also writes
> memory back to a file. And *yes* there is an intend to go public.

I didn't get to the part where I could write the data out to a file.
My original aim was to be able to view files rather than modify them.

> A few observations from the work on the lib. show, that it's not
> the problem to get f.e. a pdf-tokenizer running (and running fast),
> it's the real problem to understand PDF as a stuctured document
> in depth. I changed some code forth and back because i wasn't sure
> what description in mem would fit. (f.e. .pdf is simply a collection
> of objects, is it a list, or a dict with the object number as key?)

You should think of it as a dictionary, I believe. I don't think that
the order of objects in the file really matters; just the order in the
list of used objects in the file's trailer.

Good luck,

David

Armin Rigo

unread,

Jul 30, 2004, 3:10:55 PM7/30/04

to Christopher T King

Hi,

Christopher T King wrote:
> What makes it so difficult?

A lot of things. But discussing the internals of Psyco on this
newsgroup isn't a very good idea. Please come to the psyco-devel
mailing list.

> Implementing generators in C is trivial:

I don't think I agree with this sentence, either.

This said, I don't think that implementing real support for generators
in Psyco is particularly difficult, in theory. It's the practical
problems that keep me running away from the task. If you really want to
invest some time in the problem, please stop musing aloud on this
newsgroup and come talk in the psyco-devel mailing list.

BTW I just put on http://psyco.sourceforge.net/ the paper I am going to
present to the ACM SIGPLAN 2004 conference, which gives a (partially
theoretical) description of how Psyco works.

Armin