We all know Python loves its objects. It really loves strings. The other
day I profiled Python and checked malloc calls in the CGI script that runs
on our site most frequently. With about a 3-second run-time it called
malloc over 66,000 times with half being from inside stringobject.c. (Lists
were the next most frequently allocated objects, accounting for about 10,000
malloc calls.) I figured it was a place that could use a little help.
I wrote a possibly better (read: not evaluated in a performance way yet)
string object allocator last night that treats stringobjects of <= 128 bytes
(including the stringobject header) special. The allocator rounds up
requests for small objects to the next higher power of two (<=32 -> 32, <=64
-> 64, and <=128 -> 128). No stringobject can contain less than 20 bytes by
default (stringobject headers are normally 20 bytes). 16kbyte blocks for
32-, 64-, and 128-byte stringobjects are then allocated as necessary, and a
free list is maintained, currently chained together through the ob_type
The net effect is to cut way back on the number of malloc calls. I doubt
that by itself this is such a big performance win on my system, but malloc
needs to add some bookkeeping overhead of its own, so I figure while I might
only be saving a small amount in time, I'm also saving a small amount in
space, and it all adds up. (Tim Peters observed that ten 1% improvements
would probably net you a bigger gain than a single 10% improvement.) On
systems with less industrial-strength mallocs, cutting way back on malloc
calls may be a big win.
I then added some conditionally compiled counters to stringobject.c and a
Python-visible interface in stropmodule.c that counts the number of string
allocations and deallocations for each of the four categories, returning a
list of each. On that 3-second or so CGI script I got:
<= 32 30950 10824
<= 64 6440 3170
<= 128 1394 1076
> 128 2025 707
This was measured right at the end, just before sys.stdout was closed by the
script. I figure most of the large allocations were probably related to
imports. I haven't yet had a chance to try this with my database server.
It uses small strings as dict keys *heavily* however, so I expect more
I'll post a copy of the modified stringobject.c and stropmodule.c to my web
I'm late for dinner, but I'll try and get them over there tonight or
(By the way, there's nothing string-specific about this small object
allocator. I was thinking of chaining the free list through another field
and setting the type field once, but that would make it string-specific.
With a little work it could be the basis for a general small object
allocator as it now exists.)
I installed modified versions of stringobject.c and stropmodule.c on our web
server. They are accessible via
Warning: This is just a first attempt. I've done some testing, but not a
bunch. Use this on an experimental basis only. The code isn't yet properly
packaged, and there are some definite warts on it. Feedback is welcome.
One thing I haven't yet figured out is this: Given an arbitrary number, n,
return the power of two that is equal to or greater than n *without writing
a loop*. I can clearly do something like:
for (i = 1; i < n; i <<= 1);
Can it be done using a single bit-twiddling expression though?
Cool. I read your description and am very pleased with your
approach. Did you benchmark it with pystone yet? (I'm still waiting
for a better benchmark, but nobody has given me one yet... ;-)
> Warning: This is just a first attempt. I've done some testing, but not a
> bunch. Use this on an experimental basis only. The code isn't yet properly
> packaged, and there are some definite warts on it. Feedback is welcome.
I immediately went to resizestring (to check that you'd taken care of
it properly -- you have) and noticed that there's one easy
optimization there: when the new and old sizes fall in the same
bucket, you don't need to do anything except change the ob_size field.
> One thing I haven't yet figured out is this: Given an arbitrary number, n,
> return the power of two that is equal to or greater than n *without writing
> a loop*. I can clearly do something like:
> for (i = 1; i < n; i <<= 1);
> Can it be done using a single bit-twiddling expression though?
For numbers < 256 you can do it with a single table lookup, if you can
spare 256 bytes for the table. For larger numbers you can quickly
find the highest byte in the number that's non-zero and use that to
index the table and add 8* the byte number (if you count from the
right end ;_)
--Guido van Rossum (home page: http://www.python.org/~guido/)
* untested code*
It's not a single bittwiddle, but I think:
x |= x >> 1; // 00100100 -> 00100100 | 00010010 = 00110110
x |= x >> 2; // 00110110 -> 00110110 | 00001101 = 00111111
x |= x >> 4; // 00111111 -> 00111111 | 00000011 = 00111111
x = x+1; // 00111111 -> 01000000
almost works for 8 bits. Two more steps and you've got 32 bits. I
assume a decent compiler should be able to make this happen very
quickly.... Unfortunately it gives the power of two <greater than> x.
I think you'd need something like:
x = y;
(above code inserted)
if (x >> 1 == y) x = y;
To make it work for the power of two <greater than or equal to> x.
'Waste Please' - Exhortation on bins in DisneyWorld....
Peter Hart: ze...@one.net, 10604...@compuserve.com. Voice: +1-513-579-9957
135, Garfield Place Apt#423, Cincinnati OH45202, USA.
Well, you could always just decrement x before doing the twiddle.
Saves you a branch (and possible pipeline stall); and since a compare
is the same as a subtract, it shouldn't slow you down any...