programming language benchmark

srepmub

unread,

Jun 18, 2011, 2:24:18 PM6/18/11

to shedskin-discuss

hi all,

just found this, a comparison of different languages and language
implementations (pypy, ironpython, shedskin..) for 7 different
benchmarks.

http://attractivechaos.github.com/plb/

the first three benchmarks are fine, but then shedskin apparently
fails on two, and is slower than pypy for the last two. if anyone
might be interested in looking into this..

thanks,
mark.

srepmub

unread,

Jun 18, 2011, 4:22:37 PM6/18/11

to shedskin-discuss

>
> http://attractivechaos.github.com/plb/
>
> the first three benchmarks are fine, but then shedskin apparently
> fails on two, and is slower than pypy for the last two. if anyone
> might be interested in looking into this..

actually, the two failure benchmarks are the same, and the last two
benchmarks are also for a single program (time and memory size).

so shedskin is by far the fastest python implementation for the first
three benchmarks (if one considers it a python implementation, fails
one test probably for a silly reason, and is twice as slow as cpython
for one benchmark..

thanks,
mark.

Enzo Erbano

unread,

Jun 18, 2011, 4:50:06 PM6/18/11

to shedskin...@googlegroups.com

Hi Mark,

Really interesting!

But, it is obvious that Intel Compiler, in average, is the best. But it is only because

it is capable of optimizing codes taking advantage of vectorization [codes are compiled

for intel using SSE4.1, and the CPU is a Xeon, which is stronger for SSE], also considering

some other optimizations types.

For the Intel compiler:

First test = ICC 1°

Second test = ICC 1°

Third test = ICC 4°

Fourth test = ICC 2°

Fifth test = ICC 3°

Sixth test = ICC 1°

Seventh test = ICC 1°

Average = 1,85

For Shedskin:

First test = SS 7°

Second test = SS 8°

Thirty test = SS 11°

Fourty test = SS 24°

Fifth test = SS 24°

Sixth test = SS 18°

Seventh test = SS 21°

Average = 16,14

Considering that shedskin process python, the points is quite good,

and with minor changes, codes generated by shedskin can reach at

least points near GCC.

By the way, if you could implement code transformations for optimizations

as post-processing in C++ codes, it would be wonderful!

If it is of your interest.... I have a link that maybe will be attractive for you.

http://strategoxt.org/

Best regards.

2011/6/18 srepmub <mark....@gmail.com>

--
You received this message because you are subscribed to the Google Groups "shedskin-discuss" group.
To post to this group, send email to shedskin...@googlegroups.com.
To unsubscribe from this group, send email to shedskin-discu...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/shedskin-discuss?hl=en.

Mark Dufour

unread,

Jun 18, 2011, 5:09:31 PM6/18/11

to shedskin...@googlegroups.com

For Shedskin:
First test = SS 7°
Second test = SS 8°

Thirty test = SS 11°
Fourty test = SS 24°
Fifth test = SS 24°
Sixth test = SS 18°

Seventh test = SS 21°

Average = 16,14

note that the third and last tests measure memory, so not very interesting I think, unless there are huge differences. and the fourth, fifth and sixth tests will likely go much better with some minor improvements. my hands are itching, but I'd like to wait and see if anyone else looks into them first..

By the way, if you could implement code transformations for optimizations

as post-processing in C++ codes, it would be wonderful!

well, to be honest I've got my hands full with the current approach.. there are lots of options for high level optimizations, and escape analysis is my favorite, but I prefer to keep on improving type inference myself.. anyone is welcome to look into additional analyses, and I'd be happy to assist of course.

thanks!
mark.
--

http://www.youtube.com/watch?v=E6LsfnBmdnk

Mark Dufour

unread,

Jun 18, 2011, 5:11:49 PM6/18/11

to shedskin...@googlegroups.com

btw, combining shedskin with ICC might be an interesting experiment of course. I don't seem to recall anyone doing this.. GCC 4.6 may also perform better than the used GCC 4.3 of course..

thanks,
mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Enzo Erbano

unread,

Jun 18, 2011, 6:11:12 PM6/18/11

to shedskin...@googlegroups.com

note that the third and last tests measure memory, so not very interesting I think, unless there are huge differences. and the fourth, fifth and sixth tests will likely go much better with some minor improvements. my hands are itching, but I'd like to wait and see if anyone else looks into them first..

Well, I never got a deep look at the benchmarks, and I dont know if it is the same case, but memory transfers optimized by MMX/SSE could give a great boost, and particularly, I have tested myself a version of memcpy/memfil and brothers optimized with this, and I got an average of 2900Mb/s of speed in my current PC. I think it worth a lot, not?

Also, I have read the Intel compiler manuals, and I have mentioned here on the group some time ago some optimizations and their boost benefits, and that optimizations are exactly, or almost, what Intel compiler try to optimize. Here, also, I have my own reasons to opt doing some optimizations directly in the code than hoping the compiler will do, for example, optimizations done directly on the C++ code can be more safe and controllable, also, if happens that you cant use one or another compiler versions or environment, on a C++ code already optimized, you will get stable speed all the time, regardless the compiler version you use!

well, to be honest I've got my hands full with the current approach.. there are lots of options for high level optimizations, and escape analysis is my favorite, but I prefer to keep on improving type inference myself.. anyone is welcome to look into additional analyses, and I'd be happy to assist of course.

Well, I have few knowledge on shedskin and python, and I am better at assembly x86 + SSE and C/C++ [I have done some python code, but it does not made me a specialist, hehe], if I can, I could help a little [you would need to point me the right point also], but I think...

It is not better to implement transformation optimizations for C++ to C++ separately than directly in shedskin?

I mean, there would be more benefits, since, you could take c++ codes from many sources to transform and optimize.

There are optimizations that are inherently to C++ and others that is to Python/Shedskin.... so I think it will dont worth mixing them together... is not? [this is just a thought]

btw, combining shedskin with ICC might be an interesting experiment of course. I don't seem to recall anyone doing this.. GCC 4.6 may also perform better than the used GCC 4.3 of course..

Indeed, you are right! It could be interesting.

Fahrzin Hemmati

unread,

Jun 18, 2011, 6:40:12 PM6/18/11

to shedskin...@googlegroups.com

Alright, I fixed the patmch code. There were two problems caused by
shedskin:

1. sys.argv's last argument doesn't get the newline stripped from the
end like it does for CPython.

2. re.compile(regex).search(data) doesn't accept an empty data string,
so the if statement has to be "if line[:-1] and r.search(line[:-1]):"

With those fixed, CPython takes 19 seconds and shedskin takes 11.5
seconds on my (old) desktop. As a ratio, this gives SS a score of 3.5
for patmch:1t. Then with 58 for CPython and 71 for SS, that's a score of
15.4 for patmch:2t.

I was also surprised by the slower speed for 2t, but I'll send the fixed
code to attractivechaos and see what he comes up with.

--fahhem

Mark Dufour

unread,

Jun 19, 2011, 4:36:03 AM6/19/11

to shedskin...@googlegroups.com

thanks!

1. sys.argv's last argument doesn't get the newline stripped from the end like it does for CPython.

hmm, I don't see this..? :) (tested windows as well)

2. re.compile(regex).search(data) doesn't accept an empty data string, so the if statement has to be "if line[:-1] and r.search(line[:-1]):"

which version of python are you using? this seems to works fine here. the following makes re.search accept empty strings, though I'm not sure if it's the correct fix.

--- a/shedskin/lib/re.cpp
+++ b/shedskin/lib/re.cpp
@@ -455,6 +455,8 @@ __iter<match_object *> *re_object::finditer(str *subj, __ss_

match_object *re_object::__exec(str *subj, __ss_int pos, __ss_int endpos, __ss_
{
+ if(subj->unit.size() == 0) return (match_object *)NULL;
+

With those fixed, CPython takes 19 seconds and shedskin takes 11.5 seconds on my (old) desktop. As a ratio, this gives SS a score of 3.5 for patmch:1t. Then with 58 for CPython and 71 for SS, that's a score of 15.4 for patmch:2t.

not too bad, but I *think* shedskin may still uffer from its slow file IO implementation here, because we're ploughing through quite a big file. same thing for the dict benchmark. perhaps that explains why that one is slower than cpython.

I was also surprised by the slower speed for 2t, but I'll send the fixed code to attractivechaos and see what he comes up with.

thanks! :) would be nice in any case not to have '999' there..
mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Fahrzin Hemmati

unread,

Jun 19, 2011, 4:42:57 AM6/19/11

to shedskin...@googlegroups.com

On 6/19/2011 1:36 AM, Mark Dufour wrote:

thanks!

1. sys.argv's last argument doesn't get the newline stripped from the end like it does for CPython.

hmm, I don't see this..? :) (tested windows as well)

Oops, I guess I must have imagined it in the debugging process. My fix wouldn't do anything if there was no newline so it wouldn't have made any real effect. I'll let attractivechaos know.

2. re.compile(regex).search(data) doesn't accept an empty data string, so the if statement has to be "if line[:-1] and r.search(line[:-1]):"

which version of python are you using? this seems to works fine here. the following makes re.search accept empty strings, though I'm not sure if it's the correct fix.

Yes, it works fine in CPython, but shedskin's search() would error out.

--- a/shedskin/lib/re.cpp
+++ b/shedskin/lib/re.cpp
@@ -455,6 +455,8 @@ __iter<match_object *> *re_object::finditer(str *subj, __ss_

match_object *re_object::__exec(str *subj, __ss_int pos, __ss_int endpos, __ss_
{
+ if(subj->unit.size() == 0) return (match_object *)NULL;
+

I don't think that's valid, since the regex might match an empty string: ^a?$

With those fixed, CPython takes 19 seconds and shedskin takes 11.5 seconds on my (old) desktop. As a ratio, this gives SS a score of 3.5 for patmch:1t. Then with 58 for CPython and 71 for SS, that's a score of 15.4 for patmch:2t.

not too bad, but I *think* shedskin may still uffer from its slow file IO implementation here, because we're ploughing through quite a big file. same thing for the dict benchmark. perhaps that explains why that one is slower than cpython.

Yeah, that's likely going to cause it since the file is 38MB.

I was also surprised by the slower speed for 2t, but I'll send the fixed code to attractivechaos and see what he comes up with.

thanks! :) would be nice in any case not to have '999' there..
mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Mark Dufour

unread,

Jun 19, 2011, 5:20:17 AM6/19/11

to shedskin...@googlegroups.com

>I don't think that's valid, since the regex might match an empty string: ^a?$

right, thanks.. I just committed this fix, and added your regex to tests/197.py:

http://gitorious.org/shedskin/mainline/commit/7d47f989c783be3777482ff17b384a9a681940d7

thanks again!
mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

srepmub

unread,

Jun 19, 2011, 3:22:22 PM6/19/11

to shedskin-discuss

hi fahrzin,

> With those fixed, CPython takes 19 seconds and shedskin takes 11.5
> seconds on my (old) desktop. As a ratio, this gives SS a score of 3.5
> for patmch:1t. Then with 58 for CPython and 71 for SS, that's a score of
> 15.4 for patmch:2t.
>
> I was also surprised by the slower speed for 2t, but I'll send the fixed
> code to attractivechaos and see what he comes up with.

francois and I are discussing some approaches to improving standard
file I/O performance. I just tested a possible improvement, and it
makes patmch1 about 3 times faster.. :-) I will let you know once we
commit something.

thanks,
mark.

srepmub

unread,

Jun 20, 2011, 1:52:31 PM6/20/11

to shedskin-discuss

> francois and I are discussing some approaches to improving standard
> file I/O performance. I just tested a possible improvement, and it
> makes patmch1 about 3 times faster.. :-) I will let you know once we
> commit something.

I just committed francois's patch to use "unlocked stdio" functions,
making patmch almost three times faster on my PC. nice to see libpcre
is holding its own here, though perl is still a lot faster. I'm quite
sure the dict test will be a lot faster now as well..

mark.

srepmub

unread,

Jun 21, 2011, 1:17:00 AM6/21/11

to shedskin-discuss

the author has updated the page after these fixes.. :-)

I will have a look if shedskin suffers another silly problem for the
dict benchmark..

thanks,
mark.

srepmub

unread,

Jun 22, 2011, 6:07:37 AM6/22/11

to shedskin-discuss

> I will have a look if shedskin suffers another silly problem for the
> dict benchmark..

the problem still exists when we simplify the program to this:

h = {}
for l in sys.stdin:
h[l] = 1

adding some timers shows that the slowness comes in phases:

h, m = {}, 0
count = 0
t0 = t00 = time.time()
for l in sys.stdin:
count += 1
if count % 100000 == 0:
print 'tick', time.time()-t0, time.time()-t00
t0 = time.time()
h[l] = 1
print(len(h), m)

->

tick 0.0799760851078 4.41815991863
tick 0.0799441015115 4.49813691864
tick 0.590437019709 5.08860491868
tick 0.0756970762741 5.16434291867
tick 0.0758000835776 5.24017791869

it looks like the GC takes about half a second for the middle 'tick'
here, which seems a bit much since the program only allocates about
250 MB here..?

if I add up all these longish ticks, I get about the difference as
seen with cpython on the benchmark page..

on the bright side, perhaps some improvement in dealing with the GC
for dicts and sets (which are quite similar) could give a nice general
speed boost for dict/set intensive programs.. :)

I'm still a bit tired from releasing 0.8, but if anyone would like to
look into this problem further, please see the boehm gc homepage for
some possible tricks. some I could find:

-it is recommended to use GC_MALLOC_IGNORE_OFF_PAGE for allocations
over >100 KB.. here we're dealing with 30 MB.. :-)

-as FFAO suggested before (author of the dict and set
implementations), perhaps gc_typed.h can be used to avoid the GC from
scanning the most part of dictentry and setentry structs.. it seemed
too complicated to me before, but perhaps I underestimated its
potential. it's possible the hash values in here are causing havoc
within the GC (since they seem to point all over the place, nothing
can be deallocated perhaps).

-I tried to manually GC_FREE the old dict contents on resize, but that
only made things slower.

thanks,
mark.

Enzo Erbano

unread,

Jun 22, 2011, 10:42:48 AM6/22/11

to shedskin...@googlegroups.com

Hi mark,

Can you post only the generated C++ code?

And the gcc environment you are using, what is it?

I can compile it and fiddle with asm code very well to see what is the problem. :-)

If there exist a problem, it will become well visible when checking the asm code.

Best regards.

2011/6/22 srepmub <mark....@gmail.com>

mark.

Enzo Erbano

unread,

Jun 22, 2011, 8:36:38 PM6/22/11

to shedskin...@googlegroups.com

I got a problem here when using GCC from your build.

It is missing a file called libgmp-10.dll. What is it, and what it means?

2011/6/22 Enzo Erbano <enzo....@gmail.com>

Sorry,

I saw here that the latest version for windows already have the compiler...

I got the codes for dict and I am going to try it now.

2011/6/22 Enzo Erbano <enzo....@gmail.com>

Enzo Erbano

unread,

Jun 22, 2011, 4:48:48 PM6/22/11

to shedskin...@googlegroups.com

Sorry,

I saw here that the latest version for windows already have the compiler...

I got the codes for dict and I am going to try it now.

2011/6/22 Enzo Erbano <enzo....@gmail.com>

Hi mark,

srepmub

unread,

Jun 24, 2011, 9:35:27 AM6/24/11

to shedskin-discuss

hm, the dict benchmark seems to have been changed to optimize the
cpython case, and as a strange side-effect shedskin is faster than
pypy here as well now. I'm still curious if shedskin can go fastere
here with a bit of tuning of the GC.

srepmub

unread,

Jun 24, 2011, 9:38:48 AM6/24/11

to shedskin-discuss

> I saw here that the latest version for windows already have the compiler...
>
> I got the codes for dict and I am going to try it now.
>

:-)

I don't think you'll find much to optimize here, because the problem
seems to be in the GC process, probably caused by having it scan too
much data, for which we know it doesn't contain pointers.. at least,
that's my guess of why the dict benchmark goes slower than when using
cpython. though I'd be interested if you can get the dict code to run
faster of course..

thanks,
mark.

Enzo Erbano

unread,

Jun 24, 2011, 11:02:54 AM6/24/11

to shedskin...@googlegroups.com

Hi mark,

My plan is to reverse the code compiled by GC, and see where exactly

it is "over-interpreting" the code, and we can se what we can replace in

the generated code by shedskin to avoid the problem.

I think it could be really interesting and profitable, and I have here all the

tools and experience with asm to reverse the codes, but, I am having

problem to compile with the GCC in the shedskin package. When I do

the make command, it said it is missing that DLL and abort.

So, I am going to try another thing.... I am downloading the latest codeblocks

as it have also an embedded GCC.

2011/6/24 srepmub <mark....@gmail.com>

thanks,
mark.

Enzo Erbano

unread,

Jun 24, 2011, 1:04:55 PM6/24/11

to shedskin...@googlegroups.com

Hi mark,

I also have tried to compile the codes to use in MSVC, and latter to

have support for Intel C++. The problem I found is that the code links

to GCC very specific headers, and I found no replacement for MSVC.

The headers are:

#include <gc/gc_allocator.h>

#include <gc/gc_cpp.h>

2011/6/24 Enzo Erbano <enzo....@gmail.com>

Jérémie Roquet

unread,

Jun 24, 2011, 1:14:19 PM6/24/11

to shedskin...@googlegroups.com

Hi Enzo,

2011/6/24 Enzo Erbano <enzo....@gmail.com>:

> I also have tried to compile the codes to use in MSVC, and latter to
> have support for Intel C++. The problem I found is that the code links
> to GCC very specific headers, and I found no replacement for MSVC.
> The headers are:
> #include <gc/gc_allocator.h>
> #include <gc/gc_cpp.h>

These files are those of boehm gc¹ on which Shedskin depends. You
have to add them to your include path whatever compiler you use.

Best regards,

¹ http://www.hpl.hp.com/personal/Hans_Boehm/gc/

--
Jérémie

Enzo Erbano

unread,

Jun 24, 2011, 11:22:30 PM6/24/11

to shedskin...@googlegroups.com

Hi Jeremie,

Yes, I noticed this!

Well, here I tried to compile the code in 3 ways:

- first with the latest codeblocks [version 10, that still use an older GCC], in this case, it seems

that the include SS uses does not exist in the GCC version of this codeblocks, that means these

includes is not backward compatible with older GCC, at least here.

- second, I tried to compile with MSVC 2008 express, and, it seems that the GCC includes SS

uses is also not compatible with MSVC, at least here. And it also seems MSVC have some other

incompatibilities, that I dont saw yet.

- I also tried GCC embedded on the SS for windows and it is giving me problem when I use the

make command.

So, the only thing that comes on my mind to solve the problem would be the use of plain old

functions to replace the ones from the include [gc/gc_allocator.h, gc/gc_cpp.h], since, most of

them are free, alloc, and brothers, that are very common, I guess. Doing this, will make it

also compatible to MSVC, right?

Correct me if I am wrong.

I would be very happy to get it compiled with GCC, at least.

Best Regards.

2011/6/24 Jérémie Roquet <arka...@gmail.com>

--

gcc.png

gcc-SS.png

ffao

unread,

Jun 24, 2011, 11:42:00 PM6/24/11

to shedskin-discuss

> -as FFAO suggested before (author of the dict and set
> implementations), perhaps gc_typed.h can be used to avoid the GC from
> scanning the most part of dictentry and setentry structs.. it seemed
> too complicated to me before, but perhaps I underestimated its
> potential. it's possible the hash values in here are causing havoc
> within the GC (since they seem to point all over the place, nothing
> can be deallocated perhaps).

At the time I agreed the simple allocation was the right thing to do,
as using typed allocation was too much extra complexity for too little
gain. Even now, I don't know if this is especially helpful in this
case, but I could always play with it if I have some time during the
week. I'm not guaranteeing anything though, and if someone more
experienced with the GC were to fiddle with it, it would probably be
even better. :)

Sal

unread,

Jun 25, 2011, 2:16:08 AM6/25/11

to shedskin...@googlegroups.com

Enzo, can you try a one line 'print "hello"' and make? Downloading the win32 shedskin with gcc works for me like that well. From their maybe I can help troubleshoot.

The gc/* files are boehm gc lib, afaik they should compile under msvc as well ... If not the shed versions ones from the boehm website might work? Bu I've found the gcc route most reliable so far myself...

Hope it helps a little!!

<gcc.png>

<gcc-SS.png>

Mark Dufour

unread,

Jun 25, 2011, 1:57:46 PM6/25/11

to shedskin...@googlegroups.com

The gc/* files are boehm gc lib, afaik they should compile under msvc as well ... If not the shed versions ones from the boehm website might work? Bu I've found the gcc route most reliable so far myself...

note there is a hidden shedskin -v option to generate an msvc makefile, and someone got all the unit tests to pass using MSVC some time ago:

http://code.google.com/p/shedskin/issues/detail?id=63

no idea what the status of this is now..

thanks,
mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Enzo Erbano

unread,

Jun 25, 2011, 8:16:22 PM6/25/11

to shedskin...@googlegroups.com

Hi mark,

I dont used the -v option at the first time I tried, but now, I tried the -v option, and I am getting problem because the unordered_map only

exist on MSVC 2010, not on older versions.

I also replaced the GC and PCRE with the latest ones I found, to try compile, it compiled the code, but not linked.

The error name is "undefined reference", and, as I see about it, It seems to be not linking because mixing c++ with C code,

[I know it is quite obvious], and it is appointing all the times the GC header files [gc_allocator.H and gc_cpp.H]. But,

I really cant understand why it is happening, as I dont changed the codes, and my GCC seems to be configured to

compile C as C++.

Maybe I have doing something very wrong here... Or there is an issue with Windows and GCC.

Someone have any idea?

2011/6/25 Mark Dufour <mark....@gmail.com>

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

--

Enzo Erbano

unread,

Jun 26, 2011, 12:37:41 AM6/26/11

to shedskin...@googlegroups.com

Hi again,

I finally got the embedded GCC-SS to work, I simple started it in a clean box, it worked very well,

but I would like to compile it in a IDE, like MSVC or at least codeblocks.

But, for now, it will serve my purpose.

I did a disassembly on the codes, and I get the pattern of the compiled codes.

Here, it seems that GC is used all over the app, even where it is not necessary!

The total occurrences of GC_malloc calls in the code [not considering loops] is ~510,

also, in the same chain of GC_malloc it calls throw_logic_error [it is a std exception

handling I guess] all the time.

The total occurrences of GC_free is around ~90 , and near it is used throw_out_of_range some

[more exception handling] times.

In general, the total occurrences of trow exceptions are around ~100 times.

Humm, it seems to having a lot of overhead.

For optimization purposes, I guess would be better to use directly the functions [malloc or free, etc],

or, replace for a better garbage collector, some time ago I heard that the one used on

Google Chrome is one of the best available, and it seems to be true as Google Chrome beats

all browsers in benchmarks [take a look on this for proof http://clients.futuremark.com/peacekeeper/index.action].

Best regards.

2011/6/25 Enzo Erbano <enzo....@gmail.com>

Mark Dufour

unread,

Jun 26, 2011, 4:34:33 AM6/26/11

to shedskin...@googlegroups.com

I dont used the -v option at the first time I tried, but now, I tried the -v option, and I am getting problem because the unordered_map only

exist on MSVC 2010, not on older versions.

this problem should be fixed in GIT. the unordered_map is not used anymore, so I removed the include.

thanks,
mark.
--

http://www.youtube.com/watch?v=E6LsfnBmdnk

Mark Dufour

unread,

Jun 26, 2011, 4:53:22 AM6/26/11

to shedskin...@googlegroups.com

hi enzo,

Here, it seems that GC is used all over the app, even where it is not necessary!

would it be possible to be more specific about what it is doing exactly which is not necessary..? I'm guessing that for this case, the main problem is that it is fully scanning the memory allocated for the dictionary, including integers and hash values. hash values are particularly bad because they seem to point all over memory, so the GC cannot free anything and it's just GC hell.

In general, the total occurrences of trow exceptions are around ~100 times.

you mean all the exceptions together..? exceptions in C++ are very slow, but 100 of them doesn't take much time..

or, replace for a better garbage collector, some time ago I heard that the one used on

Google Chrome is one of the best available, and it seems to be true as Google Chrome beats
all browsers in benchmarks [take a look on this for proof http://clients.futuremark.com/peacekeeper/index.action].

I'm afraid this is not a GC you can just plug into C++.. I really haven't seen an good alternative for the Boehm GC so far. but I think with some tuning (to make it behave less conservatively), it can perform pretty well. it also apparently performs better on 64-bit systems, which should become more and more common.

thanks!
mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Enzo Erbano

unread,

Jun 26, 2011, 2:01:02 PM6/26/11

to shedskin...@googlegroups.com

To be more specific,

When I said it is used where it is not necessary, is because it is used when it is really not

necessary, for example, in the whole app, in all places that needs memory allocation, and

not only for the dictionary functions!

The exception handling is in the same "chain" of the GC_malloc, it means more or less

that each time the GC_malloc is used, it also checks to certify if the memory was allocated

I guess, so, for example, in the case that GC_malloc is used in a loop, or in a indirect

loop, the calls to GC_malloc+exceptions is multiplied by the time loops happen!

If you want, I can prepare a pseudo code from the asm codes I got, and I will send it by email

if you prefer.

Best regards.

2011/6/26 Mark Dufour <mark....@gmail.com>

François Boutines

unread,

Jun 26, 2011, 3:56:17 PM6/26/11

to shedskin...@googlegroups.com

On Sun, Jun 26, 2011 at 8:01 PM, Enzo Erbano <enzo....@gmail.com> wrote:
> To be more specific,
> When I said it is used where it is not necessary, is because it is used when
> it is really not
> necessary, for example, in the whole app, in all places that needs memory
> allocation, and
> not only for the dictionary functions!
> The exception handling is in the same "chain" of the GC_malloc, it means
> more or less
> that each time the GC_malloc is used, it also checks to certify if the
> memory was allocated
> I guess, so, for example, in the case that GC_malloc is used in a loop, or
> in a indirect
> loop, the calls to GC_malloc+exceptions is multiplied by the time loops
> happen!
> If you want, I can prepare a pseudo code from the asm codes I got, and I
> will send it by email
> if you prefer.

Hi,

Try this, but I'm afraid replacing GC_malloc/GC_free by malloc/free
would actually slow down programs on the average, since they would
then leak memory, and have to swap to disk after a certain amount of
running time. Using shared_ptrs should be slightly better, but would
suffer from the same problems in some corner cases (or lead to a much
complex program to avoid memory leaks).

As pointed out by Mark, the task comes down to a space optimization
problem. Since the incriminated program takes roughly 600Mb of RAM for
a dataset of 90Mb (which could theoretically be reduced to ~20Mb), and
is essentially page faulting at each lookup (eventually multiple
times, idem when scanning).

A linear probing hash table (with a special value for unused slots)
would give slightly better results here. Optimizing for memory usage
too.

François.

Enzo Erbano

unread,

Jun 26, 2011, 4:43:25 PM6/26/11

to shedskin...@googlegroups.com

Hi François,

The problem is that I compiled the clean example, it is not generating the dataset

to test the implementation, and it is not loading or comparing any data!

Also, I did only a static analysis of the compiled code only.

2011/6/26 François Boutines <francois...@gmail.com>

François Boutines

unread,

Jun 26, 2011, 6:35:44 PM6/26/11

to shedskin...@googlegroups.com

On Sun, Jun 26, 2011 at 10:43 PM, Enzo Erbano <enzo....@gmail.com> wrote:
> Hi François,
> The problem is that I compiled the clean example, it is not generating the
> dataset
> to test the implementation, and it is not loading or comparing any data!
> Also, I did only a static analysis of the compiled code only.

So good luck, cause you're basically trying to optimize either GC or GCC.

According to gprof : the main Shed Skin bottleneck is dict::lookup
(and str::__eq__...), allocations account for only 1% of user time.

François.

srepmub

unread,

Jun 27, 2011, 10:04:27 AM6/27/11

to shedskin-discuss

hi guys,

I think the most important question here is, why is the GC not able to
keep memory usage down? because it doesn't, it has to scan way too
much memory each GC round..

I don't think a static analysis or profile will help here.. we want to
look at memory patterns. unfortunately valgrind does not seem to work
with libgc, but I guess we can get all the necessary information from
the GC itself:

http://www.hpl.hp.com/personal/Hans_Boehm/gc/debugging.html

I'm still betting this will show the GC misinterprets the hash values
inside the memory allocated for the dict..

so the problem would be that the GC is used too conservatively here,
with all the bad consequences that brings. fortunately, I think we can
avoid conservative scanning completely for shedskin-generate code. for
example, using gc_typed.h.

thanks,
mark.

On Jun 27, 12:35 am, François Boutines <francois.bouti...@gmail.com>
wrote:

srepmub

unread,

Jun 27, 2011, 11:17:03 AM6/27/11

to shedskin-discuss

also interesting:

http://www.opensource.apple.com/source/gcc/gcc-5479/boehm-gc/doc/README.environment?txt

might be nice to try with GC_DONT_GC for example.

thanks,
mark.

Enzo Erbano

unread,

Jun 27, 2011, 11:47:35 AM6/27/11

to shedskin...@googlegroups.com

Wow,

Seems to be interesting, disabling the garbage collector will make it only use the basics, right?

It might give some boost, have you tried it?

2011/6/27 srepmub <mark....@gmail.com>

François Boutines

unread,

Jun 27, 2011, 11:54:38 AM6/27/11

to shedskin...@googlegroups.com

On Mon, Jun 27, 2011 at 5:17 PM, srepmub <mark....@gmail.com> wrote:
> also interesting:
>
> http://www.opensource.apple.com/source/gcc/gcc-5479/boehm-gc/doc/README.environment?txt
>
> might be nice to try with GC_DONT_GC for example.

Hi,

I tried with the last gc-7.2alpha6,

./configured with --enable-cplusplus --enable-parallel-mark
--enable-threads=pthreads --enable-thread-local-alloc

The code is way faster on multicore machines this way...

François.

srepmub

unread,

Jun 27, 2011, 1:52:01 PM6/27/11

to shedskin-discuss

> Seems to be interesting, disabling the garbage collector will make it only
> use the basics, right?
>
> It might give some boost, have you tried it?

I just tried this, and it speeds up the program by about 30%..
bringing it close to cpython performance, but not quite there yet
(cpython 3.3 seconds, shedskin 4.3 seconds).

thanks,
mark.

srepmub

unread,

Jun 27, 2011, 1:54:31 PM6/27/11

to shedskin-discuss

> I tried with the last gc-7.2alpha6,
>
> ./configured with --enable-cplusplus --enable-parallel-mark
> --enable-threads=pthreads --enable-thread-local-alloc
>
> The code is way faster on multicore machines this way...

that's great to hear, I was hoping parallel marking would help.. :D

what speed do you see compared to cpython now..?

thanks,
mark.

Enzo Erbano

unread,

Jun 27, 2011, 2:03:55 PM6/27/11

to shedskin...@googlegroups.com

Hi mark,

It is great! :-)

For what I saw here in the codes, I really believe that there is some overhead...

I need some more time to check it and I will let you know when I get full details.

But, giving some info about what I found here, there is used c++ exceptions on the

lookup... maybe it is where the overhead is happening?!?

2011/6/27 srepmub <mark....@gmail.com>

François Boutines

unread,

Jun 27, 2011, 2:12:39 PM6/27/11

to shedskin...@googlegroups.com

Within 13% of Python 2.6.5:
- cpython: 4.2s
- shedskin-gc7.2-multi-proc: 4.8s
- shedskin-gc7.2-single-proc: 5.5s
- shedskin-gc6.8 (gitorious version): 6.6s

François.

Mark Dufour

unread,

Jun 27, 2011, 2:29:09 PM6/27/11

to shedskin...@googlegroups.com

Within 13% of Python 2.6.5:
- cpython: 4.2s
- shedskin-gc7.2-multi-proc: 4.8s
- shedskin-gc7.2-single-proc: 5.5s
- shedskin-gc6.8 (gitorious version): 6.6s

very nice, thanks :) awesome to see 7.2 being so much faster than 6.8 here..

I will mention in the tutorial that for ultimate performance, one should use a recent libgc..

on my PC:

- cpython: 3.3s
- shedskin-gc7.2-multi-proc: 5.5s
- shedskin-gc7.2-single-proc: 5.1s
- shedskin-gc6.8: 6.2s

again, 7.2 a lot faster than 6.8.. but what's going on with the multi-proc version.. and it's nowhere near cpython. are you on a 64-bit system perhaps?

thanks!
mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

srepmub

unread,

Jun 27, 2011, 2:49:12 PM6/27/11

to shedskin-discuss

>
> But, giving some info about what I found here, there is used c++ exceptions
> on the
> lookup... maybe it is where the overhead is happening?!?

afaics, there's only an exception thrown from dict.__getitem__.. we
could move that to a separate function perhaps,

mark.

srepmub

unread,

Jun 27, 2011, 2:50:50 PM6/27/11

to shedskin-discuss

> afaics, there's only an exception thrown from dict.__getitem__.. we
> could move that to a separate function perhaps,

but I don't think that will help much.

thanks,
mark.

Enzo Erbano

unread,

Jun 27, 2011, 3:07:47 PM6/27/11

to shedskin...@googlegroups.com

Hi mark,

Here, what I see is a lot of exceptions. I think GCC is generating them automatic.

Maybe, if there exist a way to disable or diminish the use of c++ exceptions by

compiler flag, I think it could help [I am not aware if GCC permit to disable

exceptions, but MSVC permit disabling buffer check and some types of exceptions]

Here, I will take some days to finish my work. I am experimenting compiler flags, and

what the real effects of them on the final code, and I am also finding places where is

possible to optimize in the codes used, mainly on RE code.

Maybe, it is also better if I send you the final report when I finish, with detailed data.

This way, you can decide better on what can be done, or not.

Best Regards.

2011/6/27 srepmub <mark....@gmail.com>

mark.

François Boutines

unread,

Jun 27, 2011, 6:26:55 PM6/27/11

to shedskin...@googlegroups.com

Yup.

Got it on par with Python by using a linear probing hash table, and
tweaking dict resizing policy a little bit.

See attached patch for the curious.

Don't think it worth applying though,
since it doesn't preserve collision order and test 194 will regress here:

_hextochr = dict(('%02x' % i, chr(i)) for i in range(256))
_hextochr.update(('%02X' % i, chr(i)) for i in range(256))
print(repr(_hextochr))

(sorted output is OK though).

I may be inclined to provide a more cautious patch if really needed.

François.

dict_faster.patch

Fahrzin Hemmati

unread,

Jun 27, 2011, 8:03:49 PM6/27/11

to shedskin...@googlegroups.com

dict-ordering is not guaranteed in the python spec since it's based on
hashing which could change between even similar computers (x86 vs x64).

> Fran�ois.
>

François Boutines

unread,

Jun 28, 2011, 1:11:19 AM6/28/11

to shedskin...@googlegroups.com

On Tue, Jun 28, 2011 at 2:03 AM, Fahrzin Hemmati <fah...@gmail.com> wrote:
> dict-ordering is not guaranteed in the python spec since it's based on
> hashing which could change between even similar computers (x86 vs x64).

Thanks Fahrzin, I was surprised too, but this test just ensures that.

There is an interesting dictnotes.txt file in the Python distribution
which explains the rationale behind the hash table implementation. It
is optimized for small dictionaries (up to 16 elements) which are very
common in Python (attribute storage, string-formatting, etc). We could
maybe take some liberalities here if it doesn't break anything.

François.

Mark Dufour

unread,

Jun 28, 2011, 2:03:04 AM6/28/11

to shedskin...@googlegroups.com

> dict-ordering is not guaranteed in the python spec since it's based on
> hashing which could change between even similar computers (x86 vs x64).

Thanks Fahrzin, I was surprised too, but this test just ensures that.

yeah, I think this test is not good, so I changed it to print sorted(_hextochr).

Mark Dufour

unread,

Jun 28, 2011, 6:04:38 AM6/28/11

to shedskin...@googlegroups.com

On Jun 27, 8:29 pm, Mark Dufour <mark.duf...@gmail.com> wrote:
> > Within 13% of Python 2.6.5:
> > - cpython: 4.2s
> > - shedskin-gc7.2-multi-proc: 4.8s
> > - shedskin-gc7.2-single-proc: 5.5s
> > - shedskin-gc6.8 (gitorious version): 6.6s

on my PC:

- cpython: 3.3s
- shedskin-gc7.2-multi-proc: 5.5s
- shedskin-gc7.2-single-proc: 5.1s
- shedskin-gc6.8: 6.2s

and on my 64-bit work PC:

- cpython: 4.9
- gc 6.8 (ubuntu package): 9.7s
- gc 7.2 (configured as ubuntu package): 9.9s
- gc 7.2 parallel 9.1s

conclusion so far:

- parallel marking helps 2/3 times
- 7.2 helps 2/3 times

I'd be very interested in hearing other people's results.

Mark Dufour

unread,

Jun 28, 2011, 6:32:48 AM6/28/11

to shedskin...@googlegroups.com

Got it on par with Python by using a linear probing hash table, and
tweaking dict resizing policy a little bit.

See attached patch for the curious.

great, thanks for digging into this.. :-)

I note you also compare hash values in str.__eq__. that seems a very good idea, so I went ahead and committed this already.

hm, I guess if we could somehow cache short strings, this would have extra effect, because once a string is used as a key, all 'shared' strings receive a hash value that will be used here.

not sure what cpython does for short strings with length > 1.

I may be inclined to provide a more cautious patch if really needed.

I see a 3% speed improvement for the str.__eq__ change, and a further 8% improvement after the resize/linear probing change (for the dict benchmark). so that's quite interesting. I'm a bit hesitant to leave the way cpython does things though, because that's really well optimized for many cases, not just one benchmark. sure it's perhaps optimized a bit toward small dicts, but I would really like to see a thorough comparison or good explanation why this is a general optimization. and I'd really like to hear FFAO's opinion about this as well, since he wrote the original dict implementation, and probably knows a lot more about hash tables than I do.. :-)

thanks again in any case!

mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Fahrzin Hemmati

unread,

Jun 28, 2011, 7:32:50 AM6/28/11

to shedskin...@googlegroups.com

Static analysis might be able determine if the dictionary is small
(statically defined or assigned to often) so it can choose between
algorithms.

> Fran�ois.
>

François Boutines

unread,

Jun 28, 2011, 1:04:13 PM6/28/11

to shedskin...@googlegroups.com

On Tue, Jun 28, 2011 at 12:04 PM, Mark Dufour <mark....@gmail.com> wrote:
> On Jun 27, 8:29 pm, Mark Dufour <mark.duf...@gmail.com> wrote:
>> > Within 13% of Python 2.6.5:
>> > - cpython: 4.2s
>> > - shedskin-gc7.2-multi-proc: 4.8s
>> > - shedskin-gc7.2-single-proc: 5.5s
>> > - shedskin-gc6.8 (gitorious version): 6.6s
>
> on my PC:
>
> - cpython: 3.3s
> - shedskin-gc7.2-multi-proc: 5.5s
> - shedskin-gc7.2-single-proc: 5.1s
> - shedskin-gc6.8: 6.2s
>
> and on my 64-bit work PC:
>
> - cpython: 4.9
> - gc 6.8 (ubuntu package): 9.7s
> - gc 7.2 (configured as ubuntu package): 9.9s
> - gc 7.2 parallel 9.1s

mmm, something must be wrong Mark,

don't know how the Ubuntu package was compiled,

are your sure you're pointing to the good library ?

did you try to compile it yourself, with the flags I provided yesterday ?

eventually adding --enable-large-config and exporting
CXXFLAGS="-march=native -O3" ?

François.

Mark Dufour

unread,

Jun 28, 2011, 3:40:53 PM6/28/11

to shedskin...@googlegroups.com

did you try to compile it yourself, with the flags I provided yesterday ?

eventually adding --enable-large-config and exporting
CXXFLAGS="-march=native -O3" ?

hmm perhaps I forgot this last step, yes.. :-) will try again tomorrow..

François Boutines

unread,

Jun 28, 2011, 4:54:27 PM6/28/11

to shedskin...@googlegroups.com

On Tue, Jun 28, 2011 at 9:40 PM, Mark Dufour <mark....@gmail.com> wrote:
>
>> did you try to compile it yourself, with the flags I provided yesterday ?
>>
>> eventually adding --enable-large-config and exporting
>> CXXFLAGS="-march=native -O3" ?
>>
>
> hmm perhaps I forgot this last step, yes.. :-) will try again tomorrow..

Please, double check your LD_LIBRARY_PATH:

$ export LD_LIBRARY_PATH=/usr/lib/ # gc 6.8
$ time ./dico < in
(1227153, 18)

real 0m4.952s
user 0m4.770s
sys 0m0.180s

$ export LD_LIBRARY_PATH=/usr/local/lib/ # gc 7.2
$ time ./dico < in
(1227153, 18)

real 0m4.235s
user 0m6.550s
sys 0m0.300s

Notice how user time is different, effectively taking advantage of multi-cores.

François.

ffao

unread,

Jun 28, 2011, 6:12:02 PM6/28/11

to shedskin-discuss

Well, as I expected, tinkering with the GC didn't make a lot of
difference here, such as typing dictentries. One thing is that I don't
get that longish tick behavior, the ticks are just constantly slow
here (0.24 compared to CPython's 0.05). If you are really going to
change the dict implementation, you might want to change the set one
similarly for consistency.

Hoping for an ever-faster shedskin,
ffao

ffao

unread,

Jun 28, 2011, 6:44:38 PM6/28/11

to shedskin-discuss

Sorry for the double post, didn't notice there was an extra page in
the topic.

> I see a 3% speed improvement for the str.__eq__ change, and a further 8%
> improvement after the resize/linear probing change (for the dict benchmark).
> so that's quite interesting.

Linear probing has the obvious advantage of being significantly more
memory-friendly, but it does place more stress in the hash function as
there is a bigger tendency to create clusters, sequences of contiguous
occupied elements in the table, which are not good for lookups
(therefore, you want the hashes to be generally far apart from each
other).

As for the resize policy, I don't know if resizing less often is
advantageous for smaller dicts or not. For this huge dict the memory
saved probably more than makes up for the higher load factor. Nothing
can be set in stone, it's about what works better in a given context,
I guess...

On a related note, as CPython stores pointers to objects instead of
the objects themselves, empty and dummy are stored as special pointer
values, which helps in the memory department by eliminating the "use"
field as well. But that can't be done for shedskin as it would cause
conflicts when using ints...

François Boutines

unread,

Jun 28, 2011, 7:40:23 PM6/28/11

to shedskin...@googlegroups.com

On Wed, Jun 29, 2011 at 12:44 AM, ffao <fet...@gmail.com> wrote:
> Sorry for the double post, didn't notice there was an extra page in
> the topic.
>
>> I see a 3% speed improvement for the str.__eq__ change, and a further 8%
>> improvement after the resize/linear probing change (for the dict benchmark).
>> so that's quite interesting.
>
> Linear probing has the obvious advantage of being significantly more
> memory-friendly, but it does place more stress in the hash function as
> there is a bigger tendency to create clusters, sequences of contiguous
> occupied elements in the table, which are not good for lookups
> (therefore, you want the hashes to be generally far apart from each
> other).

Well this really starts to get annoying when load factor becomes
superior to 60~70%. Which has no chance to append in my version (or if
you use a really bad hash function...).

> As for the resize policy, I don't know if resizing less often is
> advantageous for smaller dicts or not. For this huge dict the memory
> saved probably more than makes up for the higher load factor. Nothing
> can be set in stone, it's about what works better in a given context,
> I guess...

True. My version also uses more memory, so it should be a little
longer to iterate (16.7% on average, but more in the worst case).

> On a related note, as CPython stores pointers to objects instead of
> the objects themselves, empty and dummy are stored as special pointer
> values, which helps in the memory department by eliminating the "use"
> field as well. But that can't be done for shedskin as it would cause
> conflicts when using ints...

Well its C++! We can specialize the template for value / pointer semantics.

But those builtin.* files starts to get very big... What about
switching to a directory structure: builtin/dict.hpp etc? I'm also a
bit annoyed by these public members accessed all over the place which
impair the great refactorings we may be interested in. Maybe this
should be done in another project? I don't know.

François.

srepmub

unread,

Jun 29, 2011, 1:50:07 PM6/29/11

to shedskin-discuss

> Please, double check your LD_LIBRARY_PATH:
>

> Notice how user time is different, effectively taking advantage of multi-cores.
>

I tried this again on my home PC, now more carefully, and got
precisely the same results as before.. --enable-parallel-mark only
makes things slower. -O2/-O3/march=native don't seem to have much
effect.. still it's nice to see 7.2 being faster in 2/3 cases, and
that parallel marking can actually help sometimes.. :-)

thanks,
mark.

srepmub

unread,

Jun 29, 2011, 2:08:15 PM6/29/11

to shedskin-discuss

thanks a lot for looking into this! :-)

I'm not sure whether I saw those long ticks on a 32 or 64 bit
system..

well, I'm planning to also play a bit with this in the weekend, or at
least see if I can understand some of the statistics/dumps that one
can get from the GC. so this probably won't lead to an improvement,
but I think it will be a useful exercise.

I'm not looking forward to change the set/dict implementations.. they
were tuned for 20 years by the python developers (together with the
used hash functions), and I think changing them is likely to make
things worse (in general). plus I really like to be consistent with
CPython.. :-)

it may well be that in the generated C++ we typically use less small
dictionaries, and that some tuning could lead to a general improvement
in this regard, but I would really like to see some convincing
evidence before considering changing them in a fundamental way.

thanks again,
mark.

srepmub

unread,

Jun 29, 2011, 2:14:25 PM6/29/11

to shedskin-discuss

> But those builtin.* files starts to get very big... What about
> switching to a directory structure: builtin/dict.hpp etc? I'm also a

yeah, it's on my list for 0.9 to finally split those up. a lib/builtin
directory sounds logical. patches are welcome if you'd like to beat me
to it.. :-)

> bit annoyed by these public members accessed all over the place which
> impair the great refactorings we may be interested in. Maybe this

so let's start cleaning that up.. ;-)

> should be done in another project? I don't know.

uh, you don't mean to develop the builtins as a project separate from
shedskin..? :-)

mark.

Enzo Erbano

unread,

Jun 29, 2011, 2:41:48 PM6/29/11

to shedskin...@googlegroups.com

Hi mark,

Have you tried to use this?

-fno-rtti

-fno-exceptions

It is supposed to disable c++ exceptions on GCC.

Best regards.

2011/6/29 srepmub <mark....@gmail.com>

François Boutines

unread,

Jun 29, 2011, 2:52:27 PM6/29/11

to shedskin...@googlegroups.com

On Wed, Jun 29, 2011 at 8:41 PM, Enzo Erbano <enzo....@gmail.com> wrote:
> Hi mark,
> Have you tried to use this?
> -fno-rtti
> -fno-exceptions
> It is supposed to disable c++ exceptions on GCC.

Enzo,

try it for yourself,

this won't compile!

even if you get it to compile, this will break Shed Skin...

François.

François Boutines

unread,

Jun 29, 2011, 2:54:01 PM6/29/11

to shedskin...@googlegroups.com

On Wed, Jun 29, 2011 at 8:14 PM, srepmub <mark....@gmail.com> wrote:
>
>> But those builtin.* files starts to get very big... What about
>> switching to a directory structure: builtin/dict.hpp etc? I'm also a
>
> yeah, it's on my list for 0.9 to finally split those up. a lib/builtin
> directory sounds logical. patches are welcome if you'd like to beat me
> to it.. :-)

Perfect Mark!!

Good decision.

>> bit annoyed by these public members accessed all over the place which
>> impair the great refactorings we may be interested in. Maybe this
>
> so let's start cleaning that up.. ;-)

Yeah! What do we do first?

Splitting then canonicalizing seems logical?

Let all volunteers come forward!

>> should be done in another project? I don't know.
>
> uh, you don't mean to develop the builtins as a project separate from
> shedskin..? :-)

Nope I meant forking Shed Skin as a whole ;)

srepmub

unread,

Jun 30, 2011, 2:52:21 PM6/30/11

to shedskin-discuss

> Yeah! What do we do first?
>
> Splitting then canonicalizing seems logical?

yes, that sounds logical. would you like to do this..? putting
everything under lib/builtin/, and the including all the parts from
lib/builtin.?pp sounds fine to me.

thanks,
mark.

srepmub

unread,

Jun 30, 2011, 3:10:45 PM6/30/11

to shedskin-discuss

On Jun 29, 12:12 am, ffao <fet...@gmail.com> wrote:

regardless of performance, there seems to be a memory leak here.. if I
cat the input 10 times, shedskin memory usage goes through the roof
(about 2 GB here, after which the GC quits with an error), while
cpython barely goes over 70 MB. of course fixing such a leak can also
only be good for performance..

mark.

François Boutines

unread,

Jun 30, 2011, 4:06:44 PM6/30/11

to shedskin...@googlegroups.com

There may be a memory leak, but …. the great difference I see is:
Python shares strings since they are immutable.
So here, you basically have 50 millions strings, whereas Python
manages 1.2 million of them constantly...

That's precisely why I want to refactor the builtins ;)

François.

ffao

unread,

Jun 30, 2011, 7:23:45 PM6/30/11

to shedskin-discuss

> regardless of performance, there seems to be a memory leak here.. if I
> cat the input 10 times, shedskin memory usage goes through the roof
> (about 2 GB here, after which the GC quits with an error), while
> cpython barely goes over 70 MB. of course fixing such a leak can also
> only be good for performance..

I don't know how to accurately measure memory consumption on windows,
but for a quick test you can type dictentries by including gc_typed.h
and changing myallocate<K,V> from GC_MALLOC(n) to

typedef dictentry<K,V> T;
GC_descr T_descr;
GC_word T_bitmap[GC_BITMAP_SIZE(T)] = {0};
GC_set_bit(T_bitmap, GC_WORD_OFFSET(T,key));
GC_set_bit(T_bitmap, GC_WORD_OFFSET(T,value));
T_descr = GC_make_descriptor(T_bitmap, GC_WORD_LEN(T));

return GC_CALLOC_EXPLICITLY_TYPED(n/sizeof(T), sizeof(T), T_descr);

(note that this does not allocate int keys or values atomically and
creates a new descriptor for each allocation, but for this specific
test it shouldn't matter much.)

So you could run a second test to see how much difference it makes. I
don't really know how Python manages the str type, but if there is a
50-fold difference to the number of strings stored I guess it's
worthwhile to look into that...

François Boutines

unread,

Jun 30, 2011, 8:12:16 PM6/30/11

to shedskin...@googlegroups.com

On Fri, Jul 1, 2011 at 1:23 AM, ffao <fet...@gmail.com> wrote:
>> regardless of performance, there seems to be a memory leak here.. if I
>> cat the input 10 times, shedskin memory usage goes through the roof
>> (about 2 GB here, after which the GC quits with an error), while
>> cpython barely goes over 70 MB. of course fixing such a leak can also
>> only be good for performance..
>
> I don't know how to accurately measure memory consumption on windows,
> but for a quick test you can type dictentries by including gc_typed.h
> and changing myallocate<K,V> from GC_MALLOC(n) to
>
> typedef dictentry<K,V> T;
> GC_descr T_descr;
> GC_word T_bitmap[GC_BITMAP_SIZE(T)] = {0};
> GC_set_bit(T_bitmap, GC_WORD_OFFSET(T,key));
> GC_set_bit(T_bitmap, GC_WORD_OFFSET(T,value));
> T_descr = GC_make_descriptor(T_bitmap, GC_WORD_LEN(T));
>
> return GC_CALLOC_EXPLICITLY_TYPED(n/sizeof(T), sizeof(T), T_descr);
>
> (note that this does not allocate int keys or values atomically and
> creates a new descriptor for each allocation, but for this specific
> test it shouldn't matter much.)

Interesting... this appears to consume twice less memory.

> So you could run a second test to see how much difference it makes. I
> don't really know how Python manages the str type, but if there is a
> 50-fold difference to the number of strings stored I guess it's
> worthwhile to look into that...

Indeed. Is this sufficient as an evidence?

$ python -c'a="foo";b="foo";c="foo";assert id(a)==id(b)==id(c)'

Mark Dufour

unread,

Jul 1, 2011, 4:03:47 AM7/1/11

to shedskin...@googlegroups.com

I don't know how to accurately measure memory consumption on windows,
but for a quick test you can type dictentries by including gc_typed.h
and changing myallocate<K,V> from GC_MALLOC(n) to

typedef dictentry<K,V> T;
GC_descr T_descr;
GC_word T_bitmap[GC_BITMAP_SIZE(T)] = {0};
GC_set_bit(T_bitmap, GC_WORD_OFFSET(T,key));
GC_set_bit(T_bitmap, GC_WORD_OFFSET(T,value));
T_descr = GC_make_descriptor(T_bitmap, GC_WORD_LEN(T));

return GC_CALLOC_EXPLICITLY_TYPED(n/sizeof(T), sizeof(T), T_descr);

thanks for this! I will test check if this solves the memory leak, and if not, what other problems we may have. it's actually seems quite simple to get an overview from the GC of what remains allocated and why..

if anyone else would like to try this:

-build the GC with ./configure --enable-gc-debug
-add #define GC_DEBUG before including the gc header files
-export GC_BACKTRACES=n (how many random traces should it dump after each GC round)
-let a memory leak go wild for a while, so there's a good chance these random traces show why the leak is occurring.

I think I will write a little script today to summarize the output.

don't really know how Python manages the str type, but if there is a
50-fold difference to the number of strings stored I guess it's
worthwhile to look into that...

if they are actually stored, it's bad, I agree. but here we only store a fraction of the total amount in the dict, and the rest can be GC'ed. quite efficiently as well, I would think..

thanks,
mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Mark Dufour

unread,

Jul 1, 2011, 4:13:02 AM7/1/11

to shedskin...@googlegroups.com

> So you could run a second test to see how much difference it makes. I
> don't really know how Python manages the str type, but if there is a
> 50-fold difference to the number of strings stored I guess it's
> worthwhile to look into that...

Indeed. Is this sufficient as an evidence?

$ python -c'a="foo";b="foo";c="foo";assert id(a)==id(b)==id(c)'

I just added support for 'id' to shedskin for you, so you can try this with shedskin.. ;-)

the following I think also shows that string sharing 'detection' comes with overhead, so it cannot be used always:

a = 'a'
b = 'b'
print id('ab') == id(a+b)

l = 'hoppa hoppa'.split()
print id(l[0]) == id(l[1])

f = open('test','w')
f.write('beheh\n')
f.write('beheh\n')
f.close()

l = list(file('test'))
print id(l[0]) == id(l[1])

at least on my system, this gives three times False..

(btw, note that compiling this with shedskin gives a strange IO error.. perhaps you could have a look at that?)

so it's not entirely clear to my what heuristic cpython uses for string sharing (for strings with len >1 that is - 1-length strings are a very important case, and shared both in cpython and shedskin.)

even if it didn't share any strings, reference counting would still make sure cpython only keeps the strings in the dict allocated..

Mark Dufour

unread,

Jul 1, 2011, 7:00:47 AM7/1/11

to shedskin...@googlegroups.com

I think I will write a little script today to summarize the output.

added scripts/gcgraph.py to GIT. very ugly/hacky, and specialized for dictentries for now, but easily modified to analyze another situation later.. see attachment for output for the dict benchmark, and instructions in the top of the file to reproduce.

note that what is summarized are relatively few, random GC "reachability" traces. the amount of traces can be set via environment variable GC_BACKTRACES. they are very slow to gather, so I used 100 now. so that's 100 random traces per GC round.

the nodes in the output are all objects of a certain size ('small' should be mostly strings here). the edges are offsets (modulo 16 for large objects, sizeof(dictentry<K,V>) here) within those objects where they seem to point to other objects, with a counter how often this occurs.

so I guess this shows indeed that the hash values make it so that the 'old' dict objects after resizing cannot be deallocated by the GC. but yeah, this is only 33 MB extra or so, so it doesn't seem to explain the memory leak.. let me look at the script again..

gcgraph.png

Mark Dufour

unread,

Jul 1, 2011, 8:41:29 AM7/1/11

to shedskin...@googlegroups.com

interestingly, using typed allocation, I get almost the same graph.. meaning the GC still seems to be scanning the hash values..!?

francois, what did you do exactly to see the memory usage drop in half..? :-)

ffao

unread,

Jul 1, 2011, 10:21:07 AM7/1/11

to shedskin-discuss

I did a bit of preliminary testing using GC_get_heap_size to print the
heap size after each allocation... Here's what I get for the last two
allocations:

For GC_MALLOC:
2486272
9629696

For GC_MALLOC_IGNORE_OFF_PAGE:
2486272
7221248

For GC_CALLOC_EXPLICITLY_TYPED:
2830336
9756672

If the type information is getting ignored, that is pretty weird.
Documentation is pretty much non-existant, so I have no idea what's
wrong here...

srepmub

unread,

Jul 1, 2011, 10:23:22 AM7/1/11

to shedskin-discuss

On Jul 1, 10:13 am, Mark Dufour <mark.duf...@gmail.com> wrote:
> > > So you could run a second test to see how much difference it makes. I
> > > don't really know how Python manages the str type, but if there is a
> > > 50-fold difference to the number of strings stored I guess it's
> > > worthwhile to look into that...
>
> > Indeed. Is this sufficient as an evidence?
>
> > $ python -c'a="foo";b="foo";c="foo";assert id(a)==id(b)==id(c)'
>
> I just added support for 'id' to shedskin for you, so you can try this with
> shedskin.. ;-)
>
> the following I think also shows that string sharing 'detection' comes with
> overhead, so it cannot be used always:
>
> a = 'a'
> b = 'b'
> print id('ab') == id(a+b)
>
> l = 'hoppa hoppa'.split()
> print id(l[0]) == id(l[1])
>
> f = open('test','w')
> f.write('beheh\n')
> f.write('beheh\n')
> f.close()
>
> l = list(file('test'))
> print id(l[0]) == id(l[1])
>
> at least on my system, this gives three times False..

just found out about the 'intern' builtin in python. might be
interesting in this context.

>>> a = 'a'
>>> b = 'b'

>>> c = a+b
>>> d = a+b
>>> c is d
False
>>> c2 = intern(c)
>>> d2 = intern(d)
>>> c2 is d2
True

thanks,
mark.

François Boutines

unread,

Jul 1, 2011, 10:33:25 AM7/1/11

to shedskin...@googlegroups.com

On Fri, Jul 1, 2011 at 2:41 PM, Mark Dufour <mark....@gmail.com> wrote:
>
> interestingly, using typed allocation, I get almost the same graph.. meaning
> the GC still seems to be scanning the hash values..!?
>
> francois, what did you do exactly to see the memory usage drop in half..?
> :-)

Printed the pid,

looked at it with top.

What strange IO error do you get with the id() example ?

Mark Dufour

unread,

Jul 1, 2011, 11:04:43 AM7/1/11

to shedskin...@googlegroups.com

even if it didn't share any strings, reference counting would still ma

ke sure cpython only keeps the strings in the dict allocated..

I think this shows python doesn't share the strings here, actually:

import collections
import sys
ids = collections.defaultdict(list)
for line in sys.stdin:
ids[id(line)].append(line)
print len(ids)

other than refcounting, it probably also helps that it recycles memory for strings.

Mark Dufour

unread,

Jul 1, 2011, 11:06:42 AM7/1/11

to shedskin...@googlegroups.com

What strange IO error do you get with the id() example ?

just compiling and running the examples gives the following:

False
False
IOError: [Errno 26] Text file busy: 'test'

Mark Dufour

unread,

Jul 2, 2011, 5:39:02 AM7/2/11

to shedskin...@googlegroups.com

For GC_MALLOC_IGNORE_OFF_PAGE:
2486272
7221248

thanks for reminding me of this one :-)

I tested this with the dict test (5 million strings), and see three advantages (gc 7.2, 32-bit):

- average speedup of 5%
- average memory reduction of 10%
- no more GC warnings.. ^^

but do you think it's safe to use this one? from the sparse documentation, this is not entirely clear to me..

thanks,
mark.
--

http://www.youtube.com/watch?v=E6LsfnBmdnk

ffao

unread,

Jul 2, 2011, 1:48:51 PM7/2/11

to shedskin-discuss

> but do you think it's safe to use this one? from the sparse documentation,
> this is not entirely clear to me..

From what I gather from the documentation, when you use
IGNORE_OFF_PAGE you're assuring that a pointer to the first 512 bytes
of the object will be kept. As the dict object keeps a pointer to the
very beginning of the table, I don't see why this wouldn't be true as
long as the table is still useful. It does say that it's best to
declare the pointer volatile to keep it from being fiddled with by the
compiler, but I don't know what sorts of impacts this could have...

I wanted to do it the typed way, but that seems to be extremely
difficult to understand, presuming it works correctly. I do know that
it's not completely ignoring typing information, though, as removing
key from the list of pointers causes the program to segfault almost
immediately.

Mark Dufour

unread,

Jul 2, 2011, 2:03:09 PM7/2/11

to shedskin...@googlegroups.com

From what I gather from the documentation, when you use
IGNORE_OFF_PAGE you're assuring that a pointer to the first 512 bytes
of the object will be kept. As the dict object keeps a pointer to the
very beginning of the table, I don't see why this wouldn't be true as
long as the table is still useful. It does say that it's best to
declare the pointer volatile to keep it from being fiddled with by the
compiler, but I don't know what sorts of impacts this could have...

ah, I was thinking about the pointer to the dict.. the pointer to the dictentries is only in one place I guess, so it's probably easy to declare that as volatile if necessary. I don't really expect a problem if we don't declare it as such though.. I can't imagine it being optimized away.

I will benchmark this a bit again on monday, on a 64-bit system, and using libgc 6.8 and 7.2.

thanks!

mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Mark Dufour

unread,

Jul 2, 2011, 2:14:59 PM7/2/11

to shedskin-discuss

]regardless of performance, there seems to be a memory leak here.. if I

cat the input 10 times, shedskin memory usage goes through the roof
(about 2 GB here, after which the GC quits with an error), while
cpython barely goes over 70 MB. of course fixing such a leak can also
only be good for performance..

btw, but this seems to only be the case for the standard libgc 6.8.. using 7.2 there's no problem (it maxes out at about 400MB).

mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Reply all

Reply to author

Forward